# Weather Data Analysis Project

## Introduction

**Purpose:**

We seek to experiment with publicly available historical and current weather data from the National Oceanic and Atmosphere Administration (NOAA). We want to download, parse, and analyze the data, perform some computations, and analyze the results.

**Project Goals:**

1. Load station and temperature data from publicly available text files from the National Oceanic and Atmosphere Administration (NOAA).
2. Integrate missing data, smooth data, and plot temperature data.
3. Compute the daily records at a given location.
4. Compare the warmest year of a cold location with the coldest year of a warm one.

## Imports

In [1]:
import os
import urllib.request
from pathlib import Path

In [2]:
import numpy as np
import matplotlib.pyplot as pp
import seaborn

%matplotlib inline

## Loading Station Data

**Goals:**

1. Download a file over FTP.
2. Parse a space-separated text file into a Python dictionary.

We will be using data from the [Global Historical Climatology Network (GHCN) | National Centers for Environmental Information (NCEI)](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn), formerly known as National Climatic Data Center (NCDC). The Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe.

We start by downloading a text file, namely `ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt`, which contains an annotated list of land surface stations in the network. Since we know the exact data we need, we are going to download that file using FTP as `stations.txt` at CWD. Once the file is downloaded, we move it to a data directory of choice, read the file in, remove newline characters, and load its contents to a Python list.

**We write the following functions to save a file to disk using FTP and load a data file into a Python list.**

In [83]:
# =============================================================================
def save_ftp_file(ftp_link_address, file_path):
    """
    Downloads the given file using FTP and saves it to the given file path,
    specified as 'data_dir_name/file_name.ext'. or to CWD, simply specified 
    as 'file_name.ext'.
    """        
    # Create the data directory, if needed and if it does not exist.
    if file_path.find('/') > -1:
        data_dir_name = file_path.split('/')[0]

        if not os.path.isdir(data_dir_name):
            os.mkdir(data_dir_name)
    
    # Download the given file using FTP to the desired location,
    # if the file does not exist in disk.
    if not os.path.isfile(file_path):
        # urllib.request.urlretrieve(ftp_file_path, ftp_file_name)
        urllib.request.urlretrieve(ftp_link_address, file_path)
        print('INFO: File saved succesfully: "{}"'.format(file_path.split('/')[1]))
    else:
        print('INFO: File already exists: "{}"'.format(file_path.split('/')[1]))

         
# =============================================================================
def get_data(data_path):
    """
    Returns a clean list of entries read in from the data file.
    """    
    # If the data file exists, read it in, remove newline characters ('\n'), 
    # and return a Python list.
    if os.path.isfile(data_path):
        return [line.rstrip() for line in open(data_path, 'r')]
    else:
        print("ERROR: Data file not found at the given path.")
        print("       Check both directory and file exist or")
        print("       redownload data from the FTP site.")

We can now **save and load the station data**.

In [98]:
ftp_link_address = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'
data_path = 'data/ghcnd-stations.txt'

In [99]:
# Download station data.
save_ftp_file(ftp_link_address, data_path)

INFO: File already exists: "ghcnd-stations.txt"


In [100]:
# Load station data.
stationlist = get_data(data_path)

If the data was loaded successfully, we can see what it looks like and how many entries there are.

In [101]:
stationlist[:10]

['ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD',
 'ACW00011647  17.1333  -61.7833   19.2    ST JOHNS',
 'AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196',
 'AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194',
 'AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217',
 'AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218',
 'AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930',
 'AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938',
 'AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948',
 'AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990']

In [74]:
print("Number of entries: {}". format(len(stationlist)))

Number of entries: 115081


We can also **save the corresponding `readme.txt` file for reference**. This time, we do not need to load any data.

In [87]:
ftp_link_address = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt'
readme_file_path = 'data/ghcnd-readme.txt'

In [88]:
# Download readme file.
save_ftp_file(ftp_link_address, readme_file_path)

INFO: File already exists: "ghcnd-readme.txt"


From the readme file, the downloaded *ghcnd* data has the following format:

In [93]:
open(readme_file_path, 'r').readlines()[407:420]

['------------------------------\n',
 'Variable   Columns   Type\n',
 '------------------------------\n',
 'ID            1-11   Character\n',
 'LATITUDE     13-20   Real\n',
 'LONGITUDE    22-30   Real\n',
 'ELEVATION    32-37   Real\n',
 'STATE        39-40   Character\n',
 'NAME         42-71   Character\n',
 'GSN FLAG     73-75   Character\n',
 'HCN/CRN FLAG 77-79   Character\n',
 'WMO ID       81-85   Character\n',
 '------------------------------\n']

**We want to concentrate for now on stations which are tagged as $\texttt{GSN}$**. We create a **dictionary of "Station Names" indexed by "Station ID"**, skipping all lines that do not have $\texttt{GSN}$ values. Here, "Station ID" is the first field of a line when we do a line split and "Station Names" is a string that joins all fields starting at index 4.

In [77]:
stations = {}

for line in stationlist:
    if "GSN" in line:
        fields = line.split()
        stations[fields[0]] = ' '.join(fields[4:])

Let us see some entries of the dictionary we just created and how many there are.

In [78]:
for key in list(stations.keys())[:10]:
    print("{}: {}".format(key, stations[key]))

AE000041196: SHARJAH INTER. AIRP GSN 41196
AF000040930: NORTH-SALANG GSN 40930
AG000060390: ALGER-DAR EL BEIDA GSN 60390
AG000060590: EL-GOLEA GSN 60590
AG000060611: IN-AMENAS GSN 60611
AG000060680: TAMANRASSET GSN 60680
AJ000037989: ASTARA GSN 37989
ALM00013615: TIRANA RINAS GSN 13615
AM000037781: ARAGAC VISOKOGORNAYA GSN 37781
AO000066160: LUANDA GSN 66160


In [79]:
len(stations)

994

Now we can look for patterns in the Station Name to retrieve values from some stations. We can create a function to do just that.

In [80]:
# =============================================================================
def find_station(station_name):
    return {ID:NAME for ID,NAME in stations.items() if station_name in NAME}

We are going to pick four stations as a starting point for analysis, and we want to get a list of their IDs.

In [81]:
station_names = ['IRKUTSK', 'LIHUE', 'MINNEAPOLIS', 'SAN DIEGO']
station_ids = [list(find_station(name).keys())[0] for name in station_names]
station_ids

['RSM00030710', 'USW00022536', 'USW00014922', 'USW00023188']

## Loading Temperature Data

**Goals:**

1. Parse a fixed-field text file using `np.genfromtxt'.
2. Use ranges of NumPy datetime64 objects.

We begin by downloading the daily weather files for the four stations we have selected, using their IDs. The files are can be downloaded using FTP from the [Index of /pub/data/ghcn/daily/gsn/](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/gsn/).

The link address of a daily file has the format `ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/gsn/AE000041196.dly`. We can use the functions we created to download the four selected files, given their IDs.

In [85]:
ftp_link_address_root = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/gsn/'
for id in station_ids:
    ftp_link_address = ftp_link_address_root + id + '.dly'
    data_path = 'data/' + id + '.dly'
    
    save_ftp_file(ftp_link_address, data_path)

INFO: File already exists: "RSM00030710.dly"
INFO: File already exists: "USW00022536.dly"
INFO: File already exists: "USW00014922.dly"
INFO: File already exists: "USW00023188.dly"


From the downloaded readme file, the downloaded files have the following format:

In [97]:
open(readme_file_path, 'r').readlines()[108:131]

['------------------------------\n',
 'Variable   Columns   Type\n',
 '------------------------------\n',
 'ID            1-11   Character\n',
 'YEAR         12-15   Integer\n',
 'MONTH        16-17   Integer\n',
 'ELEMENT      18-21   Character\n',
 'VALUE1       22-26   Integer\n',
 'MFLAG1       27-27   Character\n',
 'QFLAG1       28-28   Character\n',
 'SFLAG1       29-29   Character\n',
 'VALUE2       30-34   Integer\n',
 'MFLAG2       35-35   Character\n',
 'QFLAG2       36-36   Character\n',
 'SFLAG2       37-37   Character\n',
 '  .           .          .\n',
 '  .           .          .\n',
 '  .           .          .\n',
 'VALUE31    262-266   Integer\n',
 'MFLAG31    267-267   Character\n',
 'QFLAG31    268-268   Character\n',
 'SFLAG31    269-269   Character\n',
 '------------------------------\n']

This table describes the **fixed-text format**, where the various fields always occupy the same columns.

## #TODO: ADD md CELLS