# NOAA CO-OPS Data Download
### Part 1 of 3

In this notebook, we will download atmospheric and water observations from the [National Oceanic and Atmospheric Administration](https://www.noaa.gov) (NOAA) [Center for Operational Oceanographic Products and Services](https://tidesandcurrents.noaa.gov/) (CO-OPS) data portal. The objective is to replicate the [Climatology for Virginia Key, FL](https://bmcnoldy.earth.miami.edu/vk/) page created and maintained by [Brian McNoldy](https://bmcnoldy.earth.miami.edu/) at the [University of Miami](https://welcome.miami.edu) [Rosenstiel School of Marine, Atmospheric, and Earth Science](http://earth.miami.edu).

For sake of demonstration, we will focus on air and water temperature from Virginia Key, FL. Ultimately, however, there are several variables of interest:

- Air temperature
- Barometric pressure
- Water temperature
- Water level (*i.e.*, tides)
- Wind speed

This notebook will simply download the data, store the metadata, and write these to file. The second notebook, [NOAA-CO-OPS-records](NOAA-CO-OPS-records.ipynb), will filter these data and calculate a set of statistics and records. Part 3, [NOAA-CO-OPS-plots](NOAA-CO-OPS-plots.ipynb), will plot and display the data.

### Packages and configurations

First we import the packages we need.

In [1]:
from noaa_coops import Station
import datetime as dt
import pandas as pd
import numpy as np
import yaml
import os

By default, Python only displays warnings the first time they are thrown. Ideally, we want a code that does not throw any warnings, but it sometimes takes soem trial and error to resolve the issue being warned about. So, for diagnostic purposes, we'll set the kernel to always display warnings.n

In [2]:
import warnings
warnings.filterwarnings('always')

### Functions

Next, we define a number of functions that will come in handy later.

#### Helper functions

In [3]:
def camel(text):
    """Convert 'text' to camel case"""
    s = text.replace(',','').replace("-", " ").replace("_", " ")
    s = s.split()
    if len(text) == 0:
        return text
    return s[0].lower() + ''.join(i.capitalize() for i in s[1:])

def get_units(variable, unit_system):
    """Return the desired units for 'variable'"""
    unit_options = dict({
        'Air Temperature': {'metric': 'C', 'english': 'F'},
        'Barometric Pressure': {'metric': 'mb', 'english': 'mb'},
        'Wind Speed': {'metric': 'm/s', 'english': 'kn'},
        'Wind Gust': {'metric': 'm/s', 'english': 'kn'},
        'Wind Direction': {'metric': 'deg', 'english': 'deg'},
        'Water Temperature': {'metric': 'C', 'english': 'F'},
        'Water Level': {'metric': 'm', 'english': 'ft'}
    })
    return unit_options[variable][unit_system]

def format_date(datestr):
    """Format date strings into YYYYMMDD format"""
    dtdt = pd.to_datetime(datestr)
    return dt.datetime.strftime(dtdt, '%Y%m%d')

#### Downloading data

In [4]:
def load_atemp(metadata, start_date, end_date, verbose=True):
    """Download air temperature data from NOAA CO-OPS between 'start_date'
    and 'end_date' for 'stationid', 'unit_system', and timezone 'tz'
    provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving air temperature data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Air Temperature']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    air_temp = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='air_temperature',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    air_temp.columns = ['atemp', 'atemp_flag']
    return air_temp

def load_wind(metadata, start_date, end_date, verbose=True):
    """Download wind data from NOAA CO-OPS between 'start_date' and
    'end_date' for 'stationid', 'unit_system', and timezone 'tz' provided
    in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving wind data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Wind']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    wind = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='wind',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    wind.columns = ['windspeed', 'winddir_deg', 'winddir',
                    'windgust', 'wind_flag']
    return wind

def load_atm_pres(metadata, start_date, end_date, verbose=True):
    """Download barometric pressure data from NOAA CO-OPS between
    'start_date' and 'end_date' for 'stationid', 'unit_system', and
    timezone 'tz' provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving barometric pressure data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Barometric Pressure']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    pressure = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='air_pressure',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    pressure.columns = ['apres', 'apres_flag']
    return pressure

def load_water_temp(metadata, start_date, end_date, verbose=True):
    """Download water temperature data from NOAA CO-OPS between
    'start_date' and 'end_date' for 'stationid', 'unit_system', and
    timezone 'tz' provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving water temperature data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Water Temperature']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    water_temp = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='water_temperature',
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    water_temp.columns = ['wtemp', 'wtemp_flag']
    return water_temp

def load_water_level(metadata, start_date, end_date, verbose=True):
    """Download water level data from NOAA CO-OPS between 'start_date' and
    'end_date' for 'stationid', 'unit_system', 'datum', and timezone 'tz'
    provided in 'metadata' dictionary.
    """
    if verbose:
        print('Retrieving water level tide data')
    station = Station(id=metadata['stationid'])
    if not start_date:
        start_date = format_date(station.data_inventory['Verified 6-Minute Water Level']['start_date'])
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    water_levels = station.get_data(
        begin_date=start_date,
        end_date=end_date,
        product='water_level',
        datum=metadata['datum'],
        units=metadata['unit_system'],
        time_zone=metadata['tz'])
    water_levels.columns = ['wlevel', 's', 'wlevel_flag', 'wlevel_qc']
    return water_levels

def download_data(metadata, start_date=None, end_date=None, verbose=True):
    """Download data from NOAA CO-OPS"""
    # List of data variables to combine at the end
    datasets = []
            
    # If no 'end_date' is passed, download through end of current date
    if not end_date:
        end_date = format_date(pd.to_datetime('today') + pd.Timedelta(days=1))
    
    # Air temperature
    if 'Air Temperature' in metadata['variables']:
        air_temp = load_atemp(metadata=metadata, start_date=start_date,
                              end_date=end_date, verbose=verbose)
        air_temp['atemp_flag'] = air_temp['atemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        air_temp.loc[air_temp['atemp_flag'] > 0, 'atemp'] = np.nan
        datasets.append(air_temp['atemp'])

    # Barometric pressure
    if 'Barometric Pressure' in metadata['variables']:
        pressure = load_atm_pres(metadata=metadata, start_date=start_date,
                                 end_date=end_date, verbose=verbose)
        pressure['apres_flag'] = pressure['apres_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        pressure.loc[pressure['apres_flag'] > 0, 'apres'] = np.nan
        datasets.append(pressure['apres'])

    # Wind
    if 'Wind Speed' in metadata['variables']:
        metadata['variables'].extend(['Wind Gust'])
        wind = load_wind(metadata=metadata, start_date=start_date,
                         end_date=end_date, verbose=verbose)
        wind['windflag'] = wind['wind_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        wind.loc[wind['wind_flag'] > 0, ['windspeed', 'windgust']] = np.nan
        datasets.append(wind[['windspeed', 'windgust']])

    # Water temperature
    if 'Water Temperature' in metadata['variables']:
        water_temp = load_water_temp(metadata=metadata, start_date=start_date,
                                     end_date=end_date, verbose=verbose)
        water_temp['wtemp_flag'] = water_temp['wtemp_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        water_temp.loc[water_temp['wtemp_flag'] > 0, 'wtemp'] = np.nan
        datasets.append(water_temp['wtemp'])

    # Water level (tides)
    if 'Verified 6-Minute Water Level' in metadata['variables']:
        water_levels = load_water_level(metadata=metadata, start_date=start_date,
                                        end_date=end_date, verbose=verbose)
        water_levels['wlevel_flag'] = water_levels['wlevel_flag'].str.split(',', expand=True).astype(int).sum(axis=1)
        water_levels.loc[water_levels['wlevel_flag'] > 0, 'wlevel'] = np.nan
        datasets.append(water_levels['wlevel'])

    # Merge into single dataframe and rename columns
    newdata = pd.concat(datasets, axis=1)
    newdata.index.name = f'time_{metadata["tz"]}'
    newdata.columns = [i for i in metadata['variables']]
    return newdata

### Load / download data

Now it's time to load the data. First, specify the station we want to load. This will be used to load saved data or download all data from a new station, if we have not yet retrieved data from this particular `stationname`.

`stationname` is a custom human-readable "City, ST" string for the station, while `id` is the NOAA-COOPS station ID number.

In [5]:
stationname = 'Virginia Key, FL'
id = '8723214'

Derive the directory name containing for data from the station name. This is where the data are or will be saved locally.

In [6]:
dirname = camel(stationname)
outdir = os.path.join(os.getcwd(), dirname)

print(f"Station folder: {dirname}")
print(f"Full directory: {outdir}")

Station folder: virginiaKeyFl
Full directory: /home/climatology/virginiaKeyFl


Flag for printing statuses

In [7]:
verbose = True

Let's see if we already have data from this station saved locally. This will be true if a directory already exists for the station.

If the directory `outdir` does not exist, then no data have been downloaded for this station, so we need to download everything through the present. This requires a few steps:

1. Create `outdir`
2. Load the configuration settings from `station-init.yml`. This file contains settings such as unit system, time zone, and what variables to retrieve. Using a init file like this makes it easier to keep the same settings across multiple stations. It will be read in as a Python dictionary, which we will call `meta` and will use to store all relevant metadata for the station.
3. Download the data and record the timestamp of the last observation in the metadata. This will be used later when updating the data.
4. Write the data and metadata to file.

On the other hand, if data already exist locally, we will load it from file and download new data we do not yet have:

1. Load the data and metadata from file
2. Retrieve new data
3. Combine new data to existing data, update the 'last_updated' metadata entry, and write data and metadata to file

The noaa-coops tool only accepts dates without times, so it is possible to download data we already have. We therefore have to check what we download against what we already have to avoid duplicating data.

The most likely (and perhaps only) scenerio is if the data we have for the most recent day is incomplete. For example, assume today is May 5, 2024 and we download data at noon. Also assume the start date is some earlier day, the last time we retrieved data, and this will be automatically determined from the metadata. Specifying an end date `2024-05-01` will retrieve all data available through noon on May 5. In this case, we do not yet have these data, so we concatenate what we do not have to what we do have. However, if we then run the download function again (say, for diagnostic purposes) with the new start date of `2024-05-01` and the end date `2024-05-01`, it will again download the data through noon on May 5. But since we already have those data, we do not want to re-concatenate them.

*This cell may take several seconds or minutes to run, depending on how much data is being downloaded.*

In [8]:
if not os.path.exists(outdir):
    if verbose:
        print('Creating new directory for this station.')
    os.makedirs(outdir)

    # Metadata configuration
    with open('station-init.yml') as d:
        meta = yaml.safe_load(d)
    meta['units'] = {k:get_units(k, meta['unit_system']) for k in meta['variables']}
    meta['outdir'] = outdir
    meta['stationname'] = stationname
    meta['stationid'] = id

    # Download all data (set start and end date to None to get all data)
    if verbose:
        print('Downloading all data for this station.')
    data = download_data(metadata=meta, start_date=None, end_date=None)
    data.to_csv(os.path.join(meta['outdir'], 'observational_data_record.csv.gz'),
                             compression='infer')
    print("Updated observational data written to file "\
          f"{os.path.join(meta['outdir'], 'observational_data_record.csv')}.")

    # Save metadata
    meta['last_updated'] = str(data.index.max())
    if verbose:
        print(f"Metadata written to file {os.path.join(meta['outdir'], 'metadata.yml')}")
    with open(os.path.join(meta['outdir'], 'metadata.yml'), 'w') as fp:
        yaml.dump(meta, fp)
    
else:
    # Load the metadata
    if verbose:
        print('Loading metadata from file')
    with open(os.path.join(outdir, 'metadata.yml')) as m:
        meta = yaml.safe_load(m)
    
    # Load the historical data
    if verbose:
        print('Loading data from file')
    data = pd.read_csv(os.path.join(outdir, 'observational_data_record.csv.gz'),
                       index_col=f'time_{meta["tz"]}', parse_dates=True,
                       compression='infer')

    # Retrieve new data
    newdata = download_data(metadata=meta, start_date=format_date(meta['last_updated']))
    if sum(~newdata.index.isin(data.index)) == 0:
        print('No new data available.')
    else:
        data = pd.concat([data,
                          newdata[newdata.index.isin(data.index) == False]], axis=0)
        data.to_csv(os.path.join(meta['outdir'], 'observational_data_record.csv.gz'),
                                 compression='infer')
        meta['last_updated'] = str(data.index.max())
        with open(os.path.join(meta['outdir'], 'metadata.yml'), 'w') as fp:
            yaml.dump(meta, fp)
        print("Updated observational data written to file "\
              f"{os.path.join(meta['outdir'], 'observational_data_record.csv')}.")

Loading metadata from file
Loading data from file
Retrieving air temperature data
Retrieving water temperature data
No new data available.


Check the data and metadata for sanity:

In [9]:
data

Unnamed: 0_level_0,Air Temperature,Water Temperature
time_lst,Unnamed: 1_level_1,Unnamed: 2_level_1
1994-01-28 00:00:00,,
1994-01-28 00:06:00,,
1994-01-28 00:12:00,,
1994-01-28 00:18:00,,
1994-01-28 00:24:00,,
...,...,...
2024-05-25 09:36:00,83.5,86.0
2024-05-25 09:42:00,83.5,86.2
2024-05-25 09:48:00,83.7,86.2
2024-05-25 09:54:00,83.8,86.2


In [10]:
meta

{'datum': 'MHHW',
 'day_threshold': 2,
 'hr_threshold': 3,
 'last_updated': '2024-05-25 10:00:00',
 'outdir': '/home/climatology/virginiaKeyFl',
 'stationid': '8723214',
 'stationname': 'Virginia Key, FL',
 'tz': 'lst',
 'unit_system': 'english',
 'units': {'Air Temperature': 'F', 'Water Temperature': 'F'},
 'variables': ['Air Temperature', 'Water Temperature']}

In [11]:
len(data.index.unique()) == data.shape[0]

True

The 'last_updated' metadata flag matches the last observation in the data record and corresponds to the most recently available observation. Also, every observation time is unique, so there are no duplicated entries. So, everything checks out.

In the next part, [NOAA-CO-OPS-records](NOAA-CO-OPS-records.ipynb), we will clean filter these data and calculate statistics and records.

***