# Data Wrangling
In this notebook we'll obtain all of our data, arrange it how we want, and store it in the proper directories.

In [1]:
%run setup.ipynb
data_dir = Path('../data')
output_dir = Path('../output')

tg_nc_dir = data_dir / 'tide_gauge_nc'

# Check if the directories exist, if not create them
for d in [data_dir, output_dir, tg_nc_dir]:
    if not d.exists():
        d.mkdir()
        

## Retrieve Tide Gauge Data

We are interested in getting tide gauge and alitmetry data for the Hawaiian Islands (and surrounds) for 1993 through 2022.
Let's first establish where the tide gauges are by looking at the tide gauge dataset. We'll retrieve tide gauge data from the UHSLC (University of Hawaii Sea Level Center) fast-delivery dataset {cite:t}``. The fast-delivery data are released within 1-2 months of data collection and are subject only to basic quality control. 

We'll be retrieving the hourly data for our station group at from UHSLC, and saving this to our data directory so we don't have to download again.

In [2]:
# hawaii stations are: 
stationdict = {
    'Hilo': '060',
    'Kawaihae': '552',
    'Kahului': '059',
    'Mokuoloe': '061',
    'Honolulu': '057',
    'Nawiliwili': '058',
    'Johnston Island': '052',
    'Midway Island': '050',
    'Kaumalapau': '548',
    'Barbers Point': '547',
    'French Frigate Shoals': '014',
}
stationdict.values()

station_group = 'Hawaiian Islands'

glue('station_group', station_group)

'Hawaiian Islands'

````{margin}
```{note}
What about research quality data (RQD)? 
RQD undergo thorough and time-consuming QC, and are usually released 1-2 years after data is received. 
```
````

In [3]:
url = "https://uhslc.soest.hawaii.edu/data/netcdf/fast/hourly/" 
uhslc_ids = list(stationdict.values())

for uhslc_id in uhslc_ids:
    fname = f'h{uhslc_id}.nc' # h for hourly, d for daily

    path = os.path.join(data_dir, 'tide_gauge_nc',fname)

    if not os.path.exists(path):
        urlretrieve(os.path.join(url, fname), path) 
        print(f'Downloading {fname} from {url} to {path}')

Now we merge all the datasets. This can take a while.

In [4]:
data_dir

PosixPath('../data')

```{caution}
In the following section I remove the trailing zero from the record-id of each tide gauge. This will propagate through everything!! 
```

In [5]:
# Load the data

import glob        
# Get a list of all .nc files in the data directory
files = glob.glob(os.path.join(data_dir,'tide_gauge_nc','h*.nc'))

# Open the datasets
datasets = [xr.open_dataset(file) for file in files]

#merge in batches of 2 to avoid memory issues
batch_size = 2
merged_datasets = []

for i in range(0, len(datasets), batch_size):
    batch = datasets[i:i+batch_size]
    merged_batch = xr.merge(batch)
    merged_datasets.append(merged_batch)

#merge the merged datasets
rsl = xr.merge(merged_datasets)

# convert byte strings to normal strings
rsl['station_name'] = rsl['station_name'].astype(str)
rsl['station_country'] = rsl['station_country'].astype(str)
rsl['ssc_id'] = rsl['ssc_id'].astype(str)

# remove the trailing zero from each record_id
rsl['record_id'] =(rsl['record_id']/10).astype(int)
rsl

In [6]:
def get_MHHW_uhslc_datums(id, datumname): 
    
    url = 'https://uhslc.soest.hawaii.edu/stations/TIDES_DATUMS/fd/LST/fd'+f'{int(id):03}'+'/datumTable_'+f'{int(id):03}'+'_m.html'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table')
    table = table.find_all('tr')[1:] # skip header row
    table = '<table>' + ''.join([str(x) for x in table]) # add back the table tag
    table_io = io.StringIO(str(table))
    datumtable = pd.read_html(table_io)[0]
    datum = datumtable[datumtable['Datum'] == datumname]['Value'].values[0]
    # ensure datum is a float
    datum = float(datum)
    return datum

```{note}
Should make something to include all datums here as well.
```

In [7]:
# add MHHW to the dataset
rsl['MHHW'] = xr.DataArray([1000*get_MHHW_uhslc_datums(id, 'MHHW') for id in rsl['uhslc_id'].values], dims='record_id', coords={'record_id': rsl['record_id']})

rsl['MHHW'].attrs['units'] = 'mm'
rsl['MHHW'].attrs['long_name'] = 'Mean Higher High Water, rel. to station datum'

glue('datumname', 'MHHW')

# add MSL to the dataset
rsl['MSL'] = xr.DataArray([1000*get_MHHW_uhslc_datums(id, 'MSL') for id in rsl['uhslc_id'].values], dims='record_id', coords={'record_id': rsl['record_id']})

rsl['MSL'].attrs['units'] = 'mm'
rsl['MSL'].attrs['long_name'] = 'Mean Sea Level, rel. to station datum'

glue('datumname', 'MSL')

'MHHW'

'MSL'

In [8]:
rsl

In [9]:
#save rsl to the data directory
rsl.to_netcdf(data_dir / 'rsl_hawaii.nc')

## Retrieve data from NOAA CO-OPS API


In [10]:
import json
url = "https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations.json"

response = requests.get(url)

# get a list of all the stations
stations = json.loads(response.text)

#from list of stations, get NOAA stations within the bounding box of UHSLC gauges
hawaii_stations = [station for station in stations['stations'] if station['lat'] >= rsl['lat'].min()-2 and station['lat'] <= rsl['lat'].max()+2 and station['lng'] >= -(360-rsl['lon'].min())-2 and station['lng'] <= -(360-rsl['lon'].max())+2]

#make a dictionary of all stations in hawaii with station name: station id
stationdictNOAA  = {station['name']: station['id'] for station in hawaii_stations}
stationdictNOAA 


{'Nawiliwili': '1611400',
 'Honolulu': '1612340',
 'Pearl Harbor': '1612401',
 'Mokuoloe': '1612480',
 'Kahului, Kahului Harbor': '1615680',
 'Kawaihae': '1617433',
 'Hilo, Hilo Bay, Kuhio Bay': '1617760',
 'Sand Island, Midway Islands': '1619910'}

In [11]:
import requests
import pandas as pd
import xarray as xr
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed

def fetch_data_chunk(stationID, start_date, end_date):
    # Format dates for the URL
    begin_date_str = start_date.strftime('%Y%m%d %H:%M')
    end_date_str = end_date.strftime('%Y%m%d %H:%M')
    
    # Create the URL
    url = f'https://tidesandcurrents.noaa.gov/api/datagetter?begin_date={begin_date_str}&end_date={end_date_str}&station={stationID}&datum=STND&product=water_level&units=metric&time_zone=gmt&format=json'
    
    # Request data from NOAA API
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        try:
            data = response.json()
            if 'data' in data:
                return data['data']
            else:
                # print(f"No 'data' key in response for {begin_date_str} to {end_date_str}: {data}")
                return []
        except ValueError as e:
            print(f"JSON decoding failed for {begin_date_str} to {end_date_str}: {e}")
            return []
    else:
        # print(f"Failed to fetch data for {begin_date_str} to {end_date_str}, status code: {response.status_code}")
        return []

def fetch_noaa_water_level_parallel(stationID, start_date, end_date):
    # Convert dates to datetime objects
    start_date = datetime.strptime(start_date, '%Y%m%d %H:%M')
    end_date = datetime.strptime(end_date, '%Y%m%d %H:%M')
    
    # List to hold all data
    all_data = []
    
    # Generate date ranges in 31-day increments
    date_ranges = []
    current_start_date = start_date
    while current_start_date < end_date:
        current_end_date = current_start_date + timedelta(days=31)
        if current_end_date > end_date:
            current_end_date = end_date
        date_ranges.append((current_start_date, current_end_date))
        current_start_date = current_end_date + timedelta(seconds=1)
    
    total_ranges = len(date_ranges)
    
    # Fetch data in parallel
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_date_range = {executor.submit(fetch_data_chunk, stationID, start, end): (start, end) for start, end in date_ranges}
        completed_ranges = 0
        for future in as_completed(future_to_date_range):
            try:
                data_chunk = future.result()
                if data_chunk:
                    all_data.extend(data_chunk)
            except Exception as e:
                start, end = future_to_date_range[future]
                # print(f"Error fetching data for {start} to {end}: {e}")  #uncomment if you want to track progress
            
            completed_ranges += 1
            # print(f"Progress: {completed_ranges}/{total_ranges} chunks completed.") #uncomment if you want to track progress
    
    # Convert to DataFrame
    df = pd.DataFrame(all_data)
    if not df.empty:
        df['t'] = pd.to_datetime(df['t'])
        
        # Clean data: remove rows with empty 'v' values
        df = df[df['v'].str.strip() != '']
        
        # Convert 'v' to float
        df['v'] = df['v'].astype(float)
        
        # Remove duplicates
        df = df.drop_duplicates(subset='t')
        
        # Sort by time
        df = df.sort_values(by='t')
        
        # Create time series
        sea_level_ts = pd.Series(df['v'].values, index=df['t'])
        
        # Resample to hourly data
        sea_level_ts = sea_level_ts.resample('h').interpolate()
        
        # Ensure unique timestamps for xarray
        sea_level_ts = sea_level_ts[~sea_level_ts.index.duplicated(keep='first')]
        
        # Create xarray dataset with the correct dimension
        ds = xr.Dataset({'sea_level': ('t', sea_level_ts.values)}, coords={'t': sea_level_ts.index})

        # rename t to time
        ds = ds.rename({'t': 'time'})
        
        
        return ds
    else:
        print("No data fetched.")
        return xr.Dataset()


In [12]:

start_date = '19050101 00:00'
end_date = '20231231 00:00'

for station_name, station_id in stationdictNOAA.items():
    # check if we already have the data
    if os.path.exists(data_dir / f'tide_gauge_nc/noaa_{station_id}.nc'):
        print(f"Data for {station_name} already exists, skipping.")
        continue
    ds = fetch_noaa_water_level_parallel(station_id, start_date, end_date)
    ds['station_id'] = station_id
    ds['station_name'] = station_name
    ds['station_country'] = 'USA'
    ds['lat'] = float([station['lat'] for station in hawaii_stations if station['id'] == station_id][0])
    ds['lon'] = float([station['lng'] for station in hawaii_stations if station['id'] == station_id][0])
    ds.to_netcdf(data_dir /  f'tide_gauge_nc/noaa_{station_id}.nc')

Data for Nawiliwili already exists, skipping.
Data for Honolulu already exists, skipping.
Data for Pearl Harbor already exists, skipping.
Data for Mokuoloe already exists, skipping.
Data for Kahului, Kahului Harbor already exists, skipping.
Data for Kawaihae already exists, skipping.
Data for Hilo, Hilo Bay, Kuhio Bay already exists, skipping.
Data for Sand Island, Midway Islands already exists, skipping.


As before, we'll combine them into one dataset.

In [13]:
# Load the data

      
# Get a list of all noaa*.nc files in the data directory
files = glob.glob(os.path.join(data_dir,'tide_gauge_nc','noaa*.nc'))

# Open the datasets
datasets = [xr.open_dataset(file) for file in files]

# add the station_id as a coordinate
for ds in datasets:
    ds.coords['station_id'] = ds['station_id']
rsl = xr.concat(datasets, dim='station_id')

rsl['sea_level'].attrs['units'] = 'm'
rsl['sea_level'].attrs['long_name'] = 'Sea level, relative to station datum'

In [14]:
rsl

Let's add some more metadata to this dataset, including MSL and MHHW datums for each gauge.

In [15]:
MHHW = np.zeros(len(rsl['station_id']))
MSL = np.zeros(len(rsl['station_id']))

for i in range(len(rsl['station_id'])):
    url = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/'+ str(rsl['station_id'][i].values) +'/datums.json?units=metric'
    
    response = requests.get(url)
    datums = json.loads(response.text)
    
    def extract_datum_value(data, datum_name):
        # Iterate through the datums list
        for datum in data.get('datums', []):
            # Check if the name matches the desired datum name
            if datum.get('name') == datum_name:
                # Return the value if found
                return datum.get('value')
        # Return None if the datum name is not found
        return None
    
    #extract the MHHW and MSL datums
    MHHW[i] = extract_datum_value(datums, 'MHHW')
    MSL[i] = extract_datum_value(datums, 'MSL')




In [16]:
rsl['MHHW'] = xr.DataArray(MHHW, dims='station_id', coords={'station_id': rsl['station_id']})
rsl['MHHW'].attrs['units'] = 'm'
rsl['MHHW'].attrs['long_name'] = 'Mean Higher High Water, rel. to station datum'

rsl['MSL'] = xr.DataArray(MSL, dims='station_id', coords={'station_id': rsl['station_id']})
rsl['MSL'].attrs['units'] = 'm'
rsl['MSL'].attrs['long_name'] = 'Mean Sea Level, rel. to station datum'


Change the longitude to match UHSLC 360-degree convention.

In [17]:
rsl['lon'] = rsl['lon'] + 360
rsl

In [18]:
#save rsl to the data directory
rsl.to_netcdf(data_dir / 'rsl_hawaii_noaa.nc')

## Retrieve altimetry data 
We are using the global ocean gridded L4 [Sea Surface Heights and Derived Variables](https://doi.org/10.48670/moi-00148) from Copernicus. 

To download a subset of the global altimetry data, run get_CMEMS_data.py from this directory in a terminal with python >= 3.9 + copernicus_marine_client installed OR uncomment out the call to get_CMEMS_data and run it in this notebook. To read more about how to download the data from the Copernicus Marine Toolbox (new as of December 2023), visit https://help.marine.copernicus.eu/en/articles/7949409-copernicus-marine-toolbox-introduction. 

````{margin}
```{note}
You will need a username and password to access the CMEMS (Copernicus Marine Service) data if this is the first time running the client. To register for data access (free), visit https://data.marine.copernicus.eu/register.  
```
````

```{admonition} Large data download!
:class: warning
Getting errors on the code block below? Remember to uncomment "get_CMEMS_data()" to download. Note that if you change nothing in the function, it will download ~600 MB of data, which may take a long time!! You will only need to do this once. The dataset will be stored in the data directory you specify (which should be the same data directory we defined above).
```

In [19]:
# get the min and max lat and lon of rsl for altimetry data retrieval
minlat = float(rsl.lat.min().values)
maxlat = float(rsl.lat.max().values)
minlon = float(rsl.lon.min().values)
maxlon = float(rsl.lon.max().values)

In [20]:
def get_CMEMS_data(minlat, maxlat, minlon, maxlon, data_dir=data_dir):
        
    #maxlat = 15
    #minlat = 0
    #minlon = 125
    #maxlon = 140
    maxlat = maxlat+2
    minlat = minlat-2
    maxlon = maxlon+2
    minlon  = minlon-2
    
    start_date_str = "1993-01-01T00:00:00"
    end_date_str = "2023-04-30T23:59:59"
    data_dir = data_dir
    
    """
    Retrieves Copernicus Marine data for a specified region and time period.
    
    Args:
        minlon (float): Minimum longitude of the region.
        maxlon (float): Maximum longitude of the region.
        minlat (float): Minimum latitude of the region.
        maxlat (float): Maximum latitude of the region.
        start_date_str (str): Start date of the data in ISO 8601 format.
        end_date_str (str): End date of the data in ISO 8601 format.
        data_dir (str): Directory to save the retrieved data.
    
    Returns:
        str: The filename of the retrieved data.
    """
    copernicusmarine.subset(
        dataset_id="cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.25deg_P1D",
        dataset_version="202112",
        variables=["sla"],
        minimum_longitude=minlon,
        maximum_longitude=maxlon,
        minimum_latitude=minlat,
        maximum_latitude=maxlat,
        start_datetime=start_date_str,
        end_datetime=end_date_str,
        output_directory=data_dir,
        output_filename="cmems_L4_SSH_0.25deg_" + start_date_str[0:4] + "_" + end_date_str[0:4] + ".nc"
    )
fname_cmems = 'cmems_L4_SSH_0.25deg_1993_2023.nc'

In [21]:

# check if the file exists, if not, download it
if not os.path.exists(data_dir / fname_cmems):
    print('You will need to download the CMEMS data in a separate script')
    get_CMEMS_data(minlat, maxlat, minlon, maxlon, data_dir) #<<--- COMMENT OUT TO AVOID ACCIDENTAL DATA DOWNLOADS.
else:
    print('CMEMS data already downloaded, good to go!')

CMEMS data already downloaded, good to go!


Open up the CMEMS data and take a look. We will want to make an ASL dataset similar in structure to the RSL data so that we can easily compare the two.

In [22]:
# open the CMEMS data
ds = xr.open_dataset(data_dir / fname_cmems)

ds

In [23]:
# Extract data for the nearest point to the tide gauge location that has data
sla = []
for lat, lon in zip(rsl['lat'].values, rsl['lon'].values):
    sla.append(ds['sla'].sel(
        longitude=lon-360, latitude=lat, method='nearest'
    ))

    #if the data is null, nan average over the nearest 4 points
    tol = 0.25
    if sla[-1].isnull().all():
        sla[-1] = ds['sla'].sel(
            longitude=slice(lon-360-tol, lon-360+tol), 
            latitude=slice(lat-tol, lat+tol)
        ).mean(dim=['latitude', 'longitude'])
        sla[-1]['latitude'] = np.mean(lat)
        sla[-1]['longitude'] = np.mean(lon)

sla = xr.concat(sla, dim='record_id')

#load rsl_hawaii as rsl
rsl = xr.open_dataset(data_dir / 'rsl_hawaii.nc')



# make sla a dataset with variables from rsl
sla = sla.to_dataset(name='sla')
sla['station_name'] = rsl['station_name']

# Creating lat_str and lon_str arrays with 'record_id' as their dimension
lat_str = [f'{np.abs(lat):.3f}\u00B0{"N" if lat > 0 else "S"}' for lat in sla.latitude.values]
lon_str = [f'{np.abs(lon):.3f}\u00B0{"E" if lon > 0 else "W"}' for lon in sla.longitude.values]  

# Convert lists to DataArrays with 'record_id' as their dimension
lat_str_da = xr.DataArray(lat_str, dims=['record_id'], coords={'record_id': rsl['record_id']})
lon_str_da = xr.DataArray(lon_str, dims=['record_id'], coords={'record_id': rsl['record_id']})

# Assign these DataArrays to the sla dataset
sla['lat_str'] = lat_str_da
sla['lon_str'] = lon_str_da

# add original data source to attributes
sla.attrs['original_data_source'] = 'CMEMS L4 SSH 0.25deg'
sla.attrs['title'] = ds.attrs['title']
sla.attrs['source_file'] = str(data_dir / fname_cmems)

# ensure latitude and longitude are coordinates associated with a location
sla = sla.set_coords(['latitude', 'longitude'])



sla

ValueError: cannot reindex or align along dimension 'record_id' because of conflicting dimension sizes: {8, 11} (note: an index is found along that dimension with size=11)

In [None]:
#save sla to the data directory
sla.to_netcdf(data_dir / 'asl_hawaii.nc')

## Processing
### Process the tide gauge data to match CMEMS
 Now we'll convert tide gauge data into a daily record for the POR in units of meters to match the CMEMS data. 
 
 The next code block:
 - extracts tide gauge data for the period 1993-2022
 - converts it to meters
 - removes any NaN values
 - resamples the data to daily mean
 - and normalizes it relative to the 1993-2012 epoch. 
 

The resulting data is stored in the variable 'rsl_daily' with units in meters.

In [None]:
# Next, let's establish a period of record from 1993-2022.
# establish the time period of interest
start_date = dt.datetime(1993,1,1)
end_date = dt.datetime(2022,12,31)
#
# also save them as strings, for plotting
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')



In [None]:
# Extract the data for the period of record (POR)
tide_gauge_data_POR = rsl['sea_level'].sel(time=slice(start_date, end_date))

# Convert to meters and drop any NaN values
tide_gauge_data_meters = tide_gauge_data_POR / 1000  # Convert from mm to meters
# tide_gauge_data_clean = tide_gauge_data_meters.dropna(dim='time')

# Resample the tide gauge data to daily mean before subtracting the epoch mean
tide_gauge_daily_avg = tide_gauge_data_meters.resample(time='1D').mean()


In [None]:
tide_gauge_daily_avg

### Normalize the tide gauge data relative to the 1993-2012 epoch

In [None]:

epoch_start, epoch_end = start_date, '2011-12-31'
epoch_daily_avg = tide_gauge_daily_avg.sel(time=slice(epoch_start, epoch_end))
epoch_daily_mean = epoch_daily_avg.mean(dim='time')

In [None]:
tide_gauge_data_POR

In [None]:
# Subtract the epoch daily mean from the tide gauge daily average
rsl_daily = tide_gauge_daily_avg - epoch_daily_mean

# Set the attributes of the rsl_daily data
rsl_daily.attrs = tide_gauge_data_POR.attrs
rsl_daily.attrs['units'] = 'm'

# add lat and lon to the dataset
rsl_daily['lat'] = rsl['lat']
rsl_daily['lon'] = rsl['lon']

# add the station name and country
rsl_daily['station_name'] = rsl['station_name']

# change the variable name to sea level anomaly
rsl_daily.name = 'rsl_anomaly'

# change long name of the variable to sea level anomaly
rsl_daily.attrs['long_name'] = 'Sea Level Anomaly'
rsl_daily.attrs['epoch'] = '1993-2001'

rsl_daily

In [None]:
g = rsl_daily.plot(x='time', col='record_id', col_wrap=3, sharey=False, sharex=True, figsize=(15, 10))

# Use g.axs to iterate over the axes in the FacetGrid
for ax, rid in zip(g.axs.flat, rsl_daily.record_id):
    # Accessing the station_name coordinate for the current record_id directly
    station_name = rsl_daily.station_name.sel(record_id=rid).item()
    ax.set_title(station_name)

plt.show()


In [None]:
# Set tide gauge daily average relative to MHHW
rsl_daily_mhhw = tide_gauge_daily_avg - rsl['MHHW']/1000

# Set the attributes of the rsl_daily data
rsl_daily_mhhw.attrs = tide_gauge_data_POR.attrs
rsl_daily_mhhw.attrs['long_name'] = 'water level above MHHW'
rsl_daily_mhhw.attrs['units'] = 'm'
# add lat and lon to the dataset
rsl_daily_mhhw['lat'] = rsl['lat']
rsl_daily_mhhw['lon'] = rsl['lon']
# add the station name and country
rsl_daily_mhhw['station_name'] = rsl['station_name']

rsl_daily_mhhw.name = 'rsl_mhhw'
# save rsl_daily to the data directory
# rsl_daily_mhhw.to_netcdf(data_dir / 'rsl_daily_hawaii_mhhw.nc')


#combine the two datasets
rsl_daily_combined = xr.merge([rsl_daily, rsl_daily_mhhw])

rsl_daily_combined['storm_time'] = rsl_daily_combined['time']
rsl_daily_combined['storm_time'] = xr.DataArray(
    pd.to_datetime(rsl_daily_combined['storm_time'].values).to_series().apply(
        lambda x: x if x.month >= 5 else x - pd.DateOffset(years=1)
    ),
    dims=rsl_daily_combined['storm_time'].dims,
    coords=rsl_daily_combined['storm_time'].coords
)

#make storm time a coordinate
rsl_daily_combined = rsl_daily_combined.assign_coords(storm_time = rsl_daily_combined.storm_time)

# save rsl_daily_combined to the data directory
rsl_daily_combined.to_netcdf(data_dir / 'rsl_daily_hawaii.nc')

In [None]:
rsl_daily_combined['storm_time'] = rsl_daily_combined['time']
rsl_daily_combined['storm_time'] = xr.DataArray(
    pd.to_datetime(rsl_daily_combined['storm_time'].values).to_series().apply(
        lambda x: x if x.month >= 5 else x - pd.DateOffset(years=1)
    ),
    dims=rsl_daily_combined['storm_time'].dims,
    coords=rsl_daily_combined['storm_time'].coords
)

#make storm time a coordinate
rsl_daily_combined = rsl_daily_combined.assign_coords(storm_time = rsl_daily_combined.storm_time)

# save rsl_daily_combined to the data directory
rsl_daily_combined.to_netcdf(data_dir / 'rsl_daily_hawaii.nc')

rsl_daily_combined


## Retrieve Climate Indices

In [None]:
# Download ENSO/ Oceanic Niño Index (ONI)

url = 'https://psl.noaa.gov/data/correlation/oni.data'

# parse the data into a timeseries
oni = pd.read_csv(url, sep= '\s+', skiprows=1, header=None, names=['Year', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'])

# take only first 74 rows
oni = oni.head(74)

# Convert the 'Year' column to integer type
oni['Year'] = oni['Year'].astype(int)

# melt the dataframe to long format
oni = oni.melt(id_vars='Year', var_name='Month', value_name='ONI')

# Create a datetime column from the 'Year' and 'Month' columns, where month is the middle of the month
oni['Month'] = oni['Month'].astype(int)
oni['Day'] = 15
oni['Date'] = pd.to_datetime(oni[['Year', 'Month', 'Day']])

# Set the 'Date' column as the index
oni = oni.set_index('Date')

# Drop the 'Year', 'Month', and 'Day' columns
oni = oni.drop(columns=['Year', 'Month', 'Day'])

# make sure ONI is a float
oni['ONI'] = oni['ONI'].astype(float)


# sort the index by date
oni = oni.sort_index()

# First classify the ONI values as El Niño, La Niña, or Neutral
oni['ONI_event'] = oni['ONI'].apply(lambda x: 1 if x >= 0.5 else (-1 if x <= -0.5 else 0))


#for all ONI_events, sum the ONI_event for the next 5 months. If the sum is 5, it is an El Niño event, if it is -5, it is a La Niña event
oni['ONI_event_duration_start'] = oni['ONI_event'].rolling(5).sum().shift(-4)

# Note this method will leave NaNs at the end of the series, but that's okay, because we are only interested in the start of the event
 
# find the start of the event, defined as the first month with -5 in ONI_event_duration_start
oni['La Nina'] = oni['ONI_event_duration_start'].apply(lambda x: 1 if x == -5 else 0)
# get positive zero crossings
oni['La Nina'] = oni['La Nina'].where(oni['La Nina'] == 1).ffill(limit=4)

#do the same for El Nino
oni['El Nino'] = oni['ONI_event_duration_start'].apply(lambda x: 1 if x == 5 else 0)
# get positive zero crossings
oni['El Nino'] = oni['El Nino'].where(oni['El Nino'] == 1).ffill(limit=4)

#change the NaNs to 0 and turn Nino and Nina into boolean
oni['El Nino'] = oni['El Nino'].fillna(0).astype(bool)
oni['La Nina'] = oni['La Nina'].fillna(0).astype(bool)

#remove the intermediate columns
oni = oni.drop(columns=['ONI_event', 'ONI_event_duration_start'])

oni


In [None]:
# plot it to see if we got it right
fig, ax = plt.subplots(figsize=(15, 5))

# Plotting the ONI timeseries
oni['ONI'].plot(ax=ax, color='black', label='ONI', legend=True)

# Shading El Niño events
ax.fill_between(oni.index, oni['ONI'].min(), oni['ONI'].max(), 
                where=oni['El Nino'] == 1, color='red', alpha=0.3, label='El Niño')

# Shading La Niña events
ax.fill_between(oni.index, oni['ONI'].min(), oni['ONI'].max(), 
                where=oni['La Nina'] == 1, color='blue', alpha=0.3, label='La Niña')

# Adding labels and title
ax.set_xlabel('Year')
ax.set_ylabel('ONI')
ax.set_title('Oceanic Niño Index (ONI) and El Niño/La Niña Events')

# Formatting the date axis to only show the year
ax.xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y'))

# Adding horizontal line at ONI = 0 for reference
ax.axhline(0, color='grey', linewidth=0.5, linestyle='--')

# Adding a legend
ax.legend(loc='upper left')

# set xlim to 1983 to 2023
ax.set_xlim(pd.Timestamp('1983-01-01'), pd.Timestamp('2023-12-31'))
ax.set_ylim(oni['ONI'].min(), oni['ONI'].max())

# Show the plot
plt.show()



In [None]:
# save to the data directory
oni.to_csv(data_dir / 'oni.csv')

In [43]:
# Read the PMM data from the NOAA website
url = 'https://psl.noaa.gov/tmp/gcos_wgsp/data.137.110.216.75.254.20.11.4.txt'

# data is in date value format, with no header and a space delimiter
# read it in with pandas

pmm = pd.read_csv(url, sep= '\s+', header=None, names=['Date', 'PMM'])

# remove rows with missing values
pmm = pmm.dropna()

In [44]:
# Function to convert fraction of the year to datetime
def convert_fraction_to_date(date_float):
    # Extract year and fraction
    year = int(date_float)
    fraction = date_float - year
    
    # Calculate the total number of days in the year (consider leap years)
    days_in_year = 366 if (year % 4 == 0 and (year % 100 != 0 or year % 400 == 0)) else 365
    
    # Calculate the day of the year based on the fraction
    day_of_year = int(np.floor(fraction * days_in_year)) + 1
    
    # Create a datetime object for January 1st of the given year
    start_date = pd.to_datetime(f"{year}-01-01")
    
    # Add the day offset to get the actual date
    return start_date + pd.to_timedelta(day_of_year - 1, unit='D')

In [45]:
# Apply the conversion function to the 'Date' column
pmm['Date'] = pmm['Date'].apply(convert_fraction_to_date)

# save to the data directory without the index



pmm.to_csv(data_dir / 'pmm.csv', index=False)