# Tabulate Grids

This Jupyter notebook examines ICOADS qualitative precipitation data (which should be pre-filtered beforehand) and bins them by $1^\circ \times 1^\circ$ grid boxes (or, optionally, an arbitrary spatial binning). The data is assumed to be stored as a `pandas` dataframe of ship observations encoded as a `pickle` binary file.

In [None]:
# Import useful libraries
import numpy as np               # Numerical manipulations
import pandas as pd              # Data tables
import xarray as xr              # Gridded data
from os.path import join         # System interfacing
import pickle                    # Binary file input/output

In [None]:
# Export toggle: True if data will be exported to file
export = True 

In [None]:
data_dir = 'data/' # Directory for COADS data
export_dir = 'export/' # Directory to export data

First, we simply load the ship observations in and quickly examine the data to get a sense of what we're working with.

In [None]:
data_fp = join(data_dir, 'coads_1950-2019_filtered.pkl') # Filepath to a Pickle file containing ship data
with open(data_fp, 'rb') as f:
    df = pickle.load(f) # Load data

In [None]:
# Output a few example lines
display(df.head())

# Quickly check data
print( 'Min/max latitude in data:' )
print( np.min(df.LAT), np.max( df.LAT ) )

print( 'Min/max longitude in data:' )
print( np.min( df.LON ), np.max( df.LON ) )

print( 'Data dimensions:' )
print( df.shape )

print( 'Range of years:' )
print(df.YR.min(), '-', df.YR.max())

## Bin data

To bin the data, we first specify how the data will be binned. In this case, we will bin the data and aggregate counts in terms of years, months, latitudes (by 1 degree intervals), and longitudes (also by 1 degree intervals). **This section controls the binning intervals.**

In [None]:
# Spatial coordinates
dlat = 1 # Interval size of latitude
dlon = 1 # Interval size of longitude
lats = np.arange(-90,90,dlat) # Array of latitudes (bottom of grid cells)
lons = np.arange(0,360,dlon) # Array of longitudes (left of grid cells)

# Time coordinates
start_yr = df.YR.min() # The first year in the data
end_yr = df.YR.max() # The final year in the data
years = np.arange(start_yr,end_yr + 1) # Years in the data bounds
months = np.arange(1,13) # Months in the data bouns

Using the prescribed latitude and longitude bins, we will then assign each observation with an associated latitude and longitude value. In this case, that value is equal to `floor()` of the longitude or latitude value. This can be done more efficiently if binning was fixed, but the method below allows adjustment of the latitude and longitude bins to something other than $1^\circ \times 1^\circ$ boxes.

In [None]:
# Map longitudes and latitudes to common values for binning
lon_bin = np.digitize(df.LON.values, bins=lons)
lat_bin = np.digitize(df.LAT.values, bins=lats)

# Add these bin labels to the dataframe
df['LAT_BIN'] = lats[lat_bin-1]
df['LON_BIN'] = lons[lon_bin-1]

Next, we aggregate the data into frequency counts based on various parameters of interest. Each subset will be stored as a new dataframe.

In [None]:
colgroups = ['YR','MO','LAT_BIN','LON_BIN'] # Columns to aggregate bins within

df_reports = df.groupby(colgroups).size().to_frame()                 # For all reports
df_precip = df.loc[df.WW >= 50].groupby(colgroups).size().to_frame() # For anything of at least a drizzle

In [None]:
# Adjust automated column headers to something more sensible...
# Names should be unique if we want to use a unified gridded dataset
df_reports.columns = ['total_count']
df_precip.columns = ['precip_count']

After the data is binned within the dataframe, we then convert the `pandas` dataframe into an `xarray` data array, which is tailored for gridded datasets. This allows us to map the data onto spatial coordinates.

In [None]:
# Convert to xarray and fill in any missing years, months, latitude, or longitude bins with NaN counts
reindexer = {'YR': years, 'MO': months, 'LAT_BIN': lats, 'LON_BIN': lons}

ds_reports = df_reports.to_xarray().fillna(0).reindex(reindexer, fill_value=0)
ds_precip = df_precip.to_xarray().fillna(0).reindex(reindexer, fill_value=0)

# Merge data into a single xarray dataset
ds = xr.merge([ds_reports, ds_precip])

# Output a list of the variables included in the dataset
list(ds.keys())

The `xarray` dataset joining all variables can be exported as a netCDF file.

In [None]:
if export:
    ds.to_netcdf(path=join(export_dir, 'counts.nc'))

The `xarray` data arrays store frequency counts at each grid box given the month and year. This data object can also be stored as a simple multi-dimensional `numpy` array for exporting data individually.

In [None]:
# Store count arrays
nreports = ds_reports.total_count.values
nprecip = ds_precip.precip_count.values

In [None]:
# Optionally, export the data
if export:
    pickle.dump(nreports, open(join(export_dir, 'nreports.pkl'), 'wb'))
    pickle.dump(nprecip, open(join(export_dir, 'nprecip.pkl'), 'wb'))