# Very short intro to loading netcdf data in python

Uses packages `netCDF4` (https://pypi.org/project/netCDF4/) and `xarray` (https://docs.xarray.dev/en/stable/).  

In [1]:
# file location:
filepath = '/projects/NS9853K/DATA/SFE/Forecasts/'
# file name:
filename = 'forecast_2022_5.nc4'

## Option 1: netCDF4 library

In [2]:
# data access
from netCDF4 import Dataset

In [13]:
# load data:
ds = Dataset(filepath + filename, mode='r')

print(ds.dimensions)

{'lon': <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 360, 'lat': <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 181, 'leadtime_month': <class 'netCDF4._netCDF4.Dimension'>: name = 'leadtime_month', size = 5, 'variable': <class 'netCDF4._netCDF4.Dimension'>: name = 'variable', size = 10}


### Data dimensions

Data is on a grid covering the whole globe at a resolution of 1 degree, i.e. there are 361 coordinates representing the longitude dimension `lon` and 181 coordinates representing latitude `lat`. The dimension `leadtime_month` indicates the month that the forecast is valid for, i.e. a value of 6 indicates that the global field on that coordinate value is the forecast for June. The `variable` dimension runs from 1 to 10 and indicates the meteorological parameter. Its `long_name` indicates which coordinate value is for which parameter:

In [12]:
ds.variables['variable'].long_name

'Variables indices are 1: 2m_temperature,  2: total_precipitation,  3: mean_sea_level_pressure,  4: sea_surface_temperature,  5: snowfall,  6: 10m_wind_speed,  7: u_component_of_wind_850hPa,  8: v_component_of_wind_850hPa,  9: 10m_u_component_of_wind,  10: 10m_v_component_of_wind'

Loading the coordinates works as illustrated below. The variables will be standard numpy arrays.

In [None]:
lon = ds.variables['lon'][:]
lat = ds.variables['lat'][:]
leadtime_month = ds.variables['leadtime_month'][:]
variables = ds.variables['variable'][:]

As you can see, we currently only have monthly forecast values. If needed, we can provide daily values here at some point.

### Data variables

The variables (not to be confused with the dimension called `variable`... This could be something we change in the future) saved in the file are the coordinates of the data plus a number of different statistics of the forecast ensemble:

In [25]:
ds.variables.keys()

dict_keys(['lon', 'lat', 'leadtime_month', 'variable', 'mean_standardized_anomaly', 'Q_standard10', 'Q_standard25', 'Q_standard50', 'Q_standard75', 'Q_standard90', 'ExceedQ33', 'ExceedQ50', 'ExceedQ67', 'LandMask'])

For instance, `ExceedQ50` gives the predicted probability that the meteorlogical parameter in a certain forecast month and at a certain point on earth exceeds the 50th percentile of the climatological distribution of the parameter. This climatological distribution was estimated from forecasts of past dates (so-called hindcasts) that cover the past 30 years. As an example, if this value for the parameter temperature were close to 1 for June, it would mean that the forecast has a lot of confidence that June will be warmer than normal.

`mean_standardized_anomaly` is the mean value of all the values produced by the forecast "ensemble". However, as the name indicates, it is not an absolute value (e.g. temperature will not be in degrees Celsius). Instead, the values that the models produce are standardized by subtracting the models mean and dividing by the models standard deviation. Those are the values found in this variable. To illustrate: a value of -1 for temperature for forecast month 7 at a certain lon/lat point means that the temperature in July is predicted to be one standard deviation below average July temperatures (at that particular point).

`Q_standard25` gives the 25th percentile of the ensemble distribution, again given in units of standard deviations (as for `mean_standardized_anomaly`).

Here, maybe we can at some point talk about the kind of information you need. Perhaps you need actual absolute values of temperature, precipitation etc. Also, the forecast data here are statistics computed from a collection of many different single forecasts (260 in total, a so-called "multi-model ensemble"). So, for every forecast time step there is actually a whole distribution of values. Let me know if this would be more useful and we can provide it. As you can imagine, those files will be a lot larger!

### Option 2: xarray

xarray is a useful package for geographical data based on netcdf files. It is based on numpy and pandas and has the advantage that those files can be loaded "lazily", i.e. you can load them, do subsetting and computations on them but all of these will only be executed once you actually request a value (e.g. when plotting, saving etc.). This is basically to protect the memory from overload since the files can often be very big. I use it a lot for netcdf files.

In [30]:
import xarray as xr

In [40]:
# loading is as simple as:
ds_xr = xr.load_dataset(filepath + filename)

print(ds_xr)

<xarray.Dataset>
Dimensions:                    (lon: 360, lat: 181, leadtime_month: 5,
                                variable: 10)
Coordinates:
  * lon                        (lon) float64 -179.0 -178.0 ... 179.0 180.0
  * lat                        (lat) float64 -90.0 -89.0 -88.0 ... 89.0 90.0
  * leadtime_month             (leadtime_month) float64 6.0 7.0 8.0 9.0 10.0
  * variable                   (variable) int32 1 2 3 4 5 6 7 8 9 10
Data variables:
    mean_standardized_anomaly  (variable, leadtime_month, lat, lon) float32 0...
    Q_standard10               (variable, leadtime_month, lat, lon) float32 -...
    Q_standard25               (variable, leadtime_month, lat, lon) float32 -...
    Q_standard50               (variable, leadtime_month, lat, lon) float32 0...
    Q_standard75               (variable, leadtime_month, lat, lon) float32 1...
    Q_standard90               (variable, leadtime_month, lat, lon) float32 1...
    ExceedQ33                  (variable, leadtime_mo

Data can be accessed as follows:

In [38]:
ds_xr.mean_standardized_anomaly

The data are in so-called xarray datasets. These are basically multi-dimensional pandas dataframes. Single arrays inside an `xr.dataset` are called `xr.DataArray` and can be easily transformed to pandas dataframes in case that's what you need. But xarray also has many operations built in.

In [46]:
# example: print mean standardized anomaly of temperature for forecast month June:
print(ds_xr.mean_standardized_anomaly.sel(variable=1,leadtime_month=6).to_dataframe())

              leadtime_month  variable  mean_standardized_anomaly
lat   lon                                                        
-90.0 -179.0             6.0         1                   0.434989
      -178.0             6.0         1                   0.434989
      -177.0             6.0         1                   0.434989
      -176.0             6.0         1                   0.434989
      -175.0             6.0         1                   0.434989
...                      ...       ...                        ...
 90.0  176.0             6.0         1                   0.497094
       177.0             6.0         1                   0.497094
       178.0             6.0         1                   0.497094
       179.0             6.0         1                   0.497094
       180.0             6.0         1                   0.497094

[65160 rows x 3 columns]
