<code>calculate_S2S_model_bias.ipynb</code>.  This notebook calculates bias (model - obs) in sea ice extent for each S2S model as a function of forecast month and region. 

In [1]:
import xarray as xr
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from S2S_sea_ice_preprocess import load_model, create_aggregate_regions, create_model_climatology
from S2S_sea_ice_preprocess import create_obs_climatology 

## Overview

<li>1) Load model netCDF files, combine with CTRL, and use common reforecast period. <br>
if NCEP, use entire period </li>
<li> 2) Add aggregate regions </li>
<li> 3) Create climatology--model: calculate date of year for valid date, lead time in weeks.<br>
<li> 4) Create observed climatology (static, using only common reforecast period) </li>
<li> 5) Calculate bias at desired lead period (0 - <code>max_lead</code>) for each region, in each model, as a function of forecast month  
    $$SIE_{bias} = \overline{SIE_{model}(m,date)} - SIE_{obs}(m,date),$$
    where the overline indicates averaging from lead days 0 - <code>max_lead</code>

In [2]:
model_names_ALL = ['ecmwf','ncep','ukmo','metreofr']
obs_name = 'NSIDC_0079'
COMMON_RF = True # we want to compare the reforecasts to obs over the same 15 year period
MAX_LEAD = 1 #max lead in days

Here we want to look at all the models at once.  So we'll load each model, calculate the aggregate regions, get the ensemble mean, and create the climatology before combining everything into one dataframe.

In [3]:
SIE_df_ALL = pd.DataFrame()
SIE_df_weekly_ALL = pd.DataFrame()
for model_name in model_names_ALL:
    print('loading ',model_name)
    # Load
    SIE = load_model(model_name)
    print('loaded ',model_name)
    # Create aggregate regions
    SIE = create_aggregate_regions(SIE)
    print('combined regions')
    # Take ensemble mean and get lead time in days
    SIE_ens_mean = SIE.mean(dim='ensemble')
    regions = SIE.region_names
    lead_days = SIE.fore_time.dt.days
    # Convert to dataframe, rename some columns, and get the date of the forecast by adding the fore_time to init_date
    SIE_df = SIE_ens_mean.to_dataframe().reset_index()
    SIE_df['valid date'] = SIE_df['init_time'] + SIE_df['fore_time']
    SIE_df = SIE_df.rename(columns={'region_names':'region',
                               'fore_time':'lead time (days)',
                               'init_time':'init date',
                               'Extent':'SIE'})
    SIE_df = create_model_climatology(SIE_df,7)
    SIE_df['model name'] = model_name
    
# Create climatology
    SIE_df_ALL = SIE_df_ALL.append(SIE_df)
    #SIE_df_weekly_ALL = SIE_df_weekly_ALL.append(SIE_df_weekly)

loading  ecmwf
<xarray.Dataset>
Dimensions:       (ensemble: 10, fore_time: 46, init_time: 2080, nregions: 15)
Coordinates:
    region_names  (nregions) object dask.array<chunksize=(15,), meta=np.ndarray>
  * fore_time     (fore_time) timedelta64[ns] 0 days 1 days ... 44 days 45 days
  * ensemble      (ensemble) int32 0 1 2 3 4 5 6 7 8 9
  * nregions      (nregions) int64 99 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  * init_time     (init_time) datetime64[ns] 1998-08-06 ... 2018-08-01
Data variables:
    Extent        (ensemble, init_time, fore_time, nregions) float64 dask.array<chunksize=(10, 1, 46, 15), meta=np.ndarray>
loaded  ecmwf
combined regions
loading  ncep
<xarray.Dataset>
Dimensions:       (ensemble: 3, fore_time: 43, init_time: 4523, nregions: 15)
Coordinates:
    region_names  (nregions) object dask.array<chunksize=(15,), meta=np.ndarray>
  * fore_time     (fore_time) timedelta64[ns] 1 days 2 days ... 42 days 43 days
  * ensemble      (ensemble) int32 0 1 2
  * nregions      (nre

Load obs

In [4]:
obs_type = 'sipn_nc_yearly_agg'
filepath = '/home/disk/sipn/nicway/data/obs/{model_name}/{model_type}/'.format(model_name=obs_name,
                                                                              model_type=obs_type)
obs_filenames = xr.open_mfdataset(filepath+'/*.nc',combine='by_coords')
print('opening ',obs_filenames)
obs_SIE = obs_filenames.Extent
obs_regions = obs_filenames.nregions
obs_region_names = obs_filenames['region_names'].values
# Drop region names and re-add as a non-dask.array object.  This is stupid but oh well
obs_SIE = obs_SIE.drop('region_names')
obs_SIE["region_names"] = ("nregions",obs_region_names)
print('obs loaded')

opening  <xarray.Dataset>
Dimensions:       (nregions: 15, time: 11322)
Coordinates:
    region_names  (nregions) object dask.array<chunksize=(15,), meta=np.ndarray>
  * nregions      (nregions) int64 99 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  * time          (time) datetime64[ns] 1989-01-01 1989-01-02 ... 2019-12-31
Data variables:
    Extent        (time, nregions) float64 dask.array<chunksize=(365, 15), meta=np.ndarray>
obs loaded


Add aggregate regions to obs and convert obs to Pandas dataframe

In [5]:
obs_SIE = create_aggregate_regions(obs_SIE)
obs_SIE = obs_SIE.to_dataframe().reset_index()
obs_SIE = obs_SIE.rename(columns={'Extent':'SIE','region_names':'region','time':'valid date'})

The history saving thread hit an unexpected error (OperationalError('database is locked')).History will not be written to the database.


Calculate our observed climatology using either the full period or the common reforecast period only

In [6]:
if COMMON_RF == True:
    obs_SIE = obs_SIE[pd.to_datetime(obs_SIE['valid date']).dt.year.isin(np.arange(1999,2015))]
    obs_SIE = create_obs_climatology(obs_SIE)
    time_str = 'COMMON_RF'
    print('common reforecast')
else:
    time_str = 'FULL_PERIOD'
    obs_SIE = create_obs_climatology(obs_SIE)
    print('full period')
print('observed climatology created')

common reforecast
observed climatology created


In [7]:
obs_SIE['model name'] = obs_name

Group by model name, region, lead time (for model output only), and the forecast valid date, and subtract the observed SIE from the model prediction of SIE

In [8]:
SIE_model_gb = SIE_df_ALL.groupby(['region','valid date','model name','lead time (days)'])['SIE','SIE clim','SIE anom'].mean()
SIE_obs_gb = obs_SIE.groupby(['region','valid date'])['SIE','SIE clim','SIE anom'].mean()
SIE_err = SIE_model_gb[['SIE','SIE clim','SIE anom']] - SIE_obs_gb[['SIE','SIE clim','SIE anom']]
SIE_err_pct = SIE_err[['SIE','SIE clim','SIE anom']].divide(SIE_obs_gb[['SIE','SIE clim','SIE anom']])

  """Entry point for launching an IPython kernel.
  


In [9]:
SIE_err_pct = SIE_err_pct*100
SIE_err[['SIE pct','SIE clim pct','SIE anom pct']] = SIE_err_pct

In [10]:
SIE_err_rs = SIE_err.reset_index()
SIE_err_rs['valid month'] = pd.to_datetime(SIE_err_rs['valid date']).dt.month

In [11]:
fname_save = '../DATA/RAW_ERRORS_all_S2S_models_OBS_{obs_name}_{time_str}.csv'.format(obs_name=obs_name,time_str=time_str)
SIE_err_rs.to_csv(fname_save)