## Setup


In this first cell we''ll load the necessary libraries and setup some logging and display options.

In [2]:
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

%matplotlib inline

Next we'll load our flow variables and time tendency forcings datasets into Xarray Dataset objects.

In [3]:
ds_h0 = xr.open_dataset('C:/home/cam_learn/fv091x180L26_dry_HS.cam.h0.2000-12-27-00000_lowres.nc', decode_times=False)
ds_h1 = xr.open_dataset('C:/home/cam_learn/fv091x180L26_dry_HS.cam.h1.2000-12-27-00000_lowres.nc', decode_times=False)

In [4]:
ds_h0.info

<bound method Dataset.info of <xarray.Dataset>
Dimensions:       (ilev: 27, lat: 12, lev: 26, lon: 23, nbnd: 2, slat: 90, slon: 180, time: 720)
Coordinates:
  * ilev          (ilev) float64 2.194 4.895 9.882 18.05 29.84 44.62 61.61 ...
  * lat           (lat) float64 -90.0 -74.0 -58.0 -42.0 -26.0 -10.0 6.0 22.0 ...
  * lev           (lev) float64 3.545 7.389 13.97 23.94 37.23 53.11 70.06 ...
  * lon           (lon) float64 0.0 16.0 32.0 48.0 64.0 80.0 96.0 112.0 ...
  * slat          (slat) float64 -89.0 -87.0 -85.0 -83.0 -81.0 -79.0 -77.0 ...
  * slon          (slon) float64 -1.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 ...
  * time          (time) float64 0.0 0.02083 0.04167 0.0625 0.08333 0.1042 ...
Dimensions without coordinates: nbnd
Data variables:
    P0            float64 ...
    PS            (time, lat, lon) float32 ...
    T             (time, lev, lat, lon) float32 ...
    U             (time, lev, lat, lon) float32 ...
    V             (time, lev, lat, lon) float32 ...
    ch4v

In [5]:
ds_h1.info

<bound method Dataset.info of <xarray.Dataset>
Dimensions:       (ilev: 27, lat: 12, lev: 26, lon: 23, nbnd: 2, slat: 90, slon: 180, time: 720)
Coordinates:
  * ilev          (ilev) float64 2.194 4.895 9.882 18.05 29.84 44.62 61.61 ...
  * lat           (lat) float64 -90.0 -74.0 -58.0 -42.0 -26.0 -10.0 6.0 22.0 ...
  * lev           (lev) float64 3.545 7.389 13.97 23.94 37.23 53.11 70.06 ...
  * lon           (lon) float64 0.0 16.0 32.0 48.0 64.0 80.0 96.0 112.0 ...
  * slat          (slat) float64 -89.0 -87.0 -85.0 -83.0 -81.0 -79.0 -77.0 ...
  * slon          (slon) float64 -1.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 ...
  * time          (time) float64 0.0 0.02083 0.04167 0.0625 0.08333 0.1042 ...
Dimensions without coordinates: nbnd
Data variables:
    P0            float64 ...
    PTTEND        (time, lev, lat, lon) float32 ...
    PUTEND        (time, lev, lat, lon) float32 ...
    PVTEND        (time, lev, lat, lon) float32 ...
    ch4vmr        (time) float64 ...
    co2vmr        

Look at the time variable in order to work out the initial date, number of steps, units, etc.

In [6]:
ds_h0.variables['time']

<xarray.IndexVariable 'time' (time: 720)>
array([ 0.      ,  0.020833,  0.041667, ..., 14.9375  , 14.958333, 14.979167])
Attributes:
    long_name:  time
    units:      days since 2000-12-27 00:00:00
    calendar:   noleap
    bounds:     time_bnds

Make sure we have the same time values for the targets data.

In [7]:
if (ds_h0.variables['time'].values != ds_h1.variables['time'].values).any():
    print('ERROR: Non-matching time values')

Create array of datetime values from the times.

In [8]:
from datetime import datetime, timedelta
times = ds_h0.variables['time'].values.flatten()
initial = datetime(2000, 12, 27)
datetimes = np.empty(shape=times.shape, dtype='datetime64[m]')
for i in range(datetimes.size):
    datetimes[i] = initial + timedelta(days=times[i])
timestamps = pd.Series(datetimes)
timestamps.head()

0   2000-12-27 00:00:00
1   2000-12-27 00:30:00
2   2000-12-27 01:00:00
3   2000-12-27 01:30:00
4   2000-12-27 02:00:00
dtype: datetime64[ns]

## Feature and target selection

As features we'll use the following flow variables:

* U (west-east (zonal) wind, m/s)
* V (south-north (meridional) wind, m/s)
* T (temperature, K)
* PS (surface pressure, Pa)

Time tendency forcings are the targets (labels) that our model should learn to predict.

* PTTEND (time tendency of the temperature)
* PUTEND (time tendency of the zonal wind)
* PVTEND (time tendency of the meridional wind)

Eventually we'll train/fit our model for an entire global 3-D grid, but for this example we'll select all lat/lon/time combinations for a single level (elevation).

In [11]:
ps = pd.Series(ds_h0.variables['PS'].values[:, :, :].flatten())
t = pd.Series(ds_h0.variables['T'].values[:, 0, :, :].flatten())
u = pd.Series(ds_h0.variables['U'].values[:, 0, :, :].flatten())
v = pd.Series(ds_h0.variables['V'].values[:, 0, :, :].flatten())
pttend = pd.Series(ds_h1.variables['PTTEND'].values[:, 0, :, :].flatten())
putend = pd.Series(ds_h1.variables['PUTEND'].values[:, 0, :, :].flatten())
pvtend = pd.Series(ds_h1.variables['PVTEND'].values[:, 0, :, :].flatten())

Convert to Pandas DataFrames containing inputs (features) and outputs (label/target) for use when predicting time tendency forcings.

In [12]:
df_features = pd.DataFrame({'timestamp': timestamps,
                            'PS': ps,
                            'T': t,
                            'U': u,
                            'V': v})
df_features.set_index('timestamp', inplace=True)
df_features.head()

Unnamed: 0_level_0,PS,T,U,V
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-12-27 00:00:00,101099.0625,210.862564,-0.814972,-0.28067
2000-12-27 00:30:00,101099.0625,210.862564,-0.706038,-0.494434
2000-12-27 01:00:00,101099.0625,210.862564,-0.542403,-0.669891
2000-12-27 01:30:00,101099.0625,210.862564,-0.336744,-0.793447
2000-12-27 02:00:00,101099.0625,210.862564,-0.104996,-0.85553


In [13]:
df_targets = pd.DataFrame({'timestamp': timestamps,
                           'PTTEND': pttend,
                           'PUTEND': putend,
                           'PVTEND': pvtend})
df_targets.set_index('timestamp', inplace=True)
df_targets.head()

Unnamed: 0_level_0,PTTEND,PUTEND,PVTEND
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-12-27 00:00:00,-3e-06,0.0,0.0
2000-12-27 00:30:00,-3e-06,0.0,0.0
2000-12-27 01:00:00,-3e-06,0.0,0.0
2000-12-27 01:30:00,-3e-06,0.0,0.0
2000-12-27 02:00:00,-3e-06,0.0,0.0


## Split the data into training and testing datasets

For simplicity we'll start with an even split of 80% for training and 20% for testing.

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df_features, df_targets, test_size=0.2, random_state=4)

## Create the linear regression model

In [18]:
from sklearn import linear_model
model = linear_model.LinearRegression()

## Train and evaluate the model

Train the model by fitting to the training datasets.

In [20]:
# fit the model
history = model.fit(x_train, y_train)

#mean square error 
mse = np.mean((model.predict(x_test) - y_test)**2)
print("MSE: {}".format(mse))

MSE: PTTEND    2.764062e-23
PUTEND    0.000000e+00
PVTEND    0.000000e+00
dtype: float32
