## Setup


In this first cell we''ll load the necessary libraries and setup some logging and display options.

In [47]:
import math

import numpy as np
import pandas as pd
import xarray as xr

Next we'll load our flow variables and time tendency forcings datasets into Xarray Dataset objects.

In [48]:
ds_h0 = xr.open_dataset('C:/home/cam_learn/fv091x180L26_dry_HS.cam.h0.2000-12-27-00000_lowres.nc', decode_times=False)
ds_h1 = xr.open_dataset('C:/home/cam_learn/fv091x180L26_dry_HS.cam.h1.2000-12-27-00000_lowres.nc', decode_times=False)

In [49]:
ds_h0.info

<bound method Dataset.info of <xarray.Dataset>
Dimensions:       (ilev: 27, lat: 12, lev: 26, lon: 23, nbnd: 2, slat: 90, slon: 180, time: 720)
Coordinates:
  * ilev          (ilev) float64 2.194 4.895 9.882 18.05 29.84 44.62 61.61 ...
  * lat           (lat) float64 -90.0 -74.0 -58.0 -42.0 -26.0 -10.0 6.0 22.0 ...
  * lev           (lev) float64 3.545 7.389 13.97 23.94 37.23 53.11 70.06 ...
  * lon           (lon) float64 0.0 16.0 32.0 48.0 64.0 80.0 96.0 112.0 ...
  * slat          (slat) float64 -89.0 -87.0 -85.0 -83.0 -81.0 -79.0 -77.0 ...
  * slon          (slon) float64 -1.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 ...
  * time          (time) float64 0.0 0.02083 0.04167 0.0625 0.08333 0.1042 ...
Dimensions without coordinates: nbnd
Data variables:
    P0            float64 ...
    PS            (time, lat, lon) float32 ...
    T             (time, lev, lat, lon) float32 ...
    U             (time, lev, lat, lon) float32 ...
    V             (time, lev, lat, lon) float32 ...
    ch4v

In [51]:
ds_h1.info

<bound method Dataset.info of <xarray.Dataset>
Dimensions:       (ilev: 27, lat: 12, lev: 26, lon: 23, nbnd: 2, slat: 90, slon: 180, time: 720)
Coordinates:
  * ilev          (ilev) float64 2.194 4.895 9.882 18.05 29.84 44.62 61.61 ...
  * lat           (lat) float64 -90.0 -74.0 -58.0 -42.0 -26.0 -10.0 6.0 22.0 ...
  * lev           (lev) float64 3.545 7.389 13.97 23.94 37.23 53.11 70.06 ...
  * lon           (lon) float64 0.0 16.0 32.0 48.0 64.0 80.0 96.0 112.0 ...
  * slat          (slat) float64 -89.0 -87.0 -85.0 -83.0 -81.0 -79.0 -77.0 ...
  * slon          (slon) float64 -1.0 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 ...
  * time          (time) float64 0.0 0.02083 0.04167 0.0625 0.08333 0.1042 ...
Dimensions without coordinates: nbnd
Data variables:
    P0            float64 ...
    PTTEND        (time, lev, lat, lon) float32 ...
    PUTEND        (time, lev, lat, lon) float32 ...
    PVTEND        (time, lev, lat, lon) float32 ...
    ch4vmr        (time) float64 ...
    co2vmr        

Look at the time variable in order to work out the initial date, number of steps, units, etc.

In [52]:
ds_h0.variables['time']

<xarray.IndexVariable 'time' (time: 720)>
array([ 0.      ,  0.020833,  0.041667, ..., 14.9375  , 14.958333, 14.979167])
Attributes:
    long_name:  time
    units:      days since 2000-12-27 00:00:00
    calendar:   noleap
    bounds:     time_bnds

Make sure we have the same time values for the targets data.

In [53]:
if (ds_h0.variables['time'].values != ds_h1.variables['time'].values).any():
    print('ERROR: Non-matching time values')

Create array of datetime values from the times.

In [54]:
times = ds_h0.variables['time'].values.flatten()
initial = datetime(2000, 12, 27)
datetimes = np.empty(shape=times.shape, dtype='datetime64[m]')
for i in range(datetimes.size):
    datetimes[i] = initial + timedelta(days=times[i])
timestamps = pd.Series(datetimes)
timestamps

0     2000-12-27 00:00:00
1     2000-12-27 00:30:00
2     2000-12-27 01:00:00
3     2000-12-27 01:30:00
4     2000-12-27 02:00:00
              ...        
715   2001-01-10 21:30:00
716   2001-01-10 22:00:00
717   2001-01-10 22:30:00
718   2001-01-10 23:00:00
719   2001-01-10 23:30:00
Length: 720, dtype: datetime64[ns]

As features we'll use the following flow variables:

* U (west-east (zonal) wind, m/s)
* V (south-north (meridional) wind, m/s)
* T (temperature, K)
* PS (surface pressure, Pa)

Time tendency forcings are the targets (labels) that our model should learn to predict.

* PTTEND (time tendency of the temperature)
* PUTEND (time tendency of the zonal wind)
* PVTEND (time tendency of the meridional wind)

For the first lat, lon, and level we'll get all time steps for the feature variables. We'll do the same for the label variables PTTEND, PUTEND, and PVTEND.

In [55]:
ps = pd.Series(ds_h0.variables['PS'].values[:, 0, 0])
t = pd.Series(ds_h0.variables['T'].values[:, 0, 0, 0])
u = pd.Series(ds_h0.variables['U'].values[:, 0, 0, 0])
v = pd.Series(ds_h0.variables['V'].values[:, 0, 0, 0])
pttend = pd.Series(ds_h1.variables['PTTEND'].values[:, 0, 0, 0])
putend = pd.Series(ds_h1.variables['PUTEND'].values[:, 0, 0, 0])
pvtend = pd.Series(ds_h1.variables['PVTEND'].values[:, 0, 0, 0])

Convert to a Pandas DataFrame containing inputs (features) and output (label/target) for use when predicting temperature time tendencies.

In [57]:
df_temp = pd.DataFrame({'timestamp': timestamps,
                         'PS': ps,
                         'T': t,
                         'U': u,
                         'V': v,
                         'PTTEND': pttend})
df_temp.set_index('timestamp', inplace=True)
df_temp.head()

Unnamed: 0_level_0,PS,T,U,V,PTTEND
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-12-27 00:00:00,101099.06,210.86,-0.81,-0.28,-0.0
2000-12-27 00:30:00,101108.92,210.86,-0.82,-0.3,-0.0
2000-12-27 01:00:00,101118.48,210.86,-0.83,-0.32,-0.0
2000-12-27 01:30:00,101127.91,210.86,-0.83,-0.33,-0.0
2000-12-27 02:00:00,101137.39,210.86,-0.83,-0.35,-0.0


In [58]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 720 entries, 2000-12-27 00:00:00 to 2001-01-10 23:30:00
Data columns (total 5 columns):
PS        720 non-null float32
T         720 non-null float32
U         720 non-null float32
V         720 non-null float32
PTTEND    720 non-null float32
dtypes: float32(5)
memory usage: 19.7 KB


Define a function that will frame the time series as a supervised learning problem set (described [here](https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/)).

In [87]:
def convert_to_supervised(df,
                          previous_steps=1, 
                          forecast_steps=1,
                          dropnan=True):

    # original column names
    col_names = df.columns
    
    # list of columns and corresponding names we'll build from 
    # the originals found in the input DataFrame
    cols, names = list(), list()

    # input sequence (t-n, ... t-1)
    for i in range(previous_steps, 0, -1):
        cols.append(df.shift(i))
        names += [('%s(t-%d)' % (col_name, i)) for col_name in col_names]

    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, forecast_steps):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('%s(t)' % col_name) for col_name in col_names]
        else:
            names += [('%s(t+%d)' % (col_name, i)) for col_name in col_names]

    # put all the columns together into a single aggregated DataFrame
    agg = pd.concat(cols, axis=1)
    agg.columns = names

    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)

    return agg

Normalize the feature variables using scikit-learn's MinMaxScaler.

In [72]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(df_temp.values)
scaled

array([[0.50631714, 1.        , 0.22656655, 0.69448346, 0.        ],
       [0.5129471 , 0.9949341 , 0.22496124, 0.68799436, 0.00498581],
       [0.51937866, 0.98950195, 0.2233897 , 0.6815935 , 0.01043701],
       ...,
       [0.12479401, 0.23565674, 0.98422444, 0.7495539 , 0.76434326],
       [0.12210846, 0.23876953, 0.98267317, 0.7519671 , 0.76122856],
       [0.11936951, 0.2397461 , 0.98426366, 0.7531688 , 0.76023674]],
      dtype=float32)

Replace the original columns with the scaled values.

In [86]:
for i in range(len(df_temp.columns)):
    s = pd.Series(scaled[:, i])
    df_temp[df_temp.columns[i]] = s.values
df_temp.head()

Unnamed: 0_level_0,PS,T,U,V,PTTEND
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-12-27 00:00:00,0.51,1.0,0.23,0.69,0.0
2000-12-27 00:30:00,0.51,0.99,0.22,0.69,0.0
2000-12-27 01:00:00,0.52,0.99,0.22,0.68,0.01
2000-12-27 01:30:00,0.53,0.98,0.22,0.68,0.02
2000-12-27 02:00:00,0.53,0.98,0.22,0.67,0.02


Frame the variables as a supervised learning problem (as described [here](https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/)).

In [88]:
reframed = convert_to_supervised(df_temp, 1, 1)
reframed.head()

Unnamed: 0_level_0,PS(t-1),T(t-1),U(t-1),V(t-1),PTTEND(t-1),PS(t),T(t),U(t),V(t),PTTEND(t)
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-12-27 00:30:00,0.51,1.0,0.23,0.69,0.0,0.51,0.99,0.22,0.69,0.0
2000-12-27 01:00:00,0.51,0.99,0.22,0.69,0.0,0.52,0.99,0.22,0.68,0.01
2000-12-27 01:30:00,0.52,0.99,0.22,0.68,0.01,0.53,0.98,0.22,0.68,0.02
2000-12-27 02:00:00,0.53,0.98,0.22,0.68,0.02,0.53,0.98,0.22,0.67,0.02
2000-12-27 02:30:00,0.53,0.98,0.22,0.67,0.02,0.54,0.98,0.22,0.67,0.02


Drop the columns for variables we won't use as features or targets.

In [94]:
reframed.drop(columns=['PS(t)', 'T(t)', 'U(t)', 'V(t)'], inplace=True)

In [95]:
reframed.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 719 entries, 2000-12-27 00:30:00 to 2001-01-10 23:30:00
Data columns (total 6 columns):
PS(t-1)        719 non-null float32
T(t-1)         719 non-null float32
U(t-1)         719 non-null float32
V(t-1)         719 non-null float32
PTTEND(t-1)    719 non-null float32
PTTEND(t)      719 non-null float32
dtypes: float32(6)
memory usage: 22.5 KB


## Split the data into training and testing datasets

For simplicity we'll start with an even split of 50% for training and 50% for testing.

In [96]:
# the values array is 719 rows x 10 columns
train = reframed.values[:360, :]   # rows 0 - 359
test = reframed.values[360:, :]   # rows 360 - 719

# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

In [97]:
train[:10, -1]

array([0.00498581, 0.01043701, 0.01571846, 0.02043343, 0.02449036,
       0.02793884, 0.03071404, 0.03261375, 0.03356171, 0.0336895 ],
      dtype=float32)

In [93]:
x = reframed['PTTEND(t)'].values[:10]
print(x)

[0.00498581 0.01043701 0.01571846 0.02043343 0.02449036 0.02793884
 0.03071404 0.03261375 0.03356171 0.0336895 ]


## Create the neural network


Next, we'll instantiate and configure a neural network using TensorFlow's [DNNRegressor](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNRegressor) class. We'll train this model using the GradientDescentOptimizer, which implements Mini-Batch Stochastic Gradient Descent (SGD). The learning_rate argument controls the size of the gradient step.

NOTE: To be safe, we also apply gradient clipping to our optimizer via `clip_gradients_by_norm`. Gradient clipping ensures the magnitude of the gradients do not become too large during training, which can cause gradient descent to fail.

We use `hidden_units`to define the structure of the NN. The `hidden_units` argument provides a list of ints, where each int corresponds to a hidden layer and indicates the number of nodes in it. For example, consider the following assignment:

`hidden_units=[3, 10]`

The preceding assignment specifies a neural net with two hidden layers:

The first hidden layer contains 3 nodes.
The second hidden layer contains 10 nodes.
If we wanted to add more layers, we'd add more ints to the list. For example, `hidden_units=[10, 20, 30, 40]` would create four layers with ten, twenty, thirty, and forty units, respectively.

By default, all hidden layers will use ReLu activation and will be fully connected.

In [12]:
# Use gradient descent as the optimizer for training the model.
gd_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
gd_optimizer = tf.contrib.estimator.clip_gradients_by_norm(gd_optimizer, 5.0)

# Use two hidden layers with 3 and 10 nodes each.
hidden_units=[3, 10]

# Instantiate the neural network.
dnn_regressor = tf.estimator.DNNRegressor(feature_columns=feature_columns,
                                          hidden_units=hidden_units,
                                          optimizer=gd_optimizer)

## Define the input function

To import our weather data into our DNNRegressor, we need to define an input function, which instructs TensorFlow how to preprocess the data, as well as how to batch, shuffle, and repeat it during model training.

First, we'll convert our xarray feature data into a dict of NumPy arrays. We can then use the TensorFlow Dataset API to construct a dataset object from our data, and then break our data into batches of `batch_size`, to be repeated for the specified number of epochs (`num_epochs`).

NOTE: When the default value of `num_epochs=None` is passed to `repeat()`, the input data will be repeated indefinitely.

Next, if `shuffle` is set to True, we'll shuffle the data so that it's passed to the model randomly during training. The `buffer_size` argument specifies the size of the dataset from which shuffle will randomly sample.

Finally, our input function constructs an iterator for the dataset and returns the next batch of data.

In [13]:
from tensorflow.python.data import Dataset

def get_input(features, 
              targets, 
              batch_size=1, 
              shuffle=True, 
              num_epochs=None):
    """
    Extracts a batch of elements from a dataset.
  
    Args:
      features: xarray Dataset of features
      targets: xarray Dataset of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. 
                  None == repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """
  
    # Convert xarray data into a dict of numpy arrays.
    features_dict = {}
    for var in features.variables:
        # Convert data into a dict of np arrays.
        data_array = features[var]
        features_dict[var] = data_array.values

    targets_dict = {}
    for var in targets.variables:
        data_array = data_h1[var]
        targets_dict[var] = data_array.values

    # Construct a dataset, and configure batching/repeating.
    ds = Dataset.from_tensor_slices((features_dict, targets_dict)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    # Shuffle the data, if specified.
    if shuffle:
        ds = ds.shuffle(buffer_size=10000)
    
    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

# Create input functions. Wrap get_input() in a lambda so we 
# can pass in features and targets as arguments.
input_training = lambda: get_input(features_training, 
                                   targets_training, 
                                   batch_size=10)
predict_input_training = lambda: get_input(features_training, 
                                           targets_training, 
                                           num_epochs=1, 
                                           shuffle=False)
predict_input_validation = lambda: get_input(features_validation, 
                                             targets_validation, 
                                             num_epochs=1, 
                                             shuffle=False)

## Train and evaluate the model

We can now call `train()` on our `dnn_regressor` to train the model. We'll loop over a number of periods and on each loop we'll train the model, use it to make predictions, and compute the RMSE of the loss for both training and validation datasets.

In [None]:
print("Training model...")
print("RMSE (on training data):")
training_rmse = []
validation_rmse = []

steps = 500
periods = 20
steps_per_period = steps / periods

# Train the model inside a loop so that we can periodically assess loss metrics.
for period in range (0, periods):

    # Train the model, starting from the prior state.
    dnn_regressor.train(input_fn=input_training,
                        steps=steps_per_period)

    # Take a break and compute predictions, converting to numpy arrays.
    training_predictions = dnn_regressor.predict(input_fn=predict_input_training)
    training_predictions = np.array([item['predictions'][0] for item in training_predictions])
    
    validation_predictions = dnn_regressor.predict(input_fn=predict_input_validation)
    validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])
    
    # Compute training and validation loss.
    training_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(training_predictions, targets_training))
    validation_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(validation_predictions, targets_validation))
    
    # Print the current loss.
    print("  period %02d : %0.2f" % (period, training_root_mean_squared_error))
    
    # Add the loss metrics from this period to our list.
    training_rmse.append(training_root_mean_squared_error)
    validation_rmse.append(validation_root_mean_squared_error)

print("Model training finished.")

# Output a graph of loss metrics over periods.
plt.ylabel("RMSE")
plt.xlabel("Periods")
plt.title("Root Mean Squared Error vs. Periods")
plt.tight_layout()
plt.plot(training_rmse, label="training")
plt.plot(validation_rmse, label="validation")
plt.legend()

print("Final RMSE (on training data):   %0.2f" % training_root_mean_squared_error)
print("Final RMSE (on validation data): %0.2f" % validation_root_mean_squared_error)

Training model...
RMSE (on training data):
