In [1]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

%matplotlib inline

Using TensorFlow backend.


Set our random seed so that all computations are deterministic

In [None]:
seed = 21899

Read in the raw data for the first 100K records of the HCEPDB into a pandas dataframe

In [None]:
df = pd.read_csv('https://github.com/UWDIRECT/UWDIRECT.github.io/blob/master/Wi18_content/DSMCER/HCEPD_100K.csv?raw=true')
df.head()

Separate out the predictors from the output


In [None]:
X = df[['mass', 'voc', 'jsc', 'e_homo_alpha', 'e_gap_alpha', 
        'e_lumo_alpha']].values
Y = df[['pce']].values

Let's create the test / train split for these data using 80/20.  The `_pn` extension is related to the 'prenormalization' nature of the data.

In [None]:
X_train_pn, X_test_pn, y_train, y_test = train_test_split(X, Y,
                                                    test_size=0.20,
                                                    random_state=seed)

Now we need to `StandardScaler` the training data and apply that scale to the test data.

In [None]:
# create the scaler from the training data only and keep it for later use
X_train_scaler = StandardScaler().fit(X_train_pn)
# apply the scaler transform to the training data
X_train = X_train_scaler.transform(X_train_pn)

Now let's reuse that scaler transform on the test set.  This way we never contaminate the test data with the training data.  We'll start with a histogram of the testing data just to prove to ourselves it is working.

In [None]:
plt.hist(X_test_pn[:,1])

OK, bnow apply the training scaler transform to the test and plot a histogram

In [None]:
X_test = X_train_scaler.transform(X_test_pn)

In [None]:
plt.hist(X_test[:,1])

### Let's create the neural network layout

This is a simple neural network with no hidden layers and just the inputs transitioned to the output.

In [None]:
def simple_model():
    # assemble the structure
    model = Sequential()
    model.add(Dense(6, input_dim=6, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # compile the model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

Train the neural network with the following

In [None]:
# initialize the andom seed as this is used to generate
# the starting weights
np.random.seed(seed)
# create the NN framework
estimator = KerasRegressor(build_fn=simple_model,
        epochs=150, batch_size=25000, verbose=0)
history = estimator.fit(X_train, y_train, validation_split=0.33, epochs=150, 
        batch_size=10000, verbose=0)

The history object returned by the `fit` call contains the information in a fitting run.

In [None]:
print(history.history.keys())

In [None]:
print("final MSE for train is %.2f and for validation is %.2f" % 
      (history.history['loss'][-1], history.history['val_loss'][-1]))

Let's plot it!

In [None]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

Let's get the MSE for the test set.

In [None]:
test_loss = estimator.model.evaluate(X_test, y_test)
print("test set mse is %.2f" % test_loss)

## NEAT!

So our train mse is very similar to the training and validation at the final step!

###  Let's look at another way to evaluate the set of models using cross validation

Use 10 fold cross validation to evaluate the models generated from our training set.  We'll use scikit-learn's tools for this.  Remember, this is only assessing our training set.  If you get negative values, to make `cross_val_score` behave as expected, we have to flip the signs on the results (incompatibility with keras).

In [None]:
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (-1 * results.mean(), results.std()))

#### Quick aside, `Pipeline`

Let's use scikit learns `Pipeline` workflow to run a k-fold cross validation run on the learned model.

With this tool, we create a workflow using the `Pipeline` object.  You provide a list of actions (as named tuples) to be performed.  We do this with `StandardScaler` to eliminate the posibility of training leakage into the cross validation test set during normalization.

In [None]:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=simple_model,
        epochs=150, batch_size=25000, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)
print('MSE mean: %.4f ; std: %.4f' % (-1 * results.mean(), results.std()))

### Now, let's try a more sophisticated model

Let's use a hidden layer this time.

In [None]:
def medium_model():
    # assemble the structure
    model = Sequential()
    model.add(Dense(6, input_dim=6, kernel_initializer='normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # compile the model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [None]:
# initialize the andom seed as this is used to generate
# the starting weights
np.random.seed(seed)
# create the NN framework
estimator = KerasRegressor(build_fn=medium_model,
        epochs=150, batch_size=25000, verbose=0)
history = estimator.fit(X_train, y_train, validation_split=0.33, epochs=150, 
        batch_size=10000, verbose=0)
print("final MSE for train is %.2f and for validation is %.2f" % 
      (history.history['loss'][-1], history.history['val_loss'][-1]))

In [None]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
test_loss = estimator.model.evaluate(X_test, y_test)
print("test set mse is %.2f" % test_loss)

_So it appears our more complex model improved performance_

### Free time!

Find example code for keras for the two following items:
* L1 and L2 regularization (note in keras, this can be done by layer)
* Dropout


#### Regularization
Let's start by adding L1 or L2 (or both) regularization to the hidden layer.

Hint: you need to define a new function that is the neural network model and add the correct parameters to the layer definition.  Then retrain and plot as above.  What parameters did you choose for your dropout?  Did it improve training?

#### Dropout

Find the approach to specifying dropout on a layer using your best friend `bing`.  As with L1 and L2 above, this will involve defining a new network struction using a function and some new 'magical' dropout layers.