## 7.4

Return to the permeability problem outlined in Exercise 6.2. Train several nonlinear regression models and evaluate the resampling and test set performance.

**(a)** Which nonlinear regression model gives the optimal resampling and test set performance?

**(b)** Do any of the nonlinear models outperform the optimal linear model you previously developed in Exercise 6.2? If so, what might this tell you about the underlying relationship between the predictors and the response?

**(c)** Would you recommend any of the models you have developed to replace the permeability laboratory experiment?

In [127]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.svm import SVR
from matplotlib import pyplot as plt
from matplotlib import cm
from pyearth import Earth
import tensorflow as tf
from tensorflow import keras

In [2]:
from sklearn.model_selection import train_test_split

def split_data(df, target, val_size=0.2, test_size=0.2, drop_columns=[]):
    x={}
    y={}
    
    train_val, test = train_test_split(df, test_size=test_size)
    train, val = train_test_split(train_val, test_size=val_size/(1-test_size))
    ds = {}
    x["train"] = train.drop(columns=[target]+drop_columns)
    x["val"] = val.drop(columns=[target]+drop_columns)
    x["train_val"] = train_val.drop(columns=[target]+drop_columns)
    x["test"] = test.drop(columns=[target]+drop_columns)
    y["train"] = train[target]
    y["val"] = val[target]
    y["train_val"] = train_val[target]
    y["test"] = test[target]
    
    return x,y

In [39]:
fingerprints = pd.read_csv("data/fingerprints.csv", index_col="ID", header=0)
permeability = pd.read_csv("data/permeability.csv", index_col="ID", header=0)

data = pd.concat([fingerprints, permeability], axis="columns")
x, y = split_data(data, "permeability")

In [None]:
# Non-linear regression models
# Neural Networks
# - Implement a single example of nn regressor
# - Implement cross validation for nn
# - Test cross validation for baseline architecture
# - Google and implement better architectures.
# - Preprocessing

# MARS
# SVM
# KNN

In [40]:
x_tensor, y_tensor = {}, {}

for key in x:
    x_tensor[key] = tf.convert_to_tensor(x[key])
    y_tensor[key] = tf.convert_to_tensor(y[key])

In [131]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [132]:
loss_fn = tf.keras.losses.MeanSquaredError()

In [133]:
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor="val_loss",
        min_delta=1e-2,
        patience=5
    )
]

In [134]:
model.compile(optimizer='adam',
              loss=loss_fn)

In [135]:
model.fit(
    x_tensor["train_val"], 
    y_tensor["train_val"], 
    validation_split=0.2, 
    callbacks=callbacks,
    epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500


<tensorflow.python.keras.callbacks.History at 0x7fa28cbb1a00>

In [136]:
r2_score(y_tensor["test"].numpy(), model.predict(x_tensor["test"]).reshape(-1))

0.3983113014487961

In [137]:
mean_squared_error(y_tensor["test"].numpy(), model.predict(x_tensor["test"]).reshape(-1))**0.5

11.872260667526938