# T81-558: Applications of Deep Neural Networks
**Module 8: Kaggle Data Sets**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 8 Material

* Part 8.1: Introduction to Kaggle [[Video]](https://www.youtube.com/watch?v=v4lJBhdCuCU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_1_kaggle_intro.ipynb)
* Part 8.2: Building Ensembles with Scikit-Learn and Keras [[Video]](https://www.youtube.com/watch?v=LQ-9ZRBLasw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_2_keras_ensembles.ipynb)
* Part 8.3: How Should you Architect Your Keras Neural Network: Hyperparameters [[Video]](https://www.youtube.com/watch?v=1q9klwSoUQw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_3_keras_hyperparameters.ipynb)
* Part 8.4: Bayesian Hyperparameter Optimization for Keras [[Video]](https://www.youtube.com/watch?v=sXdxyUCCm8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb)
* **Part 8.5: Current Semester's Kaggle** [[Video]](https://www.youtube.com/watch?v=48OrNYYey5E) [[Notebook]](t81_558_class_08_5_kaggle_project.ipynb)


In [None]:
# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Part 8.5: Current Semester's Kaggle

Kaggke competition site for current semester (Fall 2019):

* [Fall 2019 Kaggle Assignment](https://kaggle.com/c/applications-of-deep-learningwustl-fall-2019)

Previous Kaggle competition sites for this class (NOT this semester's assignment, feel free to use code):
* [Spring 2019 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2019)
* [Fall 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2018)
* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)


# Iris as a Kaggle Competition

If the Iris data were used as a Kaggle, you would be given the following three files:

* [kaggle_iris_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle".  You can use it as a starting point for assignment 3.

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

df_train = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv",
                       na_values=['NA','?'])

# Encode feature vector
df_train.drop('id', axis=1, inplace=True)

num_classes = len(df_train.groupby('species').species.nunique())

print("Number of classes: {}".format(num_classes))

# Convert to numpy - Classification
x = df_train[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
dummies = pd.get_dummies(df_train['species']) # Classification
species = dummies.columns
y = dummies.values
    
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)

# Train, with early stopping
model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], activation='relu'))
model.add(Dense(25))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto',
                       restore_best_weights=True)

model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=1000)

Number of classes: 3
Restoring model weights from the end of the best epoch.
Epoch 00096: early stopping


<tensorflow.python.keras.callbacks.History at 0x1a33005e10>

In [2]:
from sklearn import metrics

# Calculate multi log loss error
pred = model.predict(x_test)
score = metrics.log_loss(y_test, pred)
print("Log loss score: {}".format(score))


Log loss score: 0.13537994282320143


In [3]:
# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv",
                      na_values=['NA','?'])

# Convert to numpy - Classification
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)
x = df_test[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
y = dummies.values

# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','species-0','species-1','species-2']

df_submit.to_csv("iris_submit.csv", index=False) # Write submit file locally

print(df_submit)


     id  species-0  species-1     species-2
0   100   0.013904   0.795336  1.907604e-01
1   101   0.001614   0.381189  6.171972e-01
2   102   0.002317   0.466447  5.312361e-01
3   103   0.973577   0.026410  1.352797e-05
4   104   0.981406   0.018586  8.082319e-06
5   105   0.979300   0.020693  6.637653e-06
6   106   0.994228   0.005771  5.370086e-07
7   107   0.000977   0.256403  7.426196e-01
8   108   0.017728   0.843886  1.383867e-01
9   109   0.000364   0.235242  7.643935e-01
10  110   0.000076   0.075360  9.245642e-01
11  111   0.997202   0.002798  1.343153e-07
12  112   0.052874   0.876635  7.049178e-02
13  113   0.000059   0.068381  9.315599e-01
14  114   0.986298   0.013698  3.236879e-06
15  115   0.002562   0.562420  4.350178e-01
16  116   0.054271   0.860268  8.546133e-02
17  117   0.981537   0.018459  3.920587e-06
18  118   0.989732   0.010266  1.793094e-06
19  119   0.990169   0.009830  1.572438e-06
20  120   0.015455   0.880904  1.036411e-01
21  121   0.021652   0.883587  9

### MPG as a Kaggle Competition (Regression)

If the Auto MPG data were used as a Kaggle, you would be given the following three files:

* [kaggle_mpg_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_mpg_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that you will use to train. (contains x and y)
* [kaggle_mpg_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

The following program generates a submission file for "MPG Kaggle".  

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

save_path = "."

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_train.csv", 
    na_values=['NA', '?'])

cars = df['name']

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values
y = df['mpg'].values # regression

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
                        verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
          verbose=2,callbacks=[monitor],epochs=1000)

# Predict
pred = model.predict(x_test)

Train on 261 samples, validate on 88 samples
Epoch 1/1000
261/261 - 0s - loss: 57911.6912 - val_loss: 24465.1060
Epoch 2/1000
261/261 - 0s - loss: 14306.0091 - val_loss: 3008.5809
Epoch 3/1000
261/261 - 0s - loss: 1132.2912 - val_loss: 468.8732
Epoch 4/1000
261/261 - 0s - loss: 1021.1585 - val_loss: 1452.0326
Epoch 5/1000
261/261 - 0s - loss: 1362.6016 - val_loss: 799.4204
Epoch 6/1000
261/261 - 0s - loss: 555.7693 - val_loss: 304.9727
Epoch 7/1000
261/261 - 0s - loss: 303.6679 - val_loss: 310.3975
Epoch 8/1000
261/261 - 0s - loss: 317.3761 - val_loss: 296.4635
Epoch 9/1000
261/261 - 0s - loss: 283.5167 - val_loss: 252.1378
Epoch 10/1000
261/261 - 0s - loss: 252.0144 - val_loss: 244.2337
Epoch 11/1000
261/261 - 0s - loss: 250.3846 - val_loss: 244.8350
Epoch 12/1000
261/261 - 0s - loss: 247.3224 - val_loss: 241.5592
Epoch 13/1000
261/261 - 0s - loss: 245.5661 - val_loss: 242.9108
Epoch 14/1000
261/261 - 0s - loss: 245.4319 - val_loss: 241.9302
Epoch 15/1000
261/261 - 0s - loss: 243.2965

In [5]:
import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 4.913462818455655


In [6]:
import pandas as pd

# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv",
                      na_values=['NA','?'])

# Convert to numpy - regression
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)

# Handle missing value
df_test['horsepower'] = df_test['horsepower'].fillna(df['horsepower'].median())

x = df_test[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values




# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','mpg']

df_submit.to_csv("auto_submit.csv", index=False) # Write submit file locally

print(df_submit)

     id        mpg
0   350  27.541994
1   351  25.862688
2   352  26.286074
3   353  28.067247
4   354  26.922823
5   355  29.997942
6   356  27.195421
7   357  27.913530
8   358  29.707598
9   359  27.877825
10  360  28.481096
11  361  27.393724
12  362  27.112352
13  363  29.838976
14  364  22.223558
15  365  11.889268
16  366  22.200274
17  367  20.895983
18  368  29.619097
19  369  29.607958
20  370  28.666338
21  371  28.795122
22  372  26.298723
23  373  26.294970
24  374  27.767382
25  375  28.209459
26  376  26.572680
27  377  28.218096
28  378  27.944262
29  379  25.759920
30  380  27.392458
31  381  27.163240
32  382  27.431566
33  383  27.360979
34  384  27.490450
35  385  27.584368
36  386  27.756899
37  387  25.689104
38  388  16.059099
39  389  24.993456
40  390  20.356920
41  391  27.254869
42  392  25.351488
43  393  26.956682
44  394  26.871660
45  395  27.191133
46  396  24.900072
47  397  28.000689
48  398  28.800188


# Module 8 Assignment

You can find the first assignment here: [assignment 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class8.ipynb)