<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_08_5_kaggle_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 8: Kaggle Data Sets**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 8 Material

* Part 8.1: Introduction to Kaggle [[Video]](https://www.youtube.com/watch?v=v4lJBhdCuCU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_1_kaggle_intro.ipynb)
* Part 8.2: Building Ensembles with Scikit-Learn and Keras [[Video]](https://www.youtube.com/watch?v=LQ-9ZRBLasw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_2_keras_ensembles.ipynb)
* Part 8.3: How Should you Architect Your Keras Neural Network: Hyperparameters [[Video]](https://www.youtube.com/watch?v=1q9klwSoUQw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_3_keras_hyperparameters.ipynb)
* Part 8.4: Bayesian Hyperparameter Optimization for Keras [[Video]](https://www.youtube.com/watch?v=sXdxyUCCm8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb)
* **Part 8.5: Current Semester's Kaggle** [[Video]](https://www.youtube.com/watch?v=48OrNYYey5E) [[Notebook]](t81_558_class_08_5_kaggle_project.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
# Start CoLab
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

Note: not using Google CoLab


# Part 8.5: Current Semester's Kaggle

Kaggke competition site for current semester (Spring 2020):

* Coming soon

Previous Kaggle competition sites for this class (NOT this semester's assignment, feel free to use code):
* [Fall 2019 Kaggle Assignment](https://kaggle.com/c/applications-of-deep-learningwustl-fall-2019)
* [Spring 2019 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2019)
* [Fall 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2018)
* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)


# Iris as a Kaggle Competition

If the Iris data were used as a Kaggle, you would be given the following three files:

* [kaggle_iris_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle".  You can use it as a starting point for assignment 3.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

df_train = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv",
                       na_values=['NA','?'])

# Encode feature vector
df_train.drop('id', axis=1, inplace=True)

num_classes = len(df_train.groupby('species').species.nunique())

print("Number of classes: {}".format(num_classes))

# Convert to numpy - Classification
x = df_train[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
dummies = pd.get_dummies(df_train['species']) # Classification
species = dummies.columns
y = dummies.values
    
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)

# Train, with early stopping
model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], activation='relu'))
model.add(Dense(25))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto',
                       restore_best_weights=True)

model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=1000)

Number of classes: 3
Restoring model weights from the end of the best epoch.
Epoch 00046: early stopping


<tensorflow.python.keras.callbacks.History at 0x1150f2c9448>

In [3]:
from sklearn import metrics

# Calculate multi log loss error
pred = model.predict(x_test)
score = metrics.log_loss(y_test, pred)
print("Log loss score: {}".format(score))


Log loss score: 0.22685463905334471


In [4]:
# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv",
                      na_values=['NA','?'])

# Convert to numpy - Classification
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)
x = df_test[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
y = dummies.values

# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','species-0','species-1','species-2']

df_submit.to_csv("iris_submit.csv", index=False) # Write submit file locally

print(df_submit)


     id  species-0  species-1  species-2
0   100   0.026600   0.695883   0.277516
1   101   0.006443   0.421240   0.572316
2   102   0.011504   0.435335   0.553161
3   103   0.944092   0.053705   0.002203
4   104   0.948412   0.049458   0.002131
5   105   0.953590   0.044654   0.001756
6   106   0.978998   0.020543   0.000459
7   107   0.002823   0.281914   0.715264
8   108   0.037100   0.699165   0.263735
9   109   0.001407   0.275527   0.723067
10  110   0.000639   0.130363   0.868998
11  111   0.989023   0.010822   0.000156
12  112   0.075660   0.773884   0.150455
13  113   0.000598   0.168161   0.831242
14  114   0.963519   0.035327   0.001154
15  115   0.008615   0.527222   0.464163
16  116   0.074143   0.718673   0.207184
17  117   0.958608   0.040161   0.001231
18  118   0.969300   0.029848   0.000852
19  119   0.965449   0.033652   0.000899
20  120   0.036438   0.750558   0.213003
21  121   0.048888   0.758114   0.192998
22  122   0.005301   0.416095   0.578604
23  123   0.0008

### MPG as a Kaggle Competition (Regression)

If the Auto MPG data were used as a Kaggle, you would be given the following three files:

* [kaggle_mpg_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_mpg_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that you will use to train. (contains x and y)
* [kaggle_mpg_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

The following program generates a submission file for "MPG Kaggle".  

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

save_path = "."

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_train.csv", 
    na_values=['NA', '?'])

cars = df['name']

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values
y = df['mpg'].values # regression

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
                        verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
          verbose=2,callbacks=[monitor],epochs=1000)

# Predict
pred = model.predict(x_test)

Train on 261 samples, validate on 88 samples
Epoch 1/1000
261/261 - 1s - loss: 230113.1715 - val_loss: 151081.2528
Epoch 2/1000
261/261 - 0s - loss: 116223.8470 - val_loss: 59762.2560
Epoch 3/1000
261/261 - 0s - loss: 39098.6015 - val_loss: 13273.2504
Epoch 4/1000
261/261 - 0s - loss: 6680.5549 - val_loss: 1147.8249
Epoch 5/1000
261/261 - 0s - loss: 836.4371 - val_loss: 1430.0602
Epoch 6/1000
261/261 - 0s - loss: 1800.4659 - val_loss: 1944.2812
Epoch 7/1000
261/261 - 0s - loss: 1693.8137 - val_loss: 1202.8138
Epoch 8/1000
261/261 - 0s - loss: 954.3587 - val_loss: 758.4105
Epoch 9/1000
261/261 - 0s - loss: 718.5414 - val_loss: 722.2682
Epoch 10/1000
261/261 - 0s - loss: 727.7613 - val_loss: 732.7944
Epoch 11/1000
261/261 - 0s - loss: 720.3635 - val_loss: 711.4131
Epoch 12/1000
261/261 - 0s - loss: 697.7491 - val_loss: 701.7881
Epoch 13/1000
261/261 - 0s - loss: 692.8152 - val_loss: 699.4777
Epoch 14/1000
261/261 - 0s - loss: 692.8855 - val_loss: 696.9256
Epoch 15/1000
261/261 - 0s - los

Epoch 126/1000
261/261 - 0s - loss: 167.9202 - val_loss: 161.7919
Epoch 127/1000
261/261 - 0s - loss: 164.9150 - val_loss: 157.5832
Epoch 128/1000
261/261 - 0s - loss: 162.8307 - val_loss: 155.3778
Epoch 129/1000
261/261 - 0s - loss: 162.7091 - val_loss: 152.8135
Epoch 130/1000
261/261 - 0s - loss: 158.2949 - val_loss: 149.9091
Epoch 131/1000
261/261 - 0s - loss: 155.5795 - val_loss: 147.6993
Epoch 132/1000
261/261 - 0s - loss: 152.6130 - val_loss: 145.1060
Epoch 133/1000
261/261 - 0s - loss: 149.9227 - val_loss: 142.0749
Epoch 134/1000
261/261 - 0s - loss: 147.7978 - val_loss: 140.7187
Epoch 135/1000
261/261 - 0s - loss: 143.8276 - val_loss: 139.0353
Epoch 136/1000
261/261 - 0s - loss: 144.2169 - val_loss: 135.5472
Epoch 137/1000
261/261 - 0s - loss: 140.1991 - val_loss: 132.5593
Epoch 138/1000
261/261 - 0s - loss: 137.9367 - val_loss: 131.6151
Epoch 139/1000
261/261 - 0s - loss: 135.3024 - val_loss: 127.8738
Epoch 140/1000
261/261 - 0s - loss: 132.8866 - val_loss: 125.8925
Epoch 141/

Epoch 254/1000
261/261 - 0s - loss: 17.2177 - val_loss: 15.5885
Epoch 255/1000
261/261 - 0s - loss: 17.6090 - val_loss: 15.6199
Epoch 256/1000
261/261 - 0s - loss: 17.9155 - val_loss: 15.2379
Epoch 257/1000
261/261 - 0s - loss: 16.8268 - val_loss: 15.0690
Epoch 258/1000
261/261 - 0s - loss: 16.4666 - val_loss: 15.4264
Epoch 259/1000
261/261 - 0s - loss: 16.4720 - val_loss: 15.1090
Epoch 260/1000
261/261 - 0s - loss: 16.2128 - val_loss: 14.7747
Epoch 261/1000
261/261 - 0s - loss: 16.1341 - val_loss: 15.1045
Epoch 262/1000
261/261 - 0s - loss: 16.8712 - val_loss: 14.5642
Epoch 263/1000
261/261 - 0s - loss: 17.1231 - val_loss: 15.1537
Epoch 264/1000
261/261 - 0s - loss: 17.0708 - val_loss: 14.5035
Epoch 265/1000
261/261 - 0s - loss: 15.8614 - val_loss: 14.5180
Epoch 266/1000
261/261 - 0s - loss: 15.7627 - val_loss: 14.3447
Epoch 267/1000
261/261 - 0s - loss: 15.4339 - val_loss: 14.5208
Epoch 268/1000
261/261 - 0s - loss: 15.4930 - val_loss: 14.2319
Epoch 269/1000
261/261 - 0s - loss: 15.4

In [6]:
import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 3.6343379521241688


In [7]:
import pandas as pd

# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv",
                      na_values=['NA','?'])

# Convert to numpy - regression
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)

# Handle missing value
df_test['horsepower'] = df_test['horsepower'].fillna(df['horsepower'].median())

x = df_test[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values




# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','mpg']

df_submit.to_csv("auto_submit.csv", index=False) # Write submit file locally

print(df_submit)

     id        mpg
0   350  33.464775
1   351  29.650288
2   352  31.757906
3   353  30.041067
4   354  30.385283
5   355  29.575613
6   356  31.040102
7   357  30.579294
8   358  27.454201
9   359  28.609385
10  360  23.245211
11  361  23.613840
12  362  25.289766
13  363  24.627230
14  364  22.129740
15  365  24.595829
16  366  25.160017
17  367  21.562229
18  368  28.345316
19  369  27.550282
20  370  29.932812
21  371  27.228390
22  372  28.579571
23  373  27.818499
24  374  25.329622
25  375  25.671453
26  376  33.618538
27  377  34.471153
28  378  34.811119
29  379  31.049118
30  380  31.899542
31  381  32.712051
32  382  31.650806
33  383  32.239178
34  384  33.770924
35  385  34.064060
36  386  33.950958
37  387  25.986696
38  388  28.534342
39  389  28.171833
40  390  28.468870
41  391  28.015144
42  392  28.989521
43  393  25.214060
44  394  25.769728
45  395  35.472599
46  396  29.209511
47  397  27.978807
48  398  27.268869


# Module 8 Assignment

You can find the first assignment here: [assignment 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class8.ipynb)