<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_08_5_kaggle_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 8: Kaggle Data Sets**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 8 Material

* Part 8.1: Introduction to Kaggle [[Video]](https://www.youtube.com/watch?v=7Mk46fb0Ayg&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_1_kaggle_intro.ipynb)
* Part 8.2: Building Ensembles with Scikit-Learn and PyTorch [[Video]](https://www.youtube.com/watch?v=przbLRCRL24&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_2_pytorch_ensembles.ipynb)
* Part 8.3: How Should you Architect Your PyTorch Neural Network: Hyperparameters [[Video]](https://www.youtube.com/watch?v=YTL2BR4U2Ng&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_3_pytorch_hyperparameters.ipynb)
* Part 8.4: Bayesian Hyperparameter Optimization for PyTorch [[Video]](https://www.youtube.com/watch?v=1f4psgAcefU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb)
* **Part 8.5: Current Semester's Kaggle** [[Video]] [[Notebook]](t81_558_class_08_5_kaggle_project.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
# Start CoLab
try:
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
import torch
device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")

Note: using Google CoLab
Using device: mps


# Part 8.5: Current Semester's Kaggle

Kaggke competition site for current semester:
* [Fall 2023 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustl-fall-2023/overview)

Previous Kaggle competition sites for this class (NOT this semester's assignment, feel free to use code):
* [Spring 2023 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustlspring-2023)
* [Fall 2022 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustlfall-2022)
* [Spring 2022 Kaggle Assignment](https://www.kaggle.com/c/tsp-cv)
* [Fall 2021 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustlfall-2021)
* [Spring 2021 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustl-spring-2021b)
* [Fall 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustl-fall-2020)
* [Spring 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2020)
* [Fall 2019 Kaggle Assignment](https://kaggle.com/c/applications-of-deep-learningwustl-fall-2019)
* [Spring 2019 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2019)
* [Fall 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2018)
* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)


## Iris as a Kaggle Competition

If I used the Iris data as a Kaggle, I would give you the following three files:

* [kaggle_iris_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on. It contains only input; you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle". You can use it as a starting point for assignment 3.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from sklearn import metrics
import numpy as np

# Read the data
df_train = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv", na_values=['NA', '?'])

# Encode feature vector
df_train.drop('id', axis=1, inplace=True)

num_classes = len(df_train.groupby('species').species.nunique())
print("Number of classes: {}".format(num_classes))

# Convert to numpy - Classification
x = df_train[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
dummies = pd.get_dummies(df_train['species'])  # Classification
species = dummies.columns
y = dummies.values

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=45)

# Convert to PyTorch tensors
x_train, y_train = torch.tensor(x_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32)
x_test, y_test = torch.tensor(x_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32)

# Define the model using torch.nn.Sequential
model = nn.Sequential(
    nn.Linear(x.shape[1], 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.Linear(25, y.shape[1]),
    nn.Softmax(dim=1)
)

optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss()

# Training loop with early stopping
n_epochs = 1000
patience = 5
best_loss = float('inf')
early_stopping_counter = 0

for epoch in range(n_epochs):
    # Train
    model.train()
    optimizer.zero_grad()
    y_pred = model(x_train)
    loss = loss_fn(y_pred, torch.argmax(y_train, 1))
    loss.backward()
    optimizer.step()

    # Validate
    model.eval()
    with torch.no_grad():
        y_val_pred = model(x_test)
        val_loss = loss_fn(y_val_pred, torch.argmax(y_test, 1))
    
    if val_loss < best_loss:
        best_loss = val_loss
        early_stopping_counter = 0
    else:
        early_stopping_counter += 1
        if early_stopping_counter >= patience:
            print("Early Stopping!")
            break

Number of classes: 3


Now that we've trained the neural network, we can check its log loss.

In [3]:
# Calculate multi log loss error
model.eval()
with torch.no_grad():
    y_pred = model(x_test)
    y_pred = y_pred.numpy()
score = metrics.log_loss(y_test, y_pred)
print("Log loss score: {}".format(score))

Log loss score: 0.015514996025663396


Now we are ready to generate the Kaggle submission file.  We will use the iris test data that does not contain a $y$ target value.  It is our job to predict this value and submit it to Kaggle.

In [4]:
# Generate Kaggle submit file
df_test = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv", na_values=['NA', '?'])

# Convert to numpy - Classification
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)
x_kaggle = df_test[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
x_kaggle = torch.tensor(x_kaggle, dtype=torch.float32)

# Generate predictions
model.eval()
with torch.no_grad():
    pred_kaggle = model(x_kaggle)
pred_kaggle = pred_kaggle.numpy()

# Create submission data set
df_submit = pd.DataFrame(pred_kaggle)
df_submit.insert(0, 'id', ids)
df_submit.columns = ['id', 'species-0', 'species-1', 'species-2']

# Write submit file locally
df_submit.to_csv("iris_submit.csv", index=False)

print(df_submit.head())


    id     species-0  species-1     species-2
0  100  5.431684e-05   0.999945  3.297705e-07
1  101  6.042830e-09   0.010619  9.893807e-01
2  102  6.944081e-10   0.000963  9.990373e-01
3  103  9.997644e-01   0.000236  2.038801e-36
4  104  9.998689e-01   0.000131  3.686617e-37


## MPG as a Kaggle Competition (Regression)

If the Auto MPG data were used as a Kaggle, you would be given the following three files:

* [kaggle_mpg_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_mpg_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that you will use to train. (contains x and y)
* [kaggle_mpg_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

The following program generates a submission file for "MPG Kaggle".  

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

# Download and preprocess data
save_path = "."
df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_train.csv", na_values=['NA', '?'])
cars = df['name']
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

x = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']].values
y = df['mpg'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

# Convert numpy arrays to PyTorch tensors
x_train, x_test, y_train, y_test = map(torch.tensor, (x_train, x_test, y_train, y_test))
x_train, x_test = x_train.float(), x_test.float()
y_train, y_test = y_train.float().unsqueeze(1), y_test.float().unsqueeze(1)

# Define the neural network using Sequential
model = nn.Sequential(
    nn.Linear(x_train.shape[1], 25),
    nn.ReLU(),
    nn.Linear(25, 10),
    nn.ReLU(),
    nn.Linear(10, 1)
)

# Define loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())

# Early stopping criteria
min_delta = 1e-3
patience = 5
best_loss = float('inf')
count = 0

# Training loop
for epoch in range(1000):
    model.train()
    optimizer.zero_grad()
    outputs = model(x_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    with torch.no_grad():
        model.eval()
        val_outputs = model(x_test)
        val_loss = criterion(val_outputs, y_test)
        if val_loss < best_loss - min_delta:
            best_loss = val_loss
            count = 0
        else:
            count += 1
        if count > patience:
            print("Early stopping")
            break

Early stopping


Now that we've trained the neural network, we can check its RMSE error.

In [6]:
# Predict
model.eval()
with torch.no_grad():
    pred = model(x_test)

# Measure RMSE
score = torch.sqrt(criterion(pred, y_test))
print("Final score (RMSE):", score.item())

Final score (RMSE): 13.760814666748047


Now we are ready to generate the Kaggle submission file.  We will use the MPG test data that does not contain a $y$ target value.  It is our job to predict this value and submit it to Kaggle.

In [7]:
# Measure RMSE
score = torch.sqrt(criterion(pred, y_test))
print("Final score (RMSE):", score.item())

# Predict on the Kaggle test set
df_test = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv", na_values=['NA', '?'])
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)
df_test['horsepower'] = df_test['horsepower'].fillna(df['horsepower'].median())
x = torch.tensor(df_test[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']].values).float()

with torch.no_grad():
    predictions = model(x)

# Prepare submission
df_submit = pd.DataFrame(predictions.numpy(), columns=['mpg'])
df_submit.insert(0, 'id', ids)
df_submit.to_csv("auto_submit.csv", index=False)
print(df_submit.head())

Final score (RMSE): 13.760814666748047
    id        mpg
0  350   9.085001
1  351  10.218105
2  352   9.354208
3  353  11.105295
4  354  10.152960
