### This notebook contains the code to generate the submission for the "Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines" competition.

The submission should contain 3 columns, with the respondent_id, the probability someone gets the H1N1 vaccine (h1n1_vaccine), and the probability that someone gets the flu shot (seasonal_vaccine).

To train the model, we use the training_set_features.csv data, with the training_set_labels.csv data as the known probabilities. Finally we want to predict the values for the test_set_features.csv data.

The score is evaluated using the receiver operating characteristic curve (ROC AUC), with default "average='macro'".

In this script we train a deep learning model.

In [1]:
import pandas as pd
from sklearn.metrics import roc_auc_score
from tensorflow import keras
from tensorflow.keras import layers

In [3]:
X_train_prep = pd.read_csv('X_train_prep.csv')
X_valid_prep = pd.read_csv('X_valid_prep.csv')
y_train = pd.read_csv('y_train.csv')
y_valid = pd.read_csv('y_valid.csv')

In [4]:
# Now train the model
n_units = 512
model = keras.Sequential([
    layers.Dense(n_units, activation="relu", input_shape=[X_train_prep.shape[1]]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(n_units, activation="relu"),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    # layers.Dense(n_units, activation="relu"),
    # layers.Dropout(0.3),
    # layers.BatchNormalization(),
    # layers.Dense(n_units, activation="relu"),
    # layers.Dropout(0.3),
    # layers.BatchNormalization(),
    layers.Dense(2, activation="softmax")  # softmax for binary classification with 2 classes
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',  # since 2 classes
    metrics=['accuracy'],
)

early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    X_train_prep, y_train,
    validation_data=(X_valid_prep, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0
)

In [5]:
# Calculate the ROC AUC score
y_pred = model.predict(X_valid_prep)
roc_auc = roc_auc_score(y_valid, y_pred)
print("The ROC AUC score is", roc_auc)

The ROC AUC score is 0.6431609416439366


Scores:

1 layer, 128 units: 0.6128
2 layers, 128 units: 0.5866
3 layers, 128 units: 0.6086
4 layers, 128 units: 0.5869
1 layer, 256 units: 0.6041
2 layers, 256 units: 0.6283
3 layers, 256 units: 0.6151
4 layers, 256 units: 0.6089
1 layer, 512 units: 0.6109
2 layers, 512 units: 0.6310   <--
3 layers, 512 units: 0.6252
4 layers, 512 units: 0.5595
1 layer, 1024 units: 0.6052
2 layers, 1024 units: 0.6273
3 layers, 1024 units: 0.6196
4 layers, 1024 units: 0.6245

In [6]:
# Load test data
X_test_prep = pd.read_csv('X_test_prep.csv')

# Initiate the output dataframe with id's
X_test = pd.read_csv("test_set_features.csv")
output = pd.DataFrame(X_test["respondent_id"])

In [7]:
# Make predictions
y_pred = model.predict(X_test_prep)



In [69]:
# Add the predictions to the output dataframe
y_pred_h1n1 = y_pred[:, 0]
y_pred_seas = y_pred[:, 1]

# Add the predictions to the output dataframe
output["h1n1_vaccine"] = y_pred_h1n1
output["seasonal_vaccine"] = y_pred_seas
output.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,26707,0.36985,0.63015
1,26708,0.321287,0.678713
2,26709,0.349197,0.650803
3,26710,0.30606,0.69394
4,26711,0.312955,0.687046


In [70]:
# Save the output as csv
output.to_csv("submission_deep.csv", index=False)

### Final note:

This model got a submission score of 0.6355. That's a lot worse than XGBoost.