Neural Network Classification
========================

## Instructions

Run each cell from top to bottom. Try to understand the output of each command. If in doubt, ask your neighbours or  Jori.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from keras.optimizers import Adam
%matplotlib inline

We will load a dataset from the US census bureau. We are going to predict whether a person makes more than $ 50K a year, or less. 

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/jvanlier/TIAS_ML_DL/master/Day2Notebooks/data/census.csv")
df.head()

Below are some more details about this dataset:


- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, - 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Note that there are many textual columns, which are somewhat annoying to deal with.

Our target is column `more-than-50k`

Luckily, we can use `pd.get_dummies` to OneHotEncode the dataset easily:

In [None]:
df_ohe = pd.get_dummies(df, drop_first=True)
df_ohe.head()

Let's take a look at the skew in the target:

In [None]:
df_ohe["more-than-50k"].value_counts()

Yes, it's fairly skewed with many more 0 instances than 1 instances. Let's use the F1 score this time, instead of Accuracy.

## Train-test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_ohe.drop("more-than-50k", axis=1), 
    df_ohe["more-than-50k"], 
    test_size=0.2, 
    random_state=0)
print(f"{len(X_train)} training instances and {len(X_test)} test instances.")

### Baseline
Before we start diving into Neural Nets, let's first try setting a baseline with Logistic Regression.

In [None]:
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV(scoring="f1", max_iter=1000, cv=3, random_state=0)

LogisticRegressionCV uses an internal cross-validation loop to find a good value for the regularization parameter. This, as we know by know, helps with the overfitting problem.

As a warm-up, start with a fit on the training data. This should be familiar after last week!

In [None]:
# YOUR CODE HERE

Validate the model on both train and test.

In [None]:
# YOUR CODE HERE

It could also be useful to take a look at the confusion matrix. We discussed this last week. It contains the number of True Postives, False Positives, False Negatives and True Negatives.

Take a look at the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [None]:
# YOUR CODE HERE: import confusion matrix (hint: see top of documentation page for import path)

In [None]:
# YOUR CODE HERE: create confusion matrix (hint: you need to pass it predictions on X_test)

How many True Negatives do you have? And how many True Positives? Refer to the documentation to find out what each cell in the confusion matrix means.

## Neural Networks
Let's now try to improve on this result by moving towards a more complex model.

`scikit-learn` provides a basic neural network, but it's not really used a lot. Most practitioners prefer Keras or PyTorch. We'll use Keras, which is backed by Google.

In [None]:
from keras import Sequential
from keras.layers import Dense

We'll start by making a network that mimicks Logistic Regression.

In [None]:
np.random.seed(1)  # Leave this here! It ensures reproducability of results.
model = Sequential()

Now, we need to add just the single sigmoid node. In Keras terminology, this is a `Dense` layer, with a single unit.

To initialize a dense layer, with 1 unit, `m` input features, and sigmoid activation function, use the following:

`dense = Dense(1, input_dim=m, activation="sigmoid")`

To figure out what `m` is, you may use `X_train.shape` or `len(X_train.columns)`.

In [None]:
# YOUR CODE HERE: create dense layer

Now add your layer to the model using `model.add(...)`:

In [None]:
# YOUR CODE HERE: add dense layer to model

The following command tells Keras how to optimize and evaluate the model. Unfortunately, there is no easy way to show F1 score during optimization, so will check it afterwards. 

You hopefully remember *binary crossentropy* from last week?

In [None]:
model.compile(loss="binary_crossentropy",  optimizer=Adam(lr=0.0003))

The following command starts training the neural network.

In [None]:
model.fit(X_train, y_train, epochs=50, batch_size=200, validation_data=[X_test, y_test])

50 epochs should get you a validation loss of approximately 0.35.

Ok, now let's see the F1 score. First, we have to get predicted classes. Use `model.predict_classes(...)`.

In [None]:
y_test_hat = # YOUR CODE HERE

In [None]:
from sklearn.metrics import f1_score

In [None]:
# YOUR CODE HERE: use f1_score function

How do you feel about this F1 score? Disappointing?

Well, that's as expected!

The Logistic Regression implementation in scikit-learn uses a very sophisticated optimizer (L-BFGS). Neural Networks use less sophisticated optimizers (backpropagation with gradient descent), which makes them harder to train. You need to get many things right: e.g. number of epochs, batch size and learning rate. However, the less sophisticated optimizer used in Neural Networks *does* allows us to do backpropagation and update hidden layers, which we shall do soon!

But first, try running the `fit()` method again and see if this improves the F1 score? In contrast to scikit-learn, repeated calls to this `fit()` method do not overwrite the previous model, but in fact continue training! You may run this command multiple times, until you no longer see `val_loss` improving.

In [None]:
# YOUR CODE HERE: Run fit() again with 20 epochs. Keep everything else the same. Feel free to copy paste the command!

What is the F1 score now? Did training longer improve things?

In [None]:
# YOUR CODE HERE: what is the F1 score now? 

# Going deeper

Now, build a new neural network with a Dense hidden layer. It is defined much like before, although now, use 200 nodes instead of 1, and use tanh activation function instead of sigmoid:

In [None]:
np.random.seed(0)
model2 = Sequential()
hidden_layer = Dense(# YOUR CODE HERE )
model2.add(hidden_layer)
model2.add(Dense(1, activation="sigmoid"))
model2.compile(loss="binary_crossentropy", optimizer=Adam(lr=0.0003))

In [None]:
model2.fit(X_train, y_train, epochs=10, batch_size=200, validation_data=[X_test, y_test])
f1_score(y_test, model2.predict_classes(X_test))

You should be able to get .67 - .68 after around 30 epochs. Run the cell above a couple of times. This a small - but not insignificant - improvement over simple Logistic Regression!

# Open ended bonus assignments

- Add a second hidden layer. Can you improve the score?
- Try tuning the learning rate, batch size, number of hidden nodes. What is the best F1 score you can get?
- Try training a Random Forest like we did last week. Feel free to copy-paste the appropriate bits of code from that notebook. How does the Random Forest compare to the Neural Network?