Neural Network Classification
========================

## Instructions

Run each cell from top to bottom. Try to understand the output of each command. If in doubt, ask your neighbours or  Jori.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from keras.optimizers import Adam
%matplotlib inline

Using TensorFlow backend.


We will load a dataset from the US census bureau. We are going to predict whether a person makes more than $ 50K a year, or less. 

In [0]:
df = pd.read_csv("https://raw.githubusercontent.com/jvanlier/TIAS_ML_DL/master/Day2Notebooks/data/census.csv")
df.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,more-than-50k
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [0]:
expected = 32561
assert len(df) == expected, f"expected {expected} rows but got {len(df)}"

In [0]:
type(df)

pandas.core.frame.DataFrame

Below are some more details about this dataset:


- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, - 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Note that there are many textual columns, which are somewhat annoying to deal with.

Our target is column `more-than-50k`

Luckily, we can use `pd.get_dummies` to OneHotEncode the dataset easily:

In [0]:
df_ohe = pd.get_dummies(df, drop_first=True)
df_ohe.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,more-than-50k,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [0]:
df

In [0]:
df_ohe.columns

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'more-than-50k', 'workclass_ Federal-gov',
       'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 11th',
       'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th',
       'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm',
       'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate',
       'education_ HS-grad', 'education_ Masters', 'education_ Preschool',
       'education_ Prof-school', 'education_ Some-college',
       'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse',
       'marital-status_ Married-spouse-absent',
       'marital-status_ Never-married', 'marital-status_ Separated',
       'marital-status_ Widowed', 'occupation_ Adm-clerical',
       'occupation_ Arme

Let's take a look at the skew in the target:

In [0]:
df_ohe["more-than-50k"].value_counts()

0    24720
1     7841
Name: more-than-50k, dtype: int64

Yes, it's fairly skewed with many more 0 instances than 1 instances. Let's use the F1 score this time, instead of Accuracy.

In [0]:
24720 / (24720 + 7841)

0.7591904425539756

## Train-test split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    df_ohe.drop("more-than-50k", axis=1), 
    df_ohe["more-than-50k"], 
    test_size=0.2, 
    random_state=0)
print(f"{len(X_train)} training instances and {len(X_test)} test instances.")

26048 training instances and 6513 test instances.


### Baseline
Before we start diving into Neural Nets, let's first try setting a baseline with Logistic Regression.

In [0]:
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV(scoring="f1", max_iter=1000, cv=3, random_state=0)

LogisticRegressionCV uses an internal cross-validation loop to find a good value for the regularization parameter. This, as we know by know, helps with the overfitting problem.

As a warm-up, start with a fit on the training data. This should be familiar after last week!

In [0]:
lr.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=3, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=1000,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=0,
           refit=True, scoring='f1', solver='lbfgs', tol=0.0001, verbose=0)

Validate the model on both train and test.

In [0]:
print("Train score", lr.score(X_train, y_train))
print("Test score", lr.score(X_test, y_test))

Train score 0.6574928977272728
Test score 0.6572218382861091




It could also be useful to take a look at the confusion matrix. We discussed this last week. It contains the number of True Postives, False Positives, False Negatives and True Negatives.

Take a look at the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [0]:
# YOUR CODE HERE: import confusion matrix (hint: see top of documentation page for import path)
from sklearn.metrics import confusion_matrix

In [0]:
# YOUR CODE HERE: create confusion matrix (hint: you need to pass it predictions on X_test)
confusion_matrix(y_test, lr.predict(X_test))

array([[4570,  348],
       [ 644,  951]])

How many True Negatives do you have? And how many True Positives? Refer to the documentation to find out what each cell in the confusion matrix means.

## Neural Networks
Let's now try to improve on this result by moving towards a more complex model.

`scikit-learn` provides a basic neural network, but it's not really used a lot. Most practitioners prefer Keras or PyTorch. We'll use Keras, which is backed by Google.

In [0]:
from keras import Sequential
from keras.layers import Dense

We'll start by making a network that mimicks Logistic Regression.

In [0]:
np.random.seed(1)  # Leave this here! It ensures reproducability of results.
model = Sequential()

Now, we need to add just the single sigmoid node. In Keras terminology, this is a `Dense` layer, with a single unit.

To initialize a dense layer, with 1 unit, `m` input features, and sigmoid activation function, use the following:

`dense = Dense(1, input_dim=m, activation="sigmoid")`

To figure out what `m` is, you may use `X_train.shape` or `len(X_train.columns)`.

In [0]:
X_train.shape[1]

99

In [0]:
# YOUR CODE HERE: create dense layer
dense = Dense(1, input_dim=X_train.shape[1], activation="sigmoid")

Now add your layer to the model using `model.add(...)`:

In [0]:
# YOUR CODE HERE: add dense layer to model
model.add(dense)

The following command tells Keras how to optimize and evaluate the model. Unfortunately, there is no easy way to show F1 score during optimization, so will check it afterwards. 

You hopefully remember *binary crossentropy* from last week?

In [0]:
model.compile(loss="binary_crossentropy",  optimizer=Adam(lr=0.0003))

The following command starts training the neural network.

In [0]:
model.fit(X_train, y_train, epochs=50, batch_size=200, validation_data=[X_test, y_test])

Train on 26048 samples, validate on 6513 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fd6fb9420b8>

50 epochs should get you a validation loss of approximately 0.35.

Ok, now let's see the F1 score. First, we have to get predicted classes. Use `model.predict_classes(...)`.

In [0]:
y_test_hat = model.predict_classes(X_test)

In [0]:
from sklearn.metrics import f1_score

In [0]:
f1_score(y_test, y_test_hat)
# YOUR CODE HERE: use f1_score function

0.6271929824561403

How do you feel about this F1 score? Disappointing?

Well, that's as expected!

The Logistic Regression implementation in scikit-learn uses a very sophisticated optimizer (L-BFGS). Neural Networks use less sophisticated optimizers (backpropagation with gradient descent), which makes them harder to train. You need to get many things right: e.g. number of epochs, batch size and learning rate. However, the less sophisticated optimizer used in Neural Networks *does* allows us to do backpropagation and update hidden layers, which we shall do soon!

But first, try running the `fit()` method again and see if this improves the F1 score? In contrast to scikit-learn, repeated calls to this `fit()` method do not overwrite the previous model, but in fact continue training! You may run this command multiple times, until you no longer see `val_loss` improving.

In [0]:
# YOUR CODE HERE: Run fit() again with 20 epochs. Keep everything else the same. Feel free to copy paste the command!
model.fit(X_train, y_train, epochs=20, batch_size=200, validation_data=[X_test, y_test])

Train on 26048 samples, validate on 6513 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fd6fd865da0>

What is the F1 score now? Did training longer improve things?

In [0]:
# YOUR CODE HERE: what is the F1 score now? 
y_test_hat = model.predict_classes(X_test)
f1_score(y_test, y_test_hat)

0.6570048309178744

In [0]:
Dense()

# Going deeper

Now, build a new neural network with a Dense hidden layer. It is defined much like before, although now, use 200 nodes instead of 1, and use tanh activation function instead of sigmoid:

In [0]:
np.random.seed(0)
model2 = Sequential()
hidden_layer = Dense(200, input_dim=99, activation="tanh")
model2.add(hidden_layer)
model2.add(Dense(1, activation="sigmoid"))
model2.compile(loss="binary_crossentropy", optimizer=Adam(lr=0.0003))

In [0]:
model2.fit(X_train, y_train, epochs=10, batch_size=200, validation_data=[X_test, y_test])
f1_score(y_test, model2.predict_classes(X_test))

Train on 26048 samples, validate on 6513 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


0.6837294332723949

You should be able to get .67 - .68 after around 30 epochs. Run the cell above a couple of times. This a small - but not insignificant - improvement over simple Logistic Regression!

# Open ended bonus assignments

- Add a second hidden layer. Can you improve the score?
- Try tuning the learning rate, batch size, number of hidden nodes. What is the best F1 score you can get?
- Try training a Random Forest like we did last week. Feel free to copy-paste the appropriate bits of code from that notebook. How does the Random Forest compare to the Neural Network?