In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In this exercise I'll build a series of MLPClassifiers using [this UCI dataset](https://archive.ics.uci.edu/ml/datasets/Adult) to predict based on census demographic data whether people were earning more than or less thatn $50K annually. 

In [2]:
df = pd.read_csv('adult.csv')

In [3]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

In [4]:
df.dtypes

age               object
workclass         object
fnlwgt            object
education         object
education_num     object
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

So it looks like I have some data cleaning to do:

- Convert Age to integer
- Replace '?' with 'Unknown' in workclass, education, occupation
- replace '#NAME?' with 'Unknown' in race, sex
- replace '#NAME?' with '40' in age (the mean age, rounded)
- impute most frequent for age
- drop fnlwgt, education_num, native_country
- create dummies for workclass, education, marital_status, occupation, relationship, race, sex

income will be my target classifier with the model trying to predict if a person is making "<=50K" or ">50K"

## Preprocessing

In [5]:
df['race'].value_counts()

White                 4021
Black                  493
#NAME?                 264
Asian-Pac-Islander     145
Amer-Indian-Eskimo      48
Other                   29
Name: race, dtype: int64

In [6]:
y = df['income']

In [7]:
df = df.drop(['income', 'fnlwgt', 'education_num', 'native_country'], 1)

In [8]:
df['race'] = df['race'].replace('#NAME?','Unknown')
df['age'] = df['age'].replace('#NAME?','40')
df['sex'] = df['sex'].replace('#NAME?','Unknown')
df['workclass'] = df['workclass'].replace('?','Unknown')
df['education'] = df['education'].replace('?','Unknown')
df['occupation'] = df['occupation'].replace('?','Unknown')

In [9]:
df['age'] = df['age'].astype(int)

In [10]:
df = pd.get_dummies(data=df, columns=['workclass', 
                                      'education', 
                                      'marital_status', 
                                      'occupation', 
                                      'relationship', 
                                      'race', 
                                      'sex'])

In [11]:
df.columns

Index(['age', 'capital_gain', 'capital_loss', 'hours_per_week',
       'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private',
       'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',
       'workclass_State-gov', 'workclass_Unknown', 'workclass_Without-pay',
       'education_10th', 'education_11th', 'education_12th',
       'education_1st-4th', 'education_5th-6th', 'education_7th-8th',
       'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc',
       'education_Bachelors', 'education_Doctorate', 'education_HS-grad',
       'education_Masters', 'education_Preschool', 'education_Prof-school',
       'education_Some-college', 'education_Unknown',
       'marital_status_Divorced', 'marital_status_Married-AF-spouse',
       'marital_status_Married-civ-spouse',
       'marital_status_Married-spouse-absent', 'marital_status_Never-married',
       'marital_status_Separated', 'marital_status_Widowed',
       'occupation_Adm-clerical', 'occupation_Armed-Forces',


In [12]:
df.dtypes

age                             int32
capital_gain                    int64
capital_loss                    int64
hours_per_week                  int64
workclass_Federal-gov           uint8
workclass_Local-gov             uint8
workclass_Private               uint8
workclass_Self-emp-inc          uint8
workclass_Self-emp-not-inc      uint8
workclass_State-gov             uint8
workclass_Unknown               uint8
workclass_Without-pay           uint8
education_10th                  uint8
education_11th                  uint8
education_12th                  uint8
education_1st-4th               uint8
education_5th-6th               uint8
education_7th-8th               uint8
education_9th                   uint8
education_Assoc-acdm            uint8
education_Assoc-voc             uint8
education_Bachelors             uint8
education_Doctorate             uint8
education_HS-grad               uint8
education_Masters               uint8
education_Preschool             uint8
education_Pr

## Preprocessing Notes and Model Build

The question of my unknown values is a consideration worth discussing. Should I replace those placeholders with NaN and then dropna? Perhaps. And perhaps I will after I've run the models for this neural network.

The number of features created might be a problem for computational cost. If it takes too long I'll drop a feature, starting with the one with the most values (native_country) and try again. 

I split off my y target early on because the classifier format was proving problematic for my preprocessing. After preprocessing both df and y still have 5000 observations. If the model still doesn't like the characters, I'll go back and replace them with text strings.

To start off, I'll build a models with only a small number of perceptrons, just to prove it works quickly. I don't expect good results from that. Then I'll add more perceptrons and then layers of perceptrons. Once built I'll tune the model parameters as needed. 

In [13]:
df.shape

(5000, 66)

In [14]:
y.shape

(5000,)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    df,
    y,
    test_size=0.2,
    random_state=42)

In [16]:
mlp = MLPClassifier(hidden_layer_sizes=(50,))
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(50,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [17]:
mlp.score(X_test, y_test)

0.805

In [18]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.78606965, 0.76      , 0.77      , 0.795     , 0.76884422])

Wow. I have to say I wasn't expecting a single layer, 50 perceptron model to yield such a high score. And it ran very fast. So now I'll rerun this model for the following layer sizes:

- (1000,)
- (100, 100)
- (1000, 1000)
- (100, 100, 100)
- (1000, 1000, 1000)

If as it gets thicker and thicker the time to run the model gets too long I'll remove that layer size. 

In [19]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [20]:
mlp.score(X_test, y_test)

0.799

In [21]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.80099502, 0.77      , 0.765     , 0.795     , 0.79396985])

In [22]:
mlp = MLPClassifier(hidden_layer_sizes=(100,100))
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 100), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [23]:
mlp.score(X_test, y_test)

0.789

In [24]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.76119403, 0.735     , 0.775     , 0.77      , 0.47738693])

In [25]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,1000))
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 1000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [26]:
mlp.score(X_test, y_test)

0.757

In [27]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.76119403, 0.75      , 0.75      , 0.265     , 0.77889447])

In [28]:
mlp = MLPClassifier(hidden_layer_sizes=(100,100,100))
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 100, 100), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [29]:
mlp.score(X_test, y_test)

0.792

In [30]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.74129353, 0.735     , 0.75      , 0.775     , 0.65326633])

In [31]:
# mlp = MLPClassifier(hidden_layer_sizes=(1000,1000,1000))
# mlp.fit(X_train, y_train)

In [32]:
# mlp.score(X_test, y_test)

In [33]:
# cross_val_score(mlp, X_test, y_test, cv=5)

Let's take a look at our scores and cross validations. The increase of hidden layers did not seem to have any significant positive impact on the model's performance. In fact, the more layers, the worse they seemed to perform. The single layer, 1000 perceptron model performed the best with a score of 79.9% and cross validation scores ranging from 0.800, 0.770, 0.765, 0.795, 0.793 (a range of only 3.5%) indicating that the model is not overfit. Surprising really that the single layer, 50 perceptron model actually outperformed this. But I like the consistency of the folds here. 

Now I'll re-run the model for a single 100 perceptron layer and make some changes to the parameters. Models that perform in an uninteresting way or error out I'll delete and continue on. Otherwise below I'll add in models that are worth talking about. Note that as a default I am looking for scores that are better than 79.9% (or really low so that I can research to understand why).

In [238]:
mlp = MLPClassifier(hidden_layer_sizes=(100, ), 
                    activation='relu', # {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’
                    solver='adam', # {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’
                    alpha=0.1, # float, optional, default 0.0001
                    batch_size='auto', # int, optional, default ‘auto’ (Must not be using solver='lbfgs')
                    learning_rate='constant', # {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’
                    learning_rate_init=0.001, # double, optional, default 0.001
                    power_t=0.5, # double, optional, default 0.5 (learning_rate = ‘invscaling’. solver=’sgd’.)
                    max_iter=200, # int, optional, default 200
                    shuffle=True, # bool, optional, default True
                    random_state=None, # int, RandomState instance or None, optional, default None
                    tol=0.0001, # float, optional, default 1e-4
                    verbose=False, # bool, optional, default False (I like this. I like more info)
                    warm_start=False, # bool, optional, default False 
                    momentum=0.9, # float, default 0.9 (solver=’sgd’)
                    nesterovs_momentum=True, # boolean, default True (must have solver=’sgd’ and momentum > 0)
                    early_stopping=False, # bool, default False (Only effective when solver=’sgd’ or ‘adam’)
                    validation_fraction=0.1, # float, optional, default 0.1 (Only used if early_stopping is True)
                    beta_1=0.9, # float, optional, default 0.9 (Only used when solver=’adam’)
                    beta_2=0.999, # float,  optional, default 0.999 (Only used when solver=’adam’)
                    epsilon=1e-08) #  float, optional, default 1e-8 (Only used when solver=’adam’)
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [239]:
mlp.score(X_test, y_test)

0.833

High score: 81.1 but changes each time so margin of error dictates that no changes produce significantly improved results.

Slight uptick in average model score by increasing alpha to 0.1 from 0.0001 default. So increasing L2 (Ridge Regulation) penalty. Otherwise the results were either below my best results but still above 70% generally, or just errored out. 

In [240]:
cross_val_score(mlp, X_test, y_test, cv=5)

array([0.76119403, 0.755     , 0.79      , 0.75      , 0.76884422])