As we have an idea about how neural networks work, have seen a basic implementation of calculations, we can now look at the most important tool(s) available in Python to model. The tools that we will use are included in the Keras library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library (we will not use that directly in this course), which is a general library for high performance machine learning tasks, built in an end-to-end manner.

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Classification performance evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix

In [2]:
# Keras is not part of Anaconda by default, you have to install the following in Anaconda prompt:
# pip install tensorflow
# pip install keras
# After this, you will be able to import the library

import keras

# We will work with a dataset to predect hourly wages of employees
# we have some demographic information, industry etc.
# So this is a regression problem

wage_data = pd.read_csv('hourly_wages.csv')

wage_data.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,5.1,0,8,21,35,1,1,0,1,0
1,4.95,0,9,42,57,1,1,0,1,0
2,6.67,0,12,1,19,0,0,0,1,0
3,4.0,0,12,4,22,0,0,0,0,0
4,7.5,0,12,17,35,0,1,0,0,0


In [None]:
# In keras, you construct a network layer by layer, with specifications in each case
# We will only work with the basic connection type between layers: each node in one layer
# is connected to all the other layers 
# In keras terminology, this is called Dense
# There are other types for various neural networks with having only a subset of all the possible links present
# You can learn about those from the Mining of Massive Datasets book (recurrent networks, LSTM)
# And we also need a function to specify the sequence of layers in the network

from keras.layers import Dense
from keras.models import Sequential

# We can create the division of the data into predictors and target

X = wage_data[wage_data.columns[1:]]
y = wage_data['wage_per_hour']

In [None]:
# The first step of creating the neural netowkr model is to initialize it
# We just specify that it will be a sequence of layers
wage_model = Sequential()

# We add the first layer: 50 nodes, relu activation function, and in this case we need to specify
# how many input nodes we will have, which is the number of columns of X
wage_model.add(Dense(50, activation='relu', input_shape=(len(X.columns),)))

# We also add a second layer with 32 nodes with relu activation
# As we already specified that the first hidden layer will have 50 nodes, we do not need to give that information here
wage_model.add(Dense(32, activation='relu'))

# Finally, and output layer with 1 node
wage_model.add(Dense(1))

# We need to define the loss function as the mean squared error
wage_model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# We can also visualize the structure of the model
# Most of the time we would not really want a mode detailed picture, as looking at 
# 9 +50 + 32 + 1 nodes connected, every possible combination between subsequent layers does not really show anything useful

from keras.utils.vis_utils import plot_model

plot_model(wage_model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

In [None]:
# And now that we have created the neural network structure, we can fit the data
# Unfortunately it is very difficult to disable every random component that is part of running keras
# What we can do in order to have reasonably good idea of model performance, is to increase the number of epoch
# (essentially iteration) that we run
# If we use 100 in this case, we will get somewhat similar values in the end
# With verbose set to 0, we avoid any messages during model building (you can try running without that)

model_1 = wage_model.fit(X, y, shuffle=False, epochs = 100, verbose = 0)
print(model_1.history['loss'][-1])

In [None]:
# Focus now on classification problems
# We will work again with the titanic dataset to predict survival
# To avoid data transformation for now, the dataset only contains nuemrical columns

titanic = pd.read_csv('titanic_all_numeric.csv', sep = ';')
print(titanic.head())

# We can create the predictor and target datasets

X = titanic[titanic.columns[1:]]
y = titanic['survived']

# We create training and test set

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

# One important thing: when we run classification with Keras, we have to use the to_categorical function for the target

y_train_cat = keras.utils.np_utils.to_categorical(y_train)
y_test_cat = keras.utils.np_utils.to_categorical(y_test)

In [None]:
# We construct the model in the same manner

# Set up the model
model = Sequential()

# Add the first layer with 32 nodes and relu actvation
model.add(Dense(32, activation='relu', input_shape=(len(X_train.columns),)))

# Add the output layer with a different activation that is appropriate for classification
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(X_train, y_train_cat, epochs = 2000, verbose = 0)

In [None]:
# We can make predictions

pred = model.predict(X_test)

# This is actually the probability of survivial, we have to convert it to a 0-1 form

pred_bool = np.argmax(pred, axis=1)

print(classification_report(y_test, pred_bool))

In [None]:
# As a comparison, the same thing with Random forests

from sklearn.ensemble import RandomForestClassifier

forest_titanic = RandomForestClassifier(n_estimators=400, random_state = 0)

# And we fit the training data
forest_titanic.fit(X_train, y_train)

# Finally look at the results

pred_forest = forest_titanic.predict(X_test)
print(confusion_matrix(y_test, pred_forest))

# As we can see, we improved even more, two misclassified cases are now corrected
print(classification_report(y_test, pred_forest))