# Building a neural network in Keras with Tensorflow #

### A walkthrough for the Machine Learning Club

The goal is to demonstrate how to build a simple neural network using Keras, a popular open source neural network library.

The demonstration will use the spam assassin corpus that was used in the Coursera / Stanford Machine Learning course that many members of the group have taken already (see week 7 assignment 'exercise 6'). In the coursera course, we trained a support vector machine (SVM) to classify spam. Here we will use a neural network instead.

In [1]:
import os
import numpy as np
import pickle
from matplotlib import pyplot as plt
%matplotlib inline
np.random.seed(1177)

In [2]:
import tensorflow as tf
import keras

Using TensorFlow backend.


### Setup steps:
- Create new environment and activate it
- Install python and packages per requirements.txt
- Run <code>conda install jupyter</code>
- Use <code>conda install nb_conda</code> to get Jupyter to use the environment


#### Warning: Keras and Tensorflow have a lot of dependencies - it will take a while to install them all.

TODO requirements.txt

- python
- tensorflow
- keras
- matplotlib
- numpy particular version 16.4.? to avoid TF warnings

In [3]:
# import library written for coursera exercise 6, providing functions for preprocessing emails (slightly modified for this demo)
# and tell it where the vocab list is saved
import utils
exampleDataPath = 'C:\\Users\\Jo\\Documents\\coursera\\ml-coursera-python-assignments\\Exercise6\\Data\\'
utils.setVocabListPath(os.path.join(exampleDataPath, 'vocab.txt'))

## Part 1: feature extraction

The data that we want to use is raw text from emails. To train a neural network, we need a fixed number of features for each training example. We therefore cannot use the text itself, but need to extract a set of features from each email. Per the Coursera course, we will use a vocabulary list to define a set of words that we are interested in. These are 'stem' words, with the endings removed (see examples below). We then parse each email in the corpus and record which of the vocab words are present in that email. All other info from the email is disregarded.

In [4]:
# Take a look at the vocab list for info
with open(os.path.join(exampleDataPath, 'vocab.txt')) as fid:
    vocab_list_contents = fid.read()
print(', '.join(vocab_list_contents.split()[1:100:2]))

aa, ab, abil, abl, about, abov, absolut, abus, ac, accept, access, accord, account, achiev, acquir, across, act, action, activ, actual, ad, adam, add, addit, address, administr, adult, advanc, advantag, advertis, advic, advis, ae, af, affect, affili, afford, africa, after, ag, again, against, agenc, agent, ago, agre, agreement, aid, air, al


### 1.0 Example feature extraction
To illustrate how this works, here is an example email, along with the processed version (reduced to stem words), the matches in the vocab list, and the resulting feature vector

In [5]:
# Extract Features from a sample email
with open(os.path.join(exampleDataPath, 'emailSample1.txt')) as fid:
    file_contents = fid.read()
print('----------------')
print(f'Unprocessed email (string length {len(file_contents)}):')
print('----------------')
print(file_contents)

processed_email, word_indices = utils.processEmail(file_contents, verbose=False)
features = utils.emailFeatures(word_indices)

print('----------------')
print(f'Processed email ({len(processed_email)} word stems):')
print('----------------')
print(' '.join(processed_email))


print('\n----------------')
print(f'Matching word indices ({len(word_indices)} matches in vocab list):')
print('----------------')
print(word_indices)

# Print Stats
print('\n----------------')
print(f'Feature vector (vector length {len(features)} with {sum(features>0)} non-zero entries):')
print('----------------')
print('...'+' '.join(features.astype(int).astype(str)[50:100])+'...')

----------------
Unprocessed email (string length 393):
----------------
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com


----------------
Processed email (63 word stems):
----------------
anyon know how much it cost to host a web portal well it depend on how mani visitor your expect thi can be anywher from less than number buck a month to a coupl of dollar number you should checkout httpaddr or perhap amazon ec number if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr

----------------
Matching word indices (55 matches in vocab list):
----------------
[85, 915, 793, 1076, 882, 369, 1698, 789, 1821, 1830, 8

### 1.1 Apply the above feature extraction process to each email to obtain our (labelled) training set

In [6]:
# 
# spamAssassinPath = "C:\\Users\\Jo\\Documents\\coursera\\ml-coursera-python-assignments\\Exercise6\\spam_assassin_corpus"
# spam_dir = os.path.join('spam_2', 'spam_2') # Using the original directory structure with annoyingly many directories
# easy_ham_dir = os.path.join('easy_ham', 'easy_ham')
# hard_ham_dir = os.path.join('hard_ham', 'hard_ham')
# training_dirs = {spam_dir:1, easy_ham_dir:0, hard_ham_dir:0}

# limit = None

# X = []
# y = []

# # First read all the data (spam, easy ham, hard ham)
# for training_dir in training_dirs.keys():
#     print(os.path.join(spamAssassinPath, training_dir))
#     done = 0
#     for root_dir, _, fnames in os.walk(os.path.join(spamAssassinPath, training_dir)):
#         for f in fnames:
#             with open(os.path.join(root_dir, f)) as fid:
#                 try:
#                     file_contents = fid.read()
#                 except BaseException:
#                     print(f"error reading file {f}")
#                 processed_email, word_indices = utils.processEmail(file_contents, verbose=False)
#                 X.append(utils.emailFeatures(word_indices))
#                 y.append(training_dirs[training_dir])
#                 done += 1
#             if limit and done >= limit:
#                 break
#     print(f"{done} files processed from directory {training_dir}")

# # Convert to numpy arrays
# X, y = np.array(X), np.array(y)

# # Dump to a file
# with open('Xy.pickle', 'wb') as f:
#     pickle.dump((X,y), f)

In [7]:
with open('Xy.pickle', 'rb') as f:
    data = pickle.load(f)

In [8]:
X, y = data[0], data[1]

In [9]:
samples, N = len(X), len(X[0])
print(f"Processed {samples} training samples into features of length {N}")

Processed 4198 training samples into features of length 1899


### 1.2 Randomly select training, validation and test datasets

In [10]:
random_idxs = np.random.permutation(samples) # all the indices in a random order
m = int(samples * 0.6)
test_size = (samples - m) // 2
X_train, y_train = X[random_idxs[:m]], y[random_idxs[:m]]
X_validate, y_validate = X[random_idxs[m:(m + test_size)]], y[random_idxs[m:(m + test_size)]]
X_test, y_test = X[random_idxs[-test_size:]], y[random_idxs[-test_size:]]
# Check we have a reasonable number of positive examples in each set
sum(y_train), sum(y_validate), sum(y_test)

(821, 285, 291)

## Part 2: Build a neural network with Keras

Now that we have our training, validation and test sets, with the associated labels, we can train a neural network.

In [24]:
## Keras imports
from keras import models
from keras import layers
from keras import regularizers

### 2.1 Start with a very simple neural network

In [12]:
## Create the model and add some layers
model = models.Sequential()

# First (hidden) layer includes the dimension of the training feature vectors (N)
# The input layer is automatically added.
model.add(layers.Dense(4, input_dim=N, activation='relu'))

## Optional additional layers
# Standard fully connected layer
model.add(layers.Dense(4, activation='relu'))
# ... can add more layers here... see later.

# Output layer has 1 element (because we are doing binary classification) and uses sigmoid activation
model.add(layers.Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


In [13]:
## Compile the model, choosing optimizer, loss function (e.g. mean squared error for regression), 
#  and metrics that we want to keep track of.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 4)                 7600      
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 20        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 5         
Total params: 7,625
Trainable params: 7,625
Non-trainable params: 0
_________________________________________________________________


In [14]:
## Run the training process
model.fit(X_train, y_train, epochs=20, batch_size=128)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x1405f1c51d0>

In [15]:
## Check how it performs on the validation data
loss_and_metrics = model.evaluate(X_validate, y_validate, batch_size=128)
loss_and_metrics



[0.02583749167443741, 0.9916666746139526]

Result! We have a high level of accuracy (over 99%) for spam classification on this dataset. The training accuracy was 100%.

### 2.2 Tackle a harder problem
To make the problem harder, let's try just using the first 190 features (10% of the data, but note not random: first 10% of words in word index (maybe a-c))

First run the same model above, with two dense layers of 4 nodes each. This gives around 95% training acc. Now we adjust the model architecture to try to improve the results.

Attempt 1: add some more nodes (2 layers, 8 nodes each)
Attempt 2: add another layer (3 layers, 8 nodes each)
Attempt 3: add some more nodes (3 layers, 16 nodes each)

After attempt 3 we have around 99% accuracy. This is probably the best we can do with the data.


In [44]:
features = 190
## Create a new model and add some layers
model = models.Sequential()
# First layer includes the dimension of the training feature vectors (N)
model.add(layers.Dense(4, input_dim=features, activation='relu'))
# Standard fully connected layer
model.add(layers.Dense(4, activation='relu'))
# # Another standard fully connected layer
# model.add(layers.Dense(4, activation='relu'))
# Output layer has 1 element (binary classification) and uses sigmoid
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train[:,:features], y_train, epochs=150, batch_size=128)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150


<keras.callbacks.callbacks.History at 0x140744d2c18>

In [20]:
loss_and_metrics = model.evaluate(X_validate[:,:features], y_validate, batch_size=128)
loss_and_metrics



[1.1389494736989338, 0.8654761910438538]

### 2.3 Deal with overfitting (variance)

The above model is 99% accurate on the training data, but only 87% accurate on the validation data. This indicates <b>overfitting</b>, also known as variance.

We can address this by adding regularization. 

First add a dropout layer between each pair of fully connected layers. Use 25% dropout probability.
Next add L1 and/or L2 regularization.

In [41]:
model = models.Sequential()
# First layer includes the dimension of the training feature vectors (N)
model.add(layers.Dense(16, input_dim=features, activation='relu', kernel_regularizer=regularizers.l2(0.005)))
# Drop out nodes with probability 0.5
model.add(layers.Dropout(0.5))
# Standard fully connected layer
model.add(layers.Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.005)))
# Dropout with probability 0.5
model.add(layers.Dropout(0.5))
# Standard fully connected layer
model.add(layers.Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.005)))
# Dropout with probability 0.5
model.add(layers.Dropout(0.5))
# Output layer has 1 element (binary classification) and uses sigmoid
model.add(layers.Dense(1, activation='sigmoid'))

In [42]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train[:,:features], y_train, epochs=150, batch_size=128, validation_data=(X_validate[:,:features], y_validate))

Train on 2518 samples, validate on 840 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150


Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150


Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150


<keras.callbacks.callbacks.History at 0x1406f603e48>

In [20]:
loss_and_metrics = model.evaluate(X_validate[:,:features], y_validate, batch_size=128)
loss_and_metrics



[0.4741794018518357, 0.8845238089561462]