# Building a neural network in Keras with Tensorflow #

### A walkthrough for the Machine Learning Club

The goal is to demonstrate how to build a simple neural network using Keras, a popular open source neural network library.

The demonstration will use the spam assassin corpus that was used in the Coursera / Stanford Machine Learning course that many members of the group have taken already (see week 7 assignment 'exercise 6'). In the coursera course, we trained a support vector machine (SVM) to classify spam. Here we will use a neural network instead.

In [1]:
import os
import numpy as np
import re
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
import tensorflow as tf
import keras

Using TensorFlow backend.


### Setup steps:
- Create new environment and activate it
- Install python and packages per requirements.txt
- Run <code>conda install jupyter</code>
- Use <code>conda install nb_conda</code> to get Jupyter to use the environment


#### Warning: Keras and Tensorflow have a lot of dependencies - it will take a while to install them all.

TODO requirements.txt

- python
- tensorflow
- keras
- matplotlib
- numpy particular version 16.4.? to avoid TF warnings

In [3]:
# import library written for coursera exercise 6, providing functions for preprocessing emails (slightly modified for this demo)
# and tell it where the vocab list is saved
import utils
spamAssassinPath = 'C:\\Users\\Jo\\Documents\\coursera\\ml-coursera-python-assignments\\Exercise6\\Data\\'
utils.setVocabListPath(os.path.join(spamAssassinPath, 'vocab.txt'))

## Part 1: feature extraction

The data that we want to use is raw text from emails. To train a neural network, we need a fixed number of features for each training example. We therefore cannot use the text itself, but need to extract a set of features from each email. Per the Coursera course, we will use a vocabulary list to define a set of words that we are interested in. These are 'stem' words, with the endings removed (see examples below). We then parse each email in the corpus and record which of the vocab words are present in that email. All other info from the email is disregarded.

In [4]:
# Take a look at the vocab list for info
with open(os.path.join(spamAssassinPath, 'vocab.txt')) as fid:
    vocab_list_contents = fid.read()
vocab_list_contents.split()[21:45:2]

['access',
 'accord',
 'account',
 'achiev',
 'acquir',
 'across',
 'act',
 'action',
 'activ',
 'actual',
 'ad',
 'adam']

### Example feature extraction
To illustrate how this works, here is an example email, along with the processed version (reduced to stem words), the matches in the vocab list, and the resulting feature vector

In [13]:
# Extract Features from a sample email
with open(os.path.join(spamAssassinPath, 'emailSample1.txt')) as fid:
    file_contents = fid.read()
print('----------------')
print(f'Unprocessed email (string length {len(file_contents)}):')
print('----------------')
print(file_contents)

processed_email, word_indices = utils.processEmail(file_contents, verbose=False)
features = utils.emailFeatures(word_indices)

print('----------------')
print(f'Processed email ({len(processed_email)} word stems):')
print('----------------')
print(' '.join(processed_email))


print('\n----------------')
print(f'Matching word indices ({len(word_indices)} matches in vocab list):')
print('----------------')
print(word_indices)

# Print Stats
print('\n----------------')
print(f'Feature vector (vector length {len(features)} with {sum(features>0)} non-zero entries:')
print('----------------')
print('...'+' '.join(features.astype(int).astype(str)[50:100])+'...')

----------------
Unprocessed email (string length 393):
----------------
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com


----------------
Processed email (63 word stems):
----------------
anyon know how much it cost to host a web portal well it depend on how mani visitor your expect thi can be anywher from less than number buck a month to a coupl of dollar number you should checkout httpaddr or perhap amazon ec number if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr

----------------
Matching word indices (55 matches in vocab list):
----------------
[85, 915, 793, 1076, 882, 369, 1698, 789, 1821, 1830, 8