### Sequential Models for Text
-------

Now, we use the Keras `Tokenizer` to preprocess our spam data and feed it through different architectures of sequential network models.

In [1]:
import pandas as pd
import numpy as np

In [2]:
from keras.preprocessing.text import Tokenizer

In [3]:
spam = pd.read_csv('data/sms_spam.csv')

In [4]:
spam.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...


In [5]:
X = spam['text']
y = np.where(spam['type'] == 'ham', 0, 1)

### `Tokenizer`
------
Here, we set the limit to the number of words at 500, then fit the texts, and finally transform our text to sequences of integer values with the `.texts_to_sequences`.  To assure the same length we use the `pad_sequences` function.  

In [6]:
#create a tokenizer and specify the vocabulary
tokenizer = Tokenizer(num_words = 500)

In [7]:
#fit it on text
tokenizer.fit_on_texts(X)

In [8]:
#generate sequences
X_vect = tokenizer.texts_to_sequences(X)

In [9]:
print(X_vect[:2])

[[122, 3, 22, 313, 4, 53, 110, 37, 8], [92, 134, 86, 11, 170]]


In [10]:
from keras_preprocessing.sequence import pad_sequences

In [11]:
#pad sequences to 100
X_seq = pad_sequences(X_vect, maxlen=100)

In [12]:
#take a peek
X_seq[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       122,   3,  22, 313,   4,  53, 110,  37,   8], dtype=int32)

### Model
-------

In [19]:
from keras.layers import Embedding

In [20]:
from keras.layers import Dense
from keras.models import Sequential

### Convolutional Networks in 1D
--------

In [21]:
from keras.layers import Conv1D, MaxPooling1D

In [22]:
X_seq.shape

(5559, 100)

In [23]:
y.shape

(5559, 1)

In [24]:
y = y.reshape(-1, 1)
y.shape

(5559, 1)

In [26]:
model = Sequential()

model.add(Embedding(input_dim = tokenizer.num_words, output_dim = 64))#creates embeddings

model.add(Conv1D(64, 8, activation = 'relu'))
model.add(MaxPooling1D(4))
model.add(Conv1D(64, 16, activation = 'relu'))
model.add(MaxPooling1D(2))

model.add(Dense(100, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'bce', metrics = ['accuracy'])

history = model.fit(X_seq, y, epochs = 20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
