### LSTMs in Keras

---

#### Add the imports

In [2]:
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import backend as K
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import spacy
import tqdm

---

#### Generate some training data

In [3]:
categories = ['alt.atheism', 'sci.space']
data = fetch_20newsgroups(categories=categories)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


- Clean the text - remove stop words, special characters, lowercase, lemmatize
- Vectorize the emails - turn our txt into number equivalents. Use Keras Embedding to make word vectors.
- Create our LSTM model
- Train and test it

In [4]:
X = data['data']
y = data['target']

In [5]:
def clean_my_text(text):
    lemmatized = []
    text = text.lower()
    tokens = model(text)
    for word in tokens:
        if not word.is_stop and word.is_alpha:
            lemmatized.append(word.lemma_)
    return lemmatized

In [7]:
model = spacy.load('en_core_web_sm')

In [8]:
clean_X = []

for text in tqdm.tqdm(X):
    results = clean_my_text(text)
    clean_X.append(results)

100%|██████████| 1073/1073 [01:35<00:00, 11.28it/s]


In [10]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(clean_X, y)

#### We have some data, but now we need to preprocess it

In [27]:
vocab_list = ['']
for text in clean_X:
    for word in text:
        vocab_list.append(word)
vocab_list = list(set(vocab_list))

In [28]:
word_to_num = {}
num_to_word = {}
for i, word in enumerate(vocab_list):
    num_to_word[i] = word
    word_to_num[vocab_list[i]] = i

In [29]:
num_to_word[0]

''

#### Now turn our reviews into word vectors, and pad the text so they are all the same size

In [37]:
max_length = len(sorted(clean_X)[0])

In [38]:
max_length

125

In [39]:
word_vec_X = [[word_to_num[word] for word in text] for text in clean_X]

In [40]:
word_vec_X = sequence.pad_sequences(word_vec_X, maxlen = max_length, padding = 'pre')

In [41]:
word_vec_X[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,  8016,  6062,
        5691, 12929,  2163,  6688,  6951,  6267,  4807,  4079,  9836,
        8839, 13915,  8512,  4913, 11037,  9723,  8332, 11034,  4071,
        3319,  2970,  2094,  6569,  8842, 13226,  5992,  9936,  4677,
        2186,  1646,  2186,   410,  9936,  2970,  6695,  9599,  9936,
        4677,  2696,  2980,  2186,  6375,  9936,  6569,  5865,   951,
       10531,  2970,  4295,  9936,  4677,  6558,  4742,   951,  4677,
         519, 13393,   116, 12259, 12748,  9186,   547,  8016],
      dtype=int32)

In [43]:
vocab_size = len(vocab_list) + 1

---

#### Now lets create the model - its a standard Sequential with an Embedding and an LSTM layer added

Embedding:
This layer takes 3 parameters - the size of the vocab (input_dims), the no. of dimensions of each word embedding (output_dim), and the length of each document (input_length), which we've standardised above. It returns a 2d matrix, with rows equal to each word in the document, and columns equal to the number of dimensions in the word embedding. 

*Actually its 3D, cos the batch_size is the first dimension in both input and output, but I find that confuses things more than it clarifies*
Put another way 

The embedding **takes in** a factorized corpus, e.g.:

**[The, cat, sat, on, the, mat]**    becomes    **[1,2,3,4,1,5]**

And **outputs** a word embedded corpus:

**[1,2,3,4,1,5]**    becomes (lets assume output_dim=2)   **[[0.2,0.7], [0.6,0.3], [0.1,0.8], [0.2,0.1], [0.2,0.7], [0.4,0.9]]**

In [46]:
model = Sequential()
model.add(Embedding(vocab_size, 64, input_length=max_length))
model.add(LSTM(512))
model.add(Dense(1, activation = 'sigmoid'))

In [47]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 125, 64)           925824    
_________________________________________________________________
lstm_1 (LSTM)                (None, 512)               1181696   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
Total params: 2,108,033
Trainable params: 2,108,033
Non-trainable params: 0
_________________________________________________________________


---

### Now train and test

In [49]:
Xtrain, Xtest, ytrain, ytest = train_test_split(word_vec_X, y)

In [50]:
model.compile(optimizer='rmsprop', loss= 'binary_crossentropy', metrics = ['accuracy'])

In [51]:
model.fit(Xtrain, ytrain, epochs = 3, batch_size = 128, validation_split = 0.2)

Train on 643 samples, validate on 161 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x13f2344d0>