<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Neural Nets for Sequential Data

-----
**OBJECTIVES**


- Use RNN's and CNN's to model text data
------

In [1]:
import pandas_datareader as pdr
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Sequential Models for Text
-------

Now, we use the Keras `Tokenizer` to preprocess our spam data and feed it through different architectures of sequential network models.

In [2]:
import pandas as pd
import numpy as np

In [3]:
from keras.preprocessing.text import Tokenizer

In [4]:
spam = pd.read_csv('data/sms_spam.csv')

In [5]:
spam.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...


### `Tokenizer`
------
Here, we set the limit to the number of words at 500, then fit the texts, and finally transform our text to sequences of integer values with the `.texts_to_sequences`.  To assure the same length we use the `pad_sequences` function.  

In [7]:
#create a tokenizer and specify the vocabulary
# this limits to teh 500 most frequently ocuriing words
# you learn the vocan when you fit on text
# then we turn the text into sequences
# we end up these indices for vocabularly
# depending on the size of the data set, you can choose a different tokenizer- just give it a try with diff numbers
# this also takes the order of the words into account, as opposed to the bag of words method
tokenizer = Tokenizer(500)

In [8]:
#fit it on text
tokenizer.fit_on_texts(spam['text'])

In [11]:
#generate sequences
sequences = tokenizer.texts_to_sequences(spam['text'])

In [12]:
# this is now the messages, tokenized
# but our network can't have different sizes of rows
# so we'll create uniformly lengthed sequences with zeros at the end of the sequence
# we will choose the max legnth on the sequence
sequences[:3]

[[122, 3, 22, 313, 4, 53, 110, 37, 8],
 [92, 134, 86, 11, 170],
 [60, 179, 155, 8, 62, 24, 17, 2, 387]]

In [13]:
from keras.preprocessing.sequence import pad_sequences

In [14]:
#pad sequences to 100
X = pad_sequences(sequences, maxlen = 100)

In [12]:
#take a peek
X[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       122,   3,  22, 313,   4,  53, 110,  37,   8], dtype=int32)

### Model
-------

In [15]:
from keras.layers import Embedding, Dense, SimpleRNN
from keras.models import Sequential

In [16]:
#sequential model
text_model1 = Sequential()
#embedding layer
# this is a word embedding- we take in data and try to learn about it
# effective tool to use in a text problem
# transforms word vecotrs into new vectors. 
text_model1.add(Embedding(input_dim = tokenizer.num_words, output_dim = 64))
#simple RNN
text_model1.add(SimpleRNN(16))
#dense layer
text_model1.add(Dense(20, activation = 'relu'))
#output
text_model1.add(Dense(1, activation = 'sigmoid'))
#compilation
text_model1.compile(loss = 'bce', metrics = ['accuracy'])

2022-04-13 19:17:42.646650: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:
#make y binary
y = np.where(spam['type'] == 'ham', 0, 1)

In [18]:
#baseline?
1-np.sum(y)/len(y)

0.8656233135456017

In [20]:
#fit it
history = text_model1.fit(X, y, epochs =10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
# try again with a validation set
#sequential model
text_model1 = Sequential()
#embedding layer
# this is a word embedding- we take in data and try to learn about it
# effective tool to use in a text problem
# transforms word vecotrs into new vectors. 
text_model1.add(Embedding(input_dim = tokenizer.num_words, output_dim = 64))
#simple RNN
text_model1.add(SimpleRNN(16))
#dense layer
text_model1.add(Dense(20, activation = 'relu'))
#output
text_model1.add(Dense(1, activation = 'sigmoid'))
#compilation
text_model1.compile(loss = 'bce', metrics = ['accuracy'])

history = text_model1.fit(X, y, validation_split=.2, epochs =10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Convolutional Networks in 1D
--------

In [None]:
from keras.layers import Conv1D, MaxPooling1D

In [38]:
X=spam['text']

y = np.where(spam.type == 'ham', 0 , 1)

In [43]:
tokenizer = Tokenizer(500)
tokenizer.fit_on_texts(X)
X= tokenizer.texts_to_sequences(X)
X= pad_sequences(X, maxlen=100)

In [26]:
tokenizer.num_words

500

In [44]:
conv_test = Sequential()
# say what size of vector we want returned
# convolve 
conv_test.add(Embedding(tokenizer.num_words, output_dim=64 ))
conv_test.add(Conv1D(filters=16, kernel_size=10))
# then pool- this will go over the 4 and choose the max number
conv_test.add(MaxPooling1D(4))
# add rest of convetional network
conv_test.add(Dense(20, activation='relu'))
# binary classification, so need a sigmoid
conv_test.add(Dense(1, activation='sigmoid'))
# compilation
conv_test.compile(loss='bce', metrics=['acc'])

In [45]:
history= conv_test.fit(X, y, validation_split=0.2, epochs=10)

Epoch 1/10


ValueError: in user code:

    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 860, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 918, in compute_loss
        return self.compiled_loss(
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 1932, in binary_crossentropy
        backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
    File "/Users/hannah.westberg/opt/anaconda3/lib/python3.9/site-packages/keras/backend.py", line 5247, in binary_crossentropy
        return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

    ValueError: `logits` and `labels` must have the same shape, received ((None, 22, 1) vs (None,)).


### Exercise

Build a model on the tweets data from `tweets.csv`. 