# LSTM Example
This notebook will go through an example of processing SMS text messages and determining if they are spam or not spam.  

We will do quite a bit of pre-processing, which I will briefly cover.  Further information in this topic can be found in subsequent courses or in the associated reference links.  

The main purpose of this notebook is to show how to use LSTMs on a deep neural network with text data.

### Dataset
The dataset can be found on this website: https://archive.ics.uci.edu/dataset/228/sms+spam+collection.  It consists of 425 spam messages and 3375 non-spam ("ham") messages.  

In [None]:
#check for tensorflow with the following code:
!pip list
#If using a server that does not already include tensorflow, run the install commands for supporting libraries.
# !pip install tensorflow==2.14.0
# !pip install dm-tree
# !pip install toml

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

# Workshop Functions
import sys
sys.path.append('..')
from WKDSS420_functions import * 

np.random.seed(1)

In [None]:
rawInput = pd.read_csv('SMSSpamCollection', sep='\t', names=['label','message'])
print(rawInput.loc[5,'label'], '\n', rawInput.loc[5,'message'])

In [None]:
df = pd.read_csv('SMSSpamCollection_clean.csv')

In [None]:
df.loc[5,'message']

In [None]:
#most texts short, but this one was really long
df.loc[1085,'message']

In [None]:
df.head()

### Tokenize input words and use result in LSTM NN

Sources:

https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47
and 
https://towardsdatascience.com/an-easy-tutorial-about-sentiment-analysis-with-deep-learning-and-keras-2bf52b9cba91

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

#max_words = 5000
#max_len = 100

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['message'])
sequences = tokenizer.texts_to_sequences(df['message'])
texts = pad_sequences(sequences, maxlen=100)
texts.shape

In [None]:
texts[5]

In [None]:
mapping = {'ham':0, 'spam':1}
df.loc[:,'label'] = df.loc[:,'label'].map(mapping)

In [None]:
df.loc[1:5,'label']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(texts, 
    df.loc[:,'label'].values, test_size=0.3, random_state=1)

In [None]:
type(y_train[1])

In [None]:
X_train.shape

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

In [None]:
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=10, input_length=100)) #The embedding layer
model.add(LSTM(3)) # More LSTM layers lead to overfitting
model.add(Dense(1,activation='sigmoid'))

In [None]:
model.summary()

In [None]:
'''  If you get an error stating "ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int)"
... then run the code below.  With some libraries, you may need to recast the 4 numpy arrays to "int"

X_train=np.asarray(X_train).astype(int)
y_train=np.asarray(y_train).astype(int)
X_test=np.asarray(X_test).astype(int)
y_test=np.asarray(y_test).astype(int)

'''

In [None]:
X_train=np.asarray(X_train).astype(int)
y_train=np.asarray(y_train).astype(int)
X_test=np.asarray(X_test).astype(int)
y_test=np.asarray(y_test).astype(int)

In [None]:
model.compile(optimizer='adam',loss='binary_crossentropy') 
model.fit(x = X_train, y = y_train, epochs=25,validation_data=(X_test, y_test))

In [None]:
losses = pd.DataFrame(model.history.history)
losses.plot()