# Spam Detection with RNN
This notebook was built for the course ENGR 501 -- *Deep learning and reinforcement learning for engineering*, as a sumplementary material of the RNN lectures, to show a Recurrent Neural Network (RNN) model can be used to distinguish spam from non-spam. The used dataset is the "SMS Spam Collection", which can be found at Kaggle under "https://www.kaggle.com/ishansoni/sms-spam-collection-dataset".

## Imports

In [1]:
import pandas as pd
import numpy as np

from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, LSTM, Embedding

2023-03-08 15:20:35.114469: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


## Read data

In [2]:
df = pd.read_csv('./spam.csv')
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Create target-column
Map text from label-column to integer (0/1) in new column

In [3]:
df['target'] = df['label'].map( {'spam':1, 'ham':0 })

## Separate training set and testing set

In [4]:
df_train = df.sample(frac=.8, random_state=11)
df_test = df.drop(df_train.index)
print(df_train.shape, df_test.shape)

(4458, 3) (1114, 3)


## Prepare y_train, y_test

In [5]:
y_train = df_train['target'].values
y_test = df_test['target'].values
print(y_train.shape, y_test.shape)

(4458,) (1114,)


## Prepare X_train, X_test

In [6]:
X_train = df_train['sms'].values
X_test = df_test['sms'].values
print(X_train.shape, X_test.shape)
print(X_train[0])

(4458,) (1114,)
Thanks again for your reply today. When is ur visa coming in. And r u still buying the gucci and bags. My sister things are not easy, uncle john also has his own bills so i really need to think about how to make my own money. Later sha.


## Tokenize
Create a `word_dict`, which maps the words to indexes, within the items are ordered by the most frequent words (they come first in the list)

In [7]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

word_dict = tokenizer.word_index

print(len(word_dict))
print(word_dict)

# for key in word_dict.keys():
#     print(key, word_dict[key])

7982


## Create index sequences from sentences (word sequences)

Transfer the word sequences to indexes

In [8]:
X_train_seq = tokenizer.texts_to_sequences(X_train) 
X_test_seq = tokenizer.texts_to_sequences(X_test)

print(X_train_seq[:5])

# print(X_train_seq[0])
# print(f"The length of the first sentence is: {len(X_train_seq[0])}")
# print(df_train.iloc[0,:])
# for el in X_train_seq[0]:
#     print(word_dict[el], end=' ')


# print(X_train_seq[1])
# print(f"The length of the second sentence is: {len(X_train_seq[1])}")

# print(df_train.iloc[1,:])
# for el in X_train_seq[1]:
#     print(word_dict[el], end=' ')

[[172, 211, 12, 13, 87, 92, 45, 8, 32, 3799, 231, 9, 7, 86, 6, 81, 1020, 5, 3800, 7, 1999, 11, 635, 241, 21, 25, 436, 928, 1110, 178, 131, 206, 929, 2564, 23, 1, 154, 80, 2, 110, 82, 48, 2, 135, 11, 929, 227, 98, 1639], [257, 307, 2, 1426, 2565, 6, 33, 30, 1245, 1246, 15, 49, 5, 337, 709, 7, 1427, 1428, 581, 68, 34, 2000, 88, 2, 2001], [22, 636, 13, 283, 211, 7, 26, 3, 17, 94, 1429, 67], [13, 296, 2, 30, 18, 4, 2002, 1640, 491, 16, 22, 1247, 37, 930, 258, 183, 931, 671, 401, 349, 1111, 1112, 1113, 1114, 1021, 8, 4, 553, 360, 16], [99, 203, 166, 1, 184, 3, 117, 3801, 148, 2, 52, 48, 3802, 22]]


## Create pads with fix length
Maximum length is 20

In [9]:
X_train_pad = pad_sequences(X_train_seq, maxlen=20, padding='post') 
X_test_pad = pad_sequences(X_test_seq, maxlen=20, padding='post')
print(X_train_pad[:5])
X_train_pad.shape

#Qs: what happened for the sentences shorter than maxlen=20 ? 

[[ 178  131  206  929 2564   23    1  154   80    2  110   82   48    2
   135   11  929  227   98 1639]
 [   6   33   30 1245 1246   15   49    5  337  709    7 1427 1428  581
    68   34 2000   88    2 2001]
 [  22  636   13  283  211    7   26    3   17   94 1429   67    0    0
     0    0    0    0    0    0]
 [  22 1247   37  930  258  183  931  671  401  349 1111 1112 1113 1114
  1021    8    4  553  360   16]
 [  99  203  166    1  184    3  117 3801  148    2   52   48 3802   22
     0    0    0    0    0    0]]


(4458, 20)

## Create Keras-model
Use a simple RNN or "Long Short Term Memory" (LSTM)

In [10]:
T = len(X_train_pad[1])  # 20
V = len(word_dict) # 7982
D = 20

model = Sequential()
model.add(Embedding(input_dim=V+1, output_dim=D, input_length=T)) # "V+1" since word index starts from 1
# model.add(SimpleRNN(400))
model.add(LSTM(400))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 20)            159660    
                                                                 
 lstm (LSTM)                 (None, 400)               673600    
                                                                 
 dense (Dense)               (None, 1)                 401       
                                                                 
Total params: 833,661
Trainable params: 833,661
Non-trainable params: 0
_________________________________________________________________


2023-03-08 15:20:36.236314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-08 15:20:36.236568: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pengxia/anaconda3:
2023-03-08 15:20:36.236608: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/pengxia/anaconda3:
2023-03-08 15:20:36.236644: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PA

## Train model

In [11]:
history = model.fit(X_train_pad, y_train, epochs=10, batch_size=64, 
                    validation_data=(X_test_pad, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Test-Estimation

In [20]:
sms_test = ['This course is about deep learning']
sms_seq = tokenizer.texts_to_sequences(sms_test)

sms_pad = pad_sequences(sms_seq, maxlen=20, padding='post')
tokenizer.index_word
sms_pad
predict_x=model.predict(sms_pad) 
predict_x
classes_x = (predict_x[0, 0] > 0.5)
classes_x



False

... classified the text as no spam. Correct!

In [21]:
sms_test = ['Free service for everyone']
sms_seq = tokenizer.texts_to_sequences(sms_test)

sms_pad = pad_sequences(sms_seq, maxlen=20, padding='post')
tokenizer.index_word
sms_pad
predict_x=model.predict(sms_pad) 
predict_x
classes_x = (predict_x[0, 0] > 0.5)
classes_x



True

... classified the tet as spam. Correct again!