# **In this practice we will be implementing a simple RNN model to classify whether an sms is spam or not**

# **We will use an sms spam classification dataset for this implementation.**

**Data pre-processing**
*   Download the dataset here : https://www.kaggle.com/uciml/sms-spam-collection-dataset and upload to colab
*   Import the necessary libraries 
*   Read the dataset and drop the unwanted column
*   Convert the strings into integers and tokens
*   Split the data into train and test dataset

**Building an RNN**
*  Use simple RNN function and build a model
*  Compile the model and begin to train
*  Evaluate the model on test dataset
*  Make model predictions on custom data







## **Upload the CSV file to colab**

In [1]:
#download data and upload it to colab
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


## **Import the necessary libraries**

In [36]:
#import the required libraries
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import SimpleRNN, LSTM, GRU, Embedding, Dense, Flatten
from keras.models import Sequential

## **Read the dataset and drop the unwanted columns**

In [37]:
#read dataset and drop the unwanted columns
data = pd.read_csv("spam.csv",encoding='latin-1')
dataset=data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1)
dataset

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## **Convert the labels into integers from strings**

In [38]:
#convert the labels from words to integers 0 and 1
sms = []
classes = []
for index, row in dataset.iterrows():
    sms.append(row['v2'])
    if row['v1'] == 'ham':
        classes.append(0)
    else:
        classes.append(1)
sms = np.asarray(sms)
classes = np.asarray(classes)

## **Convert the sms into tokens and convert them to same length**

In [39]:
#convert all the sms into tokenized sequence and set them to a standard length
total_vocab = 10000
total_len = 500
tokens = Tokenizer(num_words=total_vocab)
tokens.fit_on_texts(sms)
sequences = tokens.texts_to_sequences(sms)
words_id = tokens.word_index
dataset = pad_sequences(sequences, maxlen=total_len)

## **Split the data into train and test set**

In [40]:
#split the data into train and test set and convert them to arrays
training = int(len(sms)*0.8)
sms_train = dataset[:training]
classes_train = classes[:training]
sms_test = dataset[training:len(sms)-2]
classes_test = classes[training:len(sms)-2]
sms_train

array([[   0,    0,    0, ...,   58, 4411,  144],
       [   0,    0,    0, ...,  470,    6, 1929],
       [   0,    0,    0, ...,  659,  389, 2988],
       ...,
       [   0,    0,    0, ...,   15,    4,  316],
       [   0,    0,    0, ...,  956, 8057,  629],
       [   0,    0,    0, ...,   44,  102,  231]], dtype=int32)

## **Build a simple RNN model and compile the model**

In [41]:
#build a simple RNN with one embedding layer and compile the model
embedding=32
model = Sequential()
model.add(Embedding(input_dim=total_vocab,
                    output_dim=embedding,
                    input_length=total_len))
model.add(SimpleRNN(units=embedding))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
              metrics=['acc'])
model.summary()


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________


## **Fit the data to the model and begin training**

In [42]:
#fit the data and begin training
model.fit(sms_train, classes_train, epochs=10, batch_size=60, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f1e32d42c88>

## **Evaluate the model**

In [43]:
#evaluate the model on test data
acc = model.evaluate(sms_test, classes_test)
print("Test loss is {0:.2f} accuracy is {1:.2f}  ".format(acc[0],acc[1]))

Test loss is 0.11 accuracy is 0.98  


## **Make predictions on custom messages**

In [44]:
#function that takes the model weights and classifies custom sms 
def custom_predict(sms):
    sms = sms.lower().split(' ')
    test_sentence = np.array([words_id[word] for word in sms])
    test_sentence = np.pad(test_sentence, (500-len(test_sentence), 0),
                      'constant', constant_values=(0))
    test_sentence = test_sentence.reshape(1, 500)
    return test_sentence
custom_msg = 'hey good morning i have work'
test_sentence = custom_predict(custom_msg)
pred = model.predict_classes(test_sentence)
print(pred)

[[0]]


In [50]:
custom_msg = 'Get a chance to win 10 million for FREE'
test_sentence = custom_predict(custom_msg)
pred = model.predict_classes(test_sentence)
print(pred)

[[1]]
