# **● Problem Statement:**

Over recent years, as the popularity of mobile phone devices has increased, Short
Message Service (SMS) has grown into a multi-billion dollar industry. At the same time,
reduction in the cost of messaging services has resulted in growth in unsolicited
commercial advertisements (spams) being sent to mobile phones. Due to Spam SMS,
Mobile service providers suffer from some sort of financial problems as well as it reduces
calling time for users. Unfortunately, if the user accesses such Spam SMS they may face
the problem of virus or malware. When SMS arrives at mobile it will disturb mobile user
privacy and concentration. It may lead to frustration for the user. So Spam SMS is one of
the major issues in the wireless communication world and it grows day by day.

# **● Import required library**

In [7]:
import csv,io
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
nltk.download('stopwords')  
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
df = pd.read_csv(io.FileIO('/content/spam.csv'), encoding = "ISO-8859-1")

# **● Read dataset and do pre-processing**


In [15]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [13]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


# **● Create Model**

In [19]:
articles = []
labels = []

with open("spam.csv", 'r', encoding = "ISO-8859-1") as dataset:
    reader = csv.reader(dataset, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        article = row[1]
        for word in STOPWORDS:
            token = ' ' + word + ' '
            article = article.replace(token, ' ')
        articles.append(article)
len=int(5572 * 0.8)
train_articles = articles[0: len]
train_labels = labels[0: len]

validation_articles = articles[len:]
validation_labels = labels[len:]

In [25]:
np.unique(validation_labels)

array(['ham', 'spam'], dtype='<U4')

In [30]:
validation_articles

['Die... I accidentally deleted e msg suppose 2 put e sim archive. Haiz... I sad...',
 'Welcome UK-mobile-date msg FREE giving free calling 08719839835. Future mgs billed 150p daily. To cancel send \\go stop\\" 89123"',
 'This wishing great day. Moji told offer always speechless. You offer easily go great lengths behalf stunning. My exam next friday. After keep touch more. Sorry.',
 'Thanks reply today. When ur visa coming in. And r u still buying gucci bags. My sister things easy, uncle john also bills really need think make money. Later sha.',
 "Sorry I flaked last night, shit's seriously goin roommate, tonight?",
 "He said look pretty wif long hair wat. But thk he's cutting quite short 4 leh.",
 'Ranjith cal drpd Deeraj deepak 5min hold',
 '\\CHEERS FOR CALLIN BABE.SOZI CULDNT TALKBUT I WANNATELL U DETAILS LATER WENWECAN CHAT PROPERLY X\\""',
 'Hey u still gym?',
 "She said,'' u mind I go bedroom minute ? '' ''OK'', I sed sexy mood. She came 5 minuts latr wid cake...n My Wife,",
 'M

In [32]:
tokenizer = Tokenizer(num_words = 5572, oov_token='OOV')
tokenizer.fit_on_texts(train_articles)
word_index = tokenizer.word_index

In [49]:
word_index

{'OOV': 1,
 'i': 2,
 'u': 3,
 'call': 4,
 'you': 5,
 '2': 6,
 'get': 7,
 "i'm": 8,
 'ur': 9,
 'now': 10,
 'gt': 11,
 'lt': 12,
 '4': 13,
 'ok': 14,
 'free': 15,
 'go': 16,
 'know': 17,
 'me': 18,
 'like': 19,
 'good': 20,
 'no': 21,
 'it': 22,
 'got': 23,
 'come': 24,
 'day': 25,
 'love': 26,
 'time': 27,
 'send': 28,
 'text': 29,
 'want': 30,
 'how': 31,
 'going': 32,
 "i'll": 33,
 'txt': 34,
 'do': 35,
 'one': 36,
 'home': 37,
 'sorry': 38,
 'need': 39,
 'so': 40,
 'r': 41,
 'but': 42,
 'still': 43,
 'lor': 44,
 'n': 45,
 'today': 46,
 'reply': 47,
 'back': 48,
 'dont': 49,
 'if': 50,
 'see': 51,
 'stop': 52,
 'k': 53,
 'da': 54,
 'please': 55,
 'hi': 56,
 'take': 57,
 'tell': 58,
 'new': 59,
 'think': 60,
 'what': 61,
 'just': 62,
 'mobile': 63,
 'the': 64,
 'we': 65,
 'later': 66,
 'my': 67,
 'dear': 68,
 'pls': 69,
 'phone': 70,
 '1': 71,
 'ì': 72,
 'your': 73,
 'week': 74,
 'msg': 75,
 'well': 76,
 'much': 77,
 'and': 78,
 'is': 79,
 'night': 80,
 'hope': 81,
 'happy': 82,
 'this

In [37]:
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)

training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
train_sequences = tokenizer.texts_to_sequences(train_articles)
train_padded = pad_sequences(train_sequences)

validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
validation_sequences = tokenizer.texts_to_sequences(validation_articles)
validation_padded = pad_sequences(validation_sequences)

# **● Add Layers (LSTM, Dense-(Hidden Layers), Output)**

In [75]:
model = tf.keras.Sequential([
   
    tf.keras.layers.Embedding(5572, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])

In [76]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 64)          356608    
                                                                 
 bidirectional_3 (Bidirectio  (None, 64)               24832     
 nal)                                                            
                                                                 
 dense_6 (Dense)             (None, 32)                2080      
                                                                 
 dense_7 (Dense)             (None, 6)                 198       
                                                                 
Total params: 383,718
Trainable params: 383,718
Non-trainable params: 0
_________________________________________________________________


# **● Compile the Model**

In [77]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# **● Fit the Model**

In [78]:
history = model.fit(train_padded, training_label_seq, epochs=15, validation_data=(validation_padded, validation_label_seq), verbose=2)

Epoch 1/15
140/140 - 10s - loss: 0.4353 - accuracy: 0.8981 - val_loss: 0.0660 - val_accuracy: 0.9848 - 10s/epoch - 68ms/step
Epoch 2/15
140/140 - 6s - loss: 0.0364 - accuracy: 0.9904 - val_loss: 0.0650 - val_accuracy: 0.9830 - 6s/epoch - 42ms/step
Epoch 3/15
140/140 - 6s - loss: 0.0118 - accuracy: 0.9982 - val_loss: 0.0361 - val_accuracy: 0.9901 - 6s/epoch - 42ms/step
Epoch 4/15
140/140 - 6s - loss: 0.0054 - accuracy: 0.9991 - val_loss: 0.0390 - val_accuracy: 0.9901 - 6s/epoch - 42ms/step
Epoch 5/15
140/140 - 6s - loss: 0.0038 - accuracy: 0.9991 - val_loss: 0.0448 - val_accuracy: 0.9892 - 6s/epoch - 43ms/step
Epoch 6/15
140/140 - 6s - loss: 0.0029 - accuracy: 0.9996 - val_loss: 0.0507 - val_accuracy: 0.9874 - 6s/epoch - 42ms/step
Epoch 7/15
140/140 - 6s - loss: 0.0025 - accuracy: 0.9989 - val_loss: 0.0552 - val_accuracy: 0.9892 - 6s/epoch - 42ms/step
Epoch 8/15
140/140 - 6s - loss: 0.0016 - accuracy: 0.9996 - val_loss: 0.0521 - val_accuracy: 0.9874 - 6s/epoch - 42ms/step
Epoch 9/15
140

# **● Save The Model**

In [79]:
model.save("spam.h1")



# **● Test The Model**

In [80]:
from tensorflow.keras.models import load_model
model = load_model("/content/spam.h1")

# Test 1

In [97]:
test = ["Nah I don't think he goes to usf, he lives around here though	"]

seq = tokenizer.texts_to_sequences(test)
paddedSeq = pad_sequences(seq)
predicted_class = model.predict(paddedSeq)



In [98]:
labels = ['Not spam','Likely Not spam','Spam','Likely Spam','Cannot be determined']
labels[np.argmax(predicted_class)]

'Likely Not spam'

# Test 2

In [99]:
test = ["Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"]

seq = tokenizer.texts_to_sequences(test)
paddedSeq = pad_sequences(seq)
predicted_class = model.predict(paddedSeq)



In [100]:
labels[np.argmax(predicted_class)]

'Spam'

# Test 3

In [101]:
test = ["URGENT! You have won a 1 week FREE membership in our å£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"]

seq = tokenizer.texts_to_sequences(test)
paddedSeq = pad_sequences(seq)
predicted_class = model.predict(paddedSeq)



In [102]:
labels[np.argmax(predicted_class)]

'Spam'