# Spam Classification

In this problem, we aim to classify whether the given text message is 'spam' or 'ham'. This notebook was made using Google Colab.

In [0]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

In [0]:
# Mount Google Drive
from google.colab import drive
from google.colab import files
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
# Set path to data and read CSV files
path = '/content/gdrive/My Drive/SpamProb/'
train = pd.read_csv(path+'train.csv', encoding='iso-8859-1')
test = pd.read_csv(path+'test.csv', encoding='iso-8859-1')

In [0]:
# Modifying the train DataFrame
train = train[['ID','v1','v2']]
train.index = train['ID']
train = train.drop('ID', axis = 1)
train

Unnamed: 0_level_0,v1,v2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,ham,Call back !
2,ham,"Go until jurong point, crazy.. Available only ..."
3,ham,Ok lar... Joking wif u oni...
4,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,ham,U dun say so early hor... U c already then say...
6,ham,"Nah I don't think he goes to usf, he lives aro..."
7,spam,FreeMsg Hey there darling it's been 3 week's n...
8,ham,Even my brother is not like to speak with me. ...
9,ham,As per your request 'Melle Melle (Oru Minnamin...
10,spam,WINNER!! As a valued network customer you have...


In [0]:
# Encoding 'spam' and 'ham' as '1' and '0'
le = LabelEncoder()
train['v1'] = le.fit_transform(train['v1'])
train

Unnamed: 0_level_0,v1,v2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,Call back !
2,0,"Go until jurong point, crazy.. Available only ..."
3,0,Ok lar... Joking wif u oni...
4,1,Free entry in 2 a wkly comp to win FA Cup fina...
5,0,U dun say so early hor... U c already then say...
6,0,"Nah I don't think he goes to usf, he lives aro..."
7,1,FreeMsg Hey there darling it's been 3 week's n...
8,0,Even my brother is not like to speak with me. ...
9,0,As per your request 'Melle Melle (Oru Minnamin...
10,1,WINNER!! As a valued network customer you have...


In [0]:
# Train-test split on train DataFrame
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(train["v2"],train["v1"], test_size = 0.2, random_state = 10)

In [None]:
# CountVectorizer for text
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
Xtraindf = vect.transform(X_train)
Xtestdf = vect.transform(X_test)

In [0]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(Xtraindf,y_train)
pred = model.predict(Xtestdf)
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
accuracy_score(y_test,pred)

0.9662100456621004

In [0]:
# Modifying test DataFrame
test = test[['ID','v2']]
test.index = test['ID']
test = test.drop('ID', axis = 1)
test

Unnamed: 0_level_0,v2
ID,Unnamed: 1_level_1
1,Well obviously not because all the people in m...
2,Ok lor Ã_ reaching then message me.
3,Where's mummy's boy ? Is he being good or bad ...
4,Dhoni have luck to win some big title.so we wi...
5,Yes princess! I want to please you every night...
6,What Today-sunday..sunday is holiday..so no wo...
7,No probably &lt;#&gt; %.
8,Really do hope the work doesnt get stressful. ...
9,Have you seen who's back at Holby?!
10,Shall call now dear having food


In [0]:
# Making predictions using the model
testdf = vect.transform(test['v2'])
pred = model.predict(testdf)
test['v1'] = le.inverse_transform(pred)
test

Unnamed: 0_level_0,v2,v1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Well obviously not because all the people in m...,ham
2,Ok lor Ã_ reaching then message me.,ham
3,Where's mummy's boy ? Is he being good or bad ...,ham
4,Dhoni have luck to win some big title.so we wi...,ham
5,Yes princess! I want to please you every night...,ham
6,What Today-sunday..sunday is holiday..so no wo...,ham
7,No probably &lt;#&gt; %.,ham
8,Really do hope the work doesnt get stressful. ...,ham
9,Have you seen who's back at Holby?!,ham
10,Shall call now dear having food,ham


In [0]:
# Submission 1 CSV file
submit = test.drop('v2', axis = 1)
submit.to_csv(path+"submit1.csv")

In [0]:
# Viewing the submission
submit

Unnamed: 0_level_0,v1
ID,Unnamed: 1_level_1
1,ham
2,ham
3,ham
4,ham
5,ham
6,ham
7,ham
8,ham
9,ham
10,ham


In [0]:
# Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(Xtraindf,y_train)
prediction = dict()
prediction["Multinomial"] = model.predict(Xtestdf)
accuracy_score(y_test,prediction["Multinomial"])

0.9808219178082191

In [0]:
# Making predictions
pred = model.predict(testdf)
submit['v1'] = le.inverse_transform(pred)
submit

Unnamed: 0_level_0,v1
ID,Unnamed: 1_level_1
1,ham
2,ham
3,ham
4,spam
5,ham
6,ham
7,ham
8,ham
9,ham
10,ham


In [0]:
# Submission 2 CSV file
submit.to_csv(path+"submit2.csv")

In [0]:
# Adding length of message as a feature
train = train.reset_index()
train['len'] = pd.Series(map(len,train['v2']))
train = train.drop(['ID'], axis = 1)
train

Unnamed: 0,v1,v2,len
0,0,Call back !,11
1,0,"Go until jurong point, crazy.. Available only ...",111
2,0,Ok lar... Joking wif u oni...,29
3,1,Free entry in 2 a wkly comp to win FA Cup fina...,155
4,0,U dun say so early hor... U c already then say...,49
5,0,"Nah I don't think he goes to usf, he lives aro...",61
6,1,FreeMsg Hey there darling it's been 3 week's n...,148
7,0,Even my brother is not like to speak with me. ...,77
8,0,As per your request 'Melle Melle (Oru Minnamin...,160
9,1,WINNER!! As a valued network customer you have...,158


In [0]:
# Adding number of Capital letters as a feature
train['capslen'] = train['v2'].str.count(r'[A-Z]')
train

Unnamed: 0,v1,v2,len,capslen
0,0,Call back !,11,1
1,0,"Go until jurong point, crazy.. Available only ...",111,3
2,0,Ok lar... Joking wif u oni...,29,2
3,1,Free entry in 2 a wkly comp to win FA Cup fina...,155,10
4,0,U dun say so early hor... U c already then say...,49,2
5,0,"Nah I don't think he goes to usf, he lives aro...",61,2
6,1,FreeMsg Hey there darling it's been 3 week's n...,148,7
7,0,Even my brother is not like to speak with me. ...,77,2
8,0,As per your request 'Melle Melle (Oru Minnamin...,160,10
9,1,WINNER!! As a valued network customer you have...,158,12


In [0]:
# Train-test split
X_train,X_test,y_train,y_test = train_test_split(train[["v2","len","capslen"]],train["v1"], test_size = 0.2, random_state = 10)

In [0]:
# Using CountVectorizer for the text feature in X_train and X_test
v2_v = X_train['v2']
v2_vt = X_test['v2']
vect = CountVectorizer()
vect.fit(v2_v)
v2v = vect.transform(v2_v)
v2vt = vect.transform(v2_vt)
v2v1 = pd.DataFrame(v2v.todense())
v2vt1 = pd.DataFrame(v2vt.todense())
X_train = X_train.reset_index()
X_test = X_test.reset_index()
X_train = pd.concat([X_train,v2v1], axis = 1)
X_test = pd.concat([X_test,v2vt1], axis = 1)
X_train = X_train.drop(['index','v2'], axis = 1)
X_test = X_test.drop(['index','v2'], axis = 1)

In [0]:
# Converting to nparray
features = np.asarray(X_train)
testX = np.asarray(X_test)

In [0]:
# Multinomial Naive Bayes
model = MultinomialNB()
model.fit(features,y_train)
pred = model.predict(testX)
accuracy_score(y_test,pred)

0.9671232876712329

In [0]:
# Viewing X_train
X_train

Unnamed: 0,len,capslen,0,1,2,3,4,5,6,7,...,7668,7669,7670,7671,7672,7673,7674,7675,7676,7677
0,48,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,23,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,153,6,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,23,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,57,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,29,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,177,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,50,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,154,6,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,158,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
# Modifying test DataFrame, adding features
testd = test.drop('v1', axis = 1)
testd = testd.reset_index()
testd['len'] = pd.Series(map(len,testd['v2']))
testd['capslen'] = testd['v2'].str.count(r'[A-Z]')
testd = testd.drop('ID', axis = 1)

In [0]:
# Applying the CountVectorizer
v2_v = testd['v2']
v2v = vect.transform(v2_v)
v2v1 = pd.DataFrame(v2v.todense())
v2v1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7668,7669,7670,7671,7672,7673,7674,7675,7676,7677
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
# Final test DataFrame
testd = testd.drop('v2', axis = 1)
testd = pd.concat([testd, v2v1], axis = 1)
testd

Unnamed: 0,len,capslen,0,1,2,3,4,5,6,7,...,7668,7669,7670,7671,7672,7673,7674,7675,7676,7677
0,79,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,36,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,122,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,54,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,74,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,50,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,25,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,61,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,35,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,31,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
# Making the features array
tests = np.asarray(testd)

In [0]:
# Making predictions and Submission 3 CSV file
pred = model.predict(tests)
submit['v1'] = le.inverse_transform(pred)
submit.to_csv(path+"submit3.csv")

In [0]:
# Multi-layered Perceptron Classifier
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
model.fit(features,y_train)
pred = model.predict(testX)
accuracy_score(y_test,pred)

0.982648401826484

In [0]:
# Making predictions and Submission 4 CSV file
pred = model.predict(tests)
submit['v1'] = le.inverse_transform(pred)
submit.to_csv(path+"submit4.csv")
submit

Unnamed: 0_level_0,v1
ID,Unnamed: 1_level_1
1,ham
2,ham
3,ham
4,ham
5,ham
6,ham
7,ham
8,ham
9,ham
10,ham


In [0]:
# Importing packages for Neural Networks - Keras
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

In [0]:
# DataFrame train1 without 'len' and 'capslen' 
train1 = train.drop(['len','capslen'], axis = 1)

In [0]:
# Train-test split
X_train,X_test,y_train,y_test = train_test_split(train["v2"],train["v1"], test_size = 0.2, random_state = 10)

In [0]:
# Tokenizer for text
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_train)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [0]:
# Defining the RNN
def RNN():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [0]:
# RNN
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          (None, 150)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 150, 50)           50000     
_________________________________________________________________
lstm_5 (LSTM)                (None, 64)                29440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
activation_9 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257       
__________

In [0]:
# Training the RNN
model.fit(sequences_matrix,y_train,batch_size=128,epochs=10,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])

Train on 3504 samples, validate on 876 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


<keras.callbacks.History at 0x7f75f4449860>

In [0]:
# Applying the Tokenizer to test data
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

In [0]:
# Accuracy score
accr = model.evaluate(test_sequences_matrix,y_test)



In [0]:
# Printing loss and accuracy
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.074
  Accuracy: 0.977


In [0]:
# Viewing test DataFrame
test

Unnamed: 0_level_0,v2,v1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Well obviously not because all the people in m...,ham
2,Ok lor Ã_ reaching then message me.,ham
3,Where's mummy's boy ? Is he being good or bad ...,ham
4,Dhoni have luck to win some big title.so we wi...,ham
5,Yes princess! I want to please you every night...,ham
6,What Today-sunday..sunday is holiday..so no wo...,ham
7,No probably &lt;#&gt; %.,ham
8,Really do hope the work doesnt get stressful. ...,ham
9,Have you seen who's back at Holby?!,ham
10,Shall call now dear having food,ham


In [0]:
# Making predictions for the submission test data
test_sequences = tok.texts_to_sequences(test['v2'])
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
pred = model.predict(test_sequences_matrix)

In [0]:
# Round-off to 0 or 1 and convert to nparray of integer 0s and 1s
pred = np.around(pred)
pred = pred.reshape(99).astype(int)

In [0]:
# Making submission 5 CSV file
submit['v1'] = le.inverse_transform(pred)
submit.to_csv(path+"submit5.csv")
submit

Unnamed: 0_level_0,v1
ID,Unnamed: 1_level_1
1,ham
2,ham
3,ham
4,ham
5,ham
6,ham
7,ham
8,ham
9,ham
10,ham


With the RNN, an accuracy score of 98.99 was achieved. It is noted that the message with ID 69 is the only one to be misclassified as 'ham' when it is actually 'spam'. This could probably be rectified by using additional features or using the 'capslen' and 'len' features in the RNN as well.