## SPAM HAM MESSAGE DETECTOR USING NLP AND ML
It takes data as input and run its model to tell whether the data is spam msg or ham msg.

#### IMPORT LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import nltk

In [2]:
data='Hello this is hitin yadav. Chill guys'

#### CHECKING HOW WORD TOKENIZER AND SENTENCE TOKENIZER WORKS

In [3]:
nltk.word_tokenize(data)

['Hello', 'this', 'is', 'hitin', 'yadav', '.', 'Chill', 'guys']

In [4]:
nltk.sent_tokenize(data)

['Hello this is hitin yadav.', 'Chill guys']

#### READING DATASET & RENAMING COLUMNS

In [5]:
df=pd.read_csv("SMSSpamCollection.tsv",sep='\t',header=None)
df.columns=['target','phrase']

In [6]:
df.head()

Unnamed: 0,target,phrase
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### CHECKING ANY NULL VALUES IN DATASET

In [7]:
df.isnull().sum()

target    0
phrase    0
dtype: int64

#### APPLYING WORD TOKENIZER TO PHRASE COLUMN
Tokenizer is used for dividing a string of written language into its component words.

In [13]:
'''for i in df.phrase:
    print(nltk.word_tokenize(i))'''

'for i in df.phrase:\n    print(nltk.word_tokenize(i))'

In [15]:
l=[]
for i in df.phrase:
    l.append(nltk.word_tokenize(i))

In [16]:
l[0]

['Go',
 'until',
 'jurong',
 'point',
 ',',
 'crazy',
 '..',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 '...',
 'Cine',
 'there',
 'got',
 'amore',
 'wat',
 '...']

#### CREATING A NEW COLUMN TOKEN TO SAVE UPDATED TOKENIZE VALUES OF PHRASE COLUMN-- WHICH IS STORED IN LIST( l )

In [17]:
df['token']=l

In [18]:
df.head()

Unnamed: 0,target,phrase,token
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,..."


#### IMPORTING PORTER STEMMER
Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization).

For example, searching for fish on Google will also result in fishes, fishing as fish is the stem of both words.

In [19]:
from nltk.stem import PorterStemmer

In [20]:
ps=PorterStemmer()

In [21]:
ps.stem('doing')

'do'

In [22]:
df.token[:6]

0    [Go, until, jurong, point, ,, crazy, .., Avail...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, ..., U, c, alrea...
4    [Nah, I, do, n't, think, he, goes, to, usf, ,,...
5    [FreeMsg, Hey, there, darling, it, 's, been, 3...
Name: token, dtype: object

In [23]:
df.phrase[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [24]:
ps.stem(df.phrase[0])

'go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...'

In [25]:
l=[]
for i in df.token:
    l1=[]
    for j in i:
        l1.append(ps.stem(j))
    l.append(l1)

In [27]:
l[0]

['go',
 'until',
 'jurong',
 'point',
 ',',
 'crazi',
 '..',
 'avail',
 'onli',
 'in',
 'bugi',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 '...',
 'cine',
 'there',
 'got',
 'amor',
 'wat',
 '...']

In [28]:
df.shape

(5572, 3)

#### CREATING A NEW COLUMN STEM TO SAVE UPDATED STEMMED WORDS OF TOKEN COLUMN-- WHICH IS STORED IN LIST( l )

In [29]:
df['stem']=l

In [30]:
df.head()

Unnamed: 0,target,phrase,token,stem
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[go, until, jurong, point, ,, crazi, .., avail..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, goe, to, usf, ,, ..."


#### IMPORTING WORD NET LEMMATIZER
Lemmatization gorups togethor different inflected form of words, somehow similar to stemming, as it maps several words into one common root

In [31]:
from nltk.stem import WordNetLemmatizer

In [32]:
wl=WordNetLemmatizer()

In [33]:
wl.lemmatize('was',pos='v')

'be'

In [34]:
l=[]
for i in df.stem:
    l2=[]
    for j in i:
        l2.append(wl.lemmatize(j,pos='v'))   # here pos refer to part of speech which we have used verb ('v') as tag
    l.append(l2)

In [37]:
df['lem']=l

In [38]:
df.head()

Unnamed: 0,target,phrase,token,stem,lem
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, ,, crazi, .., avail..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, ,, ..."


#### IMPORTING PUNCTUATION FROM STRING
We will remove all punctuation marks form the dataset so that we can remove all uncessary unimportant words from our datset

In [39]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [40]:
# this way we are creating a list so we can add more things in list
# which we want to get removed from the data other than punctautions

a=[]
for i in string.punctuation:
    a.append(i)

In [41]:
a[:5]

['!', '"', '#', '$', '%']

In [42]:
l=[]
for i in df.lem:
    l2=[]
    for j in i:
        if j not in a:
            l2.append(j)
    l.append(l2)

In [43]:
df['punc']=l

In [44]:
df.head()

Unnamed: 0,target,phrase,token,stem,lem,punc
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, crazi, .., avail, o..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, he,..."


#### IMORTING STOP WORDS
Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.


In [45]:
from nltk.corpus import stopwords

In [48]:
b=stopwords.words('english')
b[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [49]:
l=[]
for i in df.punc:
    l2=[]
    for j in i:
        if j not in b:
            l2.append(j)
    l.append(l2)

In [50]:
df['stopword']=l

In [51]:
df.head()

Unnamed: 0,target,phrase,token,stem,lem,punc,stopword
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, crazi, .., avail, o...","[go, jurong, point, crazi, .., avail, onli, bu..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, 2, wkli, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, earli, hor, ..., u, c, alreadi, ..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, he,...","[nah, n't, think, goe, usf, live, around, though]"


In [52]:
df.stopword.head()

0    [go, jurong, point, crazi, .., avail, onli, bu...
1               [ok, lar, ..., joke, wif, u, oni, ...]
2    [free, entri, 2, wkli, comp, win, fa, cup, fin...
3    [u, dun, say, earli, hor, ..., u, c, alreadi, ...
4    [nah, n't, think, goe, usf, live, around, though]
Name: stopword, dtype: object

#### USING JOIN FUNCTION
We can see out stopword data is in the list form and we want to convert it back to its original form, removing list back to string. So we can use join function to join all the list with space.

In [53]:
''.join(['a','b','c'])

'abc'

In [54]:
l=[]
for i in df.stopword:
    l.append(" ".join(i))

In [60]:
l[:5]

['go jurong point crazi .. avail onli bugi n great world la e buffet ... cine get amor wat ...',
 'ok lar ... joke wif u oni ...',
 "free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri question std txt rate c 's appli 08452810075over18 's",
 'u dun say earli hor ... u c alreadi say ...',
 "nah n't think goe usf live around though"]

In [61]:
df['ready']=l

In [62]:
df.head()

Unnamed: 0,target,phrase,token,stem,lem,punc,stopword,ready
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, ,, crazi, .., avail...","[go, until, jurong, point, crazi, .., avail, o...","[go, jurong, point, crazi, .., avail, onli, bu...",go jurong point crazi .. avail onli bugi n gre...
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]","[ok, lar, ..., joke, wif, u, oni, ...]",ok lar ... joke wif u oni ...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entri, 2, wkli, comp, win, fa, cup, fin...",free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, so, earli, hor, ..., u, c, alrea...","[u, dun, say, earli, hor, ..., u, c, alreadi, ...",u dun say earli hor ... u c alreadi say ...
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, ,, ...","[nah, i, do, n't, think, he, goe, to, usf, he,...","[nah, n't, think, goe, usf, live, around, though]",nah n't think goe usf live around though


#### Count Vectorizer
Now we need to import count vectorizer so that we can count frequency of no. of words present in the dataset

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

In [64]:
cv= CountVectorizer()

We fitted our ready column data to count vectorizer, this will create a sparse matrix

In [65]:
c=cv.fit_transform(df.ready)
c

<5572x7339 sparse matrix of type '<class 'numpy.int64'>'
	with 49125 stored elements in Compressed Sparse Row format>

Sparse matrix is a matrix containing 0 and 1 values only, here we are converting our matrix to array 

In [66]:
c.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [67]:
c.toarray().shape

(5572, 7339)

In [69]:
# get_feature_names provide the list of all unique words of dataset for which column is created

cv.get_feature_names()[:10] # only checking first 10 names

['00',
 '000',
 '000pe',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02']

Defining x and y value 

In [73]:
x=c.toarray()
y=df['target']

Dividing the data into training and testing set

In [74]:
from sklearn.model_selection import train_test_split

In [75]:
xtrain,xtest,ytrain,ytest= train_test_split(x,y,test_size=0.30,random_state=100)

#### IMPORTING CLASSIFICATION ALGO NAIVE BAYES 
We are using naive bayes multinomial and bernoulli algo and checking which perform better

In [76]:
#from sklearn.naive_bayes import MultinomialNB

In [77]:
#ml=MultinomialNB()

In [78]:
#ml.fit(xtrain,ytrain)

MultinomialNB()

In [79]:
#ml.score(xtest,ytest)

0.9766746411483254

In [81]:
#ypred= ml.predict(xtest)
#ypred

array(['spam', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [92]:
#from sklearn.metrics import confusion_matrix,f1_score,accuracy_score

In [94]:
#confusion_matrix(ytest,ypred)

array([[1423,   25],
       [  14,  210]], dtype=int64)

In [95]:
#accuracy_score(ytest,ypred)

0.9766746411483254

In [97]:
#f1_score(ytest,ypred,pos_label='spam')

0.9150326797385621

In [98]:
#f1_score(ytest,ypred,pos_label='ham')

0.9864818024263432

Fitting model using Bernoulli naive bayes and checking score

In [99]:
from sklearn.naive_bayes import BernoulliNB

In [100]:
bn=BernoulliNB()

In [101]:
bn.fit(xtrain,ytrain)

BernoulliNB()

In [102]:
bn.score(xtest,ytest)

0.9778708133971292

In [103]:
ypred1=bn.predict(xtest)
ypred1

array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [104]:
from sklearn.metrics import confusion_matrix, f1_score,accuracy_score
confusion_matrix(ytest,ypred1)

array([[1443,    5],
       [  32,  192]], dtype=int64)

In [105]:
accuracy_score(ytest,ypred1)

0.9778708133971292

In [106]:
f1_score(ytest,ypred1,pos_label='spam')

0.9121140142517815

In [107]:
f1_score(ytest,ypred1,pos_label='ham')

0.9873417721518988

Fitting model using SVM and checking score

In [108]:
#from sklearn.svm import SVC

In [109]:
#sc=SVC()

In [110]:
#sc.fit(xtrain,ytrain)

SVC()

In [111]:
#sc.score(xtest,ytest)

0.979066985645933

In [112]:
#ypred2=sc.predict(xtest)
#ypred2

array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], dtype=object)

In [113]:
#confusion_matrix(ytest,ypred2)

array([[1446,    2],
       [  33,  191]], dtype=int64)

In [114]:
#accuracy_score(ytest,ypred2)

0.979066985645933

In [115]:
#f1_score(ytest,ypred2,pos_label='spam')

0.9160671462829736

In [116]:
#f1_score(ytest,ypred2,pos_label='ham')

0.9880423641954219

#### OUTCOME ----- NAIVE BAYES AND SVM ARE GIVING SIMILAR SCORE BUT SVM TAKES MORE TIME TO PROCESS THAN NAIVE BAYES, SO WE WILL CONTINUE USING NAIVE BAYES ALGO

In [117]:
data= 'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575.'

In [118]:
cv.transform([data])

<1x7339 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [119]:
pred= cv.transform([data]).toarray()
pred

array([[0, 1, 0, ..., 0, 0, 0]], dtype=int64)

In [120]:
bn.predict(pred)

array(['spam'], dtype='<U4')

### CONCLUSION- We can see that our model predicted msg as spam

Lets check our model once again on different data

In [121]:
data2= '''You get a lot of unwanted emails, such as subscriptions or promotional offers. A hacker tries to fill up your Inbox so that you can't find important security alerts from websites or services you signed up for with your Gmail account.

For example, if a hacker tries to get into your bank account, your bank can notify you by email. But if your Inbox is full of junk mail, you might miss the bank’s alert.'''

In [122]:
data2

"You get a lot of unwanted emails, such as subscriptions or promotional offers. A hacker tries to fill up your Inbox so that you can't find important security alerts from websites or services you signed up for with your Gmail account.\n\nFor example, if a hacker tries to get into your bank account, your bank can notify you by email. But if your Inbox is full of junk mail, you might miss the bank’s alert."

In [123]:
cv.transform([data2])

<1x7339 sparse matrix of type '<class 'numpy.int64'>'
	with 28 stored elements in Compressed Sparse Row format>

In [124]:
pred1= cv.transform([data2]).toarray()
pred1

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [125]:
bn.predict(pred1)

array(['ham'], dtype='<U4')

### CONCLUSION- Similarly, We can see that this model predicted msg as ham

                                   THANK YOU