The dataset I am experimenting is the spam.csv dataset from one of the Kaggle competitions.

In [2]:
import pandas as pd
import numpy as np

In [3]:
msg = pd.read_csv('spam.csv', encoding = 'latin-1')

msg.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


There are 2 lables present: ham (legitimate) and spam.

In [7]:
msg.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""",GE,"GNT:-)"""
freq,4825,30,3,2,2


We see columns [Unnamed: 2	Unnamed: 3	Unnamed: 4] have very few counts compared to the other two and most of that are NaN. So we will remove them.

In [8]:
msg = msg[['v1','v2']]

msg.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Next, we need to enconde "ham" and "spam" labels numerically. Let's set "ham" = 0 and "spam" = 1.

In [12]:
msg['v1'][msg['v1'] == 'ham'] = 0
msg['v1'][msg['v1'] == 'spam'] = 1

First of all, we need to convert all upper case to lower case to make sure that for the same word, they are classifed as the same. 

In [15]:
def lower(x):
    return x.lower()

In [16]:
msg['v2'] = msg['v2'].apply(lower)
msg.head()

I will make use of NTLK for further processing the messages. Next, we need to tonkenize each message. It is to split up a message into distinct pieces and disregards unimportant parts such as punctuation and stop words.

In [20]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tring\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [21]:
from nltk.tokenize import word_tokenize

In [25]:
def tokenize(x):
    comps = word_tokenize(x)
    comps = [i for i in comps if len(i) > 1]
    return comps

In [26]:
msg['v2'] = msg['v2'].apply(tokenize)
msg.head()

Unnamed: 0,v1,v2
0,0,"[go, until, jurong, point, crazy.., available,..."
1,0,"[ok, lar, ..., joking, wif, oni, ...]"
2,1,"[free, entry, in, wkly, comp, to, win, fa, cup..."
3,0,"[dun, say, so, early, hor, ..., already, then,..."
4,0,"[nah, do, n't, think, he, goes, to, usf, he, l..."


Next, we need to remove the stop words which are those that occur frequently such as articles.

In [29]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tring\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [30]:
eng_sw = stopwords.words('english') #list of all stop words in English

In [32]:
def stopword(x):
    x = [i for i in x if i not in eng_sw ]
    return x

In [33]:
msg['v2'] = msg['v2'].apply(stopword)
msg.head()

Unnamed: 0,v1,v2
0,0,"[go, jurong, point, crazy.., available, bugis,..."
1,0,"[ok, lar, ..., joking, wif, oni, ...]"
2,1,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,0,"[dun, say, early, hor, ..., already, say, ...]"
4,0,"[nah, n't, think, goes, usf, lives, around, th..."


Next, I will apply a algorithm called Lemmatization to reduce noise in the text by transforming the words to their base form. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.

In [40]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tring\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [41]:
def lemmatizer(x):
    x = [WordNetLemmatizer().lemmatize(i, pos ="v") for i in x]
    return x

In [42]:
msg['v2'] = msg['v2'].apply(lemmatizer)
msg.head()

Unnamed: 0,v1,v2
0,0,"[go, jurong, point, crazy.., available, bugis,..."
1,0,"[ok, lar, ..., joke, wif, oni, ...]"
2,1,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,0,"[dun, say, early, hor, ..., already, say, ...]"
4,0,"[nah, n't, think, go, usf, live, around, though]"


For our training purposes, we will need a train set and a test set. Hence, we'll do so.

In [13]:
from sklearn.model_selection import train_test_split

train_msg, test_msg = train_test_split(msg, test_size=0.2, random_state=42)

train_msg.head()

Unnamed: 0,v1,v2
1978,0,No I'm in the same boat. Still here at my moms...
3989,1,(Bank of Granite issues Strong-Buy) EXPLOSIVE ...
3935,0,They r giving a second chance to rahul dengra.
4078,0,O i played smash bros &lt;#&gt; religiously.
4086,1,PRIVATE! Your 2003 Account Statement for 07973...
