# Applying NLP techniques for a SMS Spam/Ham detector

The idea of this project is to create a model can that help us determine if a given message is either spam or ham (not spam). For this, I won't be diving that much on all the theory behind it, I will rather be focused on providing you a basic template and code that you can use for this purpose.
<br>We will be using many libraries that are pretty well known by the ML community of Python, such as Pandas, Scikit Learn and NLTK so we don't have to reinvent the wheel in many aspects. Again, I encourage you to dive deeper on each one of them to get a better understanding of the potential supported.

## Machine Learning Pipeline

### 1) Raw Text - Model can't distinguish words

We will take our data from a dataset provide by Kaggle (you can find it here - https://www.kaggle.com/assumewisely/sms-spam-collection - but I've included it in this project as well). 
<br>Everything starts by understanding the format of our data and determining HOW we can process that data. Let's do it.

In [1]:
# Taking a look at the raw format of our data
file_content = open("SMSSpamCollection.tsv", "r").read()
# Let's display the first 2000 characters of our file
file_content[0:2000]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aids patent.\nham\tI HAVE A DATE ON SUNDAY WITH WILL!!\nham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\nspam\tWINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\nspam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera f

As you may realize from this, we have a file composed by multiple text lines (you can tell by the "\n" separator) in where each line is integrated by two columns separated by a tab (\t). The first column corresponds to the label (either spam or ham) and the second one corresponds to the content of that SMS.
<br>In other words, this is a tab separated file.
<br> In this case, we can use a simple method from the Pandas library in order to help us out reading the content and managing it in a more organized way.

In [2]:
# Read the content of the file with Pandas.
import pandas as pd
# I wanna see at most 100 characters on each column
pd.set_option("display.max_colwidth", 100)

# A couple of tricks here, our file is not a comma separated file, it's a tab separated file, that's why we need to pass 
# in the separator. On the other hand, we use header equals to None in order to indicate that there's no header column
data = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None, names=["label", "body_text"])

# By default head will display us the first 5 rows, we can set as a parameter how many we want to see
data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### 2) Clean our text and tokenize

After reading our data, we need to clean it. There are many things that we want to remove from the content, since they won't add any value to our model. We want to focus on words and know the role they play in a message being spam or ham.
<br><b>IMPORTANT: if you are aware of these concepts, please jump directly to the (v) section in where I put all of this together</b>

#### i) Remove punctuation
This can be easily achieved thanks to the string libary provided by Python. We can retrieve all of the ASCII characters considered as punctuation in the C locale.

In [3]:
import string

def remove_punctuation(text):
    line_with_no_punct = "".join([char for char in text if char not in string.punctuation])
    return line_with_no_punct

data['body_no_punct'] = data['body_text'].apply(lambda x: remove_punctuation(x))
data.head()

Unnamed: 0,label,body_text,body_no_punct
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


#### ii) Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
<br>In our case, we will go line by line of our dataframe and split the sms content into words.
<br>There are many different ways to tokenize, we can for instance use a regular expression or we can use some help from libraries that already have a solid implementation of it such as NLTK.

In [4]:
from nltk.tokenize import word_tokenize

# I'm transforming each line of text to lowercase so we can have a normalized version of all words and 
# "Hello" or "hello" are considered the same word.
data['body_tokenized'] = data['body_no_punct'].apply(lambda x: word_tokenize(x.lower()))
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


#### iii) Remove stopwords
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.

In [5]:
from nltk.corpus import stopwords

#Let's get our english stopwords
english_stopwords = stopwords.words('english')

def remove_stopwords(token_list):
    no_stopwords_list = [token for token in token_list if token not in english_stopwords]
    return no_stopwords_list

data['body_tokenized_no_stopwords'] = data['body_tokenized'].apply(lambda x: remove_stopwords(x))

# Check what happened on the fifth row to understand a little bit more what stopwords are 
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


#### iv) Transforming words to their base form (stemming/lemmatizing)
As you might imagine, we might have many different forms of words in our sms messages. We can have words such as "go", "going", "gone", "goes" that for us should have all the same meaning (go). For this, we have two processes that can help us transform our words into their base forms:

<b>Stemming</b> is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
<br>Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.
<br>This basically means chopping off the end of the word to leave only the base.
<br> For example:
<br> Stemming/stemmed -> Stem
<br> Electricity/electrical -> Electr
<br> Berries/Berry -> Berri
<br> Connection/Connected/connective -> Connect

<b>Lemmatizing</b> is the process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma.
<br>On other words, using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word.

<br><b>When should you use one or the other?</b> Both of them accomplish the same thing, but as you might imagine, there's a tradeoff between accuracy and speed (stemming will be faster since it chops the end of the words using heuristics but without any understadning of the context in which a word is used and this might lead to words that don't even exist in the dictionary).



In [6]:
# For efficiency purposes, I'm going to use Stemming on this example. If you wanted to use lemmatizing, it's super easy
# to do it with nltk as well. (link in here: https://www.geeksforgeeks.org/python-lemmatization-with-nltk)
from nltk.stem import PorterStemmer

ps = PorterStemmer()

def stem_words(token_list):
    stemmed_words = [ps.stem(word) for word in token_list]
    return stemmed_words

data["cleaned_text"] = data["body_tokenized_no_stopwords"].apply(lambda x: stem_words(x))
data.head()

# Check on the first row how "searching" was transformed into "search" or "promise" into "promis"

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords,cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]"


#### v) Putting all together in just one function
Instead of going step by step, we can create a single function that will help us jump from the original sms content into the cleaned list of tokens that we need.
<br>Let's do it!

In [7]:
# I know these imports have been done before, I'm just trying to put all together in one function assuming that the previous
# explanations never happened
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

ps = PorterStemmer()
english_stopwords = stopwords.words('english')

def clean_text(text):
    # Remove punctuation
    text_no_punctuation = "".join([char for char in text if char not in string.punctuation])
    
    # Tokenize
    token_list = word_tokenize(text_no_punctuation.lower())
    
    # Remove stopwords
    token_list_no_sw = [token for token in token_list if token not in english_stopwords]   
    
    # Transform words into their root versions through stemming
    cleaned_version = [ps.stem(token) for token in token_list_no_sw]
    return cleaned_version

# Transform our original content into the cleaned version of it
data["processed_text"] = data["body_text"].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords,cleaned_text,processed_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]","[date, sunday]"


In [8]:
# Remove rows we've used for explaining purposes
data = data.drop(['body_no_punct', 'body_tokenized', 'body_tokenized_no_stopwords', 'cleaned_text'], axis = 1)
data.head()

Unnamed: 0,label,body_text,processed_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


### 4) Vectorize - convert to numeric form

### 5) Machine Learning Algorithm - fit/train our model

### 6) Spam filter - system to filter emails