# Applying NLP techniques for a SMS Spam/Ham detector

The idea of this project is to create a model can that help us determine if a given message is either spam or ham (not spam). For this, I won't be diving that much on all the theory behind it, I will rather be focused on providing you a basic template and code that you can use for this purpose.
<br>We will be using many libraries that are pretty well known by the ML community, such as Pandas, Scikit Learn and NLTK so we don't have to reinvent the wheel in many aspects. Again, I encourage you to dive deeper on each one of them to get a better understanding of the potential supported.
<br>For NLTK, after installing it through anaconda or any other environment, you need to open a jupyter notebook an run the following commands:
<br><b> >>>>>> import nltk
<br> >>>>>> nltk.download()</b>
<br>This will prompt a window in where you can choose to download all the nltk packages. After that, you can start using it.

## Machine Learning Pipeline

### 1) Raw Text - Model can't distinguish words

We will take our data from a sample dataset provided by Kaggle (you can find it here - https://www.kaggle.com/assumewisely/sms-spam-collection - but it's also included as part of this project). 
<br>Everything starts by understanding the format of our data and determining HOW we can process it. Let's do it.

In [1]:
# Taking a look at the raw format of our data
file_content = open("SMSSpamCollection.tsv", "r").read()
# Let's display the first 2000 characters of our file
file_content[0:2000]

As you may realize from this, we have a file composed by multiple text lines (you can tell by the "\n" separator) in where each line is integrated by two columns separated by a tab (\t). The first column corresponds to the label (either spam or ham) and the second one corresponds to the content of that SMS.
<br>In other words, this is a tab separated file.
<br> In this case, we can use a simple method from the Pandas library in order to help us out reading the content and managing it in a more organized way.

In [2]:
# Read the content of the file with Pandas.
import pandas as pd
# I wanna see at most 100 characters on each column
pd.set_option("display.max_colwidth", 100)

# A couple of tricks here, our file is not a comma separated file, it's a tab separated file, that's why we need to pass 
# in the separator. On the other hand, we use header equals to None in order to indicate that there's no header column
data = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None, names=["label", "body_text"])

# By default head will display us the first 5 rows, we can set as a parameter how many we want to see
print(f"Total messages in our dataset: {data.shape[0]}")
print(f"Spam messages in our dataset: {(data['label']=='spam').sum()}")
print(f"Ham messages in our dataset: {(data['label']=='ham').sum()}")
data.head()

Total messages in our dataset: 5568
Spam messages in our dataset: 746
Ham messages in our dataset: 4822


Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


Well, this is not the biggest dataset out there to work with, it only contains 5568 messages and the distribution in between them being ham and spam is clearly different (only 746 of those 5568 are spam). We cannot expect the model to perform great on messages coming from different distributions or using different words than the vocabulary we have here.

### 2) Clean our text and tokenize

After reading our data, we need to clean it. There are many things that we want to remove from the content, since they won't add any value to our model. We want to focus on words and know the role they play in a message being spam or ham.

#### i) Remove punctuation
This can be easily achieved thanks to the string libary provided by Python. We can retrieve all of the ASCII characters considered as punctuation in the C locale.

In [3]:
import string

def remove_punctuation(text):
    #If we don't join the resulting list, it will be a list of characters rather than a list of words
    line_with_no_punct = "".join([char for char in text if char not in string.punctuation])
    return line_with_no_punct

data['body_no_punct'] = data['body_text'].apply(lambda x: remove_punctuation(x))
data.head()

Unnamed: 0,label,body_text,body_no_punct
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


#### ii) Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
<br>In our case, we will go line by line of our dataframe and split the sms content into words.
<br>There are many different ways to tokenize, we can for instance use a regular expression or we can use some help from libraries that already have a solid implementation of it such as NLTK.

In [4]:
from nltk.tokenize import word_tokenize

# I'm transforming each line of text to lowercase so we can have a normalized version of all words and 
# "Hello" or "hello" are considered the same word.
data['body_tokenized'] = data['body_no_punct'].apply(lambda x: word_tokenize(x.lower()))
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


#### iii) Remove stopwords
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. 
<br>NLTK(Natural Language Toolkit) in Python has a list of stopwords stored in 16 different languages and makes this task quite easy to accomplish.

In [5]:
from nltk.corpus import stopwords

#Let's get our english stopwords
english_stopwords = stopwords.words('english')

def remove_stopwords(token_list):
    no_stopwords_list = [token for token in token_list if token not in english_stopwords]
    return no_stopwords_list

data['body_tokenized_no_stopwords'] = data['body_tokenized'].apply(lambda x: remove_stopwords(x))

# Check what happened on the fifth row to understand a little bit more what stopwords are 
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


#### iv) Transforming words to their base form (stemming/lemmatizing)
We might have many different forms of words in our sms messages. We can have words such as "go", "going", "gone", "goes" that for us should have all the same meaning (go). For this, we have two processes that can help us transform our words into their base forms:

<b>Stemming</b> is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
<br>Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.
<br>This basically means chopping off the end of the word to leave only the base.
<br> For example:
<br> Stemming/stemmed -> Stem
<br> Electricity/electrical -> Electr
<br> Berries/Berry -> Berri
<br> Connection/Connected/connective -> Connect

<b>Lemmatizing</b> is the process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma.
<br>On other words, using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word.

<br><b>When should you use one or the other?</b> Both of them accomplish the same thing, but as you might imagine, there's a tradeoff between accuracy and speed (stemming will be faster since it chops the end of the words using heuristics but without any understadning of the context in which a word is used and this might lead to words that don't even exist in the dictionary).



In [6]:
# For efficiency purposes, I'm going to use Stemming on this example. If you wanted to use lemmatizing,
# you can do it with NLTK as well. (link in here: https://www.geeksforgeeks.org/python-lemmatization-with-nltk)
from nltk.stem import PorterStemmer

ps = PorterStemmer()

def stem_words(token_list):
    stemmed_words = [ps.stem(word) for word in token_list]
    return stemmed_words

data["cleaned_text"] = data["body_tokenized_no_stopwords"].apply(lambda x: stem_words(x))
data.head()

# Check on the first row how "searching" was transformed into "search" or "promise" into "promis"

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords,cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]"


#### v) Putting all together in just one function
Instead of going step by step, we can create a single function that will help us jump from the original sms content into the cleaned list of tokens that we need.
<br>Let's do it!

In [7]:
# I know these imports have been done before, I'm just trying to put all together in one function assuming that the previous
# explanations never happened
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

ps = PorterStemmer()
english_stopwords = stopwords.words('english')

def clean_text(text):
    # Remove punctuation
    text_no_punctuation = "".join([char for char in text if char not in string.punctuation])
    
    # Tokenize
    token_list = word_tokenize(text_no_punctuation.lower())
    
    # Remove stopwords
    token_list_no_sw = [token for token in token_list if token not in english_stopwords]   
    
    # Transform words into their root versions through stemming
    cleaned_version = [ps.stem(token) for token in token_list_no_sw]
    return cleaned_version

# Transform our original content into the cleaned version of it
data["processed_text"] = data["body_text"].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,body_no_punct,body_tokenized,body_tokenized_no_stopwords,cleaned_text,processed_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]","[date, sunday]"


In [8]:
# Remove rows we've used for explaining purposes
data = data.drop(['body_no_punct', 'body_tokenized', 'body_tokenized_no_stopwords', 'cleaned_text'], axis = 1)
data.head()

Unnamed: 0,label,body_text,processed_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


### 4) Vectorize - convert to numeric form
Vectorizing is the process we use in order to encode text as integers to create feature vectors. Models work with integers rather than strings, this is why we need to go through this transformation process.
<br>A feature vector is an n-dimensional vector of numerical features that represent some object. In our case, remember that we have a structure in where each row contains the label (classification) and the message. We will convert this structure into a matrix that has one row per message and n-columns, each one of them corresponding to all the possible words (tokens) used in all of our messages. Of course the label will be a column as well.
As the values, we will count how many times each one of these words or tokens appear in very single message that we have. As you will see later, what we put as the column value depends on the vectorization algorithm we end up using.
<br>
<br>For example, let's assume that we only have two messages:
<br> - "hello lucas" classified as spam
<br> - "nice to meet you lucas" classified as not spam
<br>
<br> body_text                | hello | lucas | nice | to | meet | you | label
<br> "hello lucas"            |   1   |   1   |   0  | 0  | 0    | 0   |  spam
<br> "nice to meet you lucas" |   0   |   1   |   1  | 1  | 1    | 1   |  ham


<br> There are many ways of creating this matrix, this is, many vectorization methods. The one I showed before it's called Count Vectorization (it just counts how many time a token appears in that sentence). However, there's also a method called n-grams (https://medium.com/machine-learning-intuition/document-classification-part-2-text-processing-eaa26d16c719) and another one called Term frequency (TF-IDF) which determines the values by using a mathematical formula rather than just counting them up.
<br><b>For this example, I will generate this matrix by using TF-IDF.</b>

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# We need to provide a function that will clean our text so that the vectorizer can use it
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

# Let's transform and vectorize the content
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

# This method should return as many rows as the total amount of messages we have and an "x" amount of columns depending
# on all the possible tokens used in our messages
print(X_tfidf.shape)

# This will give us the matrix that we generated, but as you might see, it's kind of hard to understand
print(tfidf_vect.get_feature_names())

(5568, 8168)
['0', '008704050406', '0089mi', '0121', '01223585236', '01223585334', '0125698789', '02', '020603', '0207', '02070836089', '02072069400', '02073162414', '02085076972', '020903', '021', '050703', '0578', '06', '060505', '061104', '07008009200', '07046744435', '07090201529', '07090298926', '07099833605', '071104', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07786200117', '077xxx', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '0784987', '0789xxxxxxx', '0794674629107880867867', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382', '08002888812', '08002986030', '08002986906', '08002988890', '08006344447', '0808', '08081263000', '08081560665', '0825', '0844', '08448350055', '08448714184', '0845', '08450542832', '08452810071', '08452810073', '08452810075over18', '0870', '08700621170150p', '0

#### Apply TF-IDF to a smaller sample for visualization purposes

In [10]:
import pandas as pd

# Let's take only 20 messages from our pool
data_sample = data[0:20]

tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text)
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['body_text'])
print(X_tfidf_sample.shape)
print(tfidf_vect_sample.get_feature_names())

(20, 204)
['08002986030', '08452810075over18', '09061701461', '1', '100', '11', '12', '150pday', '16', '2', '20000', '2005', '21st', '3', '4', '4403ldnw1a7rw18', '4txtú120', '6day', '81010', '87077', '87121', '87575', '9', 'aft', 'aid', 'alreadi', 'anymor', 'appli', 'ard', 'around', 'b', 'bless', 'breather', 'brother', 'call', 'caller', 'callertun', 'camera', 'cash', 'chanc', 'claim', 'click', 'co', 'code', 'colour', 'comin', 'comp', 'copi', 'cost', 'credit', 'cri', 'csh11', 'cup', 'custom', 'da', 'date', 'dont', 'eg', 'eh', 'england', 'enough', 'entitl', 'entri', 'even', 'fa', 'feel', 'final', 'fine', 'finish', 'first', 'free', 'friend', 'fulfil', 'go', 'goalsteam', 'goe', 'gon', 'gota', 'grant', 'ha', 'help', 'hl', 'home', 'hour', 'httpwap', 'im', 'info', 'ive', 'jackpot', 'joke', 'k', 'kim', 'kl341', 'lar', 'latest', 'lccltd', 'like', 'link', 'live', 'lor', 'lunch', 'macedonia', 'make', 'may', 'mell', 'membership', 'messag', 'minnaminungint', 'miss', 'mobil', 'month', 'na', 'nah', '

#### Create a Sparse Matrix out of this structure
Let's now wrap this object into a panda's dataframe so we can visualize it better.
<br>Take into consideration that each row corresponds to a given message from our dataset

In [11]:
# We can now create a dataframe out of this object just to visualize it in a better way
X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray())
X_tfidf_df.columns = tfidf_vect_sample.get_feature_names()
X_tfidf_df.head()

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,11,12,150pday,16,2,...,wont,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,£100000,£900,ü,‘
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.238737,0.209853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.198986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.157831,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 5) Feature Engineering
This section can be quite big, depending on how well you understand the data that you're working with.
<br>We will proceed to fit the model only with the different weights for each word or token present on each message (this is, the TF-IDF matrix). However,I'm pretty sure there are many other metrics that can be relevant as well from our data.
<br> For example, we can think of the following relations:
<br> - Are longer messages more likely to be spam?
<br> - Are messages that contain a lot of punctuation more likely to be spam?
<br> - Is there a relation in between messages that contain a lot of capital words with them being spam or not?
<br> Like this, our list can be huge, and maybe it's nice that we analyze these ideas and see if it's worth or not to add extra columns with this information to complement our model. For this tutorial I won't do, I will leave it as it is, but I encourage you to try it.

### 6) Machine Learning Algorithm - fit/train our model
We've come to the most interesting part, we have our data pre-processed and formatted so it can be understood by a machine Learning algorithm.
<br>What is a machine learning model? We can think of it as a model that can learn and make associations from data on which it was trained and then make predictions based on it. A ML model will be good if it's able not only to adapt to the data we're using to train it with, but also to generalize to data that has never seen before.
<br> This type of ML is also called as supervised learning, and it's the approach we will follow in this example.

<br>
<br>There are a couple of concepts that have to be explained here. First of all is how we are supposed to train our model. For this, we have something called the holdout test set, which is the sample of data not used in fitting (training) a model for the purpose of evaluating the model's ability to generalize unseen data.
<br>We'll be primarily using K-fold cross-validation to evaluate our model (https://machinelearningmastery.com/k-fold-cross-validation/). The idea is that the full data set is divided into k-subsets and the holdout method is repeated k times. Each time, one of the k-subsets is used as the test set and the other k-1 subsets are put together to be used to train the model. For each evaluation, we will get a result of the metric we are testing for.

<br>
<br><b>Which metrics can we consider?</b>
<br>In this example, we will cover 3 of them:
<br> -Accuracy: basically the number predicted correctly divided by the total number of observations.
<br> -Precision: within the context of the problem we're working with, this would be the number predicted as spam that are actually spam divided by the total number predicted as spam.
<br> -Recall: this would be the number predicted as spam that are actually spam divided by the total number that are actually as spam.
<br>Something really important is that knowing only accuracy is not enough. In the case of recall and precision, we can notice that the numerator in both cases is the same (the amount correctly identified) while the denominator is different.
<br><b>If false positives are really costly (this depends on the business), we will want to optimize our model for precision. If false negatives are really costly, we want to optimize the model for recall.</b>


<br>
<br><b>Ensamble Method</b><br>
Random forest (https://en.wikipedia.org/wiki/Random_forest#Algorithm) is one type of a machine learning algorithm that falls into the category of the ensamble learners which use an ensamble method. This is a technique that creates multiple models and then combines them to produce better results than any of the single models individually.
<br>You can basically combine multiple weaker models into a stronger one.
<br> Random forest is an ensamble learning method that constructs a collection of decision trees and then aggregates the predictions of each tree to determine the final prediction. The number of trees that integrate the model is called "n_estimators". Let's say we use 100 of them, basically each one of them is going to vote individually if it thinks the message is spam or ham. We then combine the 100 votes and the majority of them will win (if 60 predicted spam then it will be spam the final prediction).


 

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

# We use n_jobs=-1 so it can parallelize. IMPORTANT: we're using the default parameters here
rf = RandomForestClassifier(n_jobs=-1)

# n_splits is the k of the k-folds, for this case we will use 5.
k_fold = KFold(n_splits=5)

# Take into consideration that we need to send the features (the token matrix) separated from the labels that
# correspond to them. In this case, we will measure accuracy
cross_val_score(rf, X_tfidf, data['label'], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.97755835, 0.97666068, 0.97486535, 0.96046721, 0.96855346])

#### Testing multiple parameters
As I mentioned before, there are many parameters that can be tuned. For instance, we didn't touch any parameters when defining our Random Forest model. There're basically two parameters that are quite insteresting to mess with: n_estimators and max_depth.
<br>There's a popular method in where we can test them, which is known as grid search, and the idea behind it is to iterate through many different values of these parameters and try to determine which is the model that produces the best results.

In [13]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [14]:
# split our test and train data by taking a 0.2 of it for testing
x_train, x_test, y_train, y_test = train_test_split(X_tfidf, data['label'], test_size=0.2)

def train_random_forest(estimators,depth):
    # Create the random forest classifier
    rf = RandomForestClassifier(n_estimators=estimators, max_depth=depth, n_jobs=-1)
    
    # Train the model
    rf_model = rf.fit(x_train,y_train)
    
    # Predict generate predctions for the test data
    y_pred = rf_model.predict(x_test)
    
    # Analyze metrics for those predictions
    precision,recall,fscore,support = score(y_test,y_pred,pos_label="spam",average="binary")
    
    # Print them so we can visualize for each iteration
    accuracy = round(((y_pred == y_test).sum()/len(y_test)),3)
    print(f"Est: {estimators} / Depth: {depth} ----> " +  
          f"Precision: {round(precision,3)} / Recall: {round(recall,3)} / Accuracy: {accuracy}")
    
    
# Create a for loop to test different possibilities
for estimators in [10,25,50,75,100]:
    for depth in [10,20,30,None]:
        train_random_forest(estimators,depth)

Est: 10 / Depth: 10 ----> Precision: 1.0 / Recall: 0.265 / Accuracy: 0.898
Est: 10 / Depth: 20 ----> Precision: 1.0 / Recall: 0.626 / Accuracy: 0.948
Est: 10 / Depth: 30 ----> Precision: 0.99 / Recall: 0.665 / Accuracy: 0.952
Est: 10 / Depth: None ----> Precision: 1.0 / Recall: 0.742 / Accuracy: 0.964
Est: 25 / Depth: 10 ----> Precision: 1.0 / Recall: 0.174 / Accuracy: 0.885
Est: 25 / Depth: 20 ----> Precision: 1.0 / Recall: 0.561 / Accuracy: 0.939
Est: 25 / Depth: 30 ----> Precision: 1.0 / Recall: 0.697 / Accuracy: 0.958
Est: 25 / Depth: None ----> Precision: 1.0 / Recall: 0.8 / Accuracy: 0.972
Est: 50 / Depth: 10 ----> Precision: 1.0 / Recall: 0.219 / Accuracy: 0.891
Est: 50 / Depth: 20 ----> Precision: 1.0 / Recall: 0.581 / Accuracy: 0.942
Est: 50 / Depth: 30 ----> Precision: 1.0 / Recall: 0.697 / Accuracy: 0.958
Est: 50 / Depth: None ----> Precision: 1.0 / Recall: 0.774 / Accuracy: 0.969
Est: 75 / Depth: 10 ----> Precision: 1.0 / Recall: 0.239 / Accuracy: 0.894
Est: 75 / Depth: 20 

In this case, we can determine models with high depth (or None, which means that it will be determined for best results) are the ones that work better. n_estimators plays an important role as well, but is not as relevant as depth.
<br>In this example, this are the parameters thaat end up generating the model that offers the best results:
<br>Est: 75 / Depth: None ----> Precision: 1.0 / Recall: 0.8 / Accuracy: 0.972

#### What if we want to combine k-fold cross validation with grid search?
In the previous example, we examined those parameters and the performance results by using a single holdout test method (this means we randomly split the test and train set data but only once)-
<br>There's a way in where we can combine k-fold cross validation together with grid search to find the best values for our parameters. Here is how:

In [15]:
from sklearn.model_selection import GridSearchCV

In [16]:
# Insteado of us creating the for loops, we can define a dictionary with the parameters we want to test
params = {
    "n_estimators": [10,25,50,75,100],
    "max_depth": [10,20,30,None]
}

rf = RandomForestClassifier(n_jobs=-1)

# We will use a k = 5 for the folds
gs = GridSearchCV(rf, params, n_jobs=-1, cv=5)
gs_fit = gs.fit(X_tfidf, data['label'])

# We want to explore the results. The "cv_results" attribute will print out all of the results accross all the
# folds accross all the different params. This can get messy, let's try to filter the results and sort them
pd.DataFrame(gs_fit.cv_results_).sort_values("mean_test_score", ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
18,6.465872,0.223178,0.346413,0.054322,,75,"{'max_depth': None, 'n_estimators': 75}",0.978456,0.976661,0.97307,0.965858,0.963163,0.971441,0.00598,1
16,2.264186,0.171978,0.275709,0.026408,,25,"{'max_depth': None, 'n_estimators': 25}",0.974865,0.969479,0.974865,0.966757,0.966757,0.970545,0.003665,2
19,7.403054,0.749408,0.195345,0.078396,,100,"{'max_depth': None, 'n_estimators': 100}",0.977558,0.968582,0.973968,0.966757,0.965858,0.970544,0.004496,3
17,4.538704,0.237527,0.335309,0.067224,,50,"{'max_depth': None, 'n_estimators': 50}",0.978456,0.971275,0.966786,0.96496,0.967655,0.969826,0.004779,4
15,1.016557,0.285024,0.221025,0.06791,,10,"{'max_depth': None, 'n_estimators': 10}",0.973968,0.968582,0.964093,0.964061,0.964061,0.966953,0.003918,5


As we can see one more time, the model that gives the best results is the one high depth level (None) and number of estimators of 75.
<br>We are filtering based on the mean_test_score, since this value indicates the average of the accuracy to predict in the test set, this is, the ability that our model has to generalize the trained data to the test data.

### What's next?
There are many other algorithms we can try with the same purpose, for example Gradient Boosting, but I will leave that for following POCs.
<br>Now that we have an idea on how to initialize our model, let's try to predict some made up messages and see what we get

In [17]:
# Get our train/test data
x_train, x_test, y_train, y_test = train_test_split(X_tfidf, data['label'], test_size=0.2)

# Create the random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=-1)

# Train the model
rf_model = rf.fit(x_train,y_train)

# Made up messages
unseen_msgs = ["WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.",
               "Hello, my name is Lucas",
               "You have still not claimed the compensation you are due for the accident you had. To start the process please reply YES. To opt out text STOP",
              ]

# Get the vocabulary (our columns) from the previous vectorizer
vocab = tfidf_vect.vocabulary_
# Generate a new one with this vocabulary and fit it with our custom messages
tfidf_vect = TfidfVectorizer(vocabulary=vocab)
unseen_msgs_tfidf = tfidf_vect.fit_transform(unseen_msgs)

# Predict on our messages
y_pred = rf_model.predict(unseen_msgs_tfidf)
print(y_pred)

['spam' 'ham' 'ham']


#### Conclusion
As you can see, our model only identified the first message to be spam (the third one looks like spam to me as well). This is expected, we only covered 6k possible messages in our dataset and most likely this data is not enough to make accurate predcitions on every message going out there.
<br>If we wanted to improve this model, clearly we would need more spam/ham examples to train the model with.