# Applying NLP techniques for a SMS Spam/Ham detector

The idea of this project is to create a model can that help us determine if a given message is either spam or ham (not spam). For this, I won't be diving that much on all the theory behind it, I will rather be focused on providing you a basic template and code that you can use for this purpose.
<br>We will be using many libraries that are pretty well known by the ML community of Python, such as Pandas, Scikit Learn and NLTK so we don't have to reinvent the wheel in many aspects. Again, I encourage you to dive deeper on each one of them to get a better understanding of the potential supported.

## Machine Learning Pipeline

### 1) Raw Text - Model can't distinguish words

We will take our data from a dataset provide by Kaggle (you can find it here - https://www.kaggle.com/assumewisely/sms-spam-collection - but I've included it in this project as well). 
<br>Everything starts by understanding the format of our data and determining HOW we can process that data. Let's do it.

In [6]:
# Taking a look at the raw format of our data
file_content = open("SMSSpamCollection.tsv", "r").read()
# Let's display the first 2000 characters of our file
file_content[0:2000]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aids patent.\nham\tI HAVE A DATE ON SUNDAY WITH WILL!!\nham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\nspam\tWINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\nspam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera f

As you may realize from this, we have a file composed by multiple text lines (you can tell by the "\n" separator) in where each line is integrated by two columns separated by a tab (\t). The first column corresponds to the label (either spam or ham) and the second one corresponds to the content of that SMS.
<br>In other words, this is a tab separated file.
<br> In this case, we can use a simple method from the Pandas library in order to help us out reading the content and managing it in a more organized way.

In [11]:
# Read the content of the file with Pandas.
import pandas as pd

# A couple of tricks here, our file is not a comma separated file, it's a tab separated file, that's why we need to pass 
# in the separator. On the other hand, we use header equals to None in order to indicate that there's no header column
dataset = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None, names=["label", "sms_content"])

dataset.head(10)

Unnamed: 0,label,sms_content
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
5,ham,As per your request 'Melle Melle (Oru Minnamin...
6,spam,WINNER!! As a valued network customer you have...
7,spam,Had your mobile 11 months or more? U R entitle...
8,ham,I'm gonna be home soon and i don't want to tal...
9,spam,"SIX chances to win CASH! From 100 to 20,000 po..."


### 2) Tokenize - tell the model what to look at

### 3) Clean text - remove stop words/punctuation, stemming, etc.

### 4) Vectorize - convert to numeric form

### 5) Machine Learning Algorithm - fit/train our model

### 6) Spam filter - system to filter emails