# Building a Spam Filter with Naive Bayes

**Project Goal**: Design a filter to detect spam SMS messages.

**Dataset**: The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

# Part 1: Exploratory Data Analysis

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])

In [3]:
print(df.columns)

Index(['Label', 'SMS'], dtype='object')


In [4]:
print(str(len(df)))

5572


In [5]:
print(df.head(3))

  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...


In [6]:
def print_label_totals(df,column,labels):
    for label in labels:
        total_rows = len(df)
        num_label = len(df[df[column] == label])
        print(str(num_label)+" out of "+str(total_rows)+" are label: "+label+" - "+str((num_label/total_rows)*100)+"%")

## Part 2: Dividing test and training data

* Reserving 20% of the data for testing, 80% for training the model
* We'll randomize the dataset before splitting

In [9]:
randomized_df = df.sample(frac=1,random_state=1)
num_training_rows = int(len(df)*.8)
training_set = df.iloc[:num_training_rows]
test_set = df.iloc[num_training_rows:]
print_label_totals(training_set,'Label',['ham','spam'])
print_label_totals(test_set,'Label',['ham','spam'])

3855 out of 4457 are label: ham - 86.49315683194975%
602 out of 4457 are label: spam - 13.506843168050258%
970 out of 1115 are label: ham - 86.99551569506725%
145 out of 1115 are label: spam - 13.004484304932735%
