# Naive Bayes Algorithm and Spam Detection

This notebook implements the Naive Bayes algorithm for spam detection. It follows this tutorial: https://github.com/udacity/machine-learning/blob/master/projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb

In [81]:
# import modules
import urllib
import pandas as pd

# for importing zipfiles from url
from requests import get
from io import BytesIO
from zipfile import ZipFile

# machine learning
from sklearn.feature_extraction.text import CountVectorizer

## Get data and import into a Pandas dataframe

In [82]:
# download zipfile from website
data_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
request = get(data_path)
data_downloaded = BytesIO(request.content)
zip_file = ZipFile(data_downloaded)

# reading the first file in the zip file and decode raw bytes in the string 
spam_data = zip_file.read('SMSSpamCollection').decode('utf-8')

# write data to a file
with open('spam_data.txt', 'w') as f:
    f.write(spam_data)

In [83]:
# Import into a pandas dataframe
df = pd.read_csv('spam_data.txt', 
                 sep='\t',
                 header=None,
                 names=['label', 'sms_message']
                )
df.head()
print(df.shape)

(5572, 2)


## Convert string labels into binary variables

In [84]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Bag of words

Machine learning algorithms needs numeric features for training. To get numeric features, use the Bag of Words model to convert the sentences into a series of columns containing the frequency of each word for each sentence in the data.

The steps to convert text into features using the Bag of Words model:
1. Convert all string into lower case form
2. Remove all punctuations
3. Tokenize sentences (split up sentences into individual words using a delimiter and save to list)
4. Count occurence of each word for each sentence using data generated in 3.

Example below demonstrates how to do this using the CountVectorizer class in sklearn.feature_extraction.text

In [85]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector = CountVectorizer(lowercase=True)
count_vector.fit(documents)

# transform documents to a document-term matrix, then to an array
doc_array = count_vector.transform(documents).toarray()
print(doc_array)

# load into dataframe
feature_names = count_vector.get_feature_names()
frequency_matrix = pd.DataFrame(doc_array, 
                                columns=feature_names)
frequency_matrix.head()

[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1
