# Naive Bayes Algorithm and Spam Detection

This notebook is my implementation of the Naive Bayes algorithm for spam detection. I used the tutorial in this link as a guide: https://github.com/udacity/machine-learning/blob/master/projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb

In [88]:
# import modules
import urllib
import pandas as pd

# for importing zipfiles from url
from requests import get
from io import BytesIO
from zipfile import ZipFile

# machine learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split


## Get data and import into a Pandas dataframe

In [82]:
# download zipfile from website
data_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
request = get(data_path)
data_downloaded = BytesIO(request.content)
zip_file = ZipFile(data_downloaded)

# reading the first file in the zip file and decode raw bytes in the string 
spam_data = zip_file.read('SMSSpamCollection').decode('utf-8')

# write data to a file
with open('spam_data.txt', 'w') as f:
    f.write(spam_data)

In [83]:
# Import into a pandas dataframe
df = pd.read_csv('spam_data.txt', 
                 sep='\t',
                 header=None,
                 names=['label', 'sms_message']
                )
df.head()
print(df.shape)

(5572, 2)


## Convert string labels into binary variables

In [84]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Bag of words

Machine learning algorithms needs numeric features for training. To get numeric features, use the Bag of Words model to convert the sentences into a series of columns containing the frequency of each word for each sentence in the data.

The steps to convert text into features using the Bag of Words model:
1. Convert all string into lower case form
2. Remove all punctuations
3. Tokenize sentences (split up sentences into individual words using a delimiter and save to list)
4. Count occurence of each word for each sentence using data generated in 3.

Example below demonstrates how to do this using the CountVectorizer class in sklearn.feature_extraction.text

In [94]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector = CountVectorizer(lowercase=True)

# learn vocabulary
count_vector.fit(documents)

# transform documents to a document-term matrix, then to an array
doc_array = count_vector.transform(documents).toarray()
print(doc_array)

# load into dataframe
feature_names = count_vector.get_feature_names()
frequency_matrix = pd.DataFrame(doc_array, 
                                columns=feature_names)
frequency_matrix.head()

[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


## Split data into training and testing
Split training and label data into a training set (75%) and a test set (25%)

In [91]:
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'],
                                                    df['label'],
                                                    random_state=1
                                                   )

print("Full data size:", df.shape[0])
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Full data size: 5572
Training set size: 4179
Test set size: 1393


## Apply Bag of Words processing to the split data
We fit the vocabulary dictionary only for the training data. Here's a great explanation why we do this: https://sebastianraschka.com/faq/docs/scale-training-test.html. The basic idea is that we treat test data as new and unseen. Therefore, we cannot learn--by applying the fit() function--the vocabulary of this new data and must use the vocabulary learned from the training set.

In [97]:
count_vector = CountVectorizer()
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

## Bayes Theorem review

The example in the tutorial is on finding the odds of a person having diabetes given that they got a positive result. Let the probability of having diabetes, denoted by $P(D)$, be 0.01. This is often referred to as the __base rate__ or the __prior__ probability. Assume that the test is %90 accurate. This means that the probability of testing positive given that you have diabetes is $P(Pos|D) = 0.9$. Alternatively, the probabily of testing negative given the person doesn't have diabetes is $P(Neg|~D) = 0.9$.

The probability we are interested in is

$$P(D|Pos) = \frac{P(Pos|D)P(D)}{P(Pos)}$$

$P(Pos)$ is the probability of testing positive regardless of whether you have diabetes or not. People without diabetes can still test positive because the test is imperfect. The formula for $P(Pos)$ is

$$P(Pos) = P(D)P(Pos|D) + P(~D)P(Pos|~D)$$

Intuitively, $P(D)P(Pos|D)$ is the probability of testing positive _and_ having diabetes. $P(~D)P(Pos|~D)$ is the probability of still testing positive _and_ not having diabetes. Their sum, therefore, is just the probability of testing positive, whether or not you have diabetes.

Putting all of this together the probability of having diabetes given that you tested positive is

In [100]:
prob_d = 0.01
prob_pos_d = 0.9
prob_neg_no_d = 0.9
prob_pos = prob_d * prob_pos_d + (1 - prob_d) * (1 - prob_neg_no_d)

prob_d_pos = (prob_pos_d * prob_d) / prob_pos

print(prob_d_pos)

0.08333333333333336


In [None]:
asdasd