# Guided Project: Building a Spam Filter Using Naive Bayes
In this guided project, we apply what we learned about conditional probability and the Naive Bayes Algorithm to create a spam filter. For this project, we will use a dataset of 5,572 SMS messages that have already been classified by humans as spam or not-spam messages. The data was collected by Tiago A. Almeida and José María Gómez Hidalgo, and it is available [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [566]:
import pandas as pd 

In [567]:
texts = pd.read_csv('Datasets\SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])

In [568]:
texts.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [569]:
texts.shape

(5572, 2)

In [570]:
texts['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Splitting the Data into a Training and Testing Set

In [571]:
train = texts.sample(frac=1, random_state=1).head(4458) #we picked 4458 so the training set has about 80% of the data. 
test = texts.sample(frac=1, random_state=1).tail(1114) #we picked 1114 so the testing set has about 20% of the data.

In [572]:
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

In [573]:
train["Label"].value_counts(normalize=True) * 100 

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [574]:
test["Label"].value_counts(normalize=True) * 100 

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

Above, we see that the training and testing sets have a similar percentage of spam and not-spam messages. 

## Cleaning the Data

In [575]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [576]:
train['SMS'] = train['SMS'].str.replace('\W', ' ', regex=True).str.lower()

In [577]:
train['SMS'] = train['SMS'].str.split()

In [578]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [579]:
vocabulary = []
for row in train['SMS']:
    for word in row:
        if word not in vocabulary:
            vocabulary.append(word)

In [580]:
len(vocabulary)

7783

In [581]:
count_dic = {}
for word in vocabulary:
    count_dic[word] = [0] * len(train['SMS'])

In [582]:
for index, sms in enumerate(train['SMS']):
    for word in sms:
        count_dic[word][index] += 1

In [583]:
train_count = pd.DataFrame(count_dic)

In [584]:
train_count.shape

(4458, 7783)

In [585]:
train.shape

(4458, 2)

In [586]:
train = pd.concat([train, train_count], axis=1)

In [587]:
train.shape

(4458, 7785)

## Calculating the Constant Values

In [588]:
spam_train = train[ train['Label'] == 'spam']
ham_train = train[ train['Label'] == 'ham']

In [589]:
n_spam = 0
for row in spam_train['SMS']:
    for word in row:
        n_spam += 1        

In [590]:
n_spam

15190

In [591]:
n_ham = 0
for row in ham_train['SMS']:
    for word in row:
        n_ham += 1  

In [592]:
n_ham

57237

In [593]:
n_vocab = len(vocabulary)
p_spam = spam_train.shape[0] / train.shape[0]
p_ham = ham_train.shape[0] / train.shape[0]
alpha = 1 

In [594]:
ham_prob = {}
for word in vocabulary:
    ham_prob[word] = 0
spam_prob = {}
for word in vocabulary:
    spam_prob[word] = 0

In [595]:
for word in vocabulary:
    p_word_given_spam = (spam_train[word].sum() + alpha) /(n_spam + alpha*n_vocab)
    spam_prob[word] = p_word_given_spam
for word in vocabulary:
    p_word_given_ham = (ham_train[word].sum() + alpha) /(n_ham + alpha*n_vocab)
    ham_prob[word] = p_word_given_ham

## Creating the Spam Filter Function 

In [596]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_prob:
            p_spam_given_message *= spam_prob[word]
        if word in ham_prob:
            p_ham_given_message *= ham_prob[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [597]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [598]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Testing the Filter on the Test Data 

In [599]:
def test_classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_prob:
            p_spam_given_message *= spam_prob[word]
        if word in ham_prob:
            p_ham_given_message *= ham_prob[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [600]:
test['Prediction'] = test['SMS'].apply(test_classify)

In [601]:
test.head()

Unnamed: 0,Label,SMS,Prediction
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [602]:
correct = 0 
total = test.shape[0]

In [603]:
for index, row in test.iterrows():
    if row[0] == row[2]:
        correct += 1

In [604]:
correct

1100

In [605]:
accuracy = correct / total * 100 

In [606]:
accuracy

98.74326750448833

In [607]:
test_wrong = test[ test['Label'] != test['Prediction']]
test_wrong.head(20) #The wrongly classified messages. 

Unnamed: 0,Label,SMS,Prediction
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


## Conclusion
We were able to create a spam filter using the Naive Bayes Algorithm that was 98.74 percent accurate! 