# Week 7- Naive Bayes for Spam Classification


Naive bayes is a relatively simple probabilistic classfication algorithm that is well suitable for categorical data .

In machine learning, common application of Naive Bayes are spam email classification, sentiment analysis, document categorization. Naive bayes is advantageous over other commonly used classification algorithms in its simplicity, speed, and its accuracy on small data sets.

## Data Description

We will be using a data from the UCI machine learning repository that countains several Youtube comments from very popular music videos. Each comment in the data has been labeled as either spam or ham (legitimate comment), we will use this data to train our Naive Bayes algorithm for youtube comment spam classification. 

In [1]:
# Import modules

# For data manipulation
import pandas as pd

# For matrix operations
import numpy as np

# For regex
import re

In [2]:
# Load data from the 'YoutubeCommentsSpam.csv' file using pandas
data_comments = pd.read_csv("YoutubeCommentsSpam.csv")

# Create column labels: 'content' and 'label'. 
# tips: the 'colums' method can be of help 
data_comments.columns = ["content", "label"]

# display the first rows of our dataset to make sure that the labels have been added
data_comments.head(10)

Unnamed: 0,content,label
0,+447935454150 lovely girl talk to me xxx,1
1,I always end up coming back to this song<br />,0
2,"my sister just received over 6,500 new <a rel=...",1
3,Cool,0
4,Hello I am from Palastine,1
5,Wow this video almost has a billion views! Did...,0
6,Go check out my rapping video called Four Whee...,1
7,Almost 1 billion,0
8,Aslamu Lykum... From Pakistan,1
9,Eminem is idol for very people in EspaÃ±a and ...,0


$\textbf{WARNING: DO NOT check the links in the spam comments! ;)}$

In [3]:
# Show spam comments in data
# DO NOT GO ON THE LINKS BELOW!!! seriously, they're spams... 
print(data_comments.content[data_comments.label == 1])

0                +447935454150 lovely girl talk to me xxx
2       my sister just received over 6,500 new <a rel=...
4                               Hello I am from Palastine
6       Go check out my rapping video called Four Whee...
8                           Aslamu Lykum... From Pakistan
10                            Help me get 50 subs please 
12      Alright ladies, if you like this song, then ch...
15      <a href="https://www.facebook.com/groups/10087...
16                  Take a look at this video on YouTube:
17                 Check out our Channel for nice Beats!!
19                    Check out this playlist on YouTube:
21                                            like please
24      I shared my first song &quot;I Want You&quot;,...
25      Come and check out my music!Im spamming on loa...
26                    Check out this playlist on YouTube:
27      HUH HYUCK HYUCK IM SPECIAL WHO S WATCHING THIS...
30      Check out this video on YouTube:<br /><br />Lo...
33            

Browsing over the comments that have been labeled as spam in this data, it seems like these comments are either unrelated to the video, or are some form of advertisement. The phrase "check out" seems to be very popular in this comments.

## Summary Statistics and Data Cleaning

The table below shows that this data set consist of $1959$ youtube comments, about $49\%$ of them are legitimate comments and about $51\%$ are spam. This high variation of classes in our data set will help us test our algorithms accuracy on the test data set. The average length of each comment is about $96$ characters, which is roughly about $15$ words on average per comment. 

In [4]:
# Add another column with corresponding comment length
# tips: use map and lambda
data_comments['length'] = data_comments['content'].apply(lambda x: len(x))

# Display summary statistics (mean, stdev, min, max)
data_comments.describe()

Unnamed: 0,label,length
count,1959.0,1959.0
mean,0.512506,94.34048
std,0.499971,128.717314
min,0.0,2.0
25%,0.0,28.5
50%,1.0,47.0
75%,1.0,97.5
max,1.0,1199.0


For the purposes of evaluation our Naive Bayes classification algorithm, we will split the data into a training and test set. The training set will be used to train the spam classification algorithm, and the test set will only be used to test its accuracy. In general the training set should be bigger than the test set and both have should be drawn from the same population (population in our case is youtube comments for music videos). We will randomly select $75\%$ of the data as training, and $25\%$ of the data for testing. 

In [5]:
# Let's split data into training and test set (75% training, 25% test)

# Set seed so we get same random allocation on each run of code
np.random.seed(2019)

# Add column vector 'uniform' of randomly generated numbers from 0 to 1 
# tips: in numpy there is a method to draw sample from a uniform distribution
data_comments["uniform"] = np.random.uniform(0,1,data_comments.shape[0])

# As the number in our 'uniform' column is uniformly distributed, 
# about 75% of these should be less than 0.75, let's grab those 75%
data_comments_train = data_comments[data_comments["uniform"] <= 0.75]

# same for the 25% of these remaining numbers that should be greater than 0.75
data_comments_test = data_comments[data_comments["uniform"] > 0.75]


In [6]:
data_comments.head(10)


Unnamed: 0,content,label,length,uniform
0,+447935454150 lovely girl talk to me xxx,1,40,0.903482
1,I always end up coming back to this song<br />,0,46,0.393081
2,"my sister just received over 6,500 new <a rel=...",1,200,0.62397
3,Cool,0,4,0.637877
4,Hello I am from Palastine,1,25,0.880499
5,Wow this video almost has a billion views! Did...,0,73,0.299172
6,Go check out my rapping video called Four Whee...,1,58,0.702198
7,Almost 1 billion,0,16,0.903206
8,Aslamu Lykum... From Pakistan,1,29,0.881382
9,Eminem is idol for very people in EspaÃ±a and ...,0,69,0.40575


In [7]:
# Check that the training data has both spam and ham comments
data_comments_train.label.describe()

count    1467.000000
mean        0.507157
std         0.500119
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: label, dtype: float64

In [8]:
# Same for the test data 
data_comments_test.label.describe()

count    492.000000
mean       0.528455
std        0.499698
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

Both the training and test data have a good mix spam and ham comments, so we are ready to move onto training the Naive Bayes classifier. 

In [9]:
# Join all the comments into a big list
# tips: 'separator'.join(list)
training_list_words = ' '.join(data_comments_train.content)

# Split the list of comments into a list of unique words
# tips: use sorted() and set()
train_unique_words = sorted(set(training_list_words.split(' ')))

# Number of unique words in training 
vocab_size_train = len(train_unique_words)

In [10]:
train_unique_words[:10]

['',
 '!',
 '!!',
 '!!!',
 '!!!!',
 '!!!!!',
 '!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!<br']

In [11]:
training_list_words[:50]

'I always end up coming back to this song<br /> my '

In [12]:
# Description of summarized comments in training data
print('Unique words in training data: %s' % vocab_size_train)
print('First 5 words in our unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in training data: 5494
First 5 words in our unique set of words: 
 ['!', '!!', '!!!', '!!!!', '!!!!!']


Output should look like somthing like:

```Unique words in training data: 5898
First 5 words in our unique set of words: 
['now!!!!!!', 'yellow', 'four', '/>.Pewdiepie', 'Does']```


Currently "now!!" and "now!!!!", as well as "DOES","DoEs", and "does" are all considered to be unique words. For the purposes of spam classification, its probably better to process the data slightly to increase accuracy. In our case we can focus on letters and numbers, as well as convert all the comments to lower case.

In [13]:
# Only keep letters and numbers
# tips: use regex and list comprehension
train_unique_words = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in train_unique_words]

# Convert to lower case and get unique set of words
# tips: use sorted and set ?
train_unique_words = sorted(set([x.lower() for x in train_unique_words]))

# Number of unique words in training 
vocab_size_train = len(train_unique_words)

# Description of summarized comments in training data
print('Unique words in processed training data: %s' % vocab_size_train)
print('First 5 words in our processed unique set of words: \n % s' % list(train_unique_words)[1:6])

Unique words in processed training data: 3487
First 5 words in our processed unique set of words: 
 ['0', '025', '04', '1', '10']


## Naive Bayes for Spam Classification

ok, so here's the deal:

- first we are going to separate our training data into 2 subsets: train and test

- then create several functions to check how many time each word apparear in spam and not spam comments, check the probability of each word appearing in spam/not spam

- then the 2 most important function: train() and classify()

- And finally check the accurracy of our predictions

Let's code!

In [14]:
# Dictionary with comment words as "keys", and their label as "value"
trainPositive = dict()
trainNegative = dict()

# Intialize classes to zero
positiveTotal = 0
negativeTotal = 0

# Initialize Prob. of to zero, but float ;) 
pSpam = 0.0
pNotSpam = 0.0

# Laplace smoothing
alpha = 1.0

In [15]:
# Initialize dictionary of words and their labels   
for word in train_unique_words:
    # skip empty words('' and ' ')
    if not word or word is ' ':
        continue
    # Classify all words for now as ham (legitimate)
    trainPositive[word] = 0
    trainNegative[word] = 0

In [16]:
# Count number of times word in comment appear in spam and ham comments
def processComment(comment,label):
    global negativeTotal
    global positiveTotal
    global trainNegative
    global trainPositive
    
    # Split comments into words
    comment = comment.split()
    comment = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in comment]
    
    # Go over each word in comment
    for word in comment:
        word = word.lower()
        if word not in trainPositive:
            continue
        # check if word is empty ('' or ' ') and increment 
        if not word or word is ' ':
            continue
        # check if comment is not spam
        if not label:      
            # Increment number of times word appears in not spam comments: trainNegative and negativeTotal
            trainNegative[word] += 1
            negativeTotal += 1
        else:
            trainPositive[word] += 1
            positiveTotal += 1
        # spam comments
        
            # Increment number of times word appears in not spam comments: trainPositive and positiveTotal
            

In [17]:
# Define Prob(word|spam) and Prob(word|ham)
def conditionalWord(word,label):
   
    # Laplace smoothing parameter : we consider that eache word is at least once in spam and one in not spam
    # So that the probability is never 0
    # remider: to have access to a global variable inside a function 
    # you have to specify it using the word 'global'
    global trainPositive
    global trainNegative
    
    # word in ham comment
    if word not in trainPositive.keys():
        trainPositive[word] = 0
        trainNegative[word] = 0
    if(label == 0):
        # Compute Prob(word|ham)
        return (trainNegative[word] + alpha) / float(negativeTotal + alpha * vocab_size_train)
           
    # word in spam comment
    else:
        # Compute Prob(word|ham)
        return (trainPositive[word] + alpha) / float(positiveTotal + alpha * vocab_size_train)

In [18]:
# Define Prob(comment|spam) or Prob(comment|ham)
def conditionalComment(comment,label):
    
    # Initialize conditional probability
    prob_label_comment = 1.0
    
    # Split comments into list of words
    comment = comment.split(' ')
    comment = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in comment]
    
    # Go through all words in comments
    for word in comment:
        word = word.lower()
        # Conditional indepdence is assumed here
        prob_label_comment *= conditionalWord(word,label)
    
    return prob_label_comment

In [19]:
# Train naive bayes by computing several conditional probabilities in training data
def train():
    # reminder: we will need pSpam and pNotSpam here ;) 
    global pSpam
    global pNotSpam

    # Initiailize our variables: the total number of comments and the number of spam comments 
    total = 0
    num_spam = 0
    
    # Go over each comment in training data 
    print('Starting training...')
    for idx in range(data_comments_train.shape[0]):
       # check if comment is spam or not
        if data_comments_train.iloc[idx].label == 1:
            num_spam += 1
        total += 1
       # increment the values depending if comment is spam or not
        processComment(data_comments_train.iloc[idx].content,data_comments_train.iloc[idx].label)
        
       # update dictionary of spam and not spam comments
    
    
    
    # Compute prior probabilities, P(spam), P(ham)
    pSpam = num_spam / total
    pNotSpam = 1 - pSpam
    print('Training done')


In [20]:
# Run naive bayes
train()

Starting training...
Training done


In [21]:
pNotSpam

0.49284253578732107

In [32]:
# Classify comment as spam or ham
def classify(comment):
    
    # get global variables
    
    
    # Compute value proportional to Pr(ham|comment)
    isNegative = conditionalComment(comment, 0) * pNotSpam
    
    # Compute value proportional to Pr(spam|comment)
    isPositive = conditionalComment(comment, 1) * pSpam
    
    # Output True = spam, False = ham depending on the 2 previously computed variables
    return (isPositive > isNegative)

In [33]:
# Initialize spam prediction in test data
prediction_test = []

# Get prediction accuracy on test data
data_comments_test['prediction'] = data_comments_test.content.apply(lambda x: classify(x))
# Check accuracy: 
# first the number of correct prediction 
correct_labels = data_comments_test[data_comments_test.label == data_comments_test.prediction].shape[0]
print(correct_labels)
# then the mean of correct predictions
test_accuracy = correct_labels / data_comments_test.shape[0]

#print prediction_test
print("Proportion of comments classified correctly on test set: %s" % test_accuracy)

427
Proportion of comments classified correctly on test set: 0.8678861788617886


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Let's try to write some comments to see whether they are classified as spam or ham. Recall the "True" is for spam comments, and "False" is for ham comments. 
Try your own !

In [34]:
# spam
classify("Guys check out my new chanell")

True

In [35]:
# spam
classify("I have solved P vs. NP, check my video https://www.youtube.com/watch?v=dQw4w9WgXcQ")

True

In [36]:
# ham
classify("I liked the video")

False

In [37]:
# ham
classify("Its great that this video has so many views")

False

### to go further...
## Extending Bag of Words by Using TF-IDF

So far we have been using the Bag of Words model to represent comments as vectors. 
The "Bag of Words" is a list of all unique words found in the training data, then each comment can be represented by a vector that contains the frequency of each unique word that appeared in the comment. 

For example if the training data contains the words $(hi, how, my, grade, are, you),$ then the text "how are you you" can be represented by $(0,1,0,0,1,2).$ The main reason we do this in our application is because comments can vary in length, but the length of all unique words stays fixed. 

In our context, the TF-IDF is a measure of how important a word is in a comment relative to all the words in our training data. For example if a word such as "the" appeared in most of the comments, the TF-IDF would be small as this word does not help us differentiate accross comments. 
Note that "TF" stands for "Term Frequency", and "IDF" stands for "Inverse Document Frequency". In particular, "TF" denoted by $tf(w,c)$ is the number of times the word $w$ appears in the given comment $c$. Whereas "IDF" is a measure of how much information a given word provides in differentiating comments. Specefically, $IDF$ is formulated as 

>$idf(w, D) = log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$ 

To combine "TF" and "IDF" together, we simple take the product, hence 
>$$TFIDF = tf(w,c) \times idf(w, D) = (\text{Number of times $w$ appears in comment $c$})\times log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$$

Now the $TF-IDF$ can be used to weight the vectors that result from the "Bag of Words" approach. For example, suppose a comment contains "this" 2 times, hence $tf = 2$. 
If we then had 1000 comments in our traininig data, and the word "this" appears in 100 comments, $idf = log(1000/100) = 2.$ Therefore in this example, the TF-IDF weight would be 2*2 = 4 for the word "this" appear twice in a particular comment. 

To incorprate TF-IDF into the naive bayes setting, we can compute 

>$$Pr(word|spam) = \frac{\sum_{\text{c is spam}}TFIDF(word,c,D)}{\sum_{\text{word in spam c}}\sum_{\text{c is spam}}TFIDF(word,c,D)+ \text{Number of unique words in data}},$$ 

>where $TFIDF(word,c,D) = TF(word,c) \times IDF(word,data).$ 

In [108]:
def ft_cleartext(text, word):
    text = text.lower().split(' ')
    text = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in text]
    return text.count(word)

In [115]:
# Compute tfidf(word, comment, data)
def TFIDF(comment, train):
    
    # Split comment into list of words
    comment = comment.lower().split(' ')
    comment = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in comment]
    
    # Initiailize tf-idf for given comment
    tfidf_comment = np.zeros(len(comment))
    
    # Initiailize number of comments containing a word
    num_comment_word = 0
    
    # Intialize index for words in comment
    word_index = 0
    
    # Go over all words in comment
    for word in comment:
        word = word.lower()
        
        # Compute term frequence (tf)
        # Count frequency of word in comment

        tf = comment.count(word)
        
        # Find number of comments containing word
        train['num_comment_word'] = train['content'].apply(lambda x: ft_cleartext(x, word))
        num_comment_word = train['num_comment_word'].sum()
        
        # Compute inverse document frequency (idf)
        # log(Total number of comments/number of comments with word)
        idf = np.log(len(train.index) / num_comment_word)

        
        
        # Update tf-idf weight for word
        tfidf_comment[word_index] = tf * idf
        
        # Reset number of comments containing a word
        num_comment_word = 0
        
        # Move onto next word in comment
        word_index += 1
        
        
    return tfidf_comment

In [116]:
TFIDF("Check out my new music video plz",data_comments_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


array([1.29452269, 1.23888561, 1.24596946, 2.83662748, 2.49518423,
       1.86162915, 4.3465358 ])

In [146]:
# Dictionary with comment words as "keys", and their tf_idf as "value"
trainPositive_tfidf = dict()
trainNegative_tfidf = dict()

# Intialize classes to zero
positiveTotal_tfidf = 0
negativeTotal_tfidf = 0

# Initialize Prob. of to zero, but float ;) 
pSpam_tfidf = 0.0
pNotSpam_tfidf = 0.0

counter = 0

In [147]:
for word in train_unique_words:
    if not word or word is ' ':
        continue
    trainPositive_tfidf[word] = 0
    trainNegative_tfidf[word] = 0

In [148]:
def processComment_tfidf(comment,label):
    global negativeTotal_tfidf
    global positiveTotal_tfidf
    global trainNegative_tfidf
    global trainPositive_tfidf
    global counter
    
    counter += 1
    print(counter, end='\r')
    
    tfidf_comment = TFIDF(comment, data_comments_train)
    comment = comment.lower().split()
    comment = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in comment]
    
    for (word, word_tfidf) in zip(comment, tfidf_comment):
        if not word or word is ' ':
            continue
        if word not in trainPositive_tfidf.keys():
            trainPositive_tfidf[word] = 0
            trainNegative_tfidf[word] = 0
            continue
        if not label:      
            trainNegative_tfidf[word] += word_tfidf
            negativeTotal_tfidf += word_tfidf
        else:
            trainPositive_tfidf[word] += word_tfidf
            positiveTotal_tfidf += word_tfidf
            

In [149]:
# And now implement TFIDF with your classifier function
# Have fun :D

def conditionalWord_tfidf(word,label):
    
    # word in ham comment
    if word not in trainPositive_tfidf.keys():
        trainPositive_tfidf[word] = 0
        trainNegative_tfidf[word] = 0
    if(label == 0):
        # Compute Prob(word|ham)
        return (trainNegative_tfidf[word] + alpha) / float(negativeTotal_tfidf + alpha * vocab_size_train)
           
    # word in spam comment
    else:
        # Compute Prob(word|ham)
        return (trainPositive_tfidf[word] + alpha) / float(positiveTotal_tfidf + alpha * vocab_size_train)


In [150]:
def conditionalComment_tfidf(comment,label):
    prob_label_comment = 1.0
    
    comment = comment.lower().split(' ')
    comment = [re.sub(r'[^a-z0-9A-Z]+','', x) for x in comment]
    for word in comment:
        prob_label_comment *= conditionalWord_tfidf(word, label)
    
    return prob_label_comment

In [151]:
# Train naive bayes by computing several conditional probabilities in training data
def train_tfidf():
    # reminder: we will need pSpam and pNotSpam here ;) 
    global pSpam_tfidf
    global pNotSpam_tfidf

    # Initiailize our variables: the total number of comments and the number of spam comments 
    total = 0
    num_spam = 0
    
    # Go over each comment in training data 
    print('Starting training...')
    data_comments_train.apply(lambda x: processComment_tfidf(x.content, x.label), axis=1)
    num_spam = data_comments_train['label'].sum()
    total = data_comments_train.shape[0]
    
    # Compute prior probabilities, P(spam), P(ham)
    pSpam_tfidf = num_spam / total
    pNotSpam_tfidf = 1 - pSpam_tfidf
    print('Training done')

In [152]:
train_tfidf()

Starting training...
1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Training done


In [153]:
# Classify comment as spam or ham
def classify_tfidf(comment):
    
    # get global variables
    
    
    # Compute value proportional to Pr(ham|comment)
    isNegative = conditionalComment_tfidf(comment, 0) * pNotSpam_tfidf
    
    # Compute value proportional to Pr(spam|comment)
    isPositive = conditionalComment_tfidf(comment, 1) * pSpam_tfidf
    
    # Output True = spam, False = ham depending on the 2 previously computed variables
    return (isPositive > isNegative)

In [154]:
# Initialize spam prediction in test data
prediction_test_tfidf = []

# Get prediction accuracy on test data
data_comments_test['prediction_tfidf'] = data_comments_test.content.apply(lambda x: classify_tfidf(x))
# Check accuracy: 
# first the number of correct prediction 
correct_labels_tfidf = data_comments_test[data_comments_test.label == data_comments_test.prediction_tfidf].shape[0]
print(correct_labels_tfidf)
# then the mean of correct predictions
test_accuracy = correct_labels_tfidf / data_comments_test.shape[0]

#print prediction_test
print("Proportion of comments classified correctly on test set: %s" % test_accuracy)

423
Proportion of comments classified correctly on test set: 0.8597560975609756


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
