**Aim :** To implement Naive Bayes classifier in python 

**Theory :** Naive bayes is a relatively simple probabilistic classfication algorithm that is well suitable for categorical data (probabilities can be compuated as simple ratios) and uses the bayes theorem together with a strong (hence "naive") independence assumption. The basic idea behind Naive Bayes is that it assigns a probability to every category (finite outcome variable) based on the features in the data and chooses the outcome that is most likely as its prediction.

The "Naive" in the name refers to the algorithm assuming features in the data are independent conditional on the outcome category. For example suppose we were doing spam text classification, then given a spam text "Free, sign up now!", Naive Bayes would assume "Free", "sign", "up, "now" all occur indepedently of each other (that is $Pr(Free, sign, up, now|spam)$ = $Pr(Free|spam)\times Pr(sign|spam)\times Pr(up|spam) \times Pr(now|spam)$). This conditional independence assumption is considered to be a strong assumption that often doesn't hold in practice, hence the resulting probabilities from Naive Bayes are not to be taken too seriously. However the classifications resulting from Naive Bayes can still be accurate.

In machine learning, common application of Naive Bayes are spam email classification, sentiment analysis, document categorization. Naive bayes is advantageous over other commonly used classification algorithms in its simplicity, speed, and its accuracy on small data sets. Since Naive Bayes needs to be trained on a labeled data set it considered to a supervisd learning algorithm.

## DATA DESCRIPTION

We are using UCI machine learning repository that contains several youtube comments from popular music videos. Each comment is labelled as spam/ham(1/0).

In [1]:
#importing modules
import pandas as pd
import numpy as np
import re

In [2]:
#loading dataset
df = pd.read_csv('./YoutubeCommentsSpam.csv')

#creating column's labels
df.columns = ['comment','label']
df.head()


Unnamed: 0,comment,label
0,+447935454150 lovely girl talk to me xxx,1
1,I always end up coming back to this song<br />,0
2,"my sister just received over 6,500 new <a rel=...",1
3,Cool,0
4,Hello I am from Palastine,1


## DATA CLEANING

The table shows that this data set consist of 1959 yt comments and 49% of them are ham and 51% are spam. With average length of each comment as 96 characters.

In [4]:
#add column length
df['length'] = df['comment'].map(lambda text:len(text))

#summary of dataset
df[['label','length']].describe()

Unnamed: 0,label,length
count,1959.0,1959.0
mean,0.512506,94.34048
std,0.499971,128.717314
min,0.0,2.0
25%,0.0,28.5
50%,1.0,47.0
75%,1.0,97.5
max,1.0,1199.0


Splitting the dataset into training and testing parts

In [5]:
#(75% training and 25% testing)

np.random.seed(2017)
df['uniform'] = np.random.uniform(0,1,len(df.index))

df_train = df[df['uniform'] < 0.75]
df_test = df[df['uniform'] > 0.75]

df_train['label'].describe()

count    1485.000000
mean        0.509764
std         0.500073
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: label, dtype: float64

In [6]:
df_test['label'].describe()

count    474.000000
mean       0.521097
std        0.500083
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: label, dtype: float64

## IMPLEMENTING NAIVE BAYES CLASSIFIER

In [7]:
#joining all the comments into a big list
train_word_list = ''.join(df_train.iloc[:,0].values)

#split the list into unique words
train_unique_words = set(train_word_list.split(' '))

#number of unique words in df_train
vocab_size_train = len(train_unique_words)

#summary
print('unique words in df_train: %s' % vocab_size_train)
print('top 5 words: \n %s' % list(train_unique_words)[1:6])

unique words in df_train: 5898
top 5 words: 
 ['Hi.', 'enI', 'album', 'tsÅ«,', 'mv']


In [8]:
#only keep letters and numbers
train_unique_words = [re.sub(r'[^a-zA-Z0-9]','',words) for words in train_unique_words]

#convert to lower case and get unique set of words
train_unique_words = set([words.lower() for words in train_unique_words])

#summary
print('unique words in df_train: %s' % vocab_size_train)
print('top 5 words: \n %s' % list(train_unique_words)[1:6])

unique words in df_train: 5898
top 5 words: 
 ['songshicheck', 'itttttttt', 'chanel', 'fit', 'meaty']


In [9]:
#dict with comment words as 'keys', and their label as 'count'
trainSpamWords = dict()
trainHamWords = dict()

spamWordsCount = 0
hamWordsCount = 0

#initialize prob. of
pSpam = 0.0
pHam = 0.0

#laplace smoothing
alpha = 1

In [10]:
#initialize dict of words and their labels
for word in train_unique_words:
    trainSpamWords[word] = 0
    trainHamWords[word] = 0
    

In [11]:
#count no. of times word in comment appear in spam and ham comments
def processComment(comment,label):
    global spamWordsCount
    global hamWordsCount
    
    #split comment into words
    comment = comment.split(' ')
    
    for word in comment:
        #spam comments
        if(label == 1 and word != ' '):
            trainSpamWords[word] = trainSpamWords.get(word,0)+1
            spamWordsCount += 1
            
        elif(label == 0 and word != ' '):
            trainHamWords[word] = trainHamWords.get(word,0)+1
            hamWordsCount += 1

In [12]:
# Define P(word|spam) and P(word|ham)
def conditionalWord(word,label):
    
    #laplace smoothing
    global alpha
    
    #word in spam comment
    if(label == 1):
        #P(word|spam)
        return (trainSpamWords.get(word,0)+alpha)/(float)(spamWordsCount + vocab_size_train)
    
    #word in ham comment
    if(label == 0):
        #P(word|ham)
        return (trainHamWords.get(word,0)+alpha)/(float)(hamWordsCount + vocab_size_train)
    
    

In [13]:
# Define P(spam|comment) or P(ham|comment)
def conditionalComment(comment,label):
    
    #initialize conditional prob
    prob_label_comment = 1.0
    
    #Split comments into words
    comment = comment.split(' ')
    
    for word in comment:
        prob_label_comment *= conditionalWord(word,label)
        
    return prob_label_comment    

## TRAINING

In [14]:
#training here is computing several conditional probs
def train():
    
    global pSpam
    global pHam
    
    #initialize
    total = 0
    spamCount = 0
    
    for idx,row in df_train.iterrows():
        
        if row.label == 1:
            spamCount +=1
            
        total += 1
        
        processComment(row.comment,row.label)
        
    #compute prior probabilities P(spam), P(ham)
    pSpam = spamCount/float(total)
    pHam = (total - spamCount)/float(total)
    
    print('Training complete')
    

In [15]:
# run the train function
train()

Training complete


In [16]:
# Classify comment as spam or ham
def classify(comment):
    
    global pSpam
    global pHam
    
    #compute value proportional to P(comment|spam)
    isSpam = pSpam * conditionalComment(comment,1)
    
    #compute value proportional to P(comment|ham)
    isHam = pSpam * conditionalComment(comment,0)
    
    #Output True = spam, False = ham
    return (isSpam > isHam)
    

In [17]:
# Initialize spam prediction in test data
prediction_test = []

# Get prediction accuracy on test data
for comment in df_test['comment']:
    # CLassify comment
    prediction_test.append(classify(comment))
    
# Check accuracy
test_accuracy = np.mean(np.equal(prediction_test, df_test['label']))

print('Accuracy: %s' % test_accuracy)

Accuracy: 0.8037974683544303


## TESTING MODAL

In [18]:
# spam
classify("please check out my channel")

True

In [19]:
# spam
classify('call me on +912321')

True

In [20]:
# ham 
classify('very nice song keep it up')

False

In [21]:
#ham
classify('video was very good I really enjoyed her singing')


False

## USING SK LEARN

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['comment'], df['label'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_test_counts = count_vect.transform(X_test)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)
pred = clf.predict(X_test_tfidf)
print(pred[:10])

[1 1 0 1 1 1 1 1 0 0]


# Confusion matrix

In [23]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[212,  25],
       [ 11, 242]], dtype=int64)

# Conlusion

Thus we have successfully implemented naive bayes classifier to classify youtube comments as spam or not spam.