## Introduction to Naive Bayes

Naive Bayes is among one of the very simple and powerful algorithms for classification based on Bayes Theorem with an assumption of independence among the predictors. The Naive Bayes classifier assumes that the presence of a feature in a class is not related to any other feature. Naive Bayes is a classification algorithm for binary and multi-class classification problems.
Bayes Theorem 
 

Based on prior knowledge of conditions that may be related to an event, Bayes theorem describes the probability of the event
conditional probability can be found this way

Assume we have a Hypothesis(H) and evidence(E),

According to Bayes theorem, the relationship between the probability of Hypothesis before getting the evidence represented as P(H) and the probability of the hypothesis after getting the evidence represented as P(H|E) is:
 
P(H|E) = P(E|H)*P(H)/P(E)

Prior probability = P(H) is the probability before getting the evidence 

Posterior probability = P(H|E) is the probability after getting evidence
In general, 
 
P(class|data) = (P(data|class) * P(class)) / P(data)

## Bayes Theorem Example

Assume we have to find the probability of the randomly picked card to be king given that it is a face card.

There are 4 Kings in a Deck of Cards which implies that P(King) = 4/52 

as all the Kings are face Cards so P(Face|King) = 1 

there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in total so P(Face) = 12/52 

Therefore, 

P(King|face) = P(face|king)*P(king)/P(face) = 1/3

There are various types of models we can write with NB but we will be focusing much on Gausian and Multinomial

https://scikit-learn.org/stable/modules/naive_bayes.html

In [2]:
import pandas as pd
import numpy  as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# First Example using Iris dataset

In [3]:
iris = pd.read_csv('Datasets\iris.csv')

In [4]:
train,test = train_test_split(iris,test_size = 0.3) 

In [5]:
x = train.iloc[:,0:4]
y = train.iloc[:,4]

In [6]:
ignb = GaussianNB()
imnb = MultinomialNB()

In [7]:
pred_gnb = ignb.fit(x,y).predict(test.iloc[:,0:4])
pred_mnb = imnb.fit(x,y).predict(test.iloc[:,0:4])

In [15]:
test.Species.value_counts()

virginica     18
versicolor    14
setosa        13
Name: Species, dtype: int64

In [8]:
confusion_matrix(test.iloc[:,4],pred_gnb)

array([[13,  0,  0],
       [ 0, 13,  1],
       [ 0,  0, 18]], dtype=int64)

In [9]:
pd.crosstab(test.iloc[:,4].values.flatten(),pred_gnb)

col_0,setosa,versicolor,virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,13,1
virginica,0,0,18


In [16]:
np.mean(pred_gnb == test.iloc[:,4].values.flatten())

0.9777777777777777

In [17]:
confusion_matrix(test.iloc[:,4],pred_mnb)

array([[13,  0,  0],
       [ 0, 14,  0],
       [ 0, 14,  4]], dtype=int64)

In [18]:
pd.crosstab(test.iloc[:,4].values.flatten(),pred_mnb) # flattening is converting 2D arrey in 1D arrey

col_0,setosa,versicolor,virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,14,0
virginica,0,14,4


In [19]:
np.mean(pred_mnb == test.iloc[:,4].values.flatten())

0.6888888888888889

# Second Example using Diabets dataset

In [20]:
Diabetes = pd.read_csv("Datasets\\Diabetes_RF.csv")

In [21]:
Diabetes.shape

(768, 9)

In [22]:
Diabetes.head(5)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure,Triceps skin fold thickness,2-Hour serum insulin,Body mass index,Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,YES
1,1,85,66,29,0,26.6,0.351,31,NO
2,8,183,64,0,0,23.3,0.672,32,YES
3,1,89,66,23,94,28.1,0.167,21,NO
4,0,137,40,35,168,43.1,2.288,33,YES


In [23]:
colnames = list(Diabetes.columns)

In [30]:
colnames

[' Number of times pregnant',
 ' Plasma glucose concentration',
 ' Diastolic blood pressure',
 ' Triceps skin fold thickness',
 ' 2-Hour serum insulin',
 ' Body mass index',
 ' Diabetes pedigree function',
 ' Age (years)',
 ' Class variable']

In [24]:
predictors = colnames[:8]
target = colnames[8]

In [25]:
DXtrain,DXtest,Dytrain,Dytest = train_test_split(Diabetes[predictors],Diabetes[target],test_size=0.3, random_state=7)

In [26]:
Dgnb = GaussianNB()
Dmnb = MultinomialNB()

In [27]:
Dpred_gnb = Dgnb.fit(DXtrain,Dytrain).predict(DXtest)
Dpred_mnb = Dmnb.fit(DXtrain,Dytrain).predict(DXtest)

In [28]:
confusion_matrix(Dytest,Dpred_gnb) 

array([[116,  31],
       [ 29,  55]], dtype=int64)

In [34]:
DXtest.shape

(231, 8)

In [29]:
pd.crosstab(Dytest,Dpred_gnb)

col_0,NO,YES
Class variable,Unnamed: 1_level_1,Unnamed: 2_level_1
NO,116,31
YES,29,55


In [38]:
print ("Accuracy",(116+55)/(116+55+31+29)) ##Testing accuracy of Dgnb model

Accuracy 0.7402597402597403


In [40]:
confusion_matrix(Dytest,Dpred_mnb)

array([[99, 48],
       [41, 43]], dtype=int64)

In [41]:
pd.crosstab(Dytest,Dpred_mnb)

col_0,NO,YES
Class variable,Unnamed: 1_level_1,Unnamed: 2_level_1
NO,99,48
YES,41,43


In [44]:
print ("Accuracy",(99+43)/(99+43+48+41))

Accuracy 0.6147186147186147


# Third Example using Salary Dataset

In [45]:
salary_train = pd.read_csv("Datasets\SalaryData_Train.csv")
salary_test = pd.read_csv("Datasets\SalaryData_Test.csv")

In [46]:
salary_train.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [47]:
#Creating a list of categorical variables
string_columns=["workclass","education","maritalstatus","occupation","relationship","race","sex","native"]

In [48]:
# Transforming categorical variables into numerical categories
from sklearn import preprocessing
number = preprocessing.LabelEncoder()
for i in string_columns:
    salary_train[i] = number.fit_transform(salary_train[i])
    salary_test[i] = number.fit_transform(salary_test[i])

In [49]:
salary_train.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,5,9,13,4,0,1,4,1,2174,0,40,37,<=50K
1,50,4,9,13,2,3,0,4,1,0,0,13,37,<=50K
2,38,2,11,9,0,5,1,4,1,0,0,40,37,<=50K
3,53,2,1,7,2,5,0,2,1,0,0,40,37,<=50K
4,28,2,9,13,2,9,5,2,0,0,0,40,4,<=50K


In [50]:
# Why not used dummy variables here ? Because after label encoding the lenghth of the coulmns remain same
colnames = salary_train.columns
len(colnames)

14

In [51]:
trainX = salary_train[colnames[0:13]]
trainY = salary_train[colnames[13]]
testX  = salary_test[colnames[0:13]]
testY  = salary_test[colnames[13]]

In [52]:
sgnb = GaussianNB()
smnb = MultinomialNB()

In [53]:
spred_gnb = sgnb.fit(trainX,trainY).predict(testX)
spred_mnb = smnb.fit(trainX,trainY).predict(testX)

In [54]:
#Testing accuracy for Gausian model on this data
print(confusion_matrix(testY,spred_gnb))
print ("Accuracy",(10759+1209)/(10759+601+2491+1209))

[[10759   601]
 [ 2491  1209]]
Accuracy 0.7946879150066402


In [57]:
salary_test.shape

(15060, 14)

In [55]:
pd.crosstab(testY,spred_gnb)

col_0,<=50K,>50K
Salary,Unnamed: 1_level_1,Unnamed: 2_level_1
<=50K,10759,601
>50K,2491,1209


In [59]:
#Testing accuracy for Multinomial model on this data
print(confusion_matrix(testY,spred_mnb))
print("Accuracy",(10891+780)/(10891+780+2920+780))

[[10891   469]
 [ 2920   780]]
Accuracy 0.7592869689675362


# Fourth and final Example on Text data using ham_spam Dataset for classifying ham mails/messages and spam mails/messages

In [60]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [61]:
#Try opening the dataset directly from Dataset folder in Jupiter Home, it will not open, encoding is used
email_data = pd.read_csv("Datasets\ham_spam.csv",encoding = "ISO-8859-1") 

In [62]:
email_data.head()

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or å£10,000..."
4,spam,okmail: Dear Dave this is your final notice to...


In [63]:
email_data.shape

(5559, 2)

In [64]:
import re # Regular expression

In [65]:
# Data cleaning function, you can use different logic for different data
def cleaning_text(i):
    i = re.sub("[^A-Za-z" "]+"," ",i).lower()
    #i = re.sub("[0-9" "]+"," ",i)
    w = []
    for word in i.split(" "):
        if len(word)>5:
            w.append(word)
    return (" ".join(w))

In [68]:
#How does split function works on string data
"This is Awsome 123 1312 $#%$# a i he yu nwj".split(" ")

['This', 'is', 'Awsome', '123', '1312', '$#%$#', 'a', 'i', 'he', 'yu', 'nwj']

In [69]:
# How does our cleaning_text function will work on following example data
cleaning_text("this is awsome beutiful flowers pending from you 1231312 $#%$# a i he yu nwj")

'awsome beutiful flowers pending'

In [70]:
#Another example
cleaning_text("Hope you are having a good week. Just checking in")

'having checking'

In [71]:
email_data.text = email_data.text.apply(cleaning_text)

In [72]:
email_data.text

0                                         having checking
1                                                  thanks
2                                                        
3        complimentary holiday urgent collection landline
4         okmail notice collect tenerife holiday landline
                              ...                        
5554       giving really miracle reason everything looked
5555                     awesome remember somebody diesel
5556                       another customer london please
5557    energy channel leadership skills strong psychi...
5558                                               having
Name: text, Length: 5559, dtype: object

In [73]:
#User difined function to be used in counting of words and creating the bag of words
def split_into_words(i):
    return (i.split(" "))

In [74]:
from sklearn.model_selection import train_test_split

In [75]:
#Spliting the data into train and test
email_train,email_test = train_test_split(email_data,test_size=0.3)

In [76]:
#Creating a bag of words
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
emails_bow = CountVectorizer(analyzer=split_into_words).fit(email_data.text)

In [77]:
# ["mailing","body","texting"] #reocrd 1
# ["mailing","awesome","good"] #record 2

# ["mailing","body","texting","good","awesome"] #total words



#        "mailing" "body" "texting" "good" "awesome"
#  0          1        1       1        0       0
 
#  1          1        0        0       1       1 

In [77]:
#Transforming All data inito the matrix form ... Dataframe
all_emails_matrix = emails_bow.transform(email_data.text)
all_emails_matrix.shape # (5559,6661)

(5559, 4183)

In [78]:
#Transforming Training data inito the matrix form ... Dataframe
train_emails_matrix = emails_bow.transform(email_train.text)
train_emails_matrix.shape # (3891,6661)

(3891, 4183)

In [79]:
# #Transforming Testing data inito the matrix form ... Dataframe
test_emails_matrix = emails_bow.transform(email_test.text)
test_emails_matrix.shape # (1668,6661)

(1668, 4183)

In [81]:
####### Without TFIDF matrices ########################
# TFIDF => Term frequency Inverse Document Frequency

# Preparing a naive bayes model on training data set 

from sklearn.naive_bayes import MultinomialNB as MB
from sklearn.naive_bayes import GaussianNB as GB #fOR NUMERICAL DATA; WILL GIVE LESSER ACCURACY ON TEXT DATA

In [82]:
# Multinomial Naive Bayes model building
classifier_mb = MB()
classifier_mb.fit(train_emails_matrix,email_train.type)
train_pred_m = classifier_mb.predict(train_emails_matrix)
accuracy_train_m = np.mean(train_pred_m==email_train.type)
accuracy_train_m

0.9838087895142636

In [83]:
pd.crosstab(train_pred_m, email_train.type)

type,ham,spam
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,3376,44
spam,19,452


In [84]:
# Multinomial Naive Bayes model testing
test_pred_m = classifier_mb.predict(test_emails_matrix)
accuracy_test_m = np.mean(test_pred_m==email_test.type)
accuracy_test_m

0.9454436450839329

In [85]:
pd.crosstab(test_pred_m, email_test.type)

type,ham,spam
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,1380,54
spam,37,197


In [86]:
# Gaussian Naive Bayes Model building
classifier_gb = GB()
classifier_gb.fit(train_emails_matrix.toarray(),email_train.type.values) # we need to convert tfidf into array format which is compatible for gaussian naive bayes
train_pred_g = classifier_gb.predict(train_emails_matrix.toarray())
accuracy_train_g = np.mean(train_pred_g==email_train.type)
accuracy_train_g

0.7501927525057825

In [87]:
# Gaussian Naive Bayes Model testing
test_pred_g = classifier_gb.predict(test_emails_matrix.toarray())
accuracy_test_g = np.mean(test_pred_g==email_test.type)
accuracy_test_g

0.6241007194244604

In [88]:
# Learning Term weighting and normalizing on entire emails
tfidf_transformer = TfidfTransformer().fit(all_emails_matrix)

In [89]:
# Preparing TFIDF for train emails
train_tfidf = tfidf_transformer.transform(train_emails_matrix)

In [90]:
train_tfidf.shape

(3891, 4183)

In [91]:
# Preparing TFIDF for test emails
test_tfidf = tfidf_transformer.transform(test_emails_matrix)

In [92]:
test_tfidf.shape

(1668, 4183)

In [93]:
# Multinomial Naive Bayes model building with train tfidf data
classifier_mb = MB()
classifier_mb.fit(train_tfidf,email_train.type)
train_pred_m = classifier_mb.predict(train_tfidf)  # prediction
accuracy_train_m = np.mean(train_pred_m==email_train.type)
accuracy_train_m

0.965047545618093

In [94]:
test_pred_m = classifier_mb.predict(test_tfidf)   #Prediction
accuracy_test_m = np.mean(test_pred_m==email_test.type)

In [95]:
accuracy_test_m

0.9364508393285371

In [96]:
# Gaussian Naive Bayes 
classifier_gb = GB()
classifier_gb.fit(train_tfidf.toarray(),email_train.type.values) # we need to convert tfidf into array format which is compatible for gaussian naive bayes
train_pred_g = classifier_gb.predict(train_tfidf.toarray()) #Prediction
accuracy_train_g = np.mean(train_pred_g==email_train.type)
accuracy_train_g

0.7501927525057825

In [97]:
test_pred_g = classifier_gb.predict(test_tfidf.toarray())   #Prediction
accuracy_test_g = np.mean(test_pred_g==email_test.type) 
accuracy_test_g

0.6205035971223022

#### From all above models we have seen that when we have the data in continuous format we shall go for Gausian NB and when the data is in categories or text then we shall go for Multinomial NB