# Using Naive Bayes Classifier 

Naive Bayes is a probabilistic algorithm based on the Bayes Theorem used for email spam filtering in data analytics.

If you want to learn more about this please go through this video: [link](https://www.youtube.com/watch?v=Q8l0Vip5YUw)

If you prefer reading then go through this article: [link](https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c)

In [2]:
import pandas as pd
df=pd.read_csv('spam.csv')


In [3]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### General Summary of Spam and Ham 

In [4]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


###  converting Category column to numbers

In [5]:
df['spam'] = df['Category'].apply(lambda x : 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
df.drop('Category',axis='columns',inplace=True)

In [8]:
df.head()

Unnamed: 0,Message,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [11]:
print("Total number of records",len(df))

Total number of records 5572


### Divide the dataset into train and test 

In [12]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df['Message'],df['spam'],train_size=0.75)

### Applying CountVectorizer on Message column 

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)

In [14]:
X_train_count.toarray()[0:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [22]:
print('The total number of unique words is',X_train_count.shape[1])

The total number of unique words is 7491


Naive Bayes have 3 different types of classifiers which are: <br>
<b>1) Bernoulli Naive Bayes</b><br>
<b>2) Multinominal Naive Bayes</b><br>
<b>3) Gaussian Naive Bayes</b><br>
If you want to know more about them you can go through this link: 
[Link](https://www.quora.com/What-is-the-difference-between-the-the-Gaussian-Bernoulli-Multinomial-and-the-regular-Naive-Bayes-algorithms)

### Bernoulli Naive Bayes

In [31]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()

In [33]:
model.fit(X_train_count,Y_train)
model

BernoulliNB()

#### Predicting

In [35]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

In [36]:
emails_count = v.transform(emails)

In [37]:
model.predict(emails_count)

array([0, 0], dtype=int64)

#### Score 

In [38]:
X_test_count = v.transform(X_test)
model.score(X_test_count,Y_test)

0.9748743718592965

### Multinominal Naive Bayes 

In [39]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(X_train_count,Y_train)

MultinomialNB()

#### Predicting

In [40]:
model.predict(emails_count)

array([0, 1], dtype=int64)

#### Score 

In [41]:
X_test_count = v.transform(X_test)
model.score(X_test_count,Y_test)

0.9877961234745154

### Gaussian Naive Bayes

In [46]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train_count.toarray(),Y_train)

GaussianNB()

#### Predicting 

In [48]:
model.predict(emails_count.toarray())

array([0, 1], dtype=int64)

#### Score

In [50]:
X_test_count = v.transform(X_test)
model.score(X_test_count.toarray(),Y_test)

0.9038047379755922