# Naive Bayes classifier

## Spam emails detection

In [4]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/nHIUYwN-5rM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

In [1]:
import pandas as pd


In [2]:
dfspam = pd.read_csv('C:/Users/Public/lmaaya/projects/python/JupiterNoteBook/data/spam.csv')
dfspam.head(2)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...


In [3]:
dfspam.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


## convert the Category & message into numbers

In [6]:
dfspam['spam'] = dfspam['Category'].apply(lambda x: 1 if x == 'spam' else 0) 
dfspam.head(5)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(dfspam.Message, dfspam.spam, 
                                                    test_size=0.25)

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:3]


(4179, 7454)

In [13]:
X_train_count.shape # unique features as columns, each email as row/obs

(4179, 7454)

In [14]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count, Y_train)

In [15]:
emails = [
'Hey Mohan, can we get together to watch football game tomorrow?',
"Upto 20% discount on parking, exclusive offer just for you. Don't miss this reward!"
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

In [17]:
X_test_count = v.transform(X_test)
model.score(X_test_count, Y_test)

0.9856424982053122

## define transform using a Pipeline

Instead of first converting the text into counts before training the model, we can use the pipeline and feed the classfier the text directly.

In [20]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
('vectorizer', CountVectorizer()),
('nb', MultinomialNB())
])

In [21]:
clf.fit(X_train, Y_train)

In [22]:
clf.score(X_test, Y_test)

0.9856424982053122

In [23]:
clf.predict(emails)

array([0, 1], dtype=int64)