# Naive Bayes Classifier from with Library
## Import some library that we need

In [1]:
import pandas as pd

## Initialize the data

Data is used to classify spam or not spam email. The data is taken from [here](https://archive.ics.uci.edu/ml/datasets/spambase).

In [2]:
# Preparing the data for Naive Bayes
spam_dataframe = pd.read_csv('./emails.csv')
spam_dataframe

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [3]:
# calculating the probability of spam and ham
spam_probability = len(spam_dataframe[spam_dataframe['spam'] == 1]) / len(spam_dataframe)
ham_probability = len(spam_dataframe[spam_dataframe['spam'] == 0]) / len(spam_dataframe)

print('Spam Probability: ', spam_probability)
print('Ham Probability: ', ham_probability)

Spam Probability:  0.2388268156424581
Ham Probability:  0.7611731843575419


## Preprocessing

Using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to convert text into a matrix of token counts.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
spamham_countVectorizer=vectorizer.fit_transform(spam_dataframe['text'])



In [5]:
spamham_countVectorizer.shape

(5728, 37303)

In [6]:
label=spam_dataframe['spam']
X=spamham_countVectorizer
y=label

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

## Using Naive Bayes Classifier

In [8]:
from sklearn.naive_bayes import MultinomialNB

NB_classifier=MultinomialNB()
NB_classifier.fit(X_train,y_train)

MultinomialNB()

In [9]:
y_predict_test=NB_classifier.predict(X_test)
y_predict_test

array([0, 1, 0, ..., 0, 0, 0])

## Evaluation

In [10]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_predict_test))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       896
           1       0.95      0.98      0.97       250

    accuracy                           0.99      1146
   macro avg       0.97      0.99      0.98      1146
weighted avg       0.99      0.99      0.99      1146

