## Naive Bayes

In this notebook we will do 3 things:
- Use Gaussian Naive Bayes to predict whether or not patients have breast cancer
- Use Multinomial Naive Bayes to predict whether or not an email is spam
- Apply Gaussian Naive Bayes to the iris data

### General Imports

In [1]:
import numpy as np
import pandas as pd
import math
import scipy.stats as stats
import matplotlib.pyplot as plt

### Part 1: GaussianNB

#### Load the data

In [2]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_df = pd.DataFrame(data=X, columns=cancer.feature_names)
X_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### Split the data

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

#### Scale the data

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Create Naive Bayes model

In [6]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)

#### Evaluate Model Performance

In [7]:
acc_train = gnb.score(X_train_scaled, y_train)
acc_test = gnb.score(X_test_scaled, y_test)

print(f"The accuracy on the training set is {100*acc_train:.2f}%")
print(f"The accuracy on the test set is {100*acc_test:.2f}%")

The accuracy on the training set is 93.66%
The accuracy on the test set is 92.31%


### Part 2: MultinomialNB

#### Load the data

In [8]:
dataset = pd.read_csv('emails.csv')

#Checking for duplicates and removing them
dataset.drop_duplicates(inplace = True)

sample_spam = dataset[dataset['spam'] == 1].sample(200, random_state=16)
sample_not_spam = dataset[dataset['spam'] == 0].sample(200, random_state=16)

sample = pd.concat([sample_spam, sample_not_spam])
sample.head()

Unnamed: 0,text,spam
475,Subject: assistance me my name is mr . newton...,1
1102,Subject: in the heart of your business ! corp...,1
704,Subject: low price software http : / / neonat...,1
625,Subject: work from home . free info we need h...,1
388,Subject: [ ilug ] here is the information you ...,1


In [9]:
#Every mail starts with 'Subject :' will remove this from each text 
sample['text'] = sample['text'].map(lambda text: text[9:])
sample.head()

Unnamed: 0,text,spam
475,assistance me my name is mr . newton gwarada ...,1
1102,in the heart of your business ! corporate ima...,1
704,low price software http : / / neonate . setup...,1
625,work from home . free info we need help . we ...,1
388,[ ilug ] here is the information you requested...,1


In [10]:
X = sample['text'].values
y = sample['spam'].values

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

counts = CountVectorizer()

counts.fit(X)

X_counts = counts.transform(X).toarray()
y_counts = y

In [12]:
X_counts.shape

(400, 10859)

In [13]:
X_counts_train, X_counts_test, y_counts_train, y_counts_test = train_test_split(X_counts, y_counts, random_state=42)

In [14]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

mnb.fit(X_counts_train, y_counts_train)

In [15]:
counts_acc_train = mnb.score(X_counts_train, y_counts_train)
counts_acc_test = mnb.score(X_counts_test, y_counts_test)

print(f"The accuracy on the training set is {100*counts_acc_train:.2f}%")
print(f"The accuracy on the test set is {100*counts_acc_test:.2f}%")

The accuracy on the training set is 100.00%
The accuracy on the test set is 93.00%


### Part 3: Multiclass data

See if you can build a Gaussian Naive Bayes model for the iris data. 

In [20]:
# import the data

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

X_df = pd.DataFrame(data=X, columns=iris.feature_names)
X_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [32]:
# split the data

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4, random_state=1)

In [25]:
# scale the data

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [26]:
# fit the model

from sklearn.naive_bayes import MultinomialNB

gnb_Multi = MultinomialNB()
gnb_Multi.fit(X_train_scaled, y_train)

In [30]:
# check that the model does indeed predict more than just 2 classes

y_pred = gnb_Multi.predict(X_test)
y_pred

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])

In [34]:
gnb_Multi.classes_

array([0, 1, 2])

In [29]:
# evaluate model performance

acc_train = gnb_Multi.score(X_train_scaled, y_train)
acc_test =gnb_Multi.score(X_test_scaled, y_test)

print(f"The accuracy on the training set is {100*acc_train:.2f}%")
print(f"The accuracy on the test set is {100*acc_test:.2f}%")

The accuracy on the training set is 67.78%
The accuracy on the test set is 61.67%
