# Reference

## https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes

---
#**Introduction To Machine Learning**
##**Supervised Learning (= classification):**

*   k-Nearest Neighbor (kNN)
*   **naive Bayesian (NB)**
*   Decision Tree (DT)
*   Support Vector Machine (SVM)
---

Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier assumes that the effect of the value of a predictor (x) on a given class *c* is independent of the values of other predictors. This assumption is called class conditional independence.
![Naive Bayes Equation](https://miro.medium.com/max/954/1*2SnqzKlKD9DC5qL8C4HaQQ.png)

## <font color = #950CDF> Part 1: </font> <font color = #4854E8> Information of Dataset </font>
<b>Breast Cancer Wisconsin (Diagnostic) Data Set:</b> Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. This data set is originally from the University of Wisconsin CS department at ftp ftp.cs.wisc.edu > cd math-prog/cpo-dataset/machine-learn/WDBC/. But I found it in the University of California Urvine Machine Learning Depository at the link below. The first column is a unique Id, the second column is a binary variable ‘M’ for malignant and ‘B’ for benign. Then there are 30 independent variables which are all different measurements of cell nucleus size and shape.

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

### <font color = #27C3E5> 1.1: </font> <font color = #41EA46> Import Libraries and Dataset </font>

#### <font color = blue>Import the Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_absolute_error, roc_auc_score

#### <font color = blue>Import the Dataset

In [None]:
df = pd.read_csv("breast-cancer.csv")
df = df.iloc[:, 1:32]
df.tail(3)

### <font color = #27C3E5> 1.2: </font> <font color = #41EA46> Data Information and Visualization </font>

#### <font color = blue> View all Rows and Cols

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#### <font color = blue> Data Information

In [None]:
df.info()

#### <font color = blue> Visualize Target Class Label Distributiuon

In [None]:
plt.style.use('fivethirtyeight')
malignant = df[df['diagnosis'] == 'M'].shape[0]
benign = df[df['diagnosis'] == 'B'].shape[0]

class_ = [malignant, benign]
label = ['malignant', 'benign']

plt.pie(class_, labels = label, shadow = True, wedgeprops = {'edgecolor': 'black'}, 
        autopct = '%1.1f%%', startangle= 90, colors=['red', 'green'])

plt.tight_layout()
plt.show()

![Machine Learning Project](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## <font color = #950CDF> Part 2: </font> <font color = #4854E8> Data Preprocessing </font>

### <font color = #27C3E5> 2.1: </font> <font color = #41EA46> Define Predictor and Target Attributes </font>

In [None]:
X = df.iloc[:, 1:32]
Y = df.iloc[:, 0]

#### <font color = blue> Predictor Attributes

In [None]:
X.tail(3)

#### <font color = blue> Target Attribute

In [None]:
Y.tail(3)

### <font color = #27C3E5> 2.2: </font> <font color = #41EA46> Split the Data into Training and Testing </font>

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(
                                                    X, 
                                                    Y,
                                                    test_size=0.2,
                                                    random_state=0)

#### <font color = blue> Training Data

In [None]:
print("X_train", X_train.shape)
print("Y_train", Y_train.shape)

#### <font color = blue> Testing Data

In [None]:
print("X_test", X_test.shape)
print("X_test", Y_test.shape)

### <font color = #27C3E5> 2.3: </font> <font color = #41EA46> Check Missing Value </font>

In [None]:
df.isnull().sum()

### <font color = #27C3E5> 2.4: </font> <font color = #41EA46> Feature Selection - With Correlation </font>

#### <font color = blue> Correlation

In [None]:
corr = X_train.corr()
corr

#### <font color = blue> Visualize the Correlation

In [None]:
plt.figure(figsize = (20, 9))
matrix = np.triu(corr)        # take lower correlation matrix
sns.heatmap(corr, mask = matrix, annot = True, linewidth = 1.5)

#### <font color = blue> Remove Features (higest corr)

In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X, 0.7)
len(set(corr_features))

In [None]:
corr_features

In [None]:
X_train = X_train.drop(corr_features,axis=1)
X_test = X_test.drop(corr_features,axis=1)

In [None]:
X_train.head()

In [None]:
X_train.shape   # 30 -20 = 10

#### <font color = blue> Label Encoder

In [None]:
LE = LabelEncoder()
Y_train = LE.fit_transform(Y_train)
Y_test = LE.fit_transform(Y_test)        
Y_test                             # malignant = 0,  benign = 1

#### <font color = blue> Feature Scaling

In [None]:
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)

pd.DataFrame(X_test).head()    # Same Scale

## <font color = #950CDF> Part 3: </font> <font color = #4854E8> Build Naive Bayes Classifier </font>

### <font color = #27C3E5> 3.1: </font> <font color = #41EA46> Implementation from Scratch </font>

#### <font color = blue> Build Model

In [None]:
class NaiveBayes:
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)

        # calculate mean, var, and prior for each class
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)

        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples)

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        posteriors = []

        # calculate posterior probability for each class
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + posterior
            posteriors.append(posterior)

        # return class with highest posterior probability
        return self._classes[np.argmax(posteriors)]

    def _pdf(self, class_idx, x):
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

#### <font color = blue> Initialize Model

In [None]:
NB_scratch = NaiveBayes()

#### <font color = blue> Fit the Training Data into Model

In [None]:
NB_scratch.fit(X_train, Y_train)

#### <font color = blue> Predict the Test data

In [None]:
Y_pred_scratch = NB_scratch.predict(X_test)
Y_pred_scratch

#### <font color = blue> Accuracy Score

In [None]:
Accuracy_Scratch = accuracy_score(Y_pred_scratch, Y_test)
print('Accuracy Score:', Accuracy_Scratch)

### <font color = #27C3E5> 3.2: </font> <font color = #41EA46> Implementation with Scikit-Learn </font>

#### <font color = blue> Import Model From Sklearn

In [None]:
from sklearn.naive_bayes import GaussianNB

#### <font color = blue> Initialize Model

In [None]:
NB_Sklearn = GaussianNB()

#### <font color = blue> Fit the Training Data into Model

In [None]:
NB_Sklearn.fit(X_train, Y_train)

#### <font color = blue> Predict the Test data

In [None]:
Y_pred_Sklearn = NB_Sklearn.predict(X_test)
Y_pred_Sklearn

#### <font color = blue> Accuracy Score

In [None]:
Accuracy_Sklearn = accuracy_score(Y_pred_Sklearn, Y_test)
print('Accuracy Score:', Accuracy_Sklearn)

### <font color = #27C3E5> 3.3: </font> <font color = #41EA46> Comparison (Scratch vs. Scikit-Learn) </font>

In [None]:
accuracy = [Accuracy_Sklearn, Accuracy_Scratch]
label = ["Sklearn", "Scratch"]
plt.bar(label, accuracy, color = ['blue', 'red'])
plt.title("Sklearn vs Scratch")
plt.xlabel("Naive Bayes")
plt.ylabel("Accuracy")
plt.show()

Both result are Equal

## <font color = #950CDF> Part 4: </font> <font color = #4854E8> Evaluate the Result </font>

### <font color = #27C3E5> 4.1: </font> <font color = #41EA46> Confusion Matrix</font>

In [None]:
confusion_matrix_Scratch = confusion_matrix(Y_pred_scratch, Y_test)

#[row, column]
TP = confusion_matrix_Scratch[1, 1]        
TN = confusion_matrix_Scratch[0, 0]           
FP = confusion_matrix_Scratch[0, 1]           
FN = confusion_matrix_Scratch[1, 0]

group_names = ['TN','FP','FN','TP']

group_counts = ["{0:0.0f}".format(value) for value in confusion_matrix_Scratch.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in confusion_matrix_Scratch.flatten()/np.sum(confusion_matrix_Scratch)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)

sns.heatmap(confusion_matrix_Scratch, annot=labels, fmt='', cmap='Greens')

### <font color = #27C3E5> 4.2: </font> <font color = #41EA46>  Evaluate the Results </font>

#### <font color = blue>4.2.1: Calculate the Results

In [None]:
# Accuracy Score
Accuracy = accuracy_score(Y_pred_scratch, Y_test)
print('Accuracy Score:', Accuracy) 

# Precision Score
Precision = precision_score(Y_pred_scratch, Y_test)
print('Precision Score:', Precision)   

# True positive Rate (TPR) or Sensitivity or Recall
TPR = recall_score(Y_pred_scratch, Y_test)
print('True positive Rate:', TPR)             

# False positive Rate (FPR)
FPR = FP / float(TN + FP)
print('False positive Rate', FPR)                       

# F1 Score or F-Measure or F-Score
F1 = f1_score(Y_pred_scratch, Y_test)
print('F1 Score:', F1)                 

# Specificity
Specificity = TN / (TN + FP)
print('Specificity:', Specificity )                    

# Mean Absolute Error
Error = mean_absolute_error(Y_pred_scratch, Y_test)
print('Mean Absolute Error:', Error)   

# ROC Area
Roc = roc_auc_score(Y_pred_scratch, Y_test)
print('ROC Area:', Roc) 

#### <font color = blue>4.2.2: Visualize the Results

In [None]:
plt.figure(figsize = (12, 5))

result = [Accuracy, Precision, TPR, FPR, F1, Specificity, Error, Roc]
label = ["Accuracy", "Precision", "TPR", "FPR", "F-Score", "Specificity", "Error", "Roc Area"]
colors=[ 'red', 'green', 'blue', 'darkgoldenrod', 'orange', 'purple', 'brown', 'darkcyan']

plt.bar(label, result, color = colors, edgecolor='black')