# Naive Bayes Classification Model
---

1.   **[Introduction to Naive Bayes](#1.-Introduction-to-Naive-Bayes)**
2.   **[Foundations of Naive Bayes](#2.-Foundations-of-Naive-Bayes)**
3.   **[Model Assumptions](#3.-Model-Assumptions)**
4.   **[Model Interpretation](#4.-Model-Interpretation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Results](#7.-Model-Results)**

---
<a name="1.-Introduction-to-Naive-Bayes"></a>
### 1. Introduction to Naive Bayes

#### 1.1 Definitions

**Naive Bayes |** A supervised classification technique that is based on Bayes' Theorem with an assumption of independence among predictors

**Posterior Probability |** The probability of an event occurring after taking into consideration new information

**Response Variable (Dependent Variable) |** The outcome of interest that is being studied and is expected to change based on the values of the independent variables.

**Predictor Variables (Independent Variable) |** Variables that are used to predict or explain changes in the value of the response or dependent variable in a statistical model.

---
<a name="2.-Foundations-of-Naive-Bayes"></a>
### 2. Foundations of Naive Bayes

A Naive Bayes model is a supervised learning technique used for classification problems. As with all supervised learning techniques, to create a Naive Bayes model you must have a response variable and a set of predictor variables to train the model.

The Naive Bayes algorithm is based on Bayes’ Theorem, an equation that can be used to calculate the probability of an outcome or class, given the values of predictor variables. This value is known as the posterior probability.

That probability is calculated using three values:

- The probability of the outcome overall P(A)
- The probability of the value of the predictor variable P(B)
- The conditional probability P(B|A) (Note: P(B|A) is interpreted as *the probability of B, given A.*)

The probability of the outcome overall, P(A), is multiplied by the conditional probability, P(B|A). This result is then divided by the probability of the predictor variable, P(B), to obtain the posterior probability.

**Bayes Theorem |**  $P(A|B)= \frac{P(B|A)*P(A)}{P(B)}$

The goal of Bayes’ Theorem is to find the probability of an event, A, given that another event B is true. In the context of a predictive model, the class label would be A and the predictor variable would be B. P(A) is considered the prior probability of event A before any evidence (feature) is seen. Then, P(A|B) is the posterior probability, or the probability of the class label after the evidence (feature) has been seen.

#### Model Implementations
There are several implementations of Naive Bayes in scikit-learn, all of which are found in the sklearn.naive_bayes module. Each is optimized for different conditions of the predictor variables.

- **BernoulliNB**:      Used for binary/Boolean features
- **CategoricalNB**:    Used for categorical features
- **ComplementNB**: 	Used for imbalanced datasets, often for text classification tasks
- **GaussianNB**:		Used for continuous features, normally distributed features
- **MultinomialNB**:	Used for multinomial (discrete) features

---
<a name="3.-Model-Assumptions"></a>
### 3. Model Assumptions

Model assumptions are statements about the data that must be true in order to justify the use of a particular modeling technique



#### 3.1 Naive Bayes Assumptions
- **Conditional Independence |** In Naive Bayes, the predictor variables (B and C in the equation above) are assumed to be conditionally independent of each other, given the target variable (A).
    - $P(B|C,A) = P(B|A) -> the probability of B, given C and A, is equal to the probability of B, given A.
- **Equal Predictive Power |** No predictor variable has any more predictive power than any other predictor. In other words, the individual predictor variables are assumed to contribute equally to the model’s prediction


#### 3.2 Assumption Violations

Even though assumptions are often violated the model still has the potential to perform well.


---
<a name="4.-Model-Interpretation"></a>
### 4. Model Interpretation

#### 4.1 Confusion Matrix

A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable 

####  4.2 Evaluation Metrics 
- Precision 
- Recall
- Accuracy
- F1 score

##### 4.2.1 Precision
The proportion of positive predictions that were true positives

$Precision = \frac{True Positives}{True Positives + False Positives}$

##### 4.2.2 Recall
The proportion of positives the model was able to identify correctly 

$Recall = \frac{True Positives}{True Positives + False Negatives}$

##### 4.2.3 Accuracy
The proportion of data points that were correctly categorized

$Accuracy = \frac{True Positives + True Negatives}{Total Predictions}$

##### 4.2.4 F-Scores
**F1 score:** F1 score is a measurement that combines both precision and recall into a single expression, giving each equal importance. F1 score can range [0, 1], with zero being the worst and one being the best.
- $F_1=2* \frac{precision*recall}{precision+recall}$

**F-beta Score:**  In an F𝛽 score, 𝛽 is a factor that represents how many times more important recall is compared to precision. In the case of F1 score, 𝛽 = 1, and recall is therefore 1x as important as precision (i.e., they are equally important). However, an F2 score has 𝛽 = 2, which means recall is twice as important as precision; and if precision is twice as important as recall, then 𝛽 = 0.5.
- $F_\beta=(1+\beta^2)*\frac{precision*recall}{(\beta^2*precision)+recall}$

##### 4.2.4 ROC Curves (Receiver Operating Characteristic).
Used to visualize the performance of a classifier at different classification thresholds on a graph. A classification threshold in the context of a binary classification is the cutoff threshold for differentiating the positive class from the negative class.
- Plots two key concepts
    1. True Positive Rate
    2. False Positive Rate
**The more that the ROC curve hugs the top left corner of the plot, the better the model does at classifying the data.**

##### 4.2.5 AUC 
Stands for Area Under ROC Curve. Provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in value from 0.0 t0 1.0. A model that is 100% wrong has an AUC of 0 and a model predicting 100% correct has AUC of 1.0
- An AUC of less than 0.5 indicates that the model performs worse than a random classifier 
- Python function: `metrics_roc_auc_score(y_test,y_pred)`

#### 4.3 Considerations when choosing metrics

##### 4.3.1 When to use Precision
Using precision as an evaluation metric is especially helpful in contexts where the cost of a false positive is quite high and much higher than the cost of a false negative. For example, in the context of email spam detection, a false positive (predicting a non-spam email as spam) would be more costly than a false negative (predicting a spam email as non-spam). A non-spam email that is misclassified could contain important information, such as project status updates from a vendor to a client or assignment deadline announcements from an instructor to a class of students. 

##### 4.3.1 When to use Recall
Using recall as an evaluation metric is especially helpful in contexts where the cost of a false negative is quite high and much higher than the cost of a false positive. For example, in the context of fraud detection among credit card transactions, a false negative (predicting a fraudulent credit card charge as non-fraudulent) would be more costly than a false positive (predicting a non-fraudulent credit card charge as fraudulent). A fraudulent credit card charge that is misclassified could lead to the customer losing money, undetected.

##### 4.3.1 When to use Accuracy
It is helpful to use accuracy as an evaluation metric when you specifically want to know how much of the data at hand has been correctly categorized by the classifier. Another scenario to consider: accuracy is an appropriate metric to use when the data is balanced, in other words, when the data has a roughly equal number of positive examples and negative examples. Otherwise, accuracy can be biased. For example, imagine that 95% of a dataset contains positive examples, and the remaining 5% contains negative examples. Then you train a logistic regression classifier on this data and use this classifier predict on this data. If you get an accuracy of 95%, that does not necessarily indicate that this classifier is effective. Since there is a much larger proportion of positive examples than negative examples, the classifier may be biased towards the majority class (positive) and thus the accuracy metric in this context may not be meaningful. When the data you are working with is imbalanced, consider either transforming it to be balanced or using a different evaluation metric other than accuracy. 

---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Import relevant libraries and modules.

import pandas as pd
from sklearn import naive_bayes
from sklearn import model_selection
from sklearn import metrics

# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

#### 5.3 Model Preparation 

##### 5.3.1 Isolate variables

In [None]:
# Define the y (target) variable.
y = data['Dependant_Variable']

# Define the X (predictor) variables.
X = data.drop('Dependant_Variable', axis = 1)

In [None]:
# Display the first 10 rows of target data.
y.head(10)

In [None]:
# Display the first 10 rows of predictor variables.
X.head(10)

##### 5.3.2 Split Data

In [None]:
# Perform the split operation on data.
# Assign the outputs as follows: X_train, X_test, y_train, y_test.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0)

In [None]:
# Print the shape (rows, columns) of the output from the train-test split.

# Print the shape of X_train.
print(X_train.shape)

# Print the shape of X_test.
print(X_test.shape)

# Print the shape of y_train.
print(y_train.shape)

# Print the shape of y_test.
print(y_test.shape)

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction



Consider which Naive Bayes algorithm should be used:
- **BernoulliNB**:      Used for binary/Boolean features
- **CategoricalNB**:    Used for categorical features
- **ComplementNB**: 	Used for imbalanced datasets, often for text classification tasks
- **GaussianNB**:		Used for continuous features, normally distributed features
- **MultinomialNB**:	Used for multinomial (discrete) features

In [None]:
# Assign `nb` to be the appropriate implementation of Naive Bayes.
nb = naive_bayes.GaussianNB()

# Fit the model on your training data.
nb.fit(X_train, y_train)

# Apply your model to predict on your test data. Call this "y_pred".
y_pred = nb.predict(X_test)

---
<a name="7.-Model-Results"></a>
### 7. Model Results

##### 7.1 Evaluate Metrics

In [None]:
# Analyze the results by printing accuracy, precision, recall and F1 score
# Print your accuracy score.
print('Accuracy score:'), print(metrics.accuracy_score(y_test, y_pred))

# Print your precision score.
print('Precision score:'), print(metrics.precision_score(y_test, y_pred))

# Print your recall score.
print('Recall score:'), print(metrics.recall_score(y_test, y_pred))

# Print your f1 score.
print('F1 score:'), print(metrics.f1_score(y_test, y_pred))

# Option 2: better formatted output

# print("Accuracy:", "%.6f" % metrics.accuracy_score(y_test, y_pred))
# print("Precision:", "%.6f" % metrics.precision_score(y_test, y_pred))
# print("Recall:", "%.6f" % metrics.recall_score(y_test, y_pred))
# print("F1 Score:", "%.6f" % metrics.f1_score(y_test, y_pred))

**Question:** What is the accuracy score for your model, and what does this tell you about the success of the model's performance?

**Question:** What are the precision and recall scores for your model, and what do they mean? Is one of these scores more accurate than the other?

**Question:** What is the F1 score of your model, and what does this score suggest about this model?

##### 7.2 Confusion Matrix

In [None]:
# Produce a confusion matrix for more clarity
cm = metrics.confusion_matrix(y_test, y_pred, labels= nb.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix= cm, display_labels= nb.classes_)
disp.plot()

**Question:** What do you notice when observing your confusion matrix, and does this correlate to any of your other calculations?