# Bayes Theorem
In probability, Bayes theorem describes the probability an event based on prior knowledge may occur.

### Bayes Theorem Formula
<img src="images/nb/bayes_theorem.jpg" height="50%" width="50%"></img>

### Bayes Theorem Example
Let's say there are two machines that produce wrenches.

Machine 1 produces 30 wrenches per hour.  
Machine 2 produces 20 wrenches per hour.

Out of all the produced parts, 1% of the wrenches are defective parts.

Out of all the defective parts, 50% came from Machine 1 and the other 50% came from Machine 2.

Question: What is the probability that a part produced by Machine 2 is defective?

### Bayes Theorem Example Solution
Probability a part is from Machine 1: ```P(Mach1) = 30 / 50 = 0.6 = 60%```  
Probability a part is from Machine 2: ```P(Mach2) =  20 / 50 = 0.4 = 40%```

(Given) Probability a part is defective: ```P(Defect) = 1%```

(Given) Likelihood a defective part is from Machine 1: ```P(Mach1 | Defect) = 50%```  
(Given) Likelihood a defective part is from Machine 2: ```P(Mach2 | Defect) = 50%```
- Notice how Machine 2 is only producing 40% of the outputs, yet 50% of defective parts are from Machine 2; therefore, we can observe Machine 2 produces more defective parts than Machine 1

#### Now let's calculate the probability using Bayes Theorem:
```P(Defect | Mach2)``` = ```P(Mach2 | Defect) * P(Defect) / P(Mach2)``` = ```0.0125 = 1.25%```

Intuitively, we just counted the number of defective wrenches that came from Machine 2 and divided by the total number of wrenches that came from Machine 2.
- This intuition is actually how Bayes Theorem works!

# Naive Bayes Classifier
We can use the Naive Bayes theorem to calculate the probability that a data point is classified to a  category.

# Naive Bayes Classifier Example
Using two features, Age and Salary, determine if a person Drive or Walk.

Is a "New data point" classified as a person that Walks or Drives?
<img src="images/nb/naive_bayes_example.png" height="50%" width="50%"></img>

Let's assume, X = features (independent variables) of the "New data point":
- Age = 30,000, Salary = 25

### Solving Problem Using Bayes Theorem
#### 1. Calculate the probability that the person walks:  
<img src="images/nb/walks_bayes_theorem.png" height="50%" width="50%"></img>

P(Walks) = ```Number of Walkers / Total Observations``` = ```10 / 30```

#### We can solve for P(X) using the "radius" method:
<img src="images/nb/calculate_p(x).png" height="75%" width="75%"></img>
- Specify a radius hyperparameter, so the data points that lie within the radius are called the "Similar Observations"

P(X) = ```4 / 30```

P(X | Walks) = ```Number of Similar Observations Who Walk / Total Number of Walkers``` = ```3 / 10```

#### Now substitute these variables into the Bayes Theorem:
P(Walks | X) = ```(3 / 10 * 10 / 30) / (4 / 30)``` = ```0.75 = 75%```

#### 2. Calculate the probability that the person drives:  
<img src="images/nb/drives_bayes_theorem.png" height="50%" width="50%"></img>

In order to reduce redundancy, we can just use the same intuition from calculating the probability that the person walks. Therefore, the probability that the person drives is ```25%```.

### 3. Conclusion
Therefore, because 75% > 25%, we can classify the "New data point" as a person that walks.

# Extras About Naive Bayes Classifier
### 1. Why is it called "Naive"?
Because the Bayes Theorem requires "independence assumptions." Therefore, the ML Model would require these assumptions, which are often times not correct. It's kind of naive to assume they're correct.

For example, in the above example, we MUST assume that Age and Salary are independent.
- We cannot have a correlation between Age and Salary, yet we kind of do in this problem because as a person's age increases then their salary increases


### 2. What is P(X)?
P(X) is the probability that a point from the entire data set has similar features to the "New data point". We define a radius to determine how many points would be similar to the "New data point."

### 3. How to handle more than 2 classes (classifications)?
We would calculate the probability for all those classes, and the greatest probability is the classification.

For example for an X "New data point", and there are 3 classes:
- P(Study | X) = 25%
- P(Games | X) = 40%
- P(Talks | X) = 35%

The person would be classified as someone who Games because it has the greatest probability.

In [17]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [18]:
# import the data set
ads_df = pd.read_csv("datasets/social_network_ads.csv")

ads_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [19]:
# x is the Age and Estimated Salary columns
x = ads_df.iloc[:, [2, 3]].values

# y is the Purchased column
y = ads_df.iloc[:, 4].values

In [20]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

In [21]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# feature scale the training and testing sets
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)



# Naive Bayes Classifier

In [22]:
# import the gaussian naive bayes class
from sklearn.naive_bayes import GaussianNB

In [23]:
"""
create a naive bayes classifier, then fit to the training set
- the model automatically selects an optimal radius for the model
"""
classifier = GaussianNB()
classifier.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [24]:
# predict the training set results
y_pred = classifier.predict(x_test)

y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

# Confusion Matrix

In [25]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [26]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
65 + 25 = 90 correct predictions
7 + 3 = 10 incorrect predictions
"""
cm

array([[65,  3],
       [ 7, 25]])