# Naive Bayes and Decision Trees

In [None]:
# Import the libraries
from math import log

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In this assignment we will be trying classification models, namely:
- Naive Bayes Classification
- Decision Trees

# Naive Bayes Classification

Naive Bayes is a classification technique based on Bayes' Theorem. We use Bayes' Theorem to find the probability of the target variable given the features.

$$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} $$

where,
- $Y$ is the target variable
- $X$ is the feature variable
- $P(Y|X)$ is the posterior probability of the target given features
- $P(X|Y)$ is the likelihood which is the probability of features given the target
- $P(Y)$ is the prior probability of the target
- $P(X)$ is the prior probability of the features

Here we find the `y` that maximizes the posterior probability $P(Y|X)$.  
Notice that we can ignore the denominator $P(X)$ since it is constant for all classes.  

We can then formulate our classifier as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} P(Y=y|X) = \underset{y}{\operatorname{argmax}} P(X|Y=y)P(Y=y) $$

## Data Preprocessing

Let's first load the training dataset into a pandas dataframe

In [None]:
df = pd.read_csv('car_train.csv')
df.head()

Here we have 6 different features and a `Decision` target variable.

So every single feature here is a categorical feature. Let's try to see which all are the unique values in each of these features.

In [None]:
df.copy().apply(lambda x: x.unique())

In [None]:
# We'll convert this to a dictionary for later usage
unique_values = df.copy().apply(lambda x: x.unique()).to_dict()
unique_values = {k: v.tolist() for k, v in unique_values.items()}

In [None]:
# Following is the form of the dictionary
unique_values

## Calculate the Prior and Likelihood values

Recall that for Naive Bayes, we essentially want to calculate

$$ P(Y=y|X) \propto P(X|Y=y)P(Y=y) $$

where $X$ is the feature vector and $Y$ is the target variable.  

### **Prior Probabilities $P(Y=y)$**

Here You will calculate the prior probabilties of each class and store them as a python list:

$$
    prior\_values = \begin{bmatrix}
        p_1 & p_2 & p_3 & p_4
    \end{bmatrix}
$$
  
where,
- $p_1 = P(Decision = unacc)$
- $p_2 = P(Decision = acc)$
- $p_3 = P(Decision = good)$
- $p_4 = P(Decision = vgood)$

In [None]:
# You can do this in a single line
prior_list = None

In [None]:
# Let's convert this prior list to a pandas dataframe and display it
prior_df = None

### **Likelihood Probabilities $P(X|Y)$**

The main assumption in Naive Bayes is that the features are conditionally independent given the target variable.  
This means that we can calculate the likelihood probabilities as:

$$ P(X|Y=y) = \prod_{i=1}^{n} P(X_i|Y=y) $$

where $X_i$ is the $i^{th}$ feature of $X$.

Let's take an example of a single feature (say buying price) and calculate the likelihood probabilities for each class.  
What we want is a pandas dataframe that looks something like this:

| - | unacc | acc | good | vgood |
| --- | --- | --- | --- | --- |
| vhigh | - | - | - | - |
| high | - | - | - | - |
| med | - | - | - | - |
| low | - | - | - | - |

where each cell value is the likelihood probability of that feature value given the class. You will need to do this for each and every feature

In [None]:
likelihood_feature_1 = None
likelihood_feature_2 = None
likelihood_feature_3 = None
likelihood_feature_4 = None
likelihood_feature_5 = None
likelihood_feature_6 = None

## Prediction on test set

### **Calculation of Posterior Probabilities $P(Y|X)$**

Recall,

we want to find the class $\hat{y}$ that maximizes the posterior probability $P(Y|X)$ according to the formula:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} P(Y=y|X) = \underset{y}{\operatorname{argmax}} P(X|Y=y)P(Y=y) $$

Since, we assume that all the features are conditionally independent (Naive Bayes Assumption), we can rewrite our objective as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} \prod_{i=1}^{n} P(X_i|Y=y) P(Y=y) $$


For numerical stability, we will instead maximise the Log Posterior instead of just the Posterior, with this we can rewrite this as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} ( \log P(Y=y) + \sum_{i=1}^{n} \log P(X_i|Y=y) )

Let's start with a single example.

Assume that we have a single example with the following feature values:

| buying | maint | doors | capacity | lug_boot | safety |
| --- | --- | --- | --- | --- | --- |
| low | vhigh | 2 | more | med | high |


We want to calculate the posterior probabilities for each class for this example.  
  
Do the following steps for each class:
- Calculate the log-likelihood probability for each feature value and add them together
- Add the above value with the log-prior probability of that class
- Store the result in a pandas dataframe

You should get a pandas dataframe that looks something like this:

| - | unacc | acc | good | vgood |
| --- | --- | --- | --- | --- |
| Log-Posterior | - | - | - | - |

where each cell value is the posterior probability of that feature value given the class. You will need to do this for each and every feature.

NOTE: Use the likelihood/prior values that you calculated earlier.

I'll do it for the category `unacc`. Implement the rest in a similar fashion. Also feel free to change the code if you want to

In [None]:
# Let's store the log prior

features = ['low', 'vhigh', '2', 'more', 'med', 'high']

log_prior = prior_df.loc['unacc', 'Prior']

log_likelihood = 0

for i, feature in enumerate(features):
    log_likelihood += log(eval(f'likelihood_feature_{i+1}').loc['unacc', feature] + 1e-9) # Add the 1e-9 to avoid taking log(0)

log_posterior = log_prior + log_likelihood

print(f'The Log Posterior for class `unacc` is: {log_posterior:.3f}')


#! Hint: Convert this to a function which takes the class_name and features as parameters and return the log_posterior
#! This should help you in the next section as well


In [None]:
def get_log_posterior(class_name: str, features: list):
    """Returns the log posterior for the class name given the features

    Args:
        class_name (str): class name. [unacc, acc, good, vgood]
        features (list): list of features
    """

    # Write you code here
    pass

Based on the log-posterior, predict what the decision is

### Prediction with Test set

Now that we know how to do this for a single datapoint, let's do this for the whole test set.

Let's start with same old loading of the test set into a pandas dataframe

In [None]:
df_test = pd.read_csv('car_test.csv')
df_test.head()

Split the dataframe into features and target

In [None]:
df_x = df_test.drop('Decision', axis=1)
df_y = df_test['Decision']

Now using the features in `df_x`, predict the `Decision` using in the same way as above

In [None]:
predictions = []

for _, row in df_x.iterrows():
    features = list(row)
    
    # Write you code here
    pass

Compare the values of the prediction with the actual values. What is the accuracy?

In [None]:
true_values = list(df_y)

correct = 0
for pred, true_val in zip(predictions, true_values):
    if pred == true_val:
        correct += 1

accuracy = correct / len(true_values)

print(f'Test Accuracy using Naive Bayes Classifier is: {accuracy:.4f}')