# TP5 - Naive Bayes and & Logistic Regression : 
---
_Author: CHRISTOFOROU Anthony_\
_Due Date: XX-XX-2023_\
_Updated: 29-11-2023_\
_Description: TP4 - AI_

---

In [20]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Modules
from assignment5.models.classifiers.naive_bayes import NaiveBayesClassifier
from assignment5.models.classifiers.logistic_regression import LogisticRegressionClassifier


# make figures appear inline
matplotlib.rcParams['figure.figsize'] = (15, 8)
%matplotlib inline

# notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 0. Data

<div class="alert alert-block alert-info">
Before starting, we have to import the data and prepare it for the training!
</div>

### 0.1. Importing the data

In [21]:
# Load
train_data_path = 'data/data_train.csv'
test_data_path = 'data/data_test.csv'

train_df = pd.read_csv(train_data_path)
test_df = pd.read_csv(test_data_path)

In [22]:
# Train data
train_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [23]:
# Test data
test_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,Female,53,104000,1
1,Male,35,75000,0
2,Female,38,65000,0
3,Female,47,51000,1
4,Male,47,105000,1


### 0.2. Preprocessing the data

In [24]:
# Create a label encoder instance
label_encoder = LabelEncoder()

# Convert 'Gender' to binary variables in both train and test datasets
train_df['Gender'] = label_encoder.fit_transform(train_df['Gender'])
test_df['Gender'] = label_encoder.transform(test_df['Gender'])

In [25]:
train_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1,19,19000,0
1,1,35,20000,0
2,0,26,43000,0
3,0,27,57000,0
4,1,19,76000,0


In [26]:
test_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,0,53,104000,1
1,1,35,75000,0
2,0,38,65000,0
3,0,47,51000,1
4,1,47,105000,1


## 1. Naive Bayes Classifier

### 1.1 Empirical Distribution of the Labels

In the `fit` method of the [`NaiveBayesClassifier`](assignment5/models/classifiers/naive_bayes.py) class, the empirical distribution of the labels is accurately calculated. This involves determining the prior probability for each class (label) in the training dataset, a crucial step in understanding the overall distribution of the classes. The prior probability, representing the initial belief about the distribution before considering the features, is computed as follows:

```python
self.params[c]['prior'] = X_c.shape[0] / X.shape[0]
```

This line calculates the proportion of instances in the training set that belong to each class \( c \), forming the basis for subsequent probability calculations.

### 1.2 Estimation of the Parameters of the Covariates' Distributions for Each Label Value

Continuing in the `fit` method, the class-specific parameters (`mean` and `variance`) for the Gaussian distribution of each feature are meticulously estimated. This step is fundamental in the Naive Bayes algorithm as it assumes conditional independence of features within each class:

```python
self.params[c] = {
    'mean': X_c.mean(axis=0),
    'var': X_c.var(axis=0),
    # Additional parameters...
}
```

Here, `X_c.mean(axis=0)` and `X_c.var(axis=0)` compute the mean and variance for each feature in subsets of the data corresponding to class \( c \), essential for modeling the feature distribution.

### 1.3 Implementation of the Gaussian Density Function

The Gaussian density function is adeptly implemented in the `gaussian_density` method. This function is responsible for calculating the likelihood of observing a specific feature value for a given class, adhering to the Gaussian (normal) distribution. The employed formula encapsulates the essence of the Gaussian distribution:

```python
numerator = np.exp(- (x - mean) ** 2 / (2 * var))
denominator = np.sqrt(2 * np.pi * var)
return numerator / denominator
```

In this expression, `x` is the feature value, while `mean` and `var` are the Gaussian parameters for the feature under a specific class.

### 1.4 Prediction of Labels Given New Covariates

The `predict` method in [`NaiveBayesClassifier`](assignment5/models/classifiers/naive_bayes.py) is ingeniously designed to handle the prediction of labels for new data points. This method adeptly computes the posterior probability for each class based on the new instance's feature values and the parameters estimated from the training data. It then judiciously selects the class with the highest posterior probability as the prediction:

```python
for x in X:
    posteriors = []
    for c in self.classes:
        prior = np.log(self.params[c]['prior'])
        class_conditional = np.sum(np.log(self.gaussian_density(c, x)))
        posterior = prior + class_conditional
        posteriors.append(posterior)
    preds.append(np.argmax(posteriors))
```

In this procedure, the class with the maximum log-posterior probability, derived from the product of the prior and the likelihoods of the features, is chosen, showcasing the effectiveness of the Naive Bayes algorithm.

In [27]:
X_train = train_df.drop('Purchased', axis=1).values
y_train = train_df['Purchased'].values

nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(X_train, y_train)

print(nb_classifier)

Naive Bayes Classifier Summary:
------------------------------------------------------------
Class      | Prior      | Mean                 | Variance
------------------------------------------------------------
0          | 0.7000     | 0.51, 32.27, 60100.84 | 0.25, 64.32, 625023444.67
1          | 0.3000     | 0.45, 45.00, 96549.02 | 0.25, 79.67, 1617855440.22


## 2. Logistic Regression Classifier

### 2.1 Derivations

#### (a) $ p(y_i | x_i; w, b) $

Given that $ y_i $ follows a Bernoulli distribution and $ p_i = \sigma(w^T \cdot x_i + b) $, where $ \sigma(z) = \frac{1}{1 + e^{-z}} $, the probability $ p(y_i | x_i; w, b) $ is:

$$
p(y_i | x_i; w, b) = p_i^{y_i} \cdot (1 - p_i)^{(1-y_i)}
$$

>This is the probability of observing the outcome $ y_i $, which can be either 0 or 1, for the given input $ x_i $. The probability $ p_i $ is obtained by passing the linear combination of inputs and parameters $ w \cdot x_i + b $ through the sigmoid function $ \sigma $.

#### (b) $ \log(p(y_i | x_i; w, b)) $

The log likelihood of $ p(y_i | x_i; w, b) $ is:

$$
\log(p(y_i | x_i; w, b)) = y_i \log(p_i) + (1 - y_i) \log(1 - p_i)
$$


<div class="alert alert-info"> Taking the logarithm of the probability expression is common practice because it turns products into sums, which are easier to differentiate and numerically more stable to compute. </div>

#### (c) $ \frac{\partial \sigma(z)}{\partial z} $

The derivative of the sigmoid function $ \sigma(z) $ with respect to $ z $ is:

$$
\begin{align} \frac{d\sigma(z)}{dz} 
&= \frac{d}{dz} \frac{1}{1 + \exp(-z)} \\ 
&= \frac{d}{dz} (1 + \exp(-z))^{-1} \\ 

&= -(1 + \exp(-z))^{-2} \cdot \frac{d}{dz} (1 + \exp(-z)) \\ 
&= -(1 + \exp(-z))^{-2} \cdot (-\exp(-z)) \\
&= \frac{1}{1 + \exp(-z)} \cdot \frac{\exp(-z)}{1 + \exp(-z)} \\ 
&= \sigma(z) \cdot \frac{\exp(-z)}{1 + \exp(-z)} \\ 
&= \sigma(z) \cdot \left(1 - \frac{1}{1 + \exp(-z)}\right) \\ 
&= \sigma(z) \cdot \left(1 - \sigma(z)\right) \end{align}
$$

> This derivative is a key component in the gradient descent algorithm used to update the weights of the logistic regression model. It indicates how a change in the weighted sum $ z $ affects the probability $ \sigma(z) $.

#### (d) $ \frac{\partial \log(p(y_i | x_i; w, b))}{\partial w_j} $

The partial derivative of the log likelihood with respect to weight $ w_j $ is:

$$\begin{align} \frac{d\log(p(y_i|x_i;w,b))}{dw_j} 
&= \frac{d}{dw_j} \left(y_{i} \log(p_{i}) + (1 - y_{i}) \log(1 - p_{i})\right) \\ 
&= \frac{d}{dw_j} \left(y_{i} \log(\sigma(w^T x_i + b)) + (1 - y_{i}) \log(1 - \sigma(w^T x_i + b))\right) \\ 
&= y_{i} \frac{d}{dw_j} \log(\sigma(w^T x_i + b)) + (1 - y_{i}) \frac{d}{dw_j} \log(1 - \sigma(w^T x_i + b)) \\ 
&= y_{i} \frac{1}{\sigma(w^T x_i + b)} \cdot \frac{d}{dw_j} \sigma(w^T x_i + b) + (1 - y_{i}) \frac{1}{1 - \sigma(w^T x_i + b)} \cdot \frac{d}{dw_j} (1 - \sigma(w^T x_i + b)) \\ 
&= y_{i} \frac{1}{\sigma(w^T x_i + b)} \cdot \sigma(w^T x_i + b) \cdot (1 - \sigma(w^T x_i + b)) \cdot \frac{d}{dw_j} (w^T x_i + b) + (1 - y_{i}) \frac{1}{1 - \sigma(w^T x_i + b)} \cdot (-\sigma(w^T x_i + b)) \cdot (1 - \sigma(w^T x_i + b)) \cdot \frac{d}{dw_j} (w^T x_i + b) \\ 
&= y_{i} \cdot (1 - \sigma(w^T x_i + b)) \cdot x_{ij} + (1 - y_{i}) \cdot (-\sigma(w^T x_i + b)) \cdot x_{ij} \\ 
&= y_{i} \cdot x_{ij} - y_{i} \cdot \sigma(w^T x_i + b) \cdot x_{ij} - \sigma(w)^T x_i + b) \cdot x_{ij} \\ 
&= y_{i} \cdot x_{ij} - \sigma(w^T x_i + b) \cdot x_{ij} \\ 
&= x_{ij} (y_{i} - \sigma(w^T x_i + b))\end{align}$$

#### (e) $ \frac{\partial \log(p(y_i | x_i; w, b))}{\partial b} $

The partial derivative of the log likelihood with respect to bias $ b $ is:

$$\begin{align} \frac{d\log(p(y_i|x_i;w,b))}{db} 
&= \frac{d}{db} \left(y_{i} \log(p_{i}) + (1 - y_{i}) \log(1 - p_{i})\right) \\ 
&= \frac{d}{db} \left(y_{i} \log(\sigma(w^T x_i + b)) + (1 - y_{i}) \log(1 - \sigma(w^T x_i + b))\right) \\ 
&= y_{i} \frac{d}{db} \log(\sigma(w^T x_i + b)) + (1 - y_{i}) \frac{d}{db} \log(1 - \sigma(w^T x_i + b)) \\ 
&= y_{i} \frac{1}{\sigma(w^T x_i + b)} \cdot \frac{d}{db} \sigma(w^T x_i + b) + (1 - y_{i}) \frac{1}{1 - \sigma(w^T x_i + b)} \cdot \frac{d}{db} (1 - \sigma(w^T x_i + b)) \\ 
&= y_{i} \frac{1}{\sigma(w^T x_i + b)} \cdot \sigma(w^T x_i + b) \cdot (1 - \sigma(w^T x_i + b)) \cdot \frac{d}{db} (w^T x_i + b) + (1 - y_{i}) \frac{1}{1 - \sigma(w^T x_i + b)} \cdot (-\sigma(w^T x_i + b)) \cdot (1 - \sigma(w^T x_i + b)) \cdot \frac{d}{db} (w^T x_i + b) \\ 
&= y_{i} \cdot (1 - \sigma(w^T x_i + b)) \cdot 1 + (1 - y_{i}) \cdot (-\sigma(w^T x_i + b)) \cdot 1 \\ 
&= y_{i} - \sigma(w^T x_i + b) \end{align}$$

### 2.2 Implementation of the Logistic Regression Classifier

The `train` method within the [`LogisticRegressionClassifier`](assignment5/models/classifiers/logistic_regression.py) class is implemented to perform the training of the logistic regression model. This method is responsible for updating the weights and bias of the model through gradient descent, aiming to minimize the cost function:

#### (a) Matrix of Covariates \( X \)
Matrix containing the training data where each row is an observation and each column is a feature. In the context of the `train` function, `X` represents the input data upon which predictions are made.

```python
n_samples, n_features = X.shape
```

#### (b) Vector of Labels \( y \)
The `y` vector holds the actual class labels for each observation in `X`. In binary classification, these labels are either 0 or 1.

```python
model = self.sigmoid(np.dot(X, self.weights) + self.bias)
```

#### (c) Initial Weights Vector \( w \)
The weights vector `w` initially starts with arbitrary values (often zeros) and will be updated through training. It determines the impact of each feature on the decision boundary.

```python
self.weights = np.zeros(n_features)
```

#### (d) Initial Bias Value
The bias term `b` is analogous to the intercept in linear regression and is used to make adjustments to the decision boundary. It starts at zero and is updated along with the weights.

```python
self.bias = 0
```

#### (e) Number of Iterations \( \text{num\_iters} \)
`num_iters` is the number of times the algorithm will work through the entire dataset (each time is an iteration). More iterations can lead to a more accurate model but also take longer to compute.

```python
for _ in range(num_iters):
```

#### (f) Learning Rate \( \text{learning\_rate} \)
The `learning_rate` determines how large the steps are during the gradient descent. A smaller learning rate means smaller steps towards the minimum of the cost function.

```python
self.weights -= learning_rate * dw
self.bias -= learning_rate * db
```

These components work together in the `train` method to iteratively update the weights and bias to minimize the cost function:

$$ -\sum_{i=1}^{N} \log(p(y_i | x_i; w, b)) $$

#### (g) Gradient Descent Updates
In each iteration, gradients are calculated for the weights and bias. These gradients indicate the direction in which the cost function has the steepest ascent. By moving in the opposite direction, we aim to find the minimum.

```python
dw = (1 / n_samples) * np.dot(X.T, (model - y))
db = (1 / n_samples) * np.sum(model - y)
```

#### (h) Parameter Updates
After computing the gradients, the weights and bias are updated in the direction that will reduce the cost function.

```python
self.weights -= learning_rate * dw
self.bias -= learning_rate * db
```

<div class='alert alert-info'> 
By repeatedly applying these steps for the specified number of iterations, the `train` method effectively tunes the weights and bias to fit the model to the training data, aiming to reduce prediction error and improve model accuracy. 
</div>

In [28]:
lr_classifier = LogisticRegressionClassifier()
lr_classifier.train(X_train, y_train, num_iters=1000, learning_rate=0.001)

print(lr_classifier)

Logistic Regression Classifier Summary:
--------------------------------------------------
Weights: [-0.0649, -1.2060, 17.8235]
Bias: -0.1075


  return 1 / (1 + np.exp(z)) if z < 0 else 1 / (1 + np.exp(-z))


## 3. Evaluation

Finally we will evaluate both the Naive Bayes and Logistic Regression models using the test data. The evaluation metrics will include accuracy, precision, recall, and F1 score. 

The evaluation of both the Naive Bayes and Logistic Regression models on the test data yields the following results:

In [29]:
X_test = test_df.drop('Purchased', axis=1).values
y_test = test_df['Purchased'].values

nb_predictions = nb_classifier.predict(X_test)
lr_predictions = lr_classifier.predict(X_test)

evaluation_metrics = {
    'Model': ['Naive Bayes', 'Logistic Regression'],
    'Accuracy': [accuracy_score(y_test, nb_predictions), accuracy_score(y_test, lr_predictions)],
    'Precision': [precision_score(y_test, nb_predictions), precision_score(y_test, lr_predictions)],
    'Recall': [recall_score(y_test, nb_predictions), recall_score(y_test, lr_predictions)],
    'F1 Score': [f1_score(y_test, nb_predictions), f1_score(y_test, lr_predictions)]
}


evaluation_df = pd.DataFrame(evaluation_metrics)

evaluation_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Naive Bayes,0.75,0.964286,0.658537,0.782609
1,Logistic Regression,0.683333,0.683333,1.0,0.811881


- **Naive Bayes** shows a higher accuracy and precision but lower recall compared to Logistic Regression. This indicates it is more precise but less sensitive in identifying positive cases.
- **Logistic Regression** shows a lower accuracy and precision but perfect recall, meaning it identifies all positive cases but at the cost of more false positives.
The choice between these models would depend on the specific requirements of the task at hand, whether precision or recall is more valued.