# Personal Data
Please fill in your details below to help us keep track of your submission.

Student Name: `here`

Student ID: `here`

### in this notebook you are going to implement Bayes classifier and Logistic  regression. please note that you are not allowed to use implemented libraries for the implementation of this code.:

In [4]:
# imports
# You can add new libraries if you like
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve


# Generate synthetic data
from sklearn.datasets import make_classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


# Metrics (10 pt)

- **Accuracy:** measures the overall correctness of the model by calculating the ratio of correctly predicted instances (True Positives and True Negatives) to the total number of predictions. Accuracy measures the proportion of correctly classified instances out of the total instances.
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$
- **precision:** focuses on how many of the predicted positive cases were actually positive, making it useful in scenarios where false positives are costly (e.g., spam detection). This is the formula for precision:
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
- **Recall:**, also known as Sensitivity, measures the ability of the model to correctly identify all actual positive cases, which is crucial when missing a positive instance is critical (e.g., medical diagnosis). This is the formalu for calculating recall:
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
- **F1-Score:** is the harmonic mean of precision and recall, providing a balanced measure when there is an imbalance between positive and negative classes, ensuring neither metric dominates the evaluation. This is the formula for calculating f1-score:
$$
\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

complete the functions bellow to calculate the 4 metrics introduced above.

In [None]:
# Metric calculation functions
# TODO: complete the codes bellow to calculate the 4 metrics introduced in the above cell.

def accuracy(metrics): # 📣
    """calculates accuracy"""
    total = None
    acc = None
    return acc if total != 0 else 0.0

def precision(metrics): # 📣
    """calculates precision"""
    denominator = None
    prec = None
    return prec if denominator != 0 else 0.0

def recall(metrics): # 📣
    """calculates recall"""
    denominator = None
    rec = None
    return rec if denominator != 0 else 0.0

def f1_score(metrics): # 📣
    """ use the functions defined above to calculate f1-score"""
    prec = None
    rec = None
    f1 = None
    return f1 if (p + r) != 0 else 0.0

# Confusion Matrix (5 pt)

A **Confusion Matrix** is a table used to evaluate the performance of a classification model in machine learning. It provides a detailed breakdown of the model's predictions compared to the actual labels. The matrix is particularly useful for understanding the types of errors the model is making.

The confusion matrix consists of four key metrics:
- **True Positive (TP)**: The model correctly predicted the positive class.
- **False Positive (FP)**: The model incorrectly predicted the positive class (Type I error).
- **True Negative (TN)**: The model correctly predicted the negative class.
- **False Negative (FN)**: The model incorrectly predicted the negative class (Type II error).

Here’s a visualization of a confusion matrix:

<!-- |                     | Predicted Negative | Predicted Positive |
|---------------------|--------------------|--------------------|
| **Actual Negative** | TN                 | FP                 |
| **Actual Positive** | FN                 | TP                 | -->

<!-- ![Confusion Matrix](https://glassboxmedicine.com/wp-content/uploads/2019/02/confusion-matrix.png) -->

<img src="https://glassboxmedicine.com/wp-content/uploads/2019/02/confusion-matrix.png" alt="Confusion Matrix" width="70%">


## How to Build a Confusion Matrix?

To build a confusion matrix, follow these steps:

1. **Train a Classification Model**:
   - Use a dataset to train a classification model (e.g., logistic regression, decision trees, etc.).

2. **Make Predictions**:
   - Use the trained model to predict labels for a test dataset.

3. **Compare Predictions with Actual Labels**:
   - Compare the predicted labels with the actual labels to determine the number of TP, FP, TN, and FN.

4. **Create the Matrix**:
   - Organize the results into a 2x2 matrix as shown above.



now, complete the cell bellow to make a confusion matrix.


In [None]:
def confusion_matrix(metrics):
    """
    Plots a confusion matrix given the metrics (TP, FP, TN, FN).
    Parameters:    metrics (dict): A dictionary containing the following keys:
                    - 'TP': True Positives
                    - 'FP': False Positives
                    - 'TN': True Negatives
                    - 'FN': False Negatives
    """
    # TODO: Create the confusion matrix as a 2x2 numpy array # 📣
    conf_matrix = None

    plt.figure(figsize=(5, 4))  # Set the figure size
    sns.heatmap(
        conf_matrix,            # The confusion matrix data
        annot=True,             # Annotate the cells with the numeric values
        fmt='d',                # Format the annotations as integers
        cmap='Blues',           # Use a blue color map
        xticklabels=['Predicted Negative', 'Predicted Positive'],  # X-axis labels
        yticklabels=['Actual Negative', 'Actual Positive']        # Y-axis labels
    )

    plt.xlabel('Predicted')  # X-axis label
    plt.ylabel('Actual')     # Y-axis label
    plt.title('Confusion Matrix')  # Title of the plot
    plt.grid(False)
    plt.show()

# example confusion matrix
metrics = {
    'TP': 50,  # True Positives
    'FP': 10,  # False Positives
    'TN': 90,  # True Negatives
    'FN': 5    # False Negatives
}

confusion_matrix(metrics)

# Dataset (15 pt)

We are using the Pima Indians diabetes dataset. The objective with this dataset is to create a binary classification model that predicts whether or not an individual has diabetes based on several medical indicators. The target variable is given as Outcome and takes on a value of 1 if the patient has diabetes and 0 otherwise. This is an imbalanced class problem because there are significantly more patients without diabetes than with diabetes.




In [7]:
# run this code to Read in the dataset and display the first 5 lines of the dataframe
data = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/eecs-491/master/assign/project/diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Impute Missing Values
There are several columns in the dataset that contain 0 values, which are not physically possible (e.g., Glucose, BloodPressure, etc.).
To correct these values, we can impute the zeros with the median of the column.
first



#### **Theoretical question1:** based on above explanation which of the columns that need to be imputed? (hint: there are 5 columns to impute).
answer:


In [None]:
# Impute Missing Values
# To correct these values, replace the 0s with the median of the respective column.
# The median is a robust measure of central tendency and is less affected by outliers, making it a good choice for imputation.
# Use the `replace()` function to substitute 0s with the median value for each column.

data[''] = None
data[''] = None
data[''] = None
data[''] = None
data[''] = None
data

now check if there are missing values

In [None]:
# TODO: check if there are missing (nan) values in columns. you can use isna() function for this.
missing = None 
missing

## Normalize data
Machine learning algorithms like logistic regression can be affected by the scale of the input features. If one feature has values ranging from 0 to 1 and another has values from 0 to 1000, the algorithm might treat the feature with the larger range as more important, even if it’s not. This can lead to incorrect or poor results. Normalizing the features ensures that all features are treated equally by the algorithm.



#### **Theoretical question 2:** give 2 more reasons of why should we normalize the data before running an ML algorithm on the data.
answer:


In [None]:
# TODO: complete the function bellow for normalizing the dataset.
def normalize_data(X: pd.DataFrame):
    """Normalize features (columns) to have zero mean and unit variance"""
    X = np.array(X)  # Convert DataFrame to numpy array if needed
    mean = None  # Column means
    std = None    # Column standard deviations
    reslut = None
    return reslut

In [None]:
# Extract the features (X) and the labels (target or y)
features = data.drop(columns='Outcome')
labels = data.Outcome

Pima_X = normalize_data(features)
Pima_y = labels

## Training and Testing Sets
We will be using 200 randomly selected observations for the test set. This leaves 568 observations in the training data from which we will fit our model.

you can use the train_test_split method from sklearn library.

In [None]:
# TODO: complete the line bellow to split the dataset into training and testing sets using 200 observations for testing
X_train, X_test, y_train, y_test = None

print(type(X_train))
print(X_train.shape)
print(X_test.shape)

## another dataset

The **Breast Cancer Wisconsin (Diagnostic) Dataset** from `sklearn.datasets` is a widely used dataset for binary classification tasks. It contains **569 samples** of breast cancer cases, each described by **30 numerical features** computed from digitized images of fine needle aspirate (FNA) of breast masses. The features include the mean, standard error, and worst (largest) values of 10 cell nucleus characteristics (e.g., radius, texture, perimeter, smoothness). The target variable indicates whether the tumor is **malignant (1) or benign (0)**, with **357 benign and 212 malignant** samples. This dataset is useful for training machine learning models in medical diagnosis.


In [None]:
data = load_breast_cancer()
print(list(data))
Cancer_X, Cancer_y = None , None


we need to get familiar with the dataset we are using in order to find the best model to fit on this dataset.



In [None]:
# TODO: use DESCR attribute of the dataset to learn more about the dataset 
description = None
description

# Logistic Regression: Theory and Implementation (35 pt)

## 1. Introduction to Logistic Regression
**Logistic Regression** is a fundamental machine learning algorithm used for **binary classification** problems. Unlike **linear regression**, which predicts continuous values, logistic regression estimates the **probability** of an instance belonging to a particular class.

It is widely used in applications such as:
- Medical diagnosis (e.g., detecting cancerous vs. non-cancerous tumors)
- Spam detection (e.g., spam vs. non-spam emails)
- Credit scoring (e.g., loan approval or rejection)

---

## 2. Mathematical Formulation

### 2.1. Hypothesis Function (Sigmoid Function)
Logistic regression uses the **sigmoid function** (also known as the **logistic function**) to map real-valued inputs into a probability range \([0,1]\):

$$
h(z) = \frac{1}{1 + e^{-z}}
$$

where $z$ is the **linear combination** of input features and weights:

$$
z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b = XW + b
$$

- $X$ is the input feature matrix
- $W$ is the weight vector (parameters)
- $b$ is the bias term

The output \( h(z) \) represents the probability that a given instance belongs to the **positive class (1)**.

### 2.2. Decision Boundary
We classify an instance using a **threshold** (e.g., 0.5):

$$
\hat{y} =
\begin{cases}
1, & \text{if } h(z) \geq 0.5 \\
0, & \text{otherwise}
\end{cases}
$$

---

## 3. Cost Function (Log Loss)
To train a logistic regression model, we optimize a **cost function** that measures how well the model fits the data. The **log loss (cross-entropy loss)** is defined as:

$$
J(W, b) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log h(z_i) + (1 - y_i) \log (1 - h(z_i)) \right]
$$

where:
-  $m$ is the number of training examples
-  $y_i$ is the actual label (\(0\) or \(1\))
-  $h(z_i)$ is the predicted probability

This function penalizes incorrect predictions:
- If $y = 1$ but $h(z)$ is small, the loss is high.
- If $y = 0$ but $h(z)$ is close to 1, the loss is high.

---

## 4. Regularization (L2 Regularization)
To prevent **overfitting**, we use **L2 regularization** (also called **Ridge Regularization**). This adds a penalty term to the cost function:

$$
J(W, b) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log h(z_i) + (1 - y_i) \log (1 - h(z_i)) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} W_j^2
$$

where:
- $\lambda$ controls the strength of regularization.
- Larger $\lambda$ values reduce overfitting but may increase bias.

---

## 5. Gradient Descent for Parameter Optimization
We minimize the cost function using **gradient descent**:

$$
W_j := W_j - \alpha \frac{\partial J}{\partial W_j}
$$

$$
b := b - \alpha \frac{\partial J}{\partial b}
$$

where $\alpha$  is the **learning rate** and the gradients are:

$$
\frac{\partial J}{\partial W} = \frac{1}{m} X^T (h - y) + \frac{\lambda}{m} W
$$

$$
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (h_i - y_i)
$$



In [None]:
class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iters=1000, lambda_=0.1):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.lambda_ = lambda_  # L2 regularization strength
        self.weights = None  # learnable weights
        self.bias = None  # learnable bias
        self.loss_history = []
        self.classes_ = ['0', '1']

    def _sigmoid(self, z):
        # TODO: Clip values to avoid overflow in exp. you can use np.clip()
        z = None  # clip z to range (-500, 500)
        sigmoid = None # calculate the f(z) where f is the sigmoid function
        return sigmoid

    def _compute_loss(self, y, h):
        """
        Compute the logistic loss with L2 regularization.
        """
        epsilon = 1e-10  # Small value to avoid log(0)

        # TODO: Clip the predicted values (h) to be within [epsilon, 1 - epsilon]
        h = None 

        # TODO: Compute the logistic loss function
        loss = None

        # TODO: Compute L2 regularization term using self.lambda_ and self.weights
        regularization = None

        return loss + regularization

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        self.loss_history = []

        for _ in range(self.n_iters):
            linear_model = np.dot(X, self.weights) + self.bias
            h = self._sigmoid(linear_model)

            loss = self._compute_loss(y, h)
            self.loss_history.append(loss)

            # TODO: Compute gradients for weights (dw) and bias (db)

            # Hint: Compute the gradient for weights (dw) using the formula from Section 5.
            # Don't forget to include the L2 regularization term.
            dw = None

            # Hint: Compute the gradient for bias (db) using the formula from Section 5.
            db = None

            # TODO: Update the weights and bias using the learning rate
            self.weights -= None  # Update weights
            self.bias -= None  # Update bias

        return self


    def predict_proba(self, X):
        # Compute the probability of the positive class: use affine mappaing learned.
        linear_model = None
        return self._sigmoid(linear_model)

    def predict(self, X, y_true, threshold=0.5):
        proba = self.predict_proba(X)

        if len(self.classes_) == 2:
            y_pred = (proba >= threshold).astype(int)
        else:
            y_pred = np.argmax(proba, axis=1)

        # Compute True Positives, False Positives, True Negatives, and False Negatives
        result = {
            'TP': ,        
            ...
        }

        return result

    def score(self, X, y):
        preds = self.predict(X)
        return np.mean(preds == y)


run .fit() function to fit the model on both datasets seperately. at first you should split each dataset into training and test sets and then pass and infer the trained model.

In [None]:
X = Pima_X
y = Pima_y
# TODO: Split into train and test sets
X_train_pima, X_test_pima, y_train_pima, y_test_pima = None

# TODO: Initialize and train LogisticRegression
lr_pima = None
lr_pima.fit(X_train, y_train)

# TODO: use .predict() function of LogisticRegression class for inference.
results = None

# Calculate metrics
print("Accuracy:", accuracy(results))
print("Precision:", precision(results))
print("Recall:", recall(results))
print("F1 Score:", f1_score(results))
confusion_matrix(results)

In [None]:
# TODO: Load dataset
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target

# TODO: Split into train and test sets
X_train_bc, X_test_bc, y_train_bc, y_test_bc = None

# TODO: Initialize and train LogisticRegressionCV
lr_bc = None
lr_bc.fit(X_train_bc, y_train_bc)

# TODO: use .predict() function of LogisticRegression class for inference.
reset_defaults = None 

# Calculate metrics
print("Accuracy:", accuracy(results))
print("Precision:", precision(results))
print("Recall:", recall(results))
print("F1 Score:", f1_score(results))

confusion_matrix(results)

## AUC



In [None]:
# Determine the roc curve and the auc and display
def calc_roc(probs, y_test): # 📣
  # Calculate the area under the roc curve
  auc = roc_auc_score(y_test, probs)
  # Calculate metrics for the roc curve
  fpr, tpr, thresholds = roc_curve(y_test, probs) 

  plt.style.use('bmh')
  plt.figure(figsize = (8, 8))

  # Plot the roc curve
  plt.plot(fpr, tpr, 'b')
  plt.xlabel('False Positive Rate', size = 16)
  plt.ylabel('True Positive Rate', size = 16)
  plt.title('Receiver Operating Characteristic Curve, AUC = %0.4f' % auc,
            size = 18)


In [None]:
lr_pred_proba = lr_pima.predict_proba(X_test_pima)
calc_roc(lr_pred_proba, y_test_pima)

In [None]:
lr_pred_proba = lr_bc.predict_proba(X_test_bc)
calc_roc(lr_pred_proba, y_test_bc)

# Bayesian Classifier: Theory and Implementation (35 pt)

## 1. Introduction to Bayesian Classifier
The Bayesian Classifier is based on **Bayes' Theorem**, which provides a way to calculate the probability of a class given some observed data. This method is highly effective for classification problems, especially when the features are conditionally independent given the class. It assumes that the data follows a **Gaussian (normal) distribution** within each class.

### Bayes' Theorem
Bayes' Theorem is given by:

$$
P(C_k | X) = \frac{P(X | C_k)P(C_k)}{P(X)}
$$

where:
- $P(C_k | X)$ is the posterior probability of class $C_k$ given the feature vector $X$.
- $P(X | C_k)$ is the likelihood, which is the probability of the feature vector $X$ given class $C_k$.
- $P(C_k)$ is the prior probability of class $C_k$.
- $P(X)$ is the evidence, or the total probability of the feature vector across all classes.

In the Bayesian Classifier, we need to calculate the likelihood of the feature vector for each class and multiply it by the prior probability of that class.

### Gaussian Assumption
The likelihood $P(X | C_k)$ is assumed to follow a **Gaussian (Normal) distribution**. This means for each feature $x_i$ in the feature vector $X$, we assume:

$$
P(x_i | C_k) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)
$$

where:
- $\mu$ is the mean of feature $x_i$ for class $C_k$,
- $\sigma^2$ is the variance of feature $x_i$ for class $C_k$.

### Prior Probability
The **prior probability** $P(C_k)$ is simply the fraction of samples that belong to class $C_k$. It can be computed as:

$$
P(C_k) = \frac{\text{Number of samples in class } C_k}{\text{Total number of samples}}
$$

## 2. Implementation Overview

In the **`__init__`** method, we initialize parameters for the class:

- **`var_smoothing`**: A small value added to variances to prevent division by zero or extremely small numbers, which can cause instability.
- **`classes_`**: The list of unique classes in the target variable.
- **`priors_`**: The prior probabilities for each class.
- **`means_`**: The means of each feature for each class.
- **`variances_`**: The variances of each feature for each class.



In [None]:
class BayesianClassifier:
    def __init__(self, var_smoothing=1e-9):
        self.var_smoothing = var_smoothing  # To prevent zero variances
        self.classes_ = None
        self.priors_ = None
        self.means_ = None
        self.variances_ = None

    def fit(self, X, y):
        self.classes_ =  # TODO: Identify unique classes 
        n_classes =  # TODO: Get number of classes 
        n_features = # TODO: Get number of features

        # Initialize arrays to store parameters
        self.means_ =  # TODO: Initialize means
        self.variances_ =  # TODO: Initialize variances
        self.priors_ =  # TODO: Initialize priors


        # 📣
        for idx, c in enumerate(self.classes_):
            X_c =  # TODO: Filter data by class 
            self.means_[idx, :] =  # TODO: Calculate mean for class 
            self.variances_[idx, :] =  # TODO: Calculate variance for class 
            self.priors_[idx] =  # TODO: Calculate prior for class 

        return self

    def _calculate_log_proba(self, X):
        log_prob = []
        for idx in range(len(self.classes_)):
            prior = # TODO: Calculate log of prior
            mean = # TODO: Get mean of class
            variance =  # TODO: Get variance of class

            # Gaussian log probability calculation
            log_likelihood = # TODO: Implement Gaussian log-likelihood # 📣
            log_prob.append()  # TODO: Add log-likelihood to log probability

        return # TODO: Return the log probabilities

    def predict_proba(self, X):
        log_prob = # TODO: Get log probabilities 
        # Softmax to get probabilities 
        prob =  # TODO: Apply softmax
        prob /= # TODO: Normalize to get probabilities
        return prob

    def predict(self, X, y_true, threshold=0.5):
        proba =   # TODO: Get predicted probabilities
        if len(self.classes_) == 2:  # Handle binary classification
            y_pred =  # TODO: Convert to binary prediction # 📣
        else:  # Handle multi-class classification
            y_pred = # TODO: Select class with maximum probability # 📣

        # Calculate confusion matrix components
        results = {}
        results['TP'] = np.sum((y_pred == 1) & (y_true == 1))
        results['FP'] = np.sum((y_pred == 1) & (y_true == 0))
        results['TN'] = np.sum((y_pred == 0) & (y_true == 0))
        results['FN'] = np.sum((y_pred == 0) & (y_true == 1))

        return results

    def score(self, X, y):
        results = self.predict(X, y)
        return accuracy(results)


In [None]:
# TODO: Initialize and train BayesianClassifier
bc_pima = None
bc_pima.fit(X_train_pima, y_train_pima)

# TODO: use .predict() function of BayesianClassifier class for inference.
results = None

# Calculate metrics
print("Accuracy:", accuracy(results))
print("Precision:", precision(results))
print("Recall:", recall(results))
print("F1 Score:", f1_score(results))
confusion_matrix(results)

In [None]:
lr_pred_proba = bc_pima.predict_proba(X_test_pima)
calc_roc(lr_pred_proba[:,1], y_test_pima)

In [None]:
# data = load_breast_cancer()
# X, y = data.data, data.target

# # Split into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Initialize and train BayesianClassifier
bayes_bc = None 
bayes_bc.fit(X_train_bc, y_train_bc)

# TODO: use .predict() function of BayesianClassifier class for inference.
results = None

# Calculate metrics
print("Accuracy:", accuracy(results))
print("Precision:", precision(results))
print("Recall:", recall(results))
print("F1 Score:", f1_score(results))
confusion_matrix(results)

In [None]:
bc_pred_proba = bayes_bc.predict_proba(X_test_bc)
calc_roc(bc_pred_proba[:, 1], y_test_bc)