# Lab 3 - Logistic regression Assignment

In this assignment, we will implement a logistic regression classifier for binary classification.

### 1. Data - Breast Cancer dataset

The Breast Cancer dataset contains features computed from digitized images of breast masses. 
It has 30 numerical features describing characteristics such as radius, texture, and smoothness. 

The target variable is binary, indicating whether a tumor is malignant (1) or benign (0).

In [4]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)

# Add target column
df['target'] = y

# Show first rows
print(df.head())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

we can now normalise our data

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df[feature_names])
X_scaled = scaler.transform(df[feature_names])

# Convert back to DataFrame (optional)
df_scaled = pd.DataFrame(X_scaled, columns=feature_names)

In [6]:
df_scaled

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


In [7]:
df[['target']]

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
564,0
565,0
566,0
567,0


In [8]:
from sklearn.model_selection import train_test_split

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_scaled[feature_names], y, test_size=0.2, random_state=42)

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9737


### Assigment - Implement Logistic regression

In the assigment you will ask to impment your own logistic regression model.
This is similar to what we demostrate for the percepton algorithm.

you will ask to fill the following class in steps.

In [10]:
import numpy as np

class MyLogisticRegression:
    def __init__(self, n_features):
        self.lr = None            # learning rate
        self.n_iter = None    # number of iterations
        self.weights = np.zeros(n_features)
        self.bias = 0
        self.n_features = n_features

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y, lr=0.01, n_iter=1000):
        # data shape
        n_samples, n_features = X.shape

        self.lr = lr            # learning rate
        self.n_iter = n_iter    # number of iterations
        
        # initialise model parameters
        self.weights = np.zeros(self.n_features)
        self.bias = 0
        
        # Gradient descent
        for _ in range(self.n_iter):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self._sigmoid(linear_model)

            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        
    def predict_proba(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        return self._sigmoid(linear_model)

    def predict(self, X, threshold=0.5):
        probs = self.predict_proba(X)
        return np.where(probs >= threshold, 1, 0)

### **Step 1 — Implement `predict_proba`**

In this step, you will implement the `predict_proba` method of the `LogisticRegression` class.  
This method should return the **predicted probabilities** that each sample belongs to class `1`.

The logistic regression model predicts probabilities using the **sigmoid function**:

$\hat{y} = Pr(y=1|x) = \sigma(Xw + b) , \quad \text{where } \sigma(z) = \frac{1}{1 + e^{-z}}$

Here:  
- $X$ is the input data matrix (shape: *n_samples × n_features*).  
- $w$ is the weight vector (learned parameters).  
- $b$ is the bias term.  
- $\sigma$ is the sigmoid function

#### **Implementation Steps**
1. Compute the **linear combination** $z = Xw + b$.  
2. Apply the **sigmoid function** to obtain probabilities.  
3. Return these probabilities as a NumPy array.  
4. Make sure your code works **for a batch of data**, not just a single sample.  
   - $X$ should be a 2D array.  
   - Use matrix–vector operations (e.g., `np.dot(X, w) + b`) to ensure correct broadcasting.


In [11]:
# Assuming X_train, X_test, y_train, y_test are ready and scaled
model = MyLogisticRegression(n_features= X_train.shape[1])
y_probs = model.predict_proba(X_test.iloc[0:5])

In [12]:
# check the outputs
y_probs

array([0.5, 0.5, 0.5, 0.5, 0.5])

### **Step 2 — Implement `predict`**

In this step, you will implement the `predict` method of the `LogisticRegression` class.  
This method should return the **predicted class labels** (`0` or `1`) for each input sample.

The `predict` method relies on the probabilities obtained from your `predict_proba` method.

#### **Implementation Steps**
1. Use your `predict_proba` method to compute the predicted probabilities $\hat{y}$ for all samples.  
2. Apply a **decision threshold** (commonly 0.5):  
   - If $\hat{y} \geq 0.5$, predict class `1`.  
   - Otherwise, predict class `0`.  
3. Return the predicted labels as a NumPy array of integers (`dtype=int`).

In [13]:
y_pred = model.predict(X_test.iloc[0:5])
y_pred

array([1, 1, 1, 1, 1])

### **Step 3 — Implement `fit`**

In this step, you will implement the `fit` method of the `LogisticRegression` class.  
This method should train the model using **gradient descent** to find the optimal parameters $w$ and $b$ that minimize the **binary cross-entropy loss**.

You want the version of  **gradient descent** where the every update is based to the whole dataset.

The loss function for logistic regression is:

$$
L = -\frac{1}{n} \sum_{i=1}^{n} \big[ y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \big]
$$

where $\hat{y}_i = \sigma(X_i w + b)$.

#### **Implementation Steps**
1. **Initialize the parameters**:
   - Set $w$ as a zero vector of shape `(n_features,)`.
   - Set $b = 0$.
2. **For a fixed number of iterations (epochs)**:
   - Use the whole dataset:
     - Compute the predicted probabilities: $\hat{y} = \sigma(X w + b)$ using your `predict_proba` function.
     - Compute the **gradients** of the loss with respect to $w$ and $b$:  
       - $\displaystyle \frac{\partial L}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) X_i$, this is a vector of size (n_features,1)
       - $\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)$
     - Update the parameters using the learning rate $\alpha$:  
       - $w = w - \alpha \frac{\partial L}{\partial w}$  
       - $b = b - \alpha \frac{\partial L}{\partial b}$

> ⚠️ Make sure your implementation works for a batch of samples (matrix $X$). Avoid using explicit `for` loops.
> ⚠️ You **do not need to compute the loss $L$** since the gradients are known in closed form.  

In [14]:
model.fit(X=X_train, y=y_train, lr=0.01, n_iter=1000)

### **Step 4 — Evaluate the Model**

Calculate the **test accuracy** of your model

In [15]:
y_pred_test = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_test)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9912


In [16]:
y_pred_test

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0])