# Logistic/Softmax regression and Cross Entropy Loss with PyTorch

https://www.youtube.com/watch?v=1fXB0Lc9RMI&list=PLLeO8f6PhlKb_FAC7qxOBtxT9-8EPDAqk&index=4

## **Logistic Regression: A Classification Algorithm**
Logistic Regression is a **supervised learning algorithm** used for **classification tasks**. Despite its name, it is **not a regression algorithm** but rather a method for predicting **categorical outcomes** (e.g., "yes" or "no", "spam" or "not spam", etc.).

---

### **How Logistic Regression Works**
Logistic Regression works by applying the **sigmoid (logistic) function** to a **linear equation**. This converts continuous values into probabilities between **0 and 1**.

#### **Mathematical Representation**
1. **Linear Combination of Features**  
   Given input features \(X\), the model computes a weighted sum:

   $$
   z = W X + b
   $$

   Where:
   - \(X\) = input features
   - \(W\) = weights (learned during training)
   - \(b\) = bias term

2. **Applying the Sigmoid Function**  
   The output \(z\) is passed through the **sigmoid function**:

   $$
   \sigma(z) = \frac{1}{1 + e^{-z}}
   $$

   This ensures the output is a probability between 0 and 1.

3. **Decision Rule**  
   - If \( \sigma(z) > 0.5 \), classify as **1** (positive class).
   - If \( \sigma(z) \leq 0.5 \), classify as **0** (negative class).

---

### **Types of Logistic Regression**
1. **Binary Logistic Regression**  
   - Used when there are **two** classes (e.g., spam or not spam).
   - Example: Predicting if an email is **spam (1) or not spam (0)**.

2. **Multiclass Logistic Regression (Softmax Regression)**  
   - Used when there are **more than two** classes.
   - Uses the **softmax function** instead of the sigmoid function.
   - Example: Predicting if an image contains a **cat (class 0), dog (class 1), or bird (class 2)**.

---

### **Loss Function: Binary Cross-Entropy**
To measure how well the model is performing, we use the **Binary Cross-Entropy (Log Loss)**:

$$
Loss = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i}) \right]
$$

where:
- \( y_i \) = actual label (0 or 1)
- \( \hat{y_i} \) = predicted probability
- \( m \) = number of samples

This function **penalizes incorrect predictions more heavily** when the model is confident but wrong.

# **Example: Logistic Regression in Python**
Using **scikit-learn** to classify whether a tumor is malignant (1) or benign (0):

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate synthetic classification data
X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

## We follow the same steps as Linear regression
0. **Prepare Data**
1. **Design Model** (input size, output size, forward pass)
2. **Construct Loss and optimizer**
3. **Training Loop**
    - Forward Pass: Compute Prediction and Loss
    - Backward Pass: Gradients
    - Update Weights
We have to make slight modification to linear regression. We usaulay add one more layer to our model, and select a different loss function


## Summary of Steps:
1. **Import Libraries**
2. **Load & Prepare Data**
3. **Split into Training & Testing Sets**
4. **Preprocess Data**
5. **Choose a Model**
6. **Train the Model**
7. **Make Predictions**
8. **Evaluate Performance**
9. **Hyperparameter Tuning (Optional)**
10. **Deploy the Model (Optional)**


### Code Understanding

- The model inherits from `nn.Module`, PyTorch’s base class for all neural networks.
`self.linear = nn.Linear(n_input_features, 1):`

- A single-layer model with: `n_input_features` (30 features from the dataset). 1 `output neuron` (since it’s binary classification).
`forward(self, x):` 
- Applies linear transformation:
$$ y=Wx+b $$
- Applies sigmoid activation:
$$\sigma(y)=\frac{1}{1+e^{-y}}$$
- Converts raw scores to probabilities (0 to 1).

`nn.BCELoss():`
- Standard loss function for binary classification.
- Measures the difference between predicted probabilities and actual labels.

Optimizer: `torch.optim.SGD` (__Stochastic Gradient Descent)__:
- `model.parameters():` Optimizes model weights.
- `lr=0.01:` Learning rate controls step size in gradient updates.

#### Why Do We Need to Perform Feature Scaling in Logistic Regression?
Feature scaling is crucial for machine learning models, especially for gradient-based optimization algorithms like logistic regression. 
1. Logistic Regression Uses Gradient Descent
    - Logistic Regression learns model parameters using Gradient Descent, which updates weights by computing gradients.
    - If features have different scales, gradient updates will be uneven, leading to slow convergence or even failure to find the optimal solution.
2. Helps with Convergence Speed \
3. Prevents Numerical Instability \
    - Some features might have very large values, causing issues like overflow or underflow during computation.
    - This is especially critical when using sigmoid activation in logistic regression $\sigma(x)=\frac{1}{1+e^{-x}}$
    - If x is too large, $^(-x$ approaches 0, causing numerical instability.
4. Improves Model Performance \
    - Proper scaling ensures that all features contribute equally to the learning process.
    - This can improve accuracy and generalization on new data.
5. Required for Many Regularization Methods \
    - If you use $L1$ (Lasso) or $L2$ (Ridge) regularization, scaling ensures that penalization applies equally to all features.

  
- Example:
| Feature | Range               |
|---------|---------------------|
| Age     | 18 - 80            |
| Salary  | 20,000 - 200,000   |

- Since "Salary" has much larger values, the model will be dominated by it, and "Age" will be ignored.
- Without Scaling:
    - Large values (like 200,000) will result in large weight updates.
    - Small values (like 18) will result in tiny updates.
    - This leads to imbalanced learning and poor optimization.

### Which Scaling Method Should You Use?

#### 1. Standardization (Recommended for Logistic Regression)
$$
X_{\text{scaled}} = \frac{X - \mu}{\sigma}
$$
- **Zero mean, unit variance** (centered around 0).
- Works well for **gradient-based** models like Logistic Regression, SVM, and Neural Networks.
- **Used in below code:**

```
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
```

#### Min-Max Scaling (Alternative)
$$
X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$
- Scales values between **0 and 1**.
- Used for models that need **bounded inputs** (e.g., Neural Networks).


In [3]:
# import torch
# import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ----
# 0) Prepare data : Import Breast Cancer data from `sklearn.datasets`. It is a binary classification dataset.
# X contains features (independent variables). y contains the target labels (0 for benign, 1 for malignant).
bc = datasets.load_breast_cancer()
X, y = bc.data, bc.target

n_samples, n_feateures = X.shape
print(n_samples, n_feateures) # 569 samples and 30 differnt features (quite a lot)


# Splitting and Scaling the Data: 
# Splits 80% training and 20% testing data; random_state=1234 ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Feature Scaling (Standardization): Alwyas recommended when doing logistic regression
# `StandardScaler()` scales each feature to zero mean and unit variance.
# `fit_transform(X_train)`: Learns and applies scaling on training data.
# `transform(X_test)`: Uses the same scaling on test data.
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Convert to tensor
X_train = torch.from_numpy(X_train.astype(np.float32))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.float32))

# Reshape the y tensor: right now y has only one row and we want to make it as a column vector to put each value in one row
y_train = y_train.view(y_train.shape[0], 1)
y_test = y_test.view(y_test.shape[0], 1)

# ----
# 1) model
# f = wx + b, sigmoid function: return the value 0 or 1.
# Lets create our own class. call this LogisticRegression and this must be derived from nn.module. It will have __init__, whcih has self and other arguments
class LogisticRegression(nn.Module):
    def __init__(self, n_input_features):
        super(LogisticRegression, self).__init__()
        # it only has one layer, with n_input_features and output size = 1, i.e, only one class label at the end.
        self.linear = nn.Linear(n_input_features, 1)
    # implement the forward pass whcih has self and data
        def forward(self, x):
            y_predicted = torch.sigmoid(self.linear(x))
            return y_predicted

model = LogisticRegression(n_features)

# ----
# 2) Loss and Optimizer
learning_rate = 0.01
criterion = nn.BCELoss() # binary cross entropy loss
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


# ----
# 3) Training Loop
num_epochs = 100
for epoch in range(num_epochs):
    # 1) Forward Pass and Loss: Computes predictions (y_predicted) and Computes loss (loss).
    y_predicted = model(X_train)
    loss = criterion(y_predicted, y_train)
        
    # 2) backward Pass: calculates gradients for all weights.
    loss.backward()

    # 3) Update Weights: updates the model's weights using the gradients.
    optimizer.step()

    # Zero the gradeints:resets gradients (otherwise, they accumulate).
    optimizer.zero_grad()
    
    if (epoch+1) % 10 ==0: # Print loss every 10 epochs.
        print(f'epoch: {epoch+1}, loss={loss.item():.4f}')

569 30


### Testing or Evalaution
#### Understand the code structure for evalauting the model
- During training, PyTorch tracks all operations on tensors to compute gradients for backpropagation. However, in evaluation, we do not need to compute gradients, which saves memory and speeds up computations.
- `with torch.no_grad():` disables gradient tracking for the following code block
- `y_predicted = model(X_test)`: The test data `X_test` is passed to the model to get predictions. Since `model(X_test)` is using the `sigmoid activation function`, the output will be a probability between $0$ and $1$.
- `y_predicted_cls = y_predicted.round()`:  The output of the model is a probability (e.g., 0.7, 0.3, 0.9, etc.). To get class labels (0 or 1), we use `.round()`, which converts: Values $\geq 0.5$ to $1$ (positive class) and Values $< 0.5$ $0$ (negative class).
- `acc = y_predicted_cls.eq(y_test).sum() / float(y_test.shape[0])`:
    - `y_predicted_cls.eq(y_test)` creates a Boolean tensor where:
    - True $(1)$ means the prediction is correct.
    - False $(0)$ means the prediction is incorrect.
    - `.sum()` counts the number of correct predictions.
    - `float(y_test.shape[0])` gets the total number of test samples.
    - Final calculation: $$ Accuray = \frac{correct prediction}{Tortal Samples}

In [None]:
# Evaluate the model, whcih is not the part of CG. Dont want to tract gradient
with torch.no_grad(): #disables gradient tracking for the following code block.
    # Get the accuracy, get all the predicted classe3s from test samples
    y_predicted = model(X_test)
    y_predicted_cls = y_predicted.round()
    # Calculate accuracy
    acc = y_predicted_cls.eq(y_test).sum() / float(y_test.shape[0])
    print(f'accuracy = {acc:.4f}') # ensures the accuracy is displayed with 4 decimal places.

### Exercise 
If accuracy is not good, play with `num_epochs`, `learning_rate` and differnt `optimizer`.

Possible Improvements:
- Try Adam (Adaptive Moment Estimation) optimizer instead of SGD (`torch.optim.Adam`).
    - Instead of: `optimizer = torch.optim.SGD(model.parameters(), lr=0.01)`
    - Use: `optimizer = torch.optim.Adam(model.parameters(), lr=0.001)`
- Tune learning rate and number of epochs for better accuracy

### Why Use Adam?
- Faster Convergence
    - Adam updates weights using adaptive learning rates, so it converges faster than vanilla SGD.
    - It is especially useful when dealing with large datasets or high-dimensional spaces.
- Adaptive Learning Rates (Automatic Learning Rate Adjustment)
    - In SGD, we use a fixed learning rate, which may require manual tuning.
    - Adam adjusts learning rates dynamically for each parameter based on the magnitude of past gradients, reducing the need for manual tuning.)
- Less Hyperparameter Tuning Required (More Robust to Hyperparameters) \
    - SGD requires careful tuning of the learning rate, momentum, etc.
    - Adam is less sensitive to hyperparameter choices and often works well with default setting
- Better for Deep Learning and Large Datasets 
- Handles Noisy and Sparse Gradients Well
    - SGD struggles with sparse gradients (i.e., when some features rarely update).
    - Adam uses adaptive scaling, making it more efficient for sparse datasets and less sensitive to noisy gradients.
- Works Well for Non-Convex Problems

### Comparison Table: Adam vs. SGD

| Optimizer  | Learning Rate | Convergence Speed | Handles Noisy Gradients | Works Well for Deep Learning |
|------------|--------------|-------------------|------------------------|------------------------------|
| **SGD**    | Fixed        | Slower            | Struggles              | Sometimes                    |
| **Adam**   | Adaptive     | Faster            | Better                 | Yes                          |


### When to Use Adam vs. SGD?
| Scenario                                      | Use Adam? | Use SGD? |
|----------------------------------------------|---------|---------|
| **Default choice for most models**           | ✅      | ❌      |
| **Deep learning (CNNs, RNNs, Transformers, etc.)** | ✅      | ❌      |
| **Small datasets, simple models**            | ❌      | ✅      |
| **Training speed is important**              | ✅      | ❌      |
| **Fine-tuning with pre-trained models**      | ✅      | ❌      |
| **Best final accuracy (with careful tuning)** | ❌      | ✅      |
