# 2. Logistic Regression

---

## References

[Geeks for Geeks - Logistic Regression in Machine Learning](https://www.geeksforgeeks.org/understanding-logistic-regression/)

[Scikit Learn - LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[StatQuest: Logistic Regression](https://youtu.be/yIYKR4sgzI8?si=k2EGU-u74p-jecQW)

[Google Developer Program](https://developers.google.com/machine-learning/crash-course/logistic-regression/)

---

## Notes

### Characteristics

- Supervised learning algorithm for classification tasks.
- Uses a inear model with a sigmoid activation function to map inputs to probabilities.
- Outputs probability values between 0 and 1.
- Can be extended to multiclass classification using softmax (multinomial logistic regression).

### Assumptions

1. **Linearity in log odds**: The relationship between input features and the log odds of the target variable is linear.
2. **Independence of observations**: Each training example is independent of others.
3. **No extreme multicollinearity**: Features should not be highly correlated.
4. **Sufficiently large dataset**: Logistic regression works best when there are enough data points per class.
5. **Features should be scaled**: Standardizing features improves performance.

### Inputs & Outputs

- **Input**: Feature matrix $X$ of shape $(n_{\text{samples}}, n_{\text{features}})$.
- **Output**: Probability $\hat{y}$ of shape $(n_{\text{samples}},)$
    - converted to a class ($0$ or $1$) using a decision threshold (usually $0.5$).

### Parameters

- **Trainable Parameters**:
  - $\vec{w}$ (weights): $(n_{\text{features}},)$
  - $b$ (bias): scalar

* **Hyperparameters**:
  - $\alpha$ (learning rate)
  - epochs (Number of iterations)


### Runtime Complexity

- **Training Complexity**: $O(n \cdot d \cdot T)$
- **Inference Complexity**: $O(n \cdot d)$
* where
    - $n$: number of samples
    - $d$: number of features
    - $T$: number of training iterations

### Pros & Cons

- **advantages**:
    - Simple, interpretable, and easy to implement.
    - Outputs probability scores, not just class labels.
    - Works well when the decision boundary is approximately linear.
    - Computationally efficient for small and medium sized datasets.

* **disadvantages**:
    - Struggles with non linearly separable data.
    - Sensitive to outliers.
    - Cannot handle complex feature interactions without feature engineering.


### Applications

- **Medical Diagnosis** (e.g., predicting cancer probability)
- **Spam Detection** (classifying emails as spam or not)
- **Credit Risk Assessment** (predicting loan default probability)
- **Customer Churn Prediction** (identifying users likely to cancel a subscription)

---

## Mathematics

### Model Equations

**Linear Combination of Inputs**:
$$z = b + w_1x_1 + w_2x_2 + \cdots + w_nx_n = b + \vec{w}^T\vec{x}$$

**Sigmoid Activation Function**:
$$\hat{y} = \sigma(z)=\frac{1}{1+e^{-z}}$$

**Class Decision with Threshold**:
$$\text{returns}\, \begin{cases}1&\hat{y}\geq0.5\\0&\hat{y}<0.5\end{cases}$$

where:
- $e^z$: odd, the ratio of the probability of favorable outcomes and that of unfavorable outcomes ($\frac{p}{1-p}$)
- $\sigma$: probability ($p$)


### Loss Function

Binary Cross-Entropy Loss (Log Loss):
$$\ell(\hat{y}_i,y_i) = -y_i\log(\hat{y}_i)-(1-y_i)\log(1-\hat{y}_i)$$
$$J(\vec{w},b)=-\frac{1}{n}\sum\limits_{i=1}^{n}(y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i))$$

where:
- $y_i$ is the actual label (0 or 1),
- $\hat{y}_i$ is the predicted probability.

### Derivatives

\begin{align*}
    \frac{\partial\ell}{\partial \hat{y}} &= -\frac{y_i}{\hat{y}_i}+\frac{1-y_i}{1-\hat{y}_i}\\
    &=\frac{\hat{y}_i-y_i}{\hat{y}_i(1-\hat{y}_i)}\\
    \\
    \frac{\partial \hat{y}}{\partial z} &= \frac{e^{-z}}{(1+e^{-z})^2}\\
    &= \frac{1}{1+e^{-z}}\cdot\left(1-\frac{1}{1+e^{-z}}\right)\\
    &= \hat{y}_i\left(1-\hat{y}_i\right)\\
    \\
    \frac{\partial z}{\partial \vec{w}} &= \vec{x}\\
    \frac{\partial z}{\partial b} &= 1
\end{align*}


### Gradients

\begin{align*}
    \frac{\partial J}{\partial \vec{w}}&=\frac{1}{n}\sum\frac{\partial\ell}{\partial \vec{w}}\\
    &=\frac{1}{n}\sum\left(\frac{\partial\ell}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z}\cdot\frac{\partial z}{\partial \vec{w}}\right)\\
    &=\frac{1}{n}\sum\limits(\hat{y}-y)\vec{x}
\end{align*}

\begin{align*}
    \frac{\partial J}{\partial b}&=\frac{1}{n}\sum\frac{\partial\ell}{\partial b}\\
    &=\frac{1}{n}\sum\left(\frac{\partial\ell}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z}\cdot\frac{\partial z}{\partial b}\right)\\
    &=\frac{1}{n}\sum\limits(\hat{y}-y)
\end{align*}

### Updates

$$\vec{w}=\vec{w}-\alpha\left(\frac{1}{n}\sum(y-\hat{y})\vec{x}\right)$$
$$b=b-\alpha\left(\frac{1}{n}\sum(y-\hat{y})\right)$$

---

## Comments

Although multinomial logistic regression exists, only binomial model will be implemented.