# 1. Linear Regression

---

## References

[Geeks for Geeks - Linear Regression in Machine Learning](https://www.geeksforgeeks.org/ml-linear-regression/)

[Scikit Learn - LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

---

## Notes

### Characteristics

- supervised learning algorithm for regression tasks
- linear relationship between inputs and outputs
- finds line of best fit

### Assumptions
1. Linearity: the relationship between features and target is linear
2. Independence: observations are independent of each other
3. Homoscedasticity: constant variance in residuals
4. Normality of Residuals: residuals should be normally distributed
5. No Perfect Multicollinearity: features should not be highly correlated

The model performance may decrese if assumptions are not satisfied.

### Inputs & Outputs

- **Input**: feature matrix $X$ of shape $(n_{\text{samples}}, n_{\text{features}})$
- **Output**: target variable $y$ of shape $(n_{\text{samples}},)$


### Parameters

- $\vec{w}$: weights, $(n_\text{features},)$
- $b$: bias, float

- **Hyperparameters**:
    - $\alpha$: learning rate
    - number of epochs

### Runtime

- **Training**: $O(nd)$ per epoch
- **Inference**: $O(nd)$

where $n=$ number of samples, $d=$ number of features

### Pros & Cons

- Pros:
    - Simple and Interpretable: easy to implement and explain
    - Works well for linearly separable data
    - Used as a baseline: often used before trying complex models
- Cons:
    - Sensitive to outliers: large errors from outliers can skew predictions
    - Assumes linear relationship: fails if data is non-linear
    - Multicollinearity issues: highly correlated features can lead to unstable weights
    - Not great for complex data: can underfit high dimensional or non linear data

### Applications
- Predicting house prices
- Forecasting stock trends
- Medical risk assessment

---

## Mathematics

### Model Equation
$$y=b+w_1x_1+w_2x_2+\cdots+w_nx_n=b+\vec{w}\cdot\vec{x}$$
$$\vec{y}=X\vec{w}+\vec{b}$$

### Loss Function

Mean Squared Error:
$$\ell(\hat{y}_i,y_i)=(\hat{y}_i-y_i)^2$$
$$J(\vec{w},b)=\frac{1}{n}\sum\limits_{i=1}^{n}(\hat{y}_i-y_i)^2$$

### Gradients

\begin{align*}
    \frac{\partial J}{\partial \vec{w}}&=\frac{\partial}{\partial \vec{w}}\left(\frac{1}{n}\sum(\hat{y}-y)^2\right)\\
    &=\frac{1}{n}\sum(2(\hat{y}-y)\cdot\frac{\partial \hat{y}}{\partial\vec{w}})\\
    &=\frac{2}{n}\sum(\hat{y}-y)\vec{x}
\end{align*}

\begin{align*}
    \frac{\partial J}{\partial b}&=\frac{\partial}{\partial b}\left(\frac{1}{n}\sum(\hat{y}-y)^2\right)\\
    &=\frac{1}{n}\sum(2(\hat{y}-y)\cdot\frac{\partial\hat{y}}{\partial b})\\
    &=\frac{2}{n}\sum(\hat{y}-y)
\end{align*}

### Updates

$$\vec{w}=\vec{w}-\alpha\left(\frac{2}{n}\sum(\hat{y}-y)\vec{x}\right)$$
$$b=b-\alpha\left(\frac{2}{n}\sum(\hat{y}-y)\right)$$

---

## Comments

The optimal solution to the linear regression problem can be found by using normal equation:
$$\vec{w}=(X^TX)^{-1}X^T\vec{y}.$$
However, it is slower and computationally more expensive as the data gets larger.