# Linear Regression

**Linear regression** is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). The relationship is modeled using a linear predictor function, whose unknown parameters are estimated from the data.

In a **machine learning** context, linear regression is a **supervised algorithm** that learns from labeled training data by fitting the best possible linear function to the input features. Once trained, this function can be used to make predictions on new, unseen data.

The linear predictor function can have the following forms:

- One predictor:
$$
\hat{y} = \beta_0 + \beta_1 x
$$

- Multiple predictors:
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n \\
$$
or
$$
\hat{y} = \beta_0 + \sum_{i=1}^{n} \beta_i x_i
$$

- Matrix form:
$$
\hat{y} = X \boldsymbol{\beta}
$$

## Uses
Most applications of linear regression fall into one of the two categories:

- **Prediction**: Fitting a model to observed data in order to predict the response variable using only the explanatory (independent) variables. Once trained, the model can generate predictions for new, unseen data.

- **Interpretation**: Understanding and quantifying how much of the variation in the response variable can be explained by changes in the explanatory variables. This is useful for identifying significant relationships and trends in data.

## Fitting

Linear regression models are often fitted using a method called **Ordinary Least Squares (OLS)**. The goal of OLS is to find the parameter values that minimize the difference between the actual and predicted values of the response variable.

This difference is quantified using the **Mean Squared Error (MSE)**, which is the average of the squared differences between actual values ($y_i$) and predicted values ($\hat{y}_i$). This is also known as the loss function:

$$
\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2
$$

> **Note:** Some sources include the $\frac{1}{n}$ term when defining Mean Squared Error (MSE), while others omit it in the context of model fitting. This difference does not affect the estimated coefficients — since scaling the loss function by a constant doesn't change the location of its minimum.

To minimize this error, OLS solves for the parameters, $\boldsymbol{\beta}$, using the following closed-form solution:

$$
\boldsymbol{\beta} = (X^T X)^{-1} X^T y
$$

Where:
- $X$: the feature matrix (with a column of ones for the intercept)
- $y$: the target vector
- $\boldsymbol{\beta}$: the vector of fitted coefficients

Once fitted, the model can predict outcomes using:

$$
\hat{y} = X \boldsymbol{\beta}
$$

## Error and Residuals

In linear regression, understanding the distinction between **error** and **residual** is key.

- The **error term** $\varepsilon_i$ represents the true, unobservable deviation of the observed response $y_i$ from the true regression line. It captures effects of omitted variables, measurement noise, and randomness:

$$
y_i = \hat{y}_i + \varepsilon_i
$$

- The **residual** $e_i$ is the observed counterpart to the error. It is the difference between the actual value and the predicted value from the model:

$$
e_i = y_i - \hat{y}_i
$$

While the true error $\varepsilon_i$ is unknown, the residual $e_i$ is used to assess model performance and validate regression assumptions.


## Assumptions

Standard linear regression models with standard estimation techniques make a number of assumptions about the predictors, the response, and their relationship. These assumptions ensure that the estimated coefficients $\boldsymbol{\beta}$ are unbiased and that inference (confidence intervals and p-values) is valid.

- **Weak exogeneity**: The predictors are not correlated with the error term $\varepsilon$. This ensures that the model is not influenced by omitted variable bias.

- **Linearity**: The relationship between the predictors and the response is linear in the parameters. That is, the expected value of the response variable is a linear combination of the predictors.

- **Constant variance (homoscedasticity)**: The variance of the error term is constant across all values of the independent variables. In other words, the spread of residuals should be uniform.

- **Independence of errors**: The residuals should be independent of each other -- no autocorrelation.

- **Lack of perfect multicollinearity**: The independent variables should not be perfectly linearly related to each other. Perfect multicollinearity makes it impossible to estimate unique coefficients.

Violations of these assumptions can result in biased estimations of $\boldsymbol{\beta}$, biased standard errors, untrustworthy confidence intervals and significance tests.

## Assessing Assumptions

Once a linear regression model is fitted, it's important to verify that the key assumptions are reasonably satisfied. Below are common techniques for diagnosing each assumption.

**Linearity**: Check whether the relationship between the predictors and response is linear.
- Plot residuals vs. fitted values. A random scatter indicates linearity while patterns indicate nonlinearity.
- Use partial regression plots to check linearity with individual predictors.

**Homoscedasticity**: Ensure the spread of residuals is constant across fitted values.
- Plot residuals vs. fitted values.
- Look for "funnel" or "bowtie" patterns.
- Breusch-Pagan test.

**Independence of Errors**: Ensure residuals are independent (no autocorrelation).
- Durbin-Watson test for time series or ordered data.
- Plot residuals over time or observation order

**Normality of Errors**: Residuals should be roughly normally distributed for valid confidence intervals.
- Plot a histogram or Q-Q plot of residuals.
- Shapiro-Wilk or Kolmogorov-Smirnov tests for normality.

**No Perfect Multicollinearity**: Ensure no predictors are perfectly (or near perfectly) linearly dependent.
- Calculate the **Variance Inflation Factor (VIF)** for each feature
- VIF > 5-10 indicates multicollinearity.

While mild violations are common and often tolerable, significant deviations from these assumptions can lead to biased estimates or invalid inference. These checks help decide when model adjustments are necessary.

## Overfitting and Regularization

A linear model may perform well on training data but poorly on unseen data — this is called **overfitting**. It often happens when the model is too complex or the data is noisy.

To address this, we use **regularization**, which adds a penalty to large coefficients. This encourages simpler models that generalize better.

- **L2 (Ridge)**: Shrinks coefficients but keeps all features
- **L1 (Lasso)**: Can shrink some coefficients to zero, effectively performing **feature selection**

These two regularization methods are alternatives to the MSE loss function.

## Regularized Loss Functions

Regularization modifies the original loss function (Mean Squared Error) by adding a penalty term.

- **Ordinary Least Squares (OLS)** minimizes:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

- **Ridge Regression (L2 penalty)** adds a squared penalty on coefficients:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
$$

- **Lasso Regression (L1 penalty)** adds an absolute value penalty:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
$$

- $\lambda$ controls the strength of regularization. Higher values shrink coefficients more aggressively.

In practice, $\lambda$ is often selected using cross-validation.

> **Note:** In most implementations (including scikit-learn), regularization is applied **only to the coefficients**, not the intercept $\beta_0$. This ensures that the model can still fit the baseline level of the response variable without penalty.



# Code