# Simple Linear Regression
## Contents
- Description  
- Errors
- Cost Function
- Convergence Algorithm

## Description  
Simple linear regression is a statistical method used to model the relationship between two variables: one independent variable (predictor) and one dependent variable (outcome). The goal is to find a linear equation, \( y = mx + b \), where \( y \) is the dependent variable, \( x \) is the independent variable, \( m \) is the slope (rate of change), and \( b \) is the intercept (value of \( y \) when \( x = 0 \)).

## Errors
Simple linear regression trys to make a best fit line with a minimium error.  **errors** represent the difference between the actual observed values
and the values predicted by the regression model. The error for each data point is calculated as:
$$ E_i = y_i - \hat{y}_i $$

Where:
- $E_i$  is the error for the \( i \)-th observation,
- $y_i$  is the actual observed value of the dependent variable,
- $\hat{y}_i$  is the predicted value from the regression model.

## Cost Function
The cost function measures the difference between the predicted values of the model and the actual target values. By minimizing this cost function, we can determine the optimal values for the model’s parameters and improve its performance.

The Mean Squared Error (MSE) is given by:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Where:
- $y_i$ is the actual value
- $\hat{y}_i$ is the predicted value
- $n$ is the total number of data points

Reference: https://medium.com/@yennhi95zz/3-understanding-the-cost-function-in-linear-regression-for-machine-learning-beginners-ec9edeecbdde

## Convergence Algorithm
Gradient Descent is an iterative optimization algorithm that tries to find the optimum value (Minimum/Maximum) of an objective function. 
In a convergence algorithm, such as gradient descent, the weight update formula is:

$$ w_{t+1} = w_t - \eta \nabla L(w_t) $$

Where:
- $w_t$ is the weight at iteration $t$
- $w_{t+1}$ is the updated weight after iteration $t$
- $\eta$ is the learning rate
- $\nabla L(w_t)$ is the gradient of the loss function with respect to $w_t$

Reference: https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/

# Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression that models the relationship between two or more independent variables (predictors) and a single dependent variable (outcome). The goal is to find the best-fit equation that describes how the dependent variable changes with variations in the independent variables.

The formula for multiple linear regression is:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon $$

Where:
- $y$ is the dependent variable (the value we are trying to predict)
- $\beta_0$ is the intercept
- $\beta_1, \beta_2, \dots, \beta_n$ are the coefficients for each independent variable
- $x_1, x_2, \dots, x_n$ are the independent variables (the predictors)
- $\epsilon$ is the error term (residuals)


# Perfomance
- R-squared
- Adjusted R-squared

## R-squared
R-squared (R2) is defined as a number that tells you how well the independent variable(s) in a statistical model explain the variation in the dependent variable. It ranges from 0 to 1, where 1 indicates a perfect fit of the model to the data.

The formula for R-squared ($R^2$) is:

$$ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $$

Where:
- $R^2$ is the coefficient of determination
- $y_i$ is the actual value for data point $i$
- $\hat{y}_i$ is the predicted value for data point $i$
- $\bar{y}$ is the mean of the actual values
- $n$ is the total number of data points


## Adjusted R-squared
The adjusted R-squared is a modified version of R-squared that accounts for predictors that are not significant in a regression model. In other words, the adjusted R-squared shows whether adding additional predictors improve a regression model or not.
The formula for Adjusted R-squared ($R^2_{adj}$) is:

$$ R^2_{adj} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right) $$

Where:
- $R^2_{adj}$ is the Adjusted R-squared
- $R^2$ is the R-squared (coefficient of determination)
- $n$ is the total number of data points
- $k$ is the number of independent variables (predictors)


Refernce for R-square vs Adjusted R-squared https://corporatefinanceinstitute.com/resources/data-science/adjusted-r-squared/#:~:text=Summary,adding%20value%20to%20the%20model.

# Polynomial Regression

Polynomial regression is a type of regression analysis where the relationship between the independent variable(s) and the dependent variable is modeled as a polynomial of a certain degree. Unlike linear regression, which assumes a straight-line relationship, polynomial regression can capture more complex, curved patterns in the data.

The formula for polynomial regression (of degree $d$) is:

$$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots + \beta_d x^d + \epsilon $$

Where:
- $y$ is the dependent variable (the value we are trying to predict)
- $\beta_0$ is the intercept
- $\beta_1, \beta_2, \dots, \beta_d$ are the coefficients for each degree of the independent variable
- $x$ is the independent variable (the predictor)
- $d$ is the degree of the polynomial
- $\epsilon$ is the error term (residuals)


By increasing the degree d, the model can fit more complex, non-linear relationships between the variables.

# Ridge regression
Ridge regression is a type of linear regression that adds a regularization term to the cost function to prevent overfitting. It is particularly useful when the model has many features or when multicollinearity exists, which can lead to large and unstable coefficients in standard linear regression.

The cost function for ridge regression includes a penalty term, which is the sum of the squared values of the coefficients:
$$ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 $$

Where:
- $J(\beta)$ is the cost function
- $y_i$ is the actual value for data point $i$
- $\hat{y}_i$ is the predicted value for data point $i$
- $\lambda$ is the regularization parameter (controls the strength of regularization)
- $\beta_j$ is the coefficient of the $j$-th feature
- $n$ is the total number of data points
- $p$ is the total number of features (independent variables)

The regularization term shrinks the coefficients, reducing model complexity and helping prevent overfitting, especially when there are highly correlated variables.

# Lasso Regression
Lasso regression (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that adds an 𝐿1.
L1 regularization term to the cost function. Like ridge regression, it helps prevent overfitting, but lasso has an additional feature: it can reduce some coefficients to exactly zero, effectively performing feature selection by eliminating irrelevant predictors.
The cost function for Lasso Regression is:

$$ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| $$

Where:
- $J(\beta)$ is the cost function
- $y_i$ is the actual value for data point $i$
- $\hat{y}_i$ is the predicted value for data point $i$
- $\lambda$ is the regularization parameter (controls the strength of regularization)
- $\beta_j$ is the coefficient of the $j$-th feature
- $|\beta_j|$ is the absolute value of the coefficient $\beta_j$ (L1 regularization)
- $n$ is the total number of data points
- $p$ is the total number of features (independent variables)

Lasso’s ability to shrink some coefficients to zero makes it especially useful for sparse models with many irrelevant features.

# Elastic Net Regrssion
Elastic Net regression is a linear regression model that combines the regularization techniques of both ridge regression (L2 regularization) and lasso regression (L1 regularization). It helps address the limitations of both methods by balancing their strengths, making it useful for models with high-dimensional data or multicollinearity. 

The cost function for Elastic Net Regression is:

$$ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right) $$

Where:
- $J(\beta)$ is the cost function
- $y_i$ is the actual value for data point $i$
- $\hat{y}_i$ is the predicted value for data point $i$
- $\lambda$ is the regularization parameter (controls the strength of regularization)
- $\alpha$ controls the balance between L1 (Lasso) and L2 (Ridge) regularization
- $|\beta_j|$ is the absolute value of the coefficient $\beta_j$ (L1 regularization)
- $\beta_j^2$ is the square of the coefficient $\beta_j$ (L2 regularization)
- $n$ is the total number of data points
- $p$ is the total number of features (independent variables)

Elastic Net can perform both feature selection (like lasso) and shrinkage (like ridge), making it particularly powerful for models with many correlated features.

# Cross Validation
Cross-validation is a technique used to assess the performance of a machine learning model by testing it on different subsets of the data. It helps to ensure that the model generalizes well to unseen data and avoids issues like overfitting or underfitting.
Common Technique of Cross Validation:
- Leave one out CV
- Leave P out CV
- K fold CV
- Stratified K fold CV
- Time Series cv
This technique provides a more reliable estimate of model performance than a single train-test split.

## Leave one out CV
In this method, the model is trained on all but one data point and tested on the single remaining point. This is repeated for each data point in the dataset, so for n data points, there are n training/testing cycles.
This method is computationally expensive for large datasets but provides an exhaustive evaluation of the model.

## Leave P out CV
A generalization of LOO CV, where instead of leaving out one data point, P data points are left out for testing, and the model is trained on the remaining n−P data points. This process is repeated for all possible combinations of P data points. Like LOO CV, LPO CV is computationally intensive and is typically used only for small datasets.

## K fold CV
The dataset is divided into k equal-sized folds. The model is trained on k−1 folds and tested on the remaining fold. This process is repeated k times, with each fold serving as the test set once. It is more computationally efficient than LOO CV and provides a more reliable performance estimate than a single train-test split.

## Stratified K fold CV
Similar to K-fold CV, but it ensures that each fold has approximately the same distribution of the target variable (e.g., in classification problems, the class proportions in each fold are similar to the original dataset). This is particularly useful when the dataset is imbalanced, ensuring that each fold is representative of the overall data distribution.

## Time Series CV
In time series data, the order of observations is crucial, so traditional K-fold CV cannot be applied. Instead, time series CV involves training the model on past data and testing it on future data. One approach is the rolling window or expanding window method, where the training set grows with each iteration, and the test set is always forward-looking in time. This method respects the temporal order of the data and avoids data leakage from future to past.