# Linear Regression

## Definition

Linear regression is a statistical method used in machine learning and statistics to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. 

$$Y = β₀ + β₁X + ε$$

Where:
- **Y** is the dependent variable (the one you want to predict).
- **X** is the independent variable (the one used for prediction).
- **β₀** is the intercept (the value of Y when X is 0).
- **β₁** is the slope (the change in Y for a one-unit change in X).
- **ε** represents the error term

Linear regression can be extended to multiple independent variables, creating multiple linear regression, where the equation becomes:

$$Y = a + b_1X_1 + b_2X_2 + ... + b_nX_n$$
In this case, there are **n** independent variables, and each has its own coefficient **b** that represents its contribution to the prediction of the dependent variable **Y**.

## Assumptions

Linear regression relies on several fundamental assumptions to be valid and produce reliable results. These assumptions are crucial for the proper interpretation of regression analyses:

1. **Linearity**: The relationship between the dependent variable (**Y**) and the independent variables (**X<sub>1</sub>**, **X<sub>2</sub>**, ..., **X<sub>n</sub>**) should be linear. This means that changes in the independent variables should result in proportional changes in the dependent variable. You can assess linearity through scatterplots and residual plots.

2. **Independence of Errors**: The errors (residuals) from the regression model should be independent of each other. In other words, the value of the residual for one observation should not depend on the values of residuals for other observations. Autocorrelation or time-series data can violate this assumption.

3. **Homoscedasticity**: The variance of the residuals should be constant across all levels of the independent variables. This implies that the spread of residuals should not change as you move along the predictor values. Heteroscedasticity, where the spread of residuals varies, can lead to biased standard errors and affect the validity of hypothesis tests.

4. **Normality of Residuals**: The residuals should follow a normal distribution. While this assumption is not necessary for large sample sizes due to the Central Limit Theorem, it is important for small sample sizes. Deviations from normality can affect the accuracy of confidence intervals and p-values.

## Algorithm

### Ordinary Least Squares (OLS) in Linear Regression

In linear regression, Ordinary Least Squares (OLS) is a widely-used method to estimate the parameters of a linear model. The primary goal of OLS is to find the best-fitting line through a set of data points by minimizing the sum of squared differences between observed values and predicted values.

OLS aims to find the values of **β₀** and **β₁** that minimize the sum of squared residuals (SSR), given by:

$$SSR = \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2$$

In this equation:
- **Y<sub>i</sub>** is the observed value of the dependent variable for the i-th data point.
- **X<sub>i</sub>** is the corresponding value of the independent variable.
- **n** is the number of data points.

We seek to find solution to minimize the SSR.

$$Y = X\beta + \varepsilon $$


The goal is to estimate the coefficients **β** that minimize the sum of squared residuals:

$$SSR = \varepsilon^T \varepsilon = (Y - X\beta)^T (Y - X\beta)$$

To find the OLS solution, we take the derivative of SSR with respect to **β** and set it equal to zero:

$$ \frac{\partial SSR}{\partial \beta} = -2X^T(Y - X\beta) = 0$$

Solving for **β** gives the OLS estimator:

$$\beta = (X^TX)^{-1}X^TY$$

We had the answer for optimal solution of Linear regression using OLS.

### With Regularization

Regularized linear regression refers to a group of linear regression models that include a regularization term in their cost function to prevent overfitting and handle issues like multicollinearity. This regularization term penalizes the magnitude of the coefficients, thereby constraining them. The cost function for Ridge Regression is given by:

$$ J(\beta) = ||Y - X\beta||^2_2 + \lambda ||\beta||^2_2 $$

where:
- **Y** is the vector of observed values.
- **X** is the matrix of feature values.
- **β** is the vector of coefficients.
- **λ** is the regularization parameter.

Expand the Cost Function
   
   $$ J(\beta) = (Y - X\beta)^T(Y - X\beta) + \lambda \beta^T\beta $$

Take the Gradient and the gradient of J(**β**) with respect to **β** is:

   $$ \nabla_\beta J(\beta) = -2X^T(Y - X\beta) + 2\lambda\beta $$

To find the minimum, set the gradient to zero:

   $$ -2X^T(Y - X\beta) + 2\lambda\beta = 0 $$

Solve for β and rearranging the terms:

   $$ X^T(Y - X\beta) = \lambda\beta $$

   $$ X^TY - X^TX\beta = \lambda\beta $$

   $$ X^TY = (X^TX + \lambda I)\beta $$

Finally:

   $$\beta = (X^TX + \lambda I)^{-1}X^TY $$

   where **I** is the identity matrix of appropriate size.

Unlike Ridge regression, Lasso **does not have a closed-form solution due to the absolute value in the regularization term**. The optimization problem involves an absolute value, which makes the derivative not well-defined at all points. The lack of a closed-form solution for Lasso makes it computationally more intensive than Ridge, especially as the number of features grows. However, its ability to **shrink some coefficients to zero**, thereby performing feature selection, can be very beneficial in models with a large number of features.

## Pros, Cons and Use cases

### Pros of Linear Regression

1. **Simplicity**: Linear regression is straightforward to understand and explain, making it a good starting point for predictive modeling.
2. **Efficient Computation**: It requires relatively less computational resources compared to more complex algorithms.
3. **Interpretable Results**: The output of a linear regression model can be easily interpreted in terms of relationship strength and direction between variables.
4. **Basis for Other Methods**: Serves as a foundation for understanding more complex models in machine learning.
5. **Less Prone to Overfitting**: With fewer variables, linear regression models are less likely to fit noise in the data.
<br>
<br>
<br>

### Cons of Linear Regression

1. **Assumption of Linearity**: Linear regression assumes a linear relationship between the dependent and independent variables, which is not always the case in real-world data.
2. **Sensitive to Outliers**: Outliers can significantly affect the regression line and hence the forecasted values.
3. **Multicollinearity**: The presence of high correlation between independent variables can distort the estimated coefficients and make them unreliable.
4. **Limited to Continuous Variables**: Linear regression is typically used for continuous numerical data, limiting its use with categorical data.
5. **Can’t Model Complex Relationships**: It cannot capture non-linear relationships without transformation of variables.

<br>
<br>
<br>

### When to use linear regression

**Emphasis on Inference**
- **Primary Goal**: Inference is the main objective. Linear regression is often superior for inferential purposes compared to other machine learning models.
- **Insights and Estimates**: Provides detailed estimates on how features influence the outcome variable, complete with confidence intervals and statistical tests for a thorough understanding.

**Ideal as a Baseline Model**
- **Simplicity and Comparison**: Serves as an uncomplicated baseline for comparing more complex models.
- **Advantages in Clean Data**: Particularly effective with datasets having minimal missing values or outliers.
- **No Hyperparameter Tuning**: A significant advantage is the absence of hyperparameter tuning, simplifying the model development process.

**Building Stakeholder Trust**
- **Familiarity and Credibility**: Linear regression's well-established nature makes it a trustworthy choice among stakeholders initially skeptical of complex machine learning models.
- **Step towards Advanced Modeling**: Once the linear regression model is accepted, it sets the stage for introducing and comparing more advanced models, demonstrating additional business value.

<br>
<br>
<br>

### When not to use linear regression

**Impact of Small Predictive Improvements**
- **Business Impact**: In scenarios where minor improvements in predictive accuracy can significantly affect business outcomes, exploring models beyond linear regression is advisable.
- **Alternative Models for Better Performance**: Models like gradient boosted trees often outperform linear regression, especially when relationships between features and outcome variables aren't perfectly linear.

**Time Constraints in Data Exploration**
- **Challenges with Linear Regression**: Linear regression can be adversely affected by issues like missing data, outliers, and correlated features.
- **Suitable Alternatives**: In situations with limited time for data cleaning and preprocessing, tree-based models such as random forests are preferable due to their resilience to these data issues.

**Situations with More Features Than Observations**
- **Inappropriateness of Standard Linear Regression**: When the number of features exceeds the number of observations, standard linear regression is not suitable.
- **Solutions**: Opt for feature reduction strategies or models capable of handling high feature-to-observation ratios, such as ridge regression.

**Handling Many Correlated Features**
- **Limitation in Standard Regression**: Standard regression models struggle with multiple correlated features.
- **Better Option**: Ridge regression, a regularized version of linear regression, effectively manages correlated features and offers a more robust solution in such cases.



# Code

In [31]:
from datasets import load_dataset
import numpy as np
import pandas as pd
import sklearn
from linear_regression_numpy import *

In [32]:
dataset = load_dataset("lvwerra/red-wine", split='train')
df = dataset.to_pandas()
df.head()



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [49]:
X = np.array(df.iloc[:,:-1])
y = np.array(df.iloc[:,-1])
print(X.shape)
print(y.shape)

(1599, 11)
(1599,)


# Numpy Version

In [50]:
w1 = linear_regression_noreg(X,y)
w1

array([ 4.19374044e-03, -1.09974310e+00, -1.84145975e-01,  7.07117376e-03,
       -1.91141882e+00,  4.54780884e-03, -3.31855188e-03,  4.52914616e+00,
       -5.22898302e-01,  8.87076125e-01,  2.97022815e-01])

In [51]:
w1_reg = regularized_linear_regression(X,y,0.1)
w1_reg

array([ 1.17661420e-02, -1.10271896e+00, -1.92798798e-01,  6.86766942e-03,
       -1.76428188e+00,  4.42122417e-03, -3.19166628e-03,  4.10710674e+00,
       -4.24248520e-01,  8.80496709e-01,  2.99207408e-01])

In [52]:
np.matmul(X[0],w1)

5.039162131900398

In [53]:
np.matmul(X[0],w1_reg)

5.048832729962756

In [56]:
y_hat = np.matmul(X[0:2],w1)
y_hat

array([5.03916213, 5.14276918])

In [59]:
np.square(np.linalg.norm(np.matmul(X[0:2], w1) - y[0:2])) / 2 # mean square error

0.010958355335190993

# Sklearn Version

In [81]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X,y)
model.score(X,y)

0.3605517030386882

In [82]:
model.coef_

array([ 2.49905527e-02, -1.08359026e+00, -1.82563948e-01,  1.63312698e-02,
       -1.87422516e+00,  4.36133331e-03, -3.26457970e-03, -1.78811638e+01,
       -4.13653144e-01,  9.16334413e-01,  2.76197699e-01])

In [83]:
model.intercept_

21.965208449448745

In [84]:
model.predict(X[0].reshape(1,-1))

array([5.03285045])

In [85]:
model.predict(X[0:2])

array([5.03285045, 5.13787975])

In [86]:
model_ridge = linear_model.Ridge(alpha=.5)
model_ridge.fit(X,y)
model_ridge.score(X,y)

0.3600080990560508

In [87]:
model_ridge.coef_

array([ 0.01137014, -1.10532042, -0.19550803,  0.00817498, -1.57644611,
        0.00448973, -0.00325348, -0.03720202, -0.46607059,  0.84996549,
        0.29601556])

In [88]:
model_ridge.predict(X[0:2])

array([5.04078482, 5.13782832])

In [98]:
model_lasso = linear_model.Lasso(alpha=.1)
model_lasso.fit(X,y)
model_lasso.score(X,y)

0.23937236014517005

In [99]:
model_lasso.coef_

array([ 0.031408  , -0.        ,  0.        ,  0.        , -0.        ,
        0.00571672, -0.00377281, -0.        , -0.        ,  0.        ,
        0.25583985])

In [100]:
model_lasso.predict(X[0:2])

array([5.36458877, 5.4350192 ])