In [20]:
# Least Absolute Shrinkage and Selection Operator (LASSO)

Regression is a statistical process that seeks to estimate the relationships among variables.

Regression models aim to construct a best-fit line or curve, known as a regression line, through data points in a manner that minimizes the overall distance between the data points and the line itself. This ‘distance’ is often referred to as an error or residual.


The objective of any regression model is to minimize the sum of these residuals, thereby maximizing the predictive accuracy of the model.


In the context of data science, regression models are used to forecast outcomes, test hypotheses, or determine relationships among variables.

Regression models yield continuous or numerical outputs.

Regression models use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. These metrics essentially measure the difference between the actual and predicted numerical values.

Regression models predict a numerical value based on input features, and therefore do not require a decision boundary.



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [3]:
df = pd.read_csv('student-mat.csv')
print(df.head(4))

  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   

  famrel freetime  goout  Dalc  Walc health absences  G1  G2  G3  
0      4        3      4     1     1      3        6   5   6   6  
1      5        3      3     1     1      3        4   5   5   6  
2      4        3      2     2     3      3       10   7   8  10  
3      3        2      2     1     1      5        2  15  14  15  

[4 rows x 33 columns]


In [7]:
X = df['G1'].values.reshape(-1,1)
y = df['absences'].values.reshape(-1,1)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Lasso Regression, an acronym for Least Absolute Shrinkage and Selection Operator, is a type of linear regression that uses a technique called regularization to improve the model’s predictability and interpretability.

Lasso regression starts by calculating the sum of squared residuals. However, Lasso regression adds a penalty term (lambda) to this calculation to discourage the coefficients of the independent variables from getting too large.

image.png

where:

yi is the ith value of the variable we want to predict.
β0 is the y-intercept.
βj is the coefficient for the jth predictor variable xij.
λ is the regularization parameter.

Lasso regression addresses overfitting through its regularization term. By adding a penalty for large coefficients, Lasso regression discourages the model from relying too heavily on any one feature, promoting a more generalized model.

In [9]:
from sklearn.linear_model import Lasso

In [13]:
lasso = Lasso(alpha=0.5)
# set the regularization parameter
lasso.fit(X_train, y_train)

In [14]:
y_pred = lasso.predict(X_test)

In [15]:
print(lasso.coef_)

[-0.06560237]


The coefficients represent the change in the grade for each one-unit change in the corresponding feature, taking into account the penalty term that we added. A coefficient of zero means that the corresponding feature was not selected by the model.

In [16]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

Mean Absolute Error: 5.759676817942001


In [17]:
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

Mean Squared Error: 117.58407159276702


In [19]:
import numpy as np
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Root Mean Squared Error: 10.843618934321098


These metrics provide different ways of understanding the model’s performance.

ADVANTAGES :
1. feature selection
2. overfitting
3. handling multiocollinearity

disadvantages:
1. selection of Regularization Parameter
2. Limitations in Feature Selection
3. Difficulty Handling Complex Relationships