<a href="https://colab.research.google.com/github/maktaurus/ML-Work/blob/main/Classic_Algorithams/Ridge_and_Lasso_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ridge and Lasso Regression

**Linear regression :** A linear model that predicts a continuous output variable (y) based on one or more input features (x).

**Types:**

1) **Simple Linear Regression:** one input feature.

2) **Multiple Linear Regression:** multiple input features.


**Polynomial regression**:  is a type of regression analysis where the relationship between the independent variable (x) and dependent variable (y) is modeled using a polynomial equation.

**Ridge and Lasso Regression**: Both are regularization methods which is used to prevent overfitting in linear regression models.

**Ridge Regression (L2 Regularization)**, which adds a penalty term to the least squares loss function to shrink coefficients.

**Lasso Regression (L1 Regularization)**, which adds a penalty term to the least squares loss function to set coefficients to zero.

### Hitters Dataset scikit learn Example

In [18]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge, Lasso, LinearRegression,RidgeCV, LassoCV
from sklearn.preprocessing import LabelEncoder, StandardScaler,OrdinalEncoder
from sklearn.model_selection import train_test_split,RepeatedKFold
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import math

Load dataset

In [19]:
data = pd.read_csv("/content/Hitters.csv")
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [20]:
len(data)

322

In [21]:
data.isnull().sum()

Unnamed: 0,0
AtBat,0
Hits,0
HmRun,0
Runs,0
RBI,0
Walks,0
Years,0
CAtBat,0
CHits,0
CHmRun,0


Get categorical data and encode it.

In [22]:
cat = []

for col in data.columns:
  if data[col].dtype == "object":
    cat.append(col)
cat

['League', 'Division', 'NewLeague']

In [24]:
le = OrdinalEncoder()
data[cat] = le.fit_transform(data[cat])

In [25]:
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,0.0,0.0,446,33,20,,0.0
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,1.0,1.0,632,43,10,475.0,1.0
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,0.0,1.0,880,82,14,480.0,0.0
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,1.0,0.0,200,11,3,500.0,1.0
4,321,87,10,39,42,30,2,396,101,12,48,46,33,1.0,0.0,805,40,4,91.5,1.0


Get the test dataset from main data where target column contains null values. Which will used for prediction.

In [26]:
test = data[data["Salary"].isnull()]
len(test)

59

In [27]:
data = data.dropna()
len(data)

263

Split data into train and validation

In [28]:
x = data.drop("Salary", axis=1)
y = data["Salary"]

In [29]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [30]:
# scale or normalize the entire dataset
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Perform linear regression task and get the MSE score

In [50]:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

In [51]:
lin_reg.coef_, lin_reg.intercept_

(array([-212.69292957,  325.15526197,   41.93129801,  -70.81014586,
         -36.58794806,  116.93621463,   38.0921268 , -535.08451309,
          72.26573651,  -40.67604204,  574.50034643,  311.82850479,
        -216.66359515,   53.98051636,  -56.91160847,   74.87906346,
          31.80249718,   -4.12966949,  -31.83928985]),
 543.6646238095238)

In [52]:
lin_pred = lin_reg.predict(x_test)
lin_pred = pd.DataFrame(lin_pred,columns=["Pred"])
lin_pred.head()

Unnamed: 0,Pred
0,597.241698
1,683.906734
2,899.763674
3,411.994225
4,340.34366


In [53]:
math.sqrt(mean_squared_error(y_test,lin_pred))

358.168040864513

In [54]:
lin_combi = pd.concat([y_test.reset_index(drop=True), lin_pred], axis=1)
lin_combi.tail(10)

Unnamed: 0,Salary,Pred
43,490.0,336.780785
44,900.0,750.513335
45,700.0,714.230077
46,400.0,301.669344
47,115.0,253.560337
48,155.0,241.325676
49,625.0,642.893367
50,525.0,759.013414
51,250.0,308.171127
52,775.0,777.458692


Perform Ridge regression with cross-validation.

In [31]:
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

In [32]:
rcv = RidgeCV(alphas=np.arange(0.1, 10, 0.1), cv=cv, scoring="neg_mean_squared_error")
rcv.fit(x_train, y_train)

In [33]:
rev_pred = rcv.predict(x_test)
# rev_pred = pd.DataFrame(rev_pred,columns=["Pred"])
# rev_pred.head()

In [34]:
math.sqrt(mean_squared_error(y_test,rev_pred))

368.1428473202585

In [35]:
rcv.score(x_train,y_train), rcv.score(x_test,y_test)

(0.5644426470243216, 0.25069028462859044)

Perform lasso regression

In [36]:
lcv = LassoCV(alphas=np.arange(0.1, 10, 0.1), cv=cv, tol=1)
lcv.fit(x_train, y_train)

In [48]:
lcv_pred = lcv.predict(x_test)
lcv_pred = pd.DataFrame(lcv_pred,columns=["Pred"])
lcv_pred.head()

Unnamed: 0,Pred
0,754.716246
1,828.156069
2,938.00984
3,427.352647
4,488.926636


In [49]:
com = pd.concat([y_test.reset_index(drop=True), lcv_pred], axis=1)
com.tail(10)

Unnamed: 0,Salary,Pred
43,490.0,518.967222
44,900.0,1015.008599
45,700.0,893.694069
46,400.0,486.928692
47,115.0,286.790371
48,155.0,272.825407
49,625.0,684.043021
50,525.0,976.74702
51,250.0,473.594436
52,775.0,657.840382


In [43]:
math.sqrt(mean_squared_error(y_test,lcv_pred))

427.76464072214145

In [44]:
lcv.score(x_train,y_train), lcv.score(x_test,y_test)

(0.5016733287335055, -0.011668889814085537)