----

# **Demonstrate Lasso Regression and it's Benifits**

## **Author**   :  **Muhammad Adil Naeem**

## **Contact**   :   **madilnaeem0@gmail.com**
<br>

----

### **Import Libraries**

In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LassoCV, Lasso
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

### **Load the Dataset**

In [66]:
df = pd.read_csv('/content/Hitters.csv')

### **First 5 rows of Dataset**

In [67]:
df.head()

Unnamed: 0.1,Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,...,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,-Andy Allanson,293,66,1,30,29,14,1,293,66,...,30,29,14,A,E,446,33,20,,A
1,-Alan Ashby,315,81,7,24,38,39,14,3449,835,...,321,414,375,N,W,632,43,10,475.0,N
2,-Alvin Davis,479,130,18,66,72,76,3,1624,457,...,224,266,263,A,W,880,82,14,480.0,A
3,-Andre Dawson,496,141,20,65,78,37,11,5628,1575,...,828,838,354,N,E,200,11,3,500.0,N
4,-Andres Galarraga,321,87,10,39,42,30,2,396,101,...,48,46,33,N,E,805,40,4,91.5,N


### **Shape of the Dataset**

In [68]:
print(f"This dataset consist of {df.shape[0]} rows and {df.shape[1]} columns.")

This dataset consist of 322 rows and 21 columns.


### **Missing values in dataset**

In [69]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
AtBat,0
Hits,0
HmRun,0
Runs,0
RBI,0
Walks,0
Years,0
CAtBat,0
CHits,0


### **Detailed Information of Dataset**

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  322 non-null    object 
 1   AtBat       322 non-null    int64  
 2   Hits        322 non-null    int64  
 3   HmRun       322 non-null    int64  
 4   Runs        322 non-null    int64  
 5   RBI         322 non-null    int64  
 6   Walks       322 non-null    int64  
 7   Years       322 non-null    int64  
 8   CAtBat      322 non-null    int64  
 9   CHits       322 non-null    int64  
 10  CHmRun      322 non-null    int64  
 11  CRuns       322 non-null    int64  
 12  CRBI        322 non-null    int64  
 13  CWalks      322 non-null    int64  
 14  League      322 non-null    object 
 15  Division    322 non-null    object 
 16  PutOuts     322 non-null    int64  
 17  Assists     322 non-null    int64  
 18  Errors      322 non-null    int64  
 19  Salary      263 non-null    f

### **Replace Missing Values with Median**

In [71]:
df['Salary'].fillna(df['Salary'].median(skipna=True), inplace=True)

- This line of code is used in Python with the Pandas library to handle missing values in a DataFrame. Here's a breakdown of what it does:

- `df['Salary']`: This accesses the 'Salary' column in the DataFrame `df`.
- `.fillna(...)`: This method is used to fill missing values (NaNs) in the specified column.
- `df['Salary'].median(skipna=True)`: This calculates the median salary from the 'Salary' column, ignoring any NaN values (because of the `skipna=True` parameter).
- `inplace=True`: This parameter means that the changes will be made directly to the original DataFrame `df`, rather than returning a new DataFrame.

##### **Summary**
The code replaces any missing values in the 'Salary' column with the median salary of that column.

In [72]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
AtBat,0
Hits,0
HmRun,0
Runs,0
RBI,0
Walks,0
Years,0
CAtBat,0
CHits,0


### **Encoding Catagoical Variables**

In [73]:
dms = pd.get_dummies(df[['League', 'Division', 'NewLeague']], drop_first=True)

This line of code is used in Python with the Pandas library to perform one-hot encoding on categorical variables. Here's a breakdown of what it does:

- **`pd.get_dummies(...)`**: This function converts categorical variable(s) into dummy/indicator variables. It creates a new DataFrame with binary (0 or 1) columns for each category.
  
- **`df[['League', 'Division', 'NewLeague']]`**: This selects the columns 'League', 'Division', and 'NewLeague' from the DataFrame `df` to be transformed.

- **`drop_first=True`**: This parameter indicates that the first category of each variable should be dropped to avoid multicollinearity. By doing this, only the remaining categories are kept as dummy variables.

##### **Summary**
The code creates a new DataFrame `dms` containing one-hot encoded variables for the 'League', 'Division', and 'NewLeague' columns from the original DataFrame `df`, excluding the first category of each variable.

### **Spliting Data into Dependent and Independent Variables**

In [74]:
y = df['Salary']
x_ = df.drop(['Unnamed: 0', 'Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
X = pd.concat([x_, dms[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)

- This code will correctly assign the DataFrame after dropping specified columns and converting the remaining columns to float64. It will then concatenate that DataFrame with the selected dummy variables from dms.

### **Spliting Data for Training and Testing**

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### **Create and Fit Lasso Model**

In [76]:
lasso_model = Lasso().fit(X_train, y_train)

In [77]:
y_pred = lasso_model.predict(X_test)

### **Evaluate the Model**

In [78]:
# Intercept
lasso_model.intercept_

342.8733925858769

In [79]:
# Coefficent
lasso_model.coef_

array([-1.98558949e+00,  5.50494749e+00,  4.79612807e+00,  1.02123896e-01,
       -8.11521080e-01,  4.87004116e+00, -9.97808288e+00, -2.19391227e-01,
        6.16237616e-01,  9.03214960e-03,  8.73990383e-01,  7.84172593e-01,
       -8.13423037e-01,  1.83989460e-01,  4.04846687e-01, -4.08650952e+00,
        2.67092023e+01, -1.11463261e+02, -0.00000000e+00])

In [80]:
# rmse
np.sqrt(mean_squared_error(y_test, lasso_model.predict(X_test)))

345.6190692407428

In [81]:
# r2_score
r2_score(y_test, lasso_model.predict(X_test))

0.3657513009571691

### **LassoCV Model Training**

In [82]:
lasso_cv_model = LassoCV(alphas=np.random.randint(0,1000,100), cv=10, max_iter=10000,n_jobs=1).fit(X_train, y_train)

- This line of code uses the `LassoCV` class from the `sklearn.linear_model` module to perform Lasso regression with cross-validation. Here's a breakdown of what it does:

- **`LassoCV(...)`**: This is a class that implements Lasso regression with built-in cross-validation to select the best alpha (regularization strength).

- **`alphas=np.random.randint(0, 1000, 100)`**: This generates 100 random integers between 0 and 999 to be used as potential values for the regularization parameter alpha. Lasso regression applies a penalty proportional to the absolute value of the coefficients, and alpha controls the strength of this penalty.

- **`cv=10`**: This specifies that 10-fold cross-validation should be used, which means the training data will be split into 10 subsets for validation purposes.

- **`max_iter=100000`**: This sets the maximum number of iterations for the optimization algorithm to converge. A higher number allows for more complex models to be fitted.

- **`.fit(X_train, y_train)`**: This method trains the Lasso model on the training data (`X_train` for features and `y_train` for the target variable).

#### **Summary**
The code trains a Lasso regression model using cross-validation to find the best regularization parameter alpha, using 10-fold cross-validation and a maximum of 100,000 iterations. The model is trained on the data contained in `X_train` and `y_train`.

#### **Best Alpha Values**

In [83]:
lasso_cv_model.alpha_

10

#### **Use Alpha best values**

In [84]:
lasso_tuned = Lasso(alpha=lasso_cv_model.alpha_).fit(X_train, y_train)
y_pred_tuned = lasso_tuned.predict(X_test)

In [85]:
# rmse
np.sqrt(mean_squared_error(y_test,y_pred_tuned))

346.0550834571761

In [86]:
# r2_score
r2_score(y_test, y_pred_tuned)

0.36415002423959975

In [87]:
# cofficents of tuned model
pd.Series(lasso_tuned.coef_, index=X_train.columns)

Unnamed: 0,0
AtBat,-1.946241
Hits,5.207711
HmRun,2.861711
Runs,0.247893
RBI,-0.102351
Walks,4.85347
Years,-5.996889
CAtBat,-0.247262
CHits,0.705343
CHmRun,0.128659
