## Problem Statement

Indian Premier League (IPL) is a league for Twenty20 (T20) cricket championships started in India. The auction price of the player depends on his performance in test matches or one-day internationals. The primary skill of the player also contributes to the auction price. We use different regression techniques to predict the auction price of the player.

## About the dataset (IPL Auction data)

**PLAYER NAME**: Name of the player<br>
**AGE**: The age of the player is classified into three categories. Category 1 means the player is less than 25 years old. Category 2 means the player is between 25 and 35 years and Category 3 means the player has aged more than 35.<br>
**COUNTRY**: Country of the player<br>
**PLAYING ROLE**: Player's primary skill<br>
**T-RUNS**: Total runs scored in the test matches<br>
**T-WKTS**: Total wickets taken in the test matches<br>
**ODI-RUNS-S**: Runs scored in One Day Internationals<br>
**ODI-SR-B**: Batting strike rate in One Day Internationals<br>
**ODI-WKTS**: Wickets taken in One Day Internationals<br>
**ODI-SR-BL**: Bowling strike rate in One Day Internationals<br>
**CAPTAINCY EXP**: Captained a team or not<br>
**RUNS-S**: Number of runs scored by a player<br>
**HS**: Highest score by a batsman in IPL<br>
**AVE**: Average runs scored by a batsman in IPL<br>
**SR-B**: Batting strike rate (ratio of the number of runs scored to the number of basses faced) in IPL.<br>
**SIXERS**: Number of six runs scored by a player in IPL.<br>
**RUNS-C**: Number of runs conceded by a player<br>
**WKTS**: Number of wickets were taken by a player in IPL.<br>
**AVE-BL**: Bowling average (number of runs conceded / number of wickets taken) in IPL.<br>
**ECON**: Economy rate of a bowler in IPL (number of runs conceded by the bowler per over).<br>
**SR-BL**: Bowling strike rate (ratio of the number of balls bowled to the number of wickets taken) in IPL.<br>
**SOLD PRICE**: Auction price of the player (Target Variable)<br>

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [2]:
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# import subpackage of Matplotlib
import matplotlib.pyplot as plt

# import 'Seaborn' 
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None
 
# to display the float values upto 6 decimal places     
pd.options.display.float_format = '{:.6f}'.format

# import train-test split 
from sklearn.model_selection import train_test_split

# import various functions from statsmodels
import statsmodels
import statsmodels.api as sm

# import 'stats'
from scipy import stats

# 'metrics' from sklearn is used for evaluating the model performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


# import function to perform linear regression
from sklearn.linear_model import LinearRegression

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 

# import SGDRegressor from sklearn to perform linear regression with stochastic gradient descent
from sklearn.linear_model import SGDRegressor

# import function for ridge regression
from sklearn.linear_model import Ridge

# import function for lasso regression
from sklearn.linear_model import Lasso

# import function for elastic net regression
from sklearn.linear_model import ElasticNet

# import function to perform GridSearchCV
from sklearn.model_selection import GridSearchCV

In [3]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="prep"></a>
# 2. Data Preparation

<a id="read"></a>
## 2.1 Read the Data

#### Read the dataset and print the first five observations.

In [4]:
# load the csv file
# store the data in 'df_ipl'
df_ipl = pd.read_csv('ipl_player_auction.csv')

# display first five observations using head()
df_ipl.head()

Unnamed: 0,PLAYER NAME,AGE,COUNTRY,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,RUNS-S,HS,AVE,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,SOLD PRICE
0,Abdulla,2,South Africa,Allrounder,0,0,0,0.0,0,0.0,0,0,0,0.0,0.0,0,307,15,20.47,9.9,13.93,50000
1,Abdur Razzak,2,Bangladesh,Bowler,266,18,657,71.41,185,37.6,0,0,0,0.0,0.0,0,29,0,0.0,17.5,0.0,50000
2,Agarkar,2,India,Bowler,669,58,1269,80.62,288,32.9,0,167,39,18.56,121.01,5,1059,29,36.52,8.81,24.9,350000
3,Ashwin,1,India,Bowler,308,31,241,84.56,51,36.8,0,58,11,5.8,76.32,0,1125,49,22.96,8.23,22.14,850000
4,Badrinath,2,India,Batsman,109,0,79,45.93,0,0.0,0,1317,71,32.93,120.71,28,0,0,0.0,1.0,0.0,800000


**Let us now see the number of variables and observations in the data.**

In [5]:
# use 'shape' to check the dimension of data
df_ipl.shape

(130, 22)

**Interpretation:** The data has 130 observations and 22 variables.

<a id="dtype"></a>
## 2.2 Check the Data Type

**Check the data type of each variable. If the data type is not as per the data definition, change the data type.**

In [6]:
# use 'dtypes' to check the data type of a variable
df_ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PLAYER NAME    130 non-null    object 
 1   AGE            130 non-null    int64  
 2   COUNTRY        130 non-null    object 
 3   PLAYING ROLE   130 non-null    object 
 4   T-RUNS         130 non-null    int64  
 5   T-WKTS         130 non-null    int64  
 6   ODI-RUNS-S     130 non-null    int64  
 7   ODI-SR-B       130 non-null    float64
 8   ODI-WKTS       130 non-null    int64  
 9   ODI-SR-BL      130 non-null    float64
 10  CAPTAINCY EXP  130 non-null    int64  
 11  RUNS-S         130 non-null    int64  
 12  HS             130 non-null    int64  
 13  AVE            130 non-null    float64
 14  SR-B           130 non-null    float64
 15  SIXERS         130 non-null    int64  
 16  RUNS-C         130 non-null    int64  
 17  WKTS           130 non-null    int64  
 18  AVE-BL    

In [9]:
# convert numerical variables to categorical (object) 
# use astype() to change the data type

# change the data type of 'AGE' 
df_ipl['AGE'] = df_ipl['AGE'].astype('object')

# change the data type of 'CAPTAINCY EXP'
df_ipl['CAPTAINCY EXP'] = df_ipl['CAPTAINCY EXP'].astype('object')


In [10]:
df_ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PLAYER NAME    130 non-null    object 
 1   AGE            130 non-null    object 
 2   COUNTRY        130 non-null    object 
 3   PLAYING ROLE   130 non-null    object 
 4   T-RUNS         130 non-null    int64  
 5   T-WKTS         130 non-null    int64  
 6   ODI-RUNS-S     130 non-null    int64  
 7   ODI-SR-B       130 non-null    float64
 8   ODI-WKTS       130 non-null    int64  
 9   ODI-SR-BL      130 non-null    float64
 10  CAPTAINCY EXP  130 non-null    object 
 11  RUNS-S         130 non-null    int64  
 12  HS             130 non-null    int64  
 13  AVE            130 non-null    float64
 14  SR-B           130 non-null    float64
 15  SIXERS         130 non-null    int64  
 16  RUNS-C         130 non-null    int64  
 17  WKTS           130 non-null    int64  
 18  AVE-BL    

<a id="drop"></a>
## 2.3 Remove Insignificant Variables

The column `PLAYER NAME` contains the name of the player, which is redundant for further analysis. Thus, we drop the column.

In [None]:
# drop the column 'PLAYER NAME' using drop()
# 'axis = 1' drops the specified column
df_ipl = df_ipl.drop('PLAYER NAME', axis = 1)

<a id="null"></a>
## 2.4 Missing Value Treatment

First run a check for the presence of missing values and their percentage for each column. Then choose the right approach to treat them.

In [13]:
Total = df_ipl.isnull().sum().sort_values(ascending=False)          
Total

PLAYER NAME      0
AGE              0
SR-BL            0
ECON             0
AVE-BL           0
WKTS             0
RUNS-C           0
SIXERS           0
SR-B             0
AVE              0
HS               0
RUNS-S           0
CAPTAINCY EXP    0
ODI-SR-BL        0
ODI-WKTS         0
ODI-SR-B         0
ODI-RUNS-S       0
T-WKTS           0
T-RUNS           0
PLAYING ROLE     0
COUNTRY          0
SOLD PRICE       0
dtype: int64

In [11]:
# sort the variables on the basis of total null values in the variable
# 'isnull().sum()' returns the number of missing values in each variable
# 'ascending = False' sorts values in the descending order
# the variable with highest number of missing values will appear first
Total = df_ipl.isnull().sum().sort_values(ascending=False)          

# calculate percentage of missing values
# 'ascending = False' sorts values in the descending order
# the variable with highest percentage of missing values will appear first
Percent = (df_ipl.isnull().sum()*100/df_ipl.isnull().count()).sort_values(ascending=False)   

# concat the 'Total' and 'Percent' columns using 'concat' function
# pass a list of column names in parameter 'keys' 
# 'axis = 1' concats along the columns
missing_data = pd.concat([Total, Percent], axis = 1, keys = ['Total', 'Percentage of Missing Values'])    
missing_data

Unnamed: 0,Total,Percentage of Missing Values
PLAYER NAME,0,0.0
AGE,0,0.0
SR-BL,0,0.0
ECON,0,0.0
AVE-BL,0,0.0
WKTS,0,0.0
RUNS-C,0,0.0
SIXERS,0,0.0
SR-B,0,0.0
AVE,0,0.0


**Interpretation:** The above output shows that there are no missing values in the data.

<a id="dummy"></a>
## 2.5 Dummy Encode the Categorical Variables

#### Split the dependent and independent variables.

In [14]:
# store the target variable 'SOLD PRICE' in a dataframe 'df_target'
df_target = df_ipl['SOLD PRICE']

# store all the independent variables in a dataframe 'df_feature' 
# drop the column 'SOLD PRICE' using drop()
# 'axis = 1' drops the specified column
df_feature = df_ipl.drop('SOLD PRICE', axis = 1)

#### Filter numerical and categorical variables.

In [18]:
# filter the numerical features in the dataset
# 'select_dtypes' is used to select the variables with given data type
# 'include = [np.number]' will include all the numerical variables
df_num = df_feature.select_dtypes(include = [np.number])

# display numerical features
df_num.head(2)

Unnamed: 0,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,RUNS-S,HS,AVE,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL
0,0,0,0,0.0,0,0.0,0,0,0.0,0.0,0,307,15,20.47,9.9,13.93
1,266,18,657,71.41,185,37.6,0,0,0.0,0.0,0,29,0,0.0,17.5,0.0


In [19]:
# filter the categorical features in the dataset
# 'select_dtypes' is used to select the variables with given data type
# 'include = [np.object]' will include all the categorical variables
df_cat = df_feature.select_dtypes(include = [object])

# display categorical features
 

Unnamed: 0,PLAYER NAME,AGE,COUNTRY,PLAYING ROLE,CAPTAINCY EXP
0,Abdulla,2,South Africa,Allrounder,0
1,Abdur Razzak,2,Bangladesh,Bowler,0
2,Agarkar,2,India,Bowler,0
3,Ashwin,1,India,Bowler,0
4,Badrinath,2,India,Batsman,0


In [26]:
df_cat  = df_cat.drop(['PLAYER NAME'],axis = 1)

In [1]:
categorical_columns =  list(df_cat.columns)

NameError: name 'df_cat' is not defined

The regression method fails in presence of categorical variables. To overcome this we use (n-1) dummy encoding. 

**Encode the each categorical variable and create (n-1) dummy variables for n categories of the variable.**

In [28]:
dummy_var = pd.get_dummies(data = df_cat, prefix = None, prefix_sep='_',
               columns = categorical_columns,
               drop_first =True,
              dtype='int8')

In [32]:
 df_ipl['AGE'].value_counts()

AGE
2    86
3    28
1    16
Name: count, dtype: int64

In [29]:
dummy_var.head(2)

Unnamed: 0,AGE_2,AGE_3,COUNTRY_Bangladesh,COUNTRY_England,COUNTRY_India,COUNTRY_New Zealand,COUNTRY_Pakistan,COUNTRY_South Africa,COUNTRY_Sri Lanka,COUNTRY_West Indies,COUNTRY_Zimbabwe,PLAYING ROLE_Batsman,PLAYING ROLE_Bowler,PLAYING ROLE_W. Keeper,CAPTAINCY EXP_1
0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0


**Interpretation:** We can see that the dummy variables are added to the data. '1' in the column 'AGE_2' represents that the age of the corresponding player is between 25 to 35 years. Also, the '0' in both the columns 'AGE_2' and 'AGE_3' indicates that the age of the corresponding player is less than 25.

<a id="scale"></a>
## 2.6 Scale the Data 

We scale the variables to get all the variables in the same range. With this, we can avoid a problem in which some features come to dominate solely because they tend to have larger impact than others.

In [36]:
# initialize the standard scalar
X_scaler = StandardScaler()

# scale all the numeric variables
# standardize all the columns of the dataframe 'df_num'
num_scaled = X_scaler.fit_transform(df_num)

# create a dataframe of scaled numerical variables
# pass the required column names to the parameter 'columns'
df_num_scaled = pd.DataFrame(num_scaled, columns = df_num.columns)

# standardize the target variable explicitly and store it in a new variable 'y'
y = (df_target - df_target.mean()) / df_target.std()

#### Concatenate scaled numerical and dummy encoded categorical variables.

In [37]:
# concat the dummy variables with numeric features to create a dataframe of all independent variables
# 'axis=1' concats the dataframes along columns 
X = pd.concat([df_num_scaled, dummy_var], axis = 1)

# display first five observations
X.head()

Unnamed: 0,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,RUNS-S,HS,AVE,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AGE_2,AGE_3,COUNTRY_Bangladesh,COUNTRY_England,COUNTRY_India,COUNTRY_New Zealand,COUNTRY_Pakistan,COUNTRY_South Africa,COUNTRY_Sri Lanka,COUNTRY_West Indies,COUNTRY_Zimbabwe,PLAYING ROLE_Batsman,PLAYING ROLE_Bowler,PLAYING ROLE_W. Keeper,CAPTAINCY EXP_1
0,-0.674581,-0.468108,-0.703043,-2.757731,-0.68676,-1.277132,-0.839014,-1.307954,-1.693829,-3.102818,-0.745369,-0.30301,-0.096956,-0.123459,0.423147,-0.226928,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,-0.593774,-0.34146,-0.518927,0.008708,0.983269,0.133821,-0.839014,-1.307954,-1.693829,-3.102818,-0.745369,-0.802864,-0.786968,-1.111749,1.877704,-1.142498,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,-0.471347,-0.060022,-0.347421,0.365505,1.913068,-0.042548,-0.566677,-0.232487,-0.014415,0.278728,-0.534721,1.049112,0.547056,0.651433,0.214533,0.494091,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0
3,-0.581015,-0.249993,-0.635505,0.518141,-0.226374,0.103801,-0.74443,-1.004617,-1.169012,-0.970105,-0.745369,1.167783,1.467072,-0.003242,0.103527,0.312686,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
4,-0.641468,-0.468108,-0.680904,-0.978393,-0.68676,-1.277132,1.308699,0.649946,1.285864,0.270344,0.434258,-0.855007,-0.786968,-1.111749,-1.280217,-1.142498,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0


**Interpretation:** We can see that the dummy variables are added to the data. '1' in the column 'AGE_2' represents that the age of the corresponding player is between 25 to 35 years. Also, the '0' in both the columns 'AGE_2' and 'AGE_3' indicates that the age of the corresponding player is less than 25.

<a id="split"></a>
## 2.7 Train-Test Split

Before applying variour regression techniques to predict the auction price of the player, let us split the dataset in train and test set.

In [38]:
# split data into train subset and test subset
# set 'random_state' to generate the same dataset each time you run the code 
# 'test_size' returns the proportion of data to be included in the testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.2)

# check the dimensions of the train & test subset using 'shape'
# print dimension of train set
print('X_train', X_train.shape)
print('y_train', y_train.shape)

# print dimension of test set
print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (104, 31)
y_train (104,)
X_test (26, 31)
y_test (26,)


## Create generalized functions to calculate various metrics for models

#### Create a generalized function to calculate the RMSE for train and test set.

In [41]:
# create a generalized function to calculate the RMSE values for train set
def get_train_rmse(model):
    
    # For training set:
    # train_pred: prediction made by the model on the training dataset 'X_train'
    # y_train: actual values ofthe target variable for the train dataset

    # predict the output of the target variable from the train data 
    train_pred = model.predict(X_train)

    # calculate the MSE using the "mean_squared_error" function

    # MSE for the train data
    mse_train = mean_squared_error(y_train, train_pred)

    # take the square root of the MSE to calculate the RMSE
    # round the value upto 4 digits using 'round()'
    rmse_train = round(np.sqrt(mse_train), 4)
    
    # return the training RMSE
    return(rmse_train)

In [42]:
# create a generalized function to calculate the RMSE values test set
def get_test_rmse(model):
    
    # For testing set:
    # test_pred: prediction made by the model on the test dataset 'X_test'
    # y_test: actual values of the target variable for the test dataset

    # predict the output of the target variable from the test data
    test_pred = model.predict(X_test)

    # MSE for the test data
    mse_test = mean_squared_error(y_test, test_pred)

    # take the square root of the MSE to calculate the RMSE
    # round the value upto 4 digits using 'round()'
    rmse_test = round(np.sqrt(mse_test), 4)

    # return the test RMSE
    return(rmse_test)

#### Create a generalized function to calculate the MAPE for test set.

In [43]:
# define a function to calculate MAPE
# pass the actual and predicted values as input to the function
# return the calculated MAPE 
def mape(actual, predicted):
    return (np.mean(np.abs((actual - predicted) / actual)) * 100)

def get_test_mape(model):
    
    # For testing set:
    # test_pred: prediction made by the model on the test dataset 'X_test'
    # y_test: actual values of the target variable for the test dataset

    # predict the output of the target variable from the test data
    test_pred = model.predict(X_test)
    
    # calculate the mape using the "mape()" function created above
    # calculate the MAPE for the test data
    mape_test = mape(y_test, test_pred)

    # return the MAPE for the test set
    return(mape_test)

#### Create a generalized function to calculate the R-Squared and Adjusted R- Squared

In [44]:
# define a function to get R-squared and adjusted R-squared value
def get_score(model):
    
    # score() returns the R-squared value
    r_sq = model.score(X_train, y_train)
    
    # return the R-squared and adjusted R-squared value 
    return (r_sq)

<a id="linreg"></a>
# 3. Multiple Linear Regression (OLS)

#### Build a MLR model on a training dataset.

In [45]:
# initiate linear regression model
linreg = LinearRegression()

# build the model using X_train and y_train
# use fit() to fit the regression model
MLR_model = linreg.fit(X_train, y_train)

# print the R-squared value for the model
# score() returns the R-squared value
#MLR_model.score(X_train, y_train)

0.5839698427213922

In [48]:
r_score = get_score(MLR_model)
r_score*100

0.5839698427213922

In [None]:
train_rmse =get_train_rmse(MLR_model)
train_rmse

In [None]:
train_rmse =get_train_rmse(MLR_model)
train_rmse

In [None]:
train_rmse - test_rmse

In [None]:
0.66
0.88

In [47]:
# print training RMSE
print('RMSE on train set: ', get_train_rmse(MLR_model))

# print training RMSE
print('RMSE on test set: ', get_test_rmse(MLR_model))

# calculate the difference between train and test set RMSE
difference = abs(get_test_rmse(MLR_model) - get_train_rmse(MLR_model))

# print the difference between train and test set RMSE
print('Difference between RMSE on train and test set: ', difference)

RMSE on train set:  0.6609
RMSE on test set:  0.8741
Difference between RMSE on train and test set:  0.21319999999999995


**Interpretation:** RMSE on the training set is 0.6609, while on the test set it is 0.8741. We can see that there is a large difference in the RMSE of the train and the test set. This implies that our model has overfitted on the train set. 

To deal with the problem of overfitting, we study the approach of `Regularization` in the later section. 

<a id="regu"></a>
# 5. Regularization

One way to deal with the overfitting problem is by adding the `Regularization` to the model. It is observed that inflation of the coefficients cause overfitting. To prevent overfitting, it is important to regulate the coefficients by penalizing possible coefficient inflations. Regularization imposes penalties on parameters if they inflate to large values to prevent them from being weighted too heavily. In this section, we will learn about the three regularization techniques:

1. Ridge Regression
2. Lasso Regression
3. Elastic Net Regression

<a id="ridge"></a>
## 5.1 Ridge Regression

Most of the times our data can show multicollinearity in the variables. To analyze such data we can use `Ridge Regression`. It uses the L2 norm for regularization. 

#### Build regression model using Ridge Regression for alpha = 1.

In [49]:
# use Ridge() to perform ridge regression
# 'alpha' assigns the regularization strength to the model
# 'max_iter' assigns maximum number of iterations for the model to run 
ridge = Ridge(alpha = 1, max_iter = 500)

# fit the model on train set
ridge = ridge.fit(X_train, y_train)

# print RMSE for test set
# call the function 'get_test_rmse'
print('RMSE on test set:', get_test_rmse(ridge))

RMSE on test set: 0.8438


**Interpretation:** After applying the ridge regression with alpha equal to one, we get 0.8438 as the RMSE value.

#### Build regression model using Ridge Regression for alpha = 2.

In [50]:
# use Ridge() to perform ridge regression
# 'alpha' assigns the regularization strength to the model
# 'max_iter' assigns maximum number of iterations for the model to run
ridge = Ridge(alpha = 2, max_iter = 500)

# fit the model on train set
ridge.fit(X_train, y_train)


# print RMSE for test set
# call the function 'get_test_rmse'
print('RMSE on test set:', get_test_rmse(ridge))

RMSE on test set: 0.8367


**Interpretation:** After applying the ridge regression with alpha equal to two, the RMSE value decreased to 0.8367.

#### Visualize the change in values of coefficients obtained from `MLR_model (using OLS)` and `ridge_model`

**Interpretation:** The coefficients obtained from ridge regression have smaller values as compared to the coefficients obtained from linear regression using OLS.

<a id="lasso"></a>
## 5.2 Lasso Regression

Lasso regression shrinks the less important variable's coefficient to zero which makes this technique more useful when we are dealing with large number of variables. It is a type of regularization technique that uses L1 norm for regularization. 

In [51]:
# use Lasso() to perform lasso regression
# 'alpha' assigns the regularization strength to the model
# 'max_iter' assigns maximum number of iterations for the model to run
lasso = Lasso(alpha = 0.01, max_iter = 500)

# fit the model on train set
lasso.fit(X_train, y_train)

# print RMSE for test set
# call the function 'get_test_rmse'
print('RMSE on test set:', get_test_rmse(lasso))


RMSE on test set: 0.8183


**Interpretation:** After applying the lasso regression with alpha equal to 0.01, the RMSE value is 0.8183.

<a id="elastic"></a>
## 5.3 Elastic Net Regression

This technique is a combination of Rigde and Lasso reression techniques. It considers the linear combination of penalties for L1 and L2 regularization.

In [52]:
# use ElasticNet() to perform Elastic Net regression
# 'alpha' assigns the regularization strength to the model
# 'l1_ratio' is the ElasticNet mixing parameter
# 'l1_ratio = 0' performs Ridge regression
# 'l1_ratio = 1' performs Lasso regression
# pass number of iterations to 'max_iter'
enet = ElasticNet(alpha = 0.1, l1_ratio = 0.01, max_iter = 500)

# fit the model on train data
enet.fit(X_train, y_train)


# print RMSE for test set
# call the function 'get_test_rmse'
print('RMSE on test set:', get_test_rmse(enet))

RMSE on test set: 0.8013


**Interpretation:** With the elastic-net regression, we get 0.8013 as the RMSE value.

#### Visualize the change in values of coefficients obtained from `MLR_model (using OLS)` and `Elastic Net regression`

<a id="GScv"></a>
# 6. GridSearchCV

Hyperparameters are the parameters in the model that are preset by the user. GridSearch considers all the combinations of hyperparameters and returns the best hyperparameter values. Following are some of the parameters that GridSearchCV takes:

1. estimator: pass the machine learning algorithm model
2. param_grid: takes a dictionary having parameter names as keys and list of parameters as values
3. cv: number of folds for k-fold cross validation

### Find optimal value of alpha for `Ridge Regression`

In [53]:
# create a dictionary with hyperparameters and its values
# 'alpha' assigns the regularization strength to the model
# 'max_iter' assigns maximum number of iterations for the model to run
tuned_paramaters = [{'alpha':[1e-15, 1e-10, 1e-8, 1e-4,1e-3, 1e-2, 0.1, 1, 5, 19, 40, 60, 80, 100]}]
 
# initiate the ridge regression model
ridge = Ridge()

# use GridSearchCV() to find the optimal value of alpha
# estimator: pass the ridge regression model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 10
ridge_grid = GridSearchCV(estimator = ridge, 
                          param_grid = tuned_paramaters, 
                          cv = 5)

# fit the model on X_train and y_train using fit()
ridge_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for Ridge Regression: ', ridge_grid.best_params_, '\n')

# print the RMSE for test set using the model having optimal value of alpha
print('RMSE on test set:', get_test_rmse(ridge_grid))

Best parameters for Ridge Regression:  {'alpha': 19} 

RMSE on test set: 0.7852


**Interpretation:** With the optimal value of alpha that we got from GridSearchCV, the RMSE of test set decreased to 0.7841.

### Find optimal value of alpha for `Lasso Regression`

In [54]:
# create a dictionary with hyperparameters and its values
# 'alpha' assigns the regularization strength to the model
# 'max_iter' assigns maximum number of iterations for the model to run
tuned_paramaters = [{'alpha':[1e-15, 1e-10, 1e-8, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20]}]
                     
# 'max_iter':100,500,1000,1500,2000
 
# initiate the lasso regression model 
lasso = Lasso()

# use GridSearchCV() to find the optimal value of alpha
# estimator: pass the lasso regression model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 10
lasso_grid = GridSearchCV(estimator = lasso, 
                          param_grid = tuned_paramaters, 
                          cv = 10)

# fit the model on X_train and y_train using fit()
lasso_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for Lasso Regression: ', lasso_grid.best_params_, '\n')

# print the RMSE for the test set using the model having optimal value of alpha
print('RMSE on test set:', get_test_rmse(lasso_grid))

Best parameters for Lasso Regression:  {'alpha': 0.1} 

RMSE on test set: 0.784


**Interpretation:** With the optimal value of alpha that we got from GridSearchCV, the RMSE of test set is 0.784.

### Find optimal value of alpha for `Elastic Net Regression`

In [55]:
# create a dictionary with hyperparameters and its values
# 'alpha' assigns the regularization strength to the model
# 'l1_ratio' is the ElasticNet mixing parameter
# 'max_iter' assigns maximum number of iterations for the model to run
tuned_paramaters = [{'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20, 40, 60],
                      'l1_ratio':[0.0001, 0.0002, 0.001, 0.01, 0.1, 0.2]}]

# initiate the elastic net regression model  
enet = ElasticNet()

# use GridSearchCV() to find the optimal value of alpha and l1_ratio
# estimator: pass the elastic net regression model
# param_grid: pass the list 'tuned_parameters'
# cv: number of folds in k-fold i.e. here cv = 10
enet_grid = GridSearchCV(estimator = enet, 
                          param_grid = tuned_paramaters, 
                          cv = 10)

# fit the model on X_train and y_train using fit()
enet_grid.fit(X_train, y_train)

# get the best parameters
print('Best parameters for Elastic Net Regression: ', enet_grid.best_params_, '\n')

# print the RMSE for the test set using the model having optimal value of alpha and l1-ratio
print('RMSE on test set:', get_test_rmse(enet_grid))

Best parameters for Elastic Net Regression:  {'alpha': 0.1, 'l1_ratio': 0.0001} 

RMSE on test set: 0.8004


**Interpretation:** With the optimal value of alpha that we got from GridSearchCV, the RMSE of test set is 0.8004. 

**Interpretation:** We can see that `Ridge Regression (using GridSearchCV)` has the lowest test RMSE. Here, ridge regression with `alpha = 20` seems to deal with the problem of overfitting efficiently.