**Mltiple linear regression and Regularised Linear Regressison (Ridge and Lasso)**

**Case study**: This data belongs to a loan aggregator agency which connects loan applications to different financial institutions in attempt to get the best interest rate. They want to now utilise past data to predict interest rate given by any financial institute just by looking at loan application characteristics.

To achieve that , they have decided to do a POC with a data from a particular financial institution. The data is given in the file "loans data.csv". 

**Multilpe Linear Regression: RMSE: 1.9907**<BR>
**Regularized Linear Regression: 1. Ridge RMSE: 1.9859 2. Lasso RMSE: 1.9828**

In [1]:
import pandas as pd
import numpy as np
import math as m
import sklearn
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
%matplotlib inline

In [2]:
data_file=r'D:\Edvancer\R\Data sets\Data\loans data.csv'
ld=pd.read_csv(data_file)

In [3]:
ld.head()

Unnamed: 0,ID,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length
0,81174.0,20000,20000,8.90%,36 months,debt_consolidation,14.90%,SC,MORTGAGE,6541.67,735-739,14,14272,2.0,< 1 year
1,99592.0,19200,19200,12.12%,36 months,debt_consolidation,28.36%,TX,MORTGAGE,4583.33,715-719,12,11140,1.0,2 years
2,80059.0,35000,35000,21.98%,60 months,debt_consolidation,23.81%,CA,MORTGAGE,11500.0,690-694,14,21977,1.0,2 years
3,15825.0,10000,9975,9.99%,36 months,debt_consolidation,14.30%,KS,MORTGAGE,3833.33,695-699,10,9346,0.0,5 years
4,33182.0,12000,12000,11.71%,36 months,credit_card,18.78%,NJ,RENT,3195.0,695-699,11,14469,0.0,9 years


In [4]:
ld.shape

(2500, 15)

In [5]:
ld.dtypes

ID                                float64
Amount.Requested                   object
Amount.Funded.By.Investors         object
Interest.Rate                      object
Loan.Length                        object
Loan.Purpose                       object
Debt.To.Income.Ratio               object
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                  object
Revolving.CREDIT.Balance           object
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

**Cleaning Variables**

1. **Problem**: Column shows as an object as it contains symbol "%" <br>
**Solution**: Removing the symbol to convert them number without making these values NA

In [6]:
for cols in ["Interest.Rate", "Debt.To.Income.Ratio"]:
    ld[cols]=ld[cols].astype("str") 
    ld[cols]=[x.replace("%","") for x in ld[cols]]

#replace func below only works on string & not object data type
#Used list comprehension. Wrote ld[cols] as that's where the data was saved in second line of code 
#didnt change it into no as doing that below together for other cols as well

In [7]:
ld.dtypes

ID                                float64
Amount.Requested                   object
Amount.Funded.By.Investors         object
Interest.Rate                      object
Loan.Length                        object
Loan.Purpose                       object
Debt.To.Income.Ratio               object
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                  object
Revolving.CREDIT.Balance           object
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

2. **Problem**: Many columns which should have been numbers have been imported as character columns <br>
   **Possible reason**: Some characters values may exist in those columns <br>
   **Solution**: Convert all such columns to numbers 

In [8]:
for col in ["Amount.Requested", "Amount.Funded.By.Investors", "Interest.Rate", "Debt.To.Income.Ratio",
           "Open.CREDIT.Lines", "Revolving.CREDIT.Balance", "Inquiries.in.the.Last.6.Months"]:
    ld[col]=pd.to_numeric(ld[col], errors="coerce")


In [9]:
ld.dtypes

ID                                float64
Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Loan.Length                        object
Loan.Purpose                       object
Debt.To.Income.Ratio              float64
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

**Dummy variable creation for categorical variables**

In [10]:
ld["Loan.Length"].value_counts()

36 months    1950
60 months     548
.               1
Name: Loan.Length, dtype: int64

In [11]:
ll_dummy=pd.get_dummies(ld["Loan.Length"])

#Function get_dummies creates dummy variables for all the categories present in a categorical variable
#Result is a dataframe, we can then choose and drop the dummies that we want to drop 
#and attach the ones selected back to our original data.

In [12]:
ll_dummy.head() #3 cols made with values replaced with 1 & 0

Unnamed: 0,.,36 months,60 months
0,0,1,0
1,0,1,0
2,0,0,1
3,0,1,0
4,0,1,0


In [13]:
ld["loanlength_36"]=ll_dummy["36 months"]

In [14]:
#After being done with dataframe ll_dummies, we can drop it. 
#Below is the general way of removing variables from notebook environment.

%reset_selective ll_dummy

Once deleted, variables cannot be recovered. Proceed (y/[n])?  Y


In [15]:
ld=ld.drop('Loan.Length',axis=1)
#Now that we have created dummies for Loan.Length, we need to remove this from the dataframe.

In [16]:
#To know what all variables are in the environment, you can use function "who"
#Who

In [17]:
ld.dtypes

ID                                float64
Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Loan.Purpose                       object
Debt.To.Income.Ratio              float64
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
loanlength_36                       uint8
dtype: object

In [18]:
#Examining now Loan.Purpose
ld["Loan.Purpose"].value_counts()

debt_consolidation    1307
credit_card            444
other                  200
home_improvement       152
major_purchase         101
small_business          87
car                     50
wedding                 39
medical                 30
moving                  29
vacation                21
house                   20
educational             15
renewable_energy         4
Name: Loan.Purpose, dtype: int64

In [19]:
#There are 14 categories
#we can either make 13 dummies or 
#we can club few of these categories together and reduce the number of effective categories 
#and then make dummy variables for those.

#Clubbing those categories which behave similarly in terms of their effect on response. 
#Means: Club those categories for which average interest rates are similar in the data.


In [20]:
round(ld.groupby("Loan.Purpose")["Interest.Rate"].mean())

Loan.Purpose
car                   11.0
credit_card           13.0
debt_consolidation    14.0
educational           11.0
home_improvement      12.0
house                 13.0
major_purchase        11.0
medical               12.0
moving                14.0
other                 13.0
renewable_energy      10.0
small_business        13.0
vacation              12.0
wedding               12.0
Name: Interest.Rate, dtype: float64

In [21]:
#Can see there are 4 effective categoris in the data. Club them
len(ld.index) #2500
#loc syntax:  df.loc[index no, "col name"]

for i in range(len(ld.index)):
    if ld["Loan.Purpose"][i] in ["car", "educational", "major_purchase"]: #11
        ld.loc[i, "Loan.Purpose"]="cem"
    if ld["Loan.Purpose"][i] in ["home_improvement","medical","vacation","wedding"]: #12
        ld.loc[i,"Loan.Purpose"]="hmvw"
    if ld["Loan.Purpose"][i] in ["credit_card","house","other","small_business"]: #13
        ld.loc[i, "Loan.Purpose"]="chos"
    if ld["Loan.Purpose"][i] in ["debt_consolidation","moving"]: #14
        ld.loc[i, "Loan.Purpose"]="dm"
    

In [22]:
#Now we make dummies for the variables as the categories are reduced

Lp_dummies=pd.get_dummies(ld["Loan.Purpose"], prefix="LP")

In [23]:
Lp_dummies.head()

Unnamed: 0,LP_cem,LP_chos,LP_dm,LP_hmvw,LP_renewable_energy
0,0,0,1,0,0
1,0,0,1,0,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,1,0,0,0


In [24]:
#We'll add this to original data and then drop original variable "Loan.Purpose" also LP_renewable_energy

ld=pd.concat([ld, Lp_dummies],1)
ld=ld.drop(["Loan.Purpose", "LP_renewable_energy"], 1)

In [25]:
ld.dtypes

ID                                float64
Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Debt.To.Income.Ratio              float64
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
loanlength_36                       uint8
LP_cem                              uint8
LP_chos                             uint8
LP_dm                               uint8
LP_hmvw                             uint8
dtype: object

In [26]:
ld["State"].nunique()

47

In [27]:
ld["State"].value_counts()

CA    433
NY    255
TX    174
FL    169
IL    101
GA     97
PA     96
NJ     94
VA     78
MA     73
OH     71
MD     68
NC     64
CO     61
WA     58
CT     50
AZ     46
MI     45
AL     38
MN     38
MO     33
NV     32
OR     30
SC     28
WI     26
KY     23
LA     22
KS     21
OK     21
UT     16
NH     15
RI     15
WV     14
AR     13
NM     13
HI     12
AK     11
DC     11
DE      8
MT      7
VT      5
WY      4
SD      4
IN      3
MS      1
IA      1
.       1
Name: State, dtype: int64

In [28]:
round(ld.groupby("State")["Interest.Rate"].mean())

State
.     15.0
AK    17.0
AL    13.0
AR    13.0
AZ    13.0
CA    13.0
CO    13.0
CT    14.0
DC    14.0
DE    12.0
FL    13.0
GA    13.0
HI    16.0
IA    14.0
IL    13.0
IN    13.0
KS    14.0
KY    12.0
LA    15.0
MA    13.0
MD    13.0
MI    14.0
MN    14.0
MO    13.0
MS    16.0
MT    11.0
NC    13.0
NH    12.0
NJ    13.0
NM    14.0
NV    14.0
NY    13.0
OH    12.0
OK    14.0
OR    13.0
PA    13.0
RI    13.0
SC    13.0
SD    10.0
TX    13.0
UT    13.0
VA    13.0
VT    18.0
WA    13.0
WI    14.0
WV    14.0
WY    13.0
Name: Interest.Rate, dtype: float64

In [29]:
for i in range(len(ld.index)):
    if ld["State"][i] in ["DE", "KY", "NH", "OH"]: #12
        ld.loc[i, "State"]="4states"
    if ld["State"][i] in ["AL","AR","AZ","CA", "CO", "FL", "GA", "IL", "IN", "MA", "MD", "MO", "NC"
                          , "NJ", "NY", "OR", "PA", "RI", "SC", "TX", "UT", "VA", "WA", "WY"]: #13
        ld.loc[i,"State"]="24states"
    if ld["State"][i] in ["CT","DC","IA","KS", "MI", "MN", "NM", "NV", "OK", "WI", "WV"]: #14
        ld.loc[i, "State"]="11states"
    if ld["State"][i] in [".","LA"]: #15
        ld.loc[i, "State"]="2states"

In [30]:
state_dummies=pd.get_dummies(ld["State"], prefix="S") 
#now need to write S_ as it already inserts a _ on concatenating

In [31]:
state_dummies.head()

Unnamed: 0,S_11states,S_24states,S_2states,S_4states,S_AK,S_HI,S_MS,S_MT,S_SD,S_VT
0,0,1,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0


In [32]:
ld=pd.concat([ld, state_dummies], 1)

In [33]:
ld=ld.drop(["State", "S_AK", "S_HI","S_MS", "S_MT", "S_SD", "S_VT"],1)

In [34]:
ld.dtypes

ID                                float64
Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Debt.To.Income.Ratio              float64
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
loanlength_36                       uint8
LP_cem                              uint8
LP_chos                             uint8
LP_dm                               uint8
LP_hmvw                             uint8
S_11states                          uint8
S_24states                          uint8
S_2states                           uint8
S_4states                           uint8
dtype: object

In [35]:
#Looking next at Home.ownership
ld["Home.Ownership"].value_counts()

MORTGAGE    1147
RENT        1146
OWN          200
OTHER          5
NONE           1
Name: Home.Ownership, dtype: int64

In [36]:
ld["ho_mort"]=np.where(ld["Home.Ownership"]=="MORTGAGE",1,0)
ld["ho_rent"]=np.where(ld["Home.Ownership"]=="RENT",1,0)
ld=ld.drop(["Home.Ownership"],1)

#Ignored values OTHER and NONE. Considered that there are only 3 categories.
#Created only two dummies as OTHER and NONE had very low frequencies

In [37]:
ld["FICO.Range"].head()
#checking how values are in the col

0    735-739
1    715-719
2    690-694
3    695-699
4    695-699
Name: FICO.Range, dtype: object

In [38]:
ld['F1'], ld['F2']=zip(*ld['FICO.Range'].apply(lambda x: x.split('-', 1)))

#Can convert values into numeric by taking average of the range given.
#To do that first split the column with "-", so that we can have both end of ranges in separate cols
#and then we can simply average them.

#*?
# Pyhton takes long to create dummy variable for each. Is there a short way like creating a func in R

In [39]:
ld["Fico"]=0.5*(pd.to_numeric(ld["F1"])+pd.to_numeric(ld["F2"]))
ld=ld.drop(["FICO.Range", "F1", "F2"], 1)

#Created new variable "fico" by averaging f1 and f2. 
#And then dropped the original variable FICO.Range and f1,f2.

In [40]:
#Looking at Employment Length variable now
ld["Employment.Length"].value_counts()

10+ years    653
< 1 year     249
2 years      243
3 years      235
5 years      202
4 years      191
1 year       177
6 years      163
7 years      127
8 years      108
9 years       72
.              2
Name: Employment.Length, dtype: int64

In [41]:
ld["Employment.Length"]=ld["Employment.Length"].astype("str")
ld["Employment.Length"]=[x.replace("years","") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("year","") for x in ld["Employment.Length"]]

#? How did it not show count for nan before?

In [42]:
ld["Employment.Length"].value_counts()

10+     653
< 1     249
2       243
3       235
5       202
4       191
1       177
6       163
7       127
8       108
nan      78
9        72
.         2
Name: Employment.Length, dtype: int64

In [43]:
#We can convert everything else to numbers, but "n/a" are a problem. 
#We can look at average interest across all values of Employment.Length 
#and then replace "n/a" with value which has closet average response

round(ld.groupby("Employment.Length")["Interest.Rate"].mean(),2)

Employment.Length
.       11.34
1       12.49
10+     13.34
2       12.87
3       12.77
4       13.14
5       13.40
6       13.29
7       13.10
8       13.01
9       13.15
< 1     12.86
nan     12.78
Name: Interest.Rate, dtype: float64

In [44]:
#Above "n/a" is similar to "< 1".

ld["Employment.Length"]=[x.replace("n/a","< 1") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("10+","10") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("< 1","0") for x in ld["Employment.Length"]]
ld["Employment.Length"]=pd.to_numeric(ld["Employment.Length"],errors="coerce")

In [45]:
ld.dtypes

ID                                float64
Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Debt.To.Income.Ratio              float64
Monthly.Income                    float64
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                 float64
loanlength_36                       uint8
LP_cem                              uint8
LP_chos                             uint8
LP_dm                               uint8
LP_hmvw                             uint8
S_11states                          uint8
S_24states                          uint8
S_2states                           uint8
S_4states                           uint8
ho_mort                             int32
ho_rent                             int32
Fico                              float64
dtype: object

In [46]:
#All variables as numbers now. 
#After dropping observations with missing values, can proceed to build the model.
ld.shape

(2500, 22)

In [47]:
ld.dropna(axis=0,inplace=True)

In [48]:
ld.shape

(2394, 22)

In [49]:
#Dropping ID from predictor's list because it doesnt make sense to include it in the model.
#Variable "Amount.Funded.By.Investors" will also be dropped as it wont be available until loan has been processed.

ld=ld.drop(["ID","Amount.Funded.By.Investors"],1)

**Splitting Data into Train and Test**

In [50]:
#Option "random_state" is used to make our random operation reproducible
ld_train, ld_test = train_test_split(ld, test_size=0.2, random_state=2)

In [51]:
#Below line creates an object of class LinearRegression named lm. 
#We can use this object to access all functions realted to LinearRegression.

lm=LinearRegression()

In [52]:
x_train=ld_train.drop(["Interest.Rate"],1)
y_train=ld_train["Interest.Rate"]
x_test=ld_test.drop(["Interest.Rate"],1)
y_test=ld_test["Interest.Rate"]

#? why do we separate x & y for both train and test. We didn't do it in R like this

**Fitting the model: Linear Regression** 

In [53]:
lm.fit(x_train, y_train)

#? what does this mean below?

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [54]:
#Predicting response on test data. 
p_test=lm.predict(x_test)

In [55]:
#Calculating error on the prediction
residual = p_test - y_test

In [56]:
#Calculating rmse for the residuals
#RMSE: Measure of performance on the test data

rmse_lm=np.sqrt(np.dot(residual, residual)/len(p_test))
rmse_lm

1.990785912983817

In [57]:
#Extracting coefficients produced by the model
Coefs=lm.coef_
Features=x_train.columns
list(zip(Features, Coefs))

[('Amount.Requested', 0.0001517734781713893),
 ('Debt.To.Income.Ratio', 0.0013760777043406445),
 ('Monthly.Income', -1.3704686177863027e-05),
 ('Open.CREDIT.Lines', -0.0287991258152676),
 ('Revolving.CREDIT.Balance', -4.352518471652548e-06),
 ('Inquiries.in.the.Last.6.Months', 0.3670299909239237),
 ('Employment.Length', 0.003011734971206336),
 ('loanlength_36', -3.223284110323758),
 ('LP_cem', -1.3324056691565713),
 ('LP_chos', -1.4697906063163133),
 ('LP_dm', -1.5261202468745294),
 ('LP_hmvw', -1.6907178776000285),
 ('S_11states', -0.8258522904438373),
 ('S_24states', -0.7795194771509176),
 ('S_2states', -0.5416254224158563),
 ('S_4states', -0.7239762343540967),
 ('ho_mort', -0.43300227711450073),
 ('ho_rent', -0.14430465555567923),
 ('Fico', -0.08655089086299964)]

**Penalising by Ridge/Lasso Regression**

Need to penalise coefficient for the variables which are not really contributing well to our response and might be causing overfitting of the model. 

**Ridge Regression** :
Since penalty in ridge regression is a hyperparameter, we'd look at multiple values of it and choose the best one through 10 fold cross validation.

In [58]:
# Finding best value of penalty weight with cross validation for ridge regression
alphas=np.linspace(2,10,100)

In [59]:
# We need to reset index for cross validation to work without hitch
# When we reset the index, the old index is added as a column, and a new sequential index is used
# Drop parameter (drop=True) is used to avoid old index being added as a column. It will only reset the index to 012
x_train.reset_index(drop=True,inplace=True)
y_train.reset_index(drop=True,inplace=True)

In [60]:
rmse_list=[]
for a in alphas:
    ridge = Ridge(fit_intercept=True, alpha=a)

    # computing average RMSE across 10-fold cross validation
    #kf = KFold(len(x_train), n_folds=10)
    kf=KFold(n_splits=10)
    xval_err = 0
    for train, test in kf.split(x_train):
        ridge.fit(x_train.iloc[train], y_train.iloc[train])
        p = ridge.predict(x_train.loc[test])
        err = p - y_train[test]
        xval_err += np.dot(err,err)
    rmse_10cv = np.sqrt(xval_err/len(x_train))
    # Below prints rmse values for individidual alphas
    #print('{:.3f}\t {:.6f}\t '.format(a,rmse_10cv))
    rmse_list.extend([rmse_10cv])
best_alpha=alphas[rmse_list==min(rmse_list)]
print('Alpha with min 10cv error is : ',best_alpha )

Alpha with min 10cv error is :  [2.]


In [61]:
#Best value of alpha might be slightly different across different runs because of random nature of cv

In [62]:
#Next we fit Ridge Regression on the entire train data with best value of alpha we just determined.

ridge=Ridge(fit_intercept=True,alpha=best_alpha)

ridge.fit(x_train,y_train)

p_test=ridge.predict(x_test)

residual=p_test-y_test

rmse_ridge=np.sqrt(np.dot(residual,residual)/len(p_test))

rmse_ridge

1.9859381185717309

In [63]:
list(zip(x_train.columns,ridge.coef_))

#From coeff below, we find that ridge regression shrinks coefficients but never makes them exactly 0
#essentially never reduces our model size

[('Amount.Requested', 0.00015221457569401624),
 ('Debt.To.Income.Ratio', 0.0014394383981735047),
 ('Monthly.Income', -1.4258866489839921e-05),
 ('Open.CREDIT.Lines', -0.029420077914029612),
 ('Revolving.CREDIT.Balance', -4.286684797689766e-06),
 ('Inquiries.in.the.Last.6.Months', 0.3659184286025226),
 ('Employment.Length', 0.0030015194126455243),
 ('loanlength_36', -3.1993145081975034),
 ('LP_cem', -0.23261420558652884),
 ('LP_chos', -0.3742677638496975),
 ('LP_dm', -0.4309809288834763),
 ('LP_hmvw', -0.5897435807410064),
 ('S_11states', -0.6711751589154407),
 ('S_24states', -0.6299334409666998),
 ('S_2states', -0.3543570484347373),
 ('S_4states', -0.565171913475411),
 ('ho_mort', -0.41959513395028136),
 ('ho_rent', -0.13414996654604794),
 ('Fico', -0.08654878487272366)]

In [64]:
list(zip(Features,Coefs))

[('Amount.Requested', 0.0001517734781713893),
 ('Debt.To.Income.Ratio', 0.0013760777043406445),
 ('Monthly.Income', -1.3704686177863027e-05),
 ('Open.CREDIT.Lines', -0.0287991258152676),
 ('Revolving.CREDIT.Balance', -4.352518471652548e-06),
 ('Inquiries.in.the.Last.6.Months', 0.3670299909239237),
 ('Employment.Length', 0.003011734971206336),
 ('loanlength_36', -3.223284110323758),
 ('LP_cem', -1.3324056691565713),
 ('LP_chos', -1.4697906063163133),
 ('LP_dm', -1.5261202468745294),
 ('LP_hmvw', -1.6907178776000285),
 ('S_11states', -0.8258522904438373),
 ('S_24states', -0.7795194771509176),
 ('S_2states', -0.5416254224158563),
 ('S_4states', -0.7239762343540967),
 ('ho_mort', -0.43300227711450073),
 ('ho_rent', -0.14430465555567923),
 ('Fico', -0.08655089086299964)]

**Lasso Regression**

In [65]:
alphas=np.linspace(0.0001,1,100)
rmse_list=[]
for a in alphas:
    lasso = Lasso(fit_intercept=True, alpha=a,max_iter=10000)

    # computing RMSE using 10-fold cross validation
    #kf = KFold(len(x_train), n_folds=10)
    kf=KFold(n_splits=10)
    xval_err = 0
    for train, test in kf.split(x_train):
        lasso.fit(x_train.loc[train], y_train[train])
        p =lasso.predict(x_train.loc[test])
        err = p - y_train[test]
        xval_err += np.dot(err,err)
    rmse_10cv = np.sqrt(xval_err/len(x_train))
    rmse_list.extend([rmse_10cv])
    # Uncomment below to print rmse values of individual alphas
    #print('{:.3f}\t {:.4f}\t '.format(a,rmse_10cv))
best_alpha=alphas[rmse_list==min(rmse_list)]
print('Alpha with min 10cv error is : ',best_alpha )

Alpha with min 10cv error is :  [0.0203]


In [66]:
lasso=Lasso(fit_intercept=True,alpha=best_alpha)

lasso.fit(x_train,y_train)

p_test=lasso.predict(x_test)

residual=p_test-y_test

rmse_lasso=np.sqrt(np.dot(residual,residual)/len(p_test))

rmse_lasso

1.982894911139778

In [67]:
list(zip(x_train.columns,lasso.coef_))

[('Amount.Requested', 0.0001539293282091595),
 ('Debt.To.Income.Ratio', 0.0009321941365557245),
 ('Monthly.Income', -1.7893764804642685e-05),
 ('Open.CREDIT.Lines', -0.029156552269878),
 ('Revolving.CREDIT.Balance', -4.210314871770977e-06),
 ('Inquiries.in.the.Last.6.Months', 0.34610251057752983),
 ('Employment.Length', 0.0),
 ('loanlength_36', -3.0814760613021903),
 ('LP_cem', 0.0),
 ('LP_chos', 0.0),
 ('LP_dm', -0.0),
 ('LP_hmvw', -0.0),
 ('S_11states', -0.0),
 ('S_24states', -0.0),
 ('S_2states', 0.0),
 ('S_4states', 0.0),
 ('ho_mort', -0.21660653329324908),
 ('ho_rent', 0.0),
 ('Fico', -0.08669139128770449)]

In [68]:
#Lasso regression not only improves performance on the data slightly, 
#but also makes size of the model smaller by making many coefficents exactly zero,
#thus excluding them from our model.
