Problem Statement : We have been given data for people applying for a loan to a peer-to-peer
lending firm. The data contains details of loan application and eventually how much interest rate
was offered to people by the platform. Our solution needs to be able to predict the interest rate
which will be offered to people , given their application detail as inputs. We need to minimze mean error


In [33]:

# import for processing data
import pandas as pd 
import numpy as np


# imports for suppressing warnings
import warnings
warnings.filterwarnings('ignore')



In [34]:
# We are providing complete path for the file which contains your data
# r at the beginning is used to ensure that path is considerd as raw string and
# we dont get unicode error because of special characters combined with \ or /

train_file=r'D:\\Hackathon\\Loan Data\\loan_data_train.csv'
test_file=r'D:\\Hackathon\\Loan Data\\loan_data_test.csv'


ld_train=pd.read_csv(train_file)
ld_test=pd.read_csv(test_file)               


In [35]:
ld_train.head()

Unnamed: 0,ID,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length
0,79542.0,25000,25000.0,18.49%,60 months,debt_consolidation,27.56%,VA,MORTGAGE,8606.56,720-724,11,15210,3.0,5 years
1,75473.0,19750,19750.0,17.27%,60 months,debt_consolidation,13.39%,NY,MORTGAGE,6737.5,710-714,14,19070,3.0,4 years
2,67265.0,2100,2100.0,14.33%,36 months,major_purchase,3.50%,LA,OWN,1000.0,690-694,13,893,1.0,< 1 year
3,80167.0,28000,28000.0,16.29%,36 months,credit_card,19.62%,NV,MORTGAGE,7083.33,710-714,12,38194,1.0,10+ years
4,17240.0,24250,17431.82,12.23%,60 months,credit_card,23.79%,OH,MORTGAGE,5833.33,730-734,6,31061,2.0,10+ years


In [36]:
ld_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              2199 non-null   float64
 1   Amount.Requested                2199 non-null   object 
 2   Amount.Funded.By.Investors      2199 non-null   object 
 3   Interest.Rate                   2200 non-null   object 
 4   Loan.Length                     2199 non-null   object 
 5   Loan.Purpose                    2199 non-null   object 
 6   Debt.To.Income.Ratio            2199 non-null   object 
 7   State                           2199 non-null   object 
 8   Home.Ownership                  2199 non-null   object 
 9   Monthly.Income                  2197 non-null   float64
 10  FICO.Range                      2200 non-null   object 
 11  Open.CREDIT.Lines               2196 non-null   object 
 12  Revolving.CREDIT.Balance        21

In [37]:
#ID : It doesn't make sense to include unique identifiers of the observation (ID vars) as input. We'll drop this column.

#Amount.Requested , Open.CREDIT.Lines, Revolving.CREDIT.Balance : Ideally these should have been numeric columns, but 
#if we look at the type assigned , it is object type. They must have come as object type because of some odd strings 
#in the data at one or more places . We'll convert them to numeric type.

#Amount.Funded.By.Investors : This information, although present in the data, will not come with loan application. If 
#we want to build a model for predicting Interest.Rate using loan application characteristics , then we can not include 
#this information in our model. We'll drop this column.

#Interest.Rate, Debt.To.Income.Ratio: These come as object type again because of the % symbol contained in it. We'll 
#first remove the % sign and then convert it to numeric type.

#State , Home.Ownership, Loan.Length, Loan.Purpose : We'll create dummies , ignoring categories with too few occurrences.

#Monthly.Income , Inquiries.in.the.Last.6.Months : we will leave it as is.

#FICO.Range: This comes as object type because the value written as numeric ranges in the data. As such, we can convert 
#this to dummies, but we'll not be using information contained in order of the values if we convert them to dummies . 
#We'll instead take average of the given range using string processing.

#Employment.Length: This takes type object; because it takes numeric values written in words. We can again chose to work
#with it like a categorical variable, but then we'll end up losing information on the order of values.


In [38]:
#Combining train and test datasets to convert data in columns. Adding dummy Interest Rate column in Id_test dataset
#and data_type column in both dataset so that we can segregate them after changes.

ld_test['Interest.Rate']='na'
ld_test['data_type']='test'
ld_train['data_type']='train'

id_all=pd.concat([ld_train,ld_test],axis=0)

In [39]:
id_all.shape

(2500, 16)

In [40]:
# removing ID and Amount.Funded.By.Investors

id_all.drop(['ID','Amount.Funded.By.Investors'],axis=1,inplace=True)


In [41]:
# Removing % signs from two columns
for col in ['Interest.Rate','Debt.To.Income.Ratio']:
    id_all[col]=id_all[col].str.replace("%","")


In [42]:
# converting columns to numeric with pandas
for col in ['Amount.Requested', 'Interest.Rate','Debt.To.Income.Ratio','Open.CREDIT.Lines','Revolving.CREDIT.Balance']:
     id_all[col]=pd.to_numeric(id_all[col],errors='coerce')

In [43]:
# Processing FICO.Range
k=id_all['FICO.Range'].str.split("-",expand=True).astype(float)
id_all['fico']=0.5*(k[0]+k[1])
del id_all['FICO.Range']


In [44]:
# Processing Employment.Length
id_all['Employment.Length']=id_all['Employment.Length'].str.replace('years','')
id_all['Employment.Length']=id_all['Employment.Length'].str.replace('year','')
id_all['Employment.Length']=np.where(id_all['Employment.Length'].str[0]=='<',0, id_all['Employment.Length'])
id_all['Employment.Length']=np.where(id_all['Employment.Length'].str[:2]=='10',10, id_all['Employment.Length'])
id_all['Employment.Length']=pd.to_numeric(id_all['Employment.Length'],errors='coerce')


In [45]:
id_all.head()

Unnamed: 0,Amount.Requested,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,data_type,fico
0,25000.0,18.49,60 months,debt_consolidation,27.56,VA,MORTGAGE,8606.56,11.0,15210.0,3.0,5.0,train,722.0
1,19750.0,17.27,60 months,debt_consolidation,13.39,NY,MORTGAGE,6737.5,14.0,19070.0,3.0,4.0,train,712.0
2,2100.0,14.33,36 months,major_purchase,3.5,LA,OWN,1000.0,13.0,893.0,1.0,0.0,train,692.0
3,28000.0,16.29,36 months,credit_card,19.62,NV,MORTGAGE,7083.33,12.0,38194.0,1.0,10.0,train,712.0
4,24250.0,12.23,60 months,credit_card,23.79,OH,MORTGAGE,5833.33,6.0,31061.0,2.0,10.0,train,732.0


In [46]:
# Creating dummies with frequency cutoff
cat_col=['State' , 'Home.Ownership', 'Loan.Length', 'Loan.Purpose']
for col in cat_col :
    k=id_all[col].value_counts(dropna=False)
    cats=k.index[k>50][:-1]
    for cat in cats:
        name=col+'_'+cat
        id_all[name]=(id_all[col]==cat).astype(int)
del id_all[col] 

In [None]:
del id_all['State']
del id_all['Home.Ownership']
del id_all['Loan.Length']
#del id_all['Loan.Purpose']

In [51]:
id_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 299
Data columns (total 32 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Amount.Requested                 2495 non-null   float64
 1   Interest.Rate                    2200 non-null   float64
 2   Debt.To.Income.Ratio             2499 non-null   float64
 3   Monthly.Income                   2497 non-null   float64
 4   Open.CREDIT.Lines                2491 non-null   float64
 5   Revolving.CREDIT.Balance         2495 non-null   float64
 6   Inquiries.in.the.Last.6.Months   2497 non-null   float64
 7   Employment.Length                2420 non-null   float64
 8   data_type                        2500 non-null   object 
 9   fico                             2500 non-null   float64
 10  State_CA                         2500 non-null   int32  
 11  State_NY                         2500 non-null   int32  
 12  State_TX             

In [54]:
# checking for missing values in the data
id_all.isnull().sum()


Amount.Requested                   0
Interest.Rate                      0
Debt.To.Income.Ratio               0
Monthly.Income                     0
Open.CREDIT.Lines                  0
Revolving.CREDIT.Balance           0
Inquiries.in.the.Last.6.Months     0
Employment.Length                  0
data_type                          0
fico                               0
State_CA                           0
State_NY                           0
State_TX                           0
State_FL                           0
State_IL                           0
State_GA                           0
State_PA                           0
State_NJ                           0
State_VA                           0
State_MA                           0
State_OH                           0
State_MD                           0
State_NC                           0
State_CO                           0
Home.Ownership_MORTGAGE            0
Home.Ownership_RENT                0
Loan.Length_36 months              0
L

In [53]:
# imputing missing values with averages of the columns
for col in id_all.columns:
    if id_all[col].isnull().sum()>0:
        id_all.loc[id_all[col].isnull(),col]=id_all[col].mean()

In [55]:
id_all.head()

Unnamed: 0,Amount.Requested,Interest.Rate,Debt.To.Income.Ratio,Monthly.Income,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,data_type,fico,...,State_NC,State_CO,Home.Ownership_MORTGAGE,Home.Ownership_RENT,Loan.Length_36 months,Loan.Purpose_debt_consolidation,Loan.Purpose_credit_card,Loan.Purpose_other,Loan.Purpose_home_improvement,Loan.Purpose_major_purchase
0,25000.0,18.49,27.56,8606.56,11.0,15210.0,3.0,5.0,train,722.0,...,0,0,1,0,0,1,0,0,0,0
1,19750.0,17.27,13.39,6737.5,14.0,19070.0,3.0,4.0,train,712.0,...,0,0,1,0,0,1,0,0,0,0
2,2100.0,14.33,3.5,1000.0,13.0,893.0,1.0,0.0,train,692.0,...,0,0,0,0,1,0,0,0,0,1
3,28000.0,16.29,19.62,7083.33,12.0,38194.0,1.0,10.0,train,712.0,...,0,0,1,0,1,0,1,0,0,0
4,24250.0,12.23,23.79,5833.33,6.0,31061.0,2.0,10.0,train,732.0,...,0,0,1,0,0,0,1,0,0,0


In [56]:
pd_train=id_all[id_all['data_type']=='train']
pd_test=id_all[id_all['data_type']=='test']

In [58]:
pd_test.drop(['data_type','Interest.Rate'],1,inplace=True)
del pd_train['data_type']
del id_all

In [59]:
pd_train.shape,pd_test.shape

((2200, 31), (300, 30))

In [64]:
# breaking data into two parts
from sklearn.model_selection import train_test_split
t1,t2=train_test_split(pd_train,test_size=0.2,random_state=123)
# test_size=0.2, means the data is being split into two parts in the 80:20 ratio.
# t1 will contain 80%, and t2 will get 20% of the obs.
# random_state=123, simply makes the random process reproducible


In [65]:
#breaking data in training and test for X & Y
x_train=t1.drop('Interest.Rate',axis=1)
y_train=t1['Interest.Rate']
x_test=t2.drop('Interest.Rate',axis=1)
y_test=t2['Interest.Rate']


# Starting with Linear regression

In [66]:
# import the function for Linear Regression
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
# fit function , builds the model ( parameter estimation etc)
lr.fit(x_train,y_train)

LinearRegression()

In [68]:
predicted_values=lr.predict(x_test)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(predicted_values,y_test)

1.653163703521755

In [69]:
from sklearn.model_selection import cross_val_score

In [70]:
errors =np.abs(cross_val_score(lr,x_train,y_train,cv=10,scoring='neg_mean_absolute_error'))
# cv=10 , means 10 fold cross validation
# Regarding scoring functions, the general theme in scikit learn is , higher the better
# to remain consistent with the same , instead of mean_absolute_error, available function for regression is neg_mean_absolute_error
# we can always wrap that and take positve values [ with np.abs ]

In [71]:
errors

array([1.45111811, 1.58646582, 1.61879344, 1.57590657, 1.66592409,
       1.70202519, 1.60983759, 1.62351837, 1.5811983 , 1.52873518])

In [72]:
avg_error=errors.mean()
error_std=np.std(errors)
avg_error,error_std

(1.5943522675407418, 0.06618225100215)

# Using Regularization Ridge and Lasso

In [73]:
from sklearn.linear_model import Ridge,Lasso
from sklearn.model_selection import GridSearchCV


In [74]:
# we are going to try out values from 1 to 100
# what we have referring to as lambda is named alpha in the sklearn implementation
# we'll pass this dictionary to GridSearchCV function
lambdas=np.linspace(1,100,200)
params={'alpha':lambdas}

In [76]:
# this is the model for which we are tryin to estimate best value of lambda
model=Ridge(fit_intercept=True)

In [77]:
# this function will be trying out all the values of lambdas passed to it and record cross-validate performance
# using a custom function we'll extract the results
grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_error')

In [78]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=Ridge(),
             param_grid={'alpha': array([  1.        ,   1.49748744,   1.99497487,   2.49246231,
         2.98994975,   3.48743719,   3.98492462,   4.48241206,
         4.9798995 ,   5.47738693,   5.97487437,   6.47236181,
         6.96984925,   7.46733668,   7.96482412,   8.46231156,
         8.95979899,   9.45728643,   9.95477387,  10.45226131,
        10.94974874,  11.44723618,  11.94472362,  12.44221106,
        12.93969849,  13.43718593,  13...
        86.5678392 ,  87.06532663,  87.56281407,  88.06030151,
        88.55778894,  89.05527638,  89.55276382,  90.05025126,
        90.54773869,  91.04522613,  91.54271357,  92.04020101,
        92.53768844,  93.03517588,  93.53266332,  94.03015075,
        94.52763819,  95.02512563,  95.52261307,  96.0201005 ,
        96.51758794,  97.01507538,  97.51256281,  98.01005025,
        98.50753769,  99.00502513,  99.50251256, 100.        ])},
             scoring='neg_mean_absolute_error')

In [79]:
grid_search.best_estimator_

Ridge(alpha=14.4321608040201)

In [97]:
#Creating a report which will give us best estimator
def report(results, n_top=3):
    for i in range(1, n_top + 1): 
         # np.flatnonzero extracts index of `True` in a boolean array
        candidate = np.flatnonzero(results['rank_test_score'] == i)[0]
        # print rank of the model
         # values passed to function format here are put in the curly brackets when printing
         # 0 , 1 etc refer to placeholder for position of values passed to format function
         # .3f means upto 2 decimal digits
        print("Model with rank: {0}".format(i))
         # this prints cross validate performance and its standard deviation
        print("Mean validation score: {0:.5f} (std: {1:.5f})".format(
              results['mean_test_score'][candidate],
              results['std_test_score'][candidate]))
            # prints the paramter combination for which this performance was obtained
        print("Parameters: {0}".format(results['params'][candidate]))
        print("")

In [98]:
report(grid_search.cv_results_,3)

Model with rank: 1
Mean validation score: -1.59223 (std: 0.06409)
Parameters: {'alpha': 14.4321608040201}

Model with rank: 2
Mean validation score: -1.59223 (std: 0.06407)
Parameters: {'alpha': 14.92964824120603}

Model with rank: 3
Mean validation score: -1.59223 (std: 0.06410)
Parameters: {'alpha': 13.93467336683417}



In [99]:
ridge=grid_search.best_estimator_
ridge.fit(x_train,y_train)

Ridge(alpha=14.4321608040201)

In [105]:
# there is no ideal range for any parameter search
# generally whatever range that you search in ,
# if the best value comes at the either edge of the range,
# you should expand on that side
# however if you keep on getting always largets value of lambda as best here
# that simply means that all vars are junk
# same on the lower side means , all vars are good and penalty is not necessary
lambdas=np.linspace(0.001,2,200)
params={'alpha':lambdas}

In [106]:
modell=Lasso(fit_intercept=True)

In [111]:
grid_search=GridSearchCV(modell,param_grid=params,cv=10,scoring='neg_mean_absolute_error')

In [112]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=Lasso(),
             param_grid={'alpha': array([1.00000000e-03, 1.10452261e-02, 2.10904523e-02, 3.11356784e-02,
       4.11809045e-02, 5.12261307e-02, 6.12713568e-02, 7.13165829e-02,
       8.13618090e-02, 9.14070352e-02, 1.01452261e-01, 1.11497487e-01,
       1.21542714e-01, 1.31587940e-01, 1.41633166e-01, 1.51678392e-01,
       1.61723618e-01, 1.71768844e-01, 1.81814070e-01, 1...
       1.76895980e+00, 1.77900503e+00, 1.78905025e+00, 1.79909548e+00,
       1.80914070e+00, 1.81918593e+00, 1.82923116e+00, 1.83927638e+00,
       1.84932161e+00, 1.85936683e+00, 1.86941206e+00, 1.87945729e+00,
       1.88950251e+00, 1.89954774e+00, 1.90959296e+00, 1.91963819e+00,
       1.92968342e+00, 1.93972864e+00, 1.94977387e+00, 1.95981910e+00,
       1.96986432e+00, 1.97990955e+00, 1.98995477e+00, 2.00000000e+00])},
             scoring='neg_mean_absolute_error')

In [113]:
grid_search.best_estimator_

Lasso(alpha=0.011045226130653268)

In [114]:
report(grid_search.cv_results_,5)

Model with rank: 1
Mean validation score: -1.58988 (std: 0.06283)
Parameters: {'alpha': 0.011045226130653268}

Model with rank: 2
Mean validation score: -1.59217 (std: 0.06307)
Parameters: {'alpha': 0.021090452261306535}

Model with rank: 3
Mean validation score: -1.59335 (std: 0.06584)
Parameters: {'alpha': 0.001}

Model with rank: 4
Mean validation score: -1.59859 (std: 0.06463)
Parameters: {'alpha': 0.0311356783919598}

Model with rank: 5
Mean validation score: -1.60186 (std: 0.06611)
Parameters: {'alpha': 0.04118090452261307}



In [115]:
lasso=grid_search.best_estimator_
lasso.fit(x_train,y_train)

Lasso(alpha=0.011045226130653268)

In [116]:
list(zip(x_train.columns,lasso.coef_))

[('Amount.Requested', 0.00016148741790364033),
 ('Debt.To.Income.Ratio', -0.00391373223008555),
 ('Monthly.Income', -4.1969832983981255e-05),
 ('Open.CREDIT.Lines', -0.03056406263870382),
 ('Revolving.CREDIT.Balance', -2.192096219210904e-06),
 ('Inquiries.in.the.Last.6.Months', 0.3184694451469217),
 ('Employment.Length', 0.01929824885411669),
 ('fico', -0.0870579385779751),
 ('State_CA', -0.0),
 ('State_NY', 0.0),
 ('State_TX', 0.41719463341984836),
 ('State_FL', 0.0),
 ('State_IL', -0.0),
 ('State_GA', -0.0),
 ('State_PA', -0.4908334647768798),
 ('State_NJ', -0.06960805758661913),
 ('State_VA', 0.0),
 ('State_MA', -0.0),
 ('State_OH', -0.0),
 ('State_MD', 0.0),
 ('State_NC', -0.0),
 ('State_CO', 0.0),
 ('Home.Ownership_MORTGAGE', -0.2923210900760527),
 ('Home.Ownership_RENT', -0.0),
 ('Loan.Length_36 months', -3.106721650319979),
 ('Loan.Purpose_debt_consolidation', -0.23481797092212514),
 ('Loan.Purpose_credit_card', -0.2968946024411189),
 ('Loan.Purpose_other', 0.4360595666878573),



# Decision Treewith Randomized Search CV

In [137]:
from sklearn.model_selection import RandomizedSearchCV

In [138]:
from sklearn.tree import DecisionTreeRegressor

In [170]:
# RandomSearchCV/GridSearchCV accept parameters values as dictionaries.
# In example given below we have constructed dictionary for different parameter values that we want to
# try for decision tree model

params={"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }


In [172]:
clf=DecisionTreeRegressor(random_state = 0)

In [173]:
random_search=RandomizedSearchCV(clf, cv=10, param_distributions=params,  scoring='neg_mean_absolute_error',n_iter=10, n_jobs=-1,verbose=True)

In [174]:
random_search.fit(x_train,y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=0),
                   n_jobs=-1,
                   param_distributions={'max_depth': [1, 3, 5, 7, 9, 11, 12],
                                        'max_features': ['auto', 'log2', 'sqrt',
                                                         None],
                                        'max_leaf_nodes': [None, 10, 20, 30, 40,
                                                           50, 60, 70, 80, 90],
                                        'min_samples_leaf': [1, 2, 3, 4, 5, 6,
                                                             7, 8, 9, 10],
                                        'min_weight_fraction_leaf': [0.1, 0.2,
                                                                     0.3, 0.4,
                                                                     0.5, 0.6,
                                                                     0.7, 0.8,
                                         

In [175]:
random_search.best_estimator_

DecisionTreeRegressor(max_depth=11, max_features='auto', max_leaf_nodes=60,
                      min_samples_leaf=5, min_weight_fraction_leaf=0.1,
                      random_state=0, splitter='random')

In [176]:
report(random_search.cv_results_,5)

Model with rank: 1
Mean validation score: -3.05762 (std: 0.18385)
Parameters: {'splitter': 'random', 'min_weight_fraction_leaf': 0.1, 'min_samples_leaf': 5, 'max_leaf_nodes': 60, 'max_features': 'auto', 'max_depth': 11}

Model with rank: 2
Mean validation score: -3.33130 (std: 0.16492)
Parameters: {'splitter': 'random', 'min_weight_fraction_leaf': 0.3, 'min_samples_leaf': 8, 'max_leaf_nodes': 10, 'max_features': None, 'max_depth': 12}

Model with rank: 3
Mean validation score: -3.34263 (std: 0.13876)
Parameters: {'splitter': 'random', 'min_weight_fraction_leaf': 0.4, 'min_samples_leaf': 10, 'max_leaf_nodes': 40, 'max_features': None, 'max_depth': 7}

Model with rank: 4
Mean validation score: -3.38657 (std: 0.16127)
Parameters: {'splitter': 'random', 'min_weight_fraction_leaf': 0.5, 'min_samples_leaf': 4, 'max_leaf_nodes': 80, 'max_features': 'auto', 'max_depth': 1}

Model with rank: 5
Mean validation score: nan (std: nan)
Parameters: {'splitter': 'best', 'min_weight_fraction_leaf': 0.9

# Gradient Boost

In [177]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV


In [178]:
gbm_params={'n_estimators':[50,100,200],
 'learning_rate': [0.01,.05,0.1,0.4,0.8,1],
 'max_depth':[1,2,3,4,5,6],
 'subsample':[0.5,0.8,1],
 'max_features':[5,10,15,20,28]
           }

In [179]:
modelGB=GradientBoostingRegressor()
random_search=RandomizedSearchCV(modelGB,scoring='neg_mean_absolute_error',
                                 param_distributions=gbm_params,cv=10,n_iter=10, n_jobs=-1,verbose=False)


In [181]:
random_search.fit(x_train,y_train)

RandomizedSearchCV(cv=10, estimator=GradientBoostingRegressor(), n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.05, 0.1, 0.4,
                                                          0.8, 1],
                                        'max_depth': [1, 2, 3, 4, 5, 6],
                                        'max_features': [5, 10, 15, 20, 28],
                                        'n_estimators': [50, 100, 200],
                                        'subsample': [0.5, 0.8, 1]},
                   scoring='neg_mean_absolute_error', verbose=False)

In [182]:
report(random_search.cv_results_,3)

Model with rank: 1
Mean validation score: -1.33933 (std: 0.06035)
Parameters: {'subsample': 0.5, 'n_estimators': 200, 'max_features': 20, 'max_depth': 5, 'learning_rate': 0.1}

Model with rank: 2
Mean validation score: -1.39358 (std: 0.05465)
Parameters: {'subsample': 0.8, 'n_estimators': 100, 'max_features': 10, 'max_depth': 1, 'learning_rate': 0.8}

Model with rank: 3
Mean validation score: -1.43570 (std: 0.09526)
Parameters: {'subsample': 1, 'n_estimators': 50, 'max_features': 28, 'max_depth': 5, 'learning_rate': 0.4}



# XG Boost with parameter tuning

In [183]:
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBRegressor


In [188]:
xgb_params = { "n_estimators":[25,50,100,150,200],
              "gamma":[0,2,5,8,10],
               "max_depth": [2,3,4,5,6,7,8],
               "min_child_weight":[0.5,1,2,5,10]
}

In [189]:
xgb1=XGBRegressor(subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)
grid_search=GridSearchCV(xgb1,cv=10,param_grid=xgb_params,scoring='neg_mean_absolute_error',verbose=False,n_jobs=-1)

In [190]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=10,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=0.8,
                                    colsample_bynode=None, colsample_bytree=0.8,
                                    enable_categorical=False, gamma=None,
                                    gpu_id=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_...
                                    num_parallel_tree=None, predictor=None,
                                    random_state=None, reg_alpha=None,
                                    reg_lambda=None, scale_pos_weight=None,
                                    subsample=0.8, tree_method=None,


In [200]:
grid_search.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=5, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, min_child_weight=2, missing=nan,
             monotone_constraints='()', n_estimators=25, n_jobs=6,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=0.8,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [187]:
report(grid_search.cv_results_,3)

Model with rank: 1
Mean validation score: -1.34584 (std: 0.08157)
Parameters: {'n_estimators': 25}

Model with rank: 2
Mean validation score: -1.37796 (std: 0.09421)
Parameters: {'n_estimators': 50}

Model with rank: 3
Mean validation score: -1.41636 (std: 0.09517)
Parameters: {'n_estimators': 100}



In [191]:
xgb_params={
 'subsample':[i/10 for i in range(5,11)],
 'colsample_bytree':[i/10 for i in range(5,11)],
 'colsample_bylevel':[i/10 for i in range(5,11)]
}

In [198]:
xgb3=XGBRegressor(learning_rate=0.300000012,n_estimators=25,min_child_weight=2,gamma=5,max_depth=4)
random_search=RandomizedSearchCV(xgb3,param_distributions=xgb_params,cv=10, n_iter=20,scoring='neg_mean_absolute_error',
                                  n_jobs=-1,verbose=False)
random_search.fit(x_train,y_train)
report(random_search.cv_results_,3)

Model with rank: 1
Mean validation score: -1.30246 (std: 0.07150)
Parameters: {'subsample': 0.9, 'colsample_bytree': 0.8, 'colsample_bylevel': 0.8}

Model with rank: 2
Mean validation score: -1.30513 (std: 0.07153)
Parameters: {'subsample': 0.8, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.8}

Model with rank: 3
Mean validation score: -1.30793 (std: 0.05549)
Parameters: {'subsample': 0.7, 'colsample_bytree': 1.0, 'colsample_bylevel': 0.9}



In [201]:
random_search.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=5, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, min_child_weight=2, missing=nan,
             monotone_constraints='()', n_estimators=25, n_jobs=6,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=0.9,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [199]:
xgb_params={
     'reg_lambda':[i/10 for i in range(0,50)],
     'reg_alpha':[i/10 for i in range(0,50)]
}

In [203]:
xgb4=XGBRegressor(learning_rate=0.300000012,n_estimators=25,min_child_weight=2, gamma=5,max_depth=4, colsample_bylevel= 0.8,
                   colsample_bytree= 0.8, subsample= 0.9)
random_search=RandomizedSearchCV(xgb4,param_distributions=xgb_params,cv=10, n_iter=20,scoring='neg_mean_absolute_error',
                                  n_jobs=-1,verbose=False)
random_search.fit(x_train,y_train)
report(random_search.cv_results_,3)


Model with rank: 1
Mean validation score: -1.28488 (std: 0.05015)
Parameters: {'reg_lambda': 1.1, 'reg_alpha': 3.8}

Model with rank: 2
Mean validation score: -1.28977 (std: 0.05323)
Parameters: {'reg_lambda': 2.1, 'reg_alpha': 2.4}

Model with rank: 3
Mean validation score: -1.29011 (std: 0.03941)
Parameters: {'reg_lambda': 1.3, 'reg_alpha': 3.8}



In [208]:
random_search.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=5, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, min_child_weight=2, missing=nan,
             monotone_constraints='()', n_estimators=25, n_jobs=6,
             num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=3.8, reg_lambda=1.1, scale_pos_weight=1, subsample=0.9,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [212]:
xgb5=XGBRegressor(learning_rate=0.300000012,n_estimators=25,min_child_weight=2,
                   gamma=5,max_depth=4, colsample_bylevel= 0.8,
                   colsample_bytree= 0.8, subsample= 0.9,
                   reg_lambda=1.1,reg_alpha=3.8)

In [213]:
from sklearn.model_selection import cross_val_score
scores=-cross_val_score(xgb4,x_train,y_train,scoring='neg_mean_absolute_error',verbose=False,n_jobs=-1,cv=10)

In [214]:
np.mean(scores)

1.3024568256789988

In [217]:
xgb5.fit(x_train,y_train)
predicted_values=xgb5.predict(x_test)
mean_absolute_error(predicted_values,y_test)

1.34216091299057

In [218]:
random_search.fit(x_train,y_train)
predicted_values=random_search.predict(x_test)
mean_absolute_error(predicted_values,y_test)

1.33129302406311

In [219]:
lasso.fit(x_train,y_train)
predicted_values=lasso.predict(x_test)
mean_absolute_error(predicted_values,y_test)

1.644737283350567

In [None]:
lasso.fit(x_train,y_train)
predicted_values=lasso.predict(x_test)
mean_absolute_error(predicted_values,y_test)