# Fullstory NYC Taxi Case Study - Feature Selection Notebook

Laura Evans September 2019

Project Setup:
    
The objective of this case study is to maximize my income as a taxi driver if I had 10 hours each week to earn extra money. The dataset given was the NYC Taxi and Limousine Commission (TLC) trips by Yellow taxis for June 2017

This notebook is separate due to memory demands of the hyperparameter tuning. 

## Import libraries

In [1]:
import pandas as pd
import datetime as dt
import numpy as np
import plotly.graph_objs as go#visualization
import plotly.offline as py#visualization
import matplotlib.pyplot as plt # plotting
import seaborn as sns
import gc
from sklearn.model_selection import train_test_split
gc.collect()

0

## Import June 2017 Taxi Training Data

In [2]:
#load data
data_train=pd.read_csv("C:/TLC/data_train.csv",index_col=[0])

#view data
data_train.head()


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,...,lat2,direct_distance,speed_mph,dollar_per_min,airport_flag,direction,mean_speed,credit_tip_ratio,tip_impute,total_income
1387393,1,1,0.8,1,N,161,229,2,6.5,0.0,...,40.756589,1.687178,6.501129,0.880361,1,172.392369,5.303522,0.22254,1.0,7.5
1690849,2,1,1.62,1,N,161,90,1,9.0,0.0,...,40.742546,3.840808,8.849772,0.819423,1,39.025937,6.742664,0.214643,1.96,10.96
2917992,2,5,2.18,1,N,144,90,1,11.0,0.5,...,40.742546,3.931531,9.665024,0.812808,1,-91.640765,10.062616,0.219231,2.0,13.0
939511,1,1,1.0,1,N,231,45,1,8.0,0.5,...,40.713058,1.527833,5.504587,0.733945,1,146.317076,7.786154,0.23625,1.86,9.86
5103145,1,1,4.7,1,N,142,231,1,20.0,0.5,...,40.718696,10.46105,10.317073,0.731707,1,65.329742,13.837161,0.213509,4.25,24.25


In [3]:
data_train.shape

(4665435, 46)

## The Models

## Predict Income (Fare + Tip) 

I will drop the PULocationID and DOLocationID in favor of latitude and longitude. The ID values are somewhat haphazardly assigned where as latitude and longitude reflect a geospatial relationship between the various pickup and dropoff points.

I will drop the extra amount, mta tax, and improvement surcharge due to lack of unique values and inherient predictive power.

Finally, the tip amount, imputed tip amount,total amount, and fare amount are already components of the total_income model target.

In [4]:
data_model=data_train.drop(['PULocationID','DOLocationID','extra','mta_tax','tip_amount',
                             'fare_amount','total_amount','improvement_surcharge','tip_impute','dollar_per_min',
                            'direct_distance','PU_time','DO_time'],axis=1)

In [5]:
#convert the store and forward Y and N values to an indicator
data_model['store_and_fwd_ind']=np.where(data_model['store_and_fwd_flag']=="Y",1,0)
data_model=data_model.drop(['store_and_fwd_flag'],axis=1)



In [6]:
#create dataframe of continuous attributes for correlation calculations
data_cont=data_model.drop(['week_day','Borough_PU','Zone_PU','service_zone_PU',
                           'Borough_DO','Zone_DO','service_zone_DO','Airport_Ind'],axis=1)

In [7]:
#make a dataframe of continuous attributes for correlation calculations
#one hot encode the weekdays and airport classification
one_hot_week_day=pd.get_dummies(data_model['week_day'])
one_hot_airport=pd.get_dummies(data_model['Airport_Ind'])
one_hot_airport=one_hot_airport.add_suffix('_airport')

one_hot_pu_borough=pd.get_dummies(data_model['Borough_PU'])
one_hot_pu_borough = one_hot_pu_borough.add_suffix('_PU')
one_hot_do_borough=pd.get_dummies(data_model['Borough_DO'])
one_hot_do_borough = one_hot_do_borough.add_suffix('_DO')

data_cont = data_cont.join(one_hot_week_day)
data_cont = data_cont.join(one_hot_airport)
data_cont = data_cont.join(one_hot_pu_borough)
data_cont = data_cont.join(one_hot_do_borough)

Calculate Kendall's Tau between attributes. I refrain from using Pearsons because it assumes a linear relationship between attributes. 

In [8]:
#Correlation of attributes
#correlation=np.around(data_cont.corr(method="kendall"),2)
#correlation.to_csv("correlation_cont.csv")

In [9]:
correlation.style.background_gradient(cmap='coolwarm')

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,payment_type,tolls_amount,month,day,hour,minute,duration_min,tmpf,relh,precip_inch,lng1,lat1,lng2,lat2,speed_mph,airport_flag,direction,mean_speed,credit_tip_ratio,total_income,store_and_fwd_ind,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,EWR_airport,JFK_airport,LGA_airport,None_airport,Bronx_PU,Brooklyn_PU,EWR_PU,Manhattan_PU,Queens_PU,Staten Island_PU,Bronx_DO,Brooklyn_DO,EWR_DO,Manhattan_DO,Queens_DO,Staten Island_DO
VendorID,1.0,0.22,0.02,0.01,-0.01,0.02,0,-0.0,0.01,-0.0,0.01,0.0,-0.0,0.0,0.01,0.0,0.01,0.0,0.03,-0.02,0.0,0.01,0.0,0.01,-0.06,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,0.01,0.01,-0.02,0.0,-0.0,0.0,-0.01,0.02,0.0,0.0,-0.0,0.0,-0.0,0.01,-0.0
passenger_count,0.22,1.0,0.02,0.01,0.02,0.02,0,-0.0,0.02,-0.0,0.02,0.01,-0.01,0.0,-0.01,-0.0,-0.0,-0.0,0.01,-0.01,-0.0,0.01,0.0,0.02,-0.01,-0.0,-0.01,0.03,0.02,-0.01,-0.01,-0.01,0.0,0.01,0.01,-0.01,-0.0,-0.0,-0.0,-0.01,0.01,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0
trip_distance,0.02,0.02,1.0,0.21,-0.07,0.3,0,0.0,0.01,-0.01,0.65,-0.02,0.01,-0.01,0.04,-0.07,0.09,-0.07,0.35,-0.35,-0.03,0.37,-0.34,0.78,0.0,-0.01,-0.0,0.01,0.04,-0.01,-0.01,-0.01,0.06,0.23,0.25,-0.35,0.01,0.03,0.0,-0.26,0.28,0.0,0.08,0.18,0.06,-0.32,0.23,0.02
RatecodeID,0.01,0.01,0.21,1.0,-0.01,0.49,0,0.0,-0.0,-0.0,0.2,0.01,-0.0,0.0,0.12,-0.12,0.04,-0.08,0.18,-0.53,0.04,0.19,-0.02,0.21,0.0,-0.0,0.01,-0.01,0.02,-0.01,-0.01,-0.01,0.29,0.76,-0.03,-0.53,-0.0,-0.02,0.01,-0.29,0.34,-0.0,-0.01,-0.03,0.29,-0.16,0.2,-0.0
payment_type,-0.01,0.02,-0.07,-0.01,1.0,-0.05,0,0.0,-0.02,-0.0,-0.06,0.02,-0.01,0.0,0.03,0.03,0.03,0.03,-0.04,0.04,0.0,-0.04,-0.01,-0.06,0.01,-0.0,0.0,0.03,0.02,-0.02,-0.01,-0.02,-0.01,0.01,-0.05,0.04,0.02,0.01,-0.0,-0.01,0.0,0.0,0.04,-0.01,-0.01,-0.01,0.02,0.0
tolls_amount,0.02,0.02,0.3,0.49,-0.05,1.0,0,0.0,-0.01,-0.0,0.28,0.01,-0.0,-0.01,0.17,0.01,0.1,-0.0,0.23,-0.7,0.01,0.25,0.08,0.3,0.0,0.0,0.02,-0.03,0.0,0.0,0.0,0.0,0.19,0.36,0.57,-0.7,0.0,-0.02,0.01,-0.41,0.47,0.0,0.07,0.0,0.19,-0.27,0.31,0.06
month,-0.0,0.0,0.0,-0.0,0.0,-0.0,1,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0
day,-0.0,-0.0,0.0,0.0,0.0,0.0,0,1.0,-0.01,-0.0,0.0,0.28,0.06,0.02,-0.0,-0.0,-0.0,-0.0,0.0,-0.01,-0.0,-0.01,0.0,0.0,-0.0,0.02,0.0,-0.07,-0.04,-0.03,0.04,0.08,0.0,0.0,0.0,-0.01,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.01,0.0
hour,0.01,0.02,0.01,-0.0,-0.02,-0.01,0,-0.01,1.0,-0.01,0.02,0.1,-0.15,-0.04,-0.0,-0.0,-0.01,0.0,-0.02,0.0,-0.0,-0.01,0.15,0.01,-0.0,-0.0,0.01,-0.02,-0.05,0.02,0.02,0.02,-0.02,0.01,-0.01,0.0,-0.0,-0.01,-0.0,-0.01,0.01,0.0,-0.01,0.01,-0.02,0.03,-0.04,0.0
minute,-0.0,-0.0,-0.01,-0.0,-0.0,-0.0,0,-0.0,-0.01,1.0,-0.01,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.01,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0


Even though my model algorithmn, lightgbm, will handle correlation between attributes, I will review those attributes with high correlation. The existence of correlation can tamper with the feature importance ranking in some algorithmns. Lightgbm is supposed to rank one highly correlated variable as insignificant and the other as significant. 

In [10]:
correlation_pairs=correlation.where(pd.np.triu(pd.np.ones(correlation.shape), k=1).astype(bool)).stack().reset_index()
correlation_pairs.columns=['attribute1','attribute2','kendalls_tau']
high_corr=correlation_pairs.query('kendalls_tau>0.70 or kendalls_tau<-0.70')


In [11]:
print("Attributes with High Correlation")
high_corr

Attributes with High Correlation


Unnamed: 0,attribute1,attribute2,kendalls_tau
113,trip_distance,total_income,0.78
167,RatecodeID,JFK_airport,0.76
437,duration_min,total_income,0.84
736,airport_flag,LGA_airport,-0.74
737,airport_flag,None_airport,1.0
742,airport_flag,Queens_PU,-0.75
1019,EWR_airport,EWR_DO,0.99
1037,LGA_airport,None_airport,-0.74
1054,None_airport,Queens_PU,-0.75
1092,Manhattan_PU,Queens_PU,-0.88


* The airport flag is highly correlated with the airport classification attribute, which takes on the values of JFK, LGA, EWR, and None.

* Trip distance and total income are highly correlated. Total income is the target of one model, so I will keep trip distance in my modeling dataset.

* Duration and total income are correlated, but they are the targets of my two models and will be removed from my modeling dataset.

* I will remove the speed_mph from my data and use mean_speed instead. This is due to the fact that I will not know my future trip speed, but I will know the historical speed of the road for each trip-week_day-hour. 

Note: the credit_tip_ratio is negatively correlated with total income, suggesting that higher tipping trips tend to have lower fare amounts (and lower total income)


When deciding upon which attributes to keep, I would run a mutual information regression to find the attribute's dependency with the target. I would also run some feature selection models (RFE with an extra trees base, random forest, Boruta, and Lasso). Memory issues on my computer constrain me from running several of these scripts. I have commented out my mutual information script below. 

In [12]:
#find mutual information of attributes and target
#from sklearn.feature_selection import f_regression, mutual_info_regression
#X=data_cont[['airport_flag','LGA','JFK','EWR','None']]
#y=data_cont['total_income']
#mi=mutual_info_regression(X,y)

In [23]:
#split data for early stopping rounds
train,val=train_test_split(data_cont,test_size=0.50,random_state=42)

In [24]:
#create initial modeling dataset
#X_income_train=train.drop(['total_income','duration_min','airport_flag','service_zone_PU','service_zone_DO','Zone_PU','Zone_DO'],axis=1)
X_train=train.drop(['total_income','duration_min','airport_flag','speed_mph'],axis=1)
Y_income_train=train['total_income']
Y_duration_train=train['duration_min']
#X_income_val=val.drop(['total_income','duration_min','airport_flag','service_zone_PU','service_zone_DO','Zone_PU','Zone_DO'],axis=1)
X_val=val.drop(['total_income','duration_min','airport_flag','speed_mph'],axis=1)
Y_income_val=val['total_income'] 
Y_duration_val=val['duration_min']

In [11]:
#create a tunning for feature selection
import lightgbm as lgbm
n_estimators=[1000,1500]
learning_rate=[0.1]
max_depth=[5,10,12]
reg_alpha=[0,1,10]
reg_lambda=[0,1,10]
param_grid=dict(learning_rate=learning_rate,n_estimators=n_estimators,max_depth=max_depth,reg_alpha=reg_alpha,reg_lambda=reg_lambda)
eval_set=[(X_val,Y_income_val)]
metric="rmse"
model=lgbm.LGBMRegressor(
                         #min_data_in_leaf=20000,
                         bagging_fraction=0.66,
                         eval_metric=metric,
                         eval_set=eval_set,
                         early_stopping_rounds=100
                         #categorical_feature=['week_day','Borough_PU','Borough_DO','Airport_Ind']
                        )

In [13]:
from sklearn.model_selection import GridSearchCV
grid=GridSearchCV(model,param_grid,cv=2,n_jobs=-1,scoring='neg_median_absolute_error')
feature_grid=grid.fit(X_train,Y_income_train,eval_set=eval_set)
#feature_grid=grid.fit(X_income_train,Y_income_train)


Found `early_stopping_rounds` in params. Will use it instead of argument



[1]	valid_0's l2: 128.711
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's l2: 105.419
[3]	valid_0's l2: 86.4672
[4]	valid_0's l2: 71.0455
[5]	valid_0's l2: 58.4665
[6]	valid_0's l2: 48.2395
[7]	valid_0's l2: 39.9116
[8]	valid_0's l2: 33.0801
[9]	valid_0's l2: 27.5117
[10]	valid_0's l2: 22.9881
[11]	valid_0's l2: 19.256
[12]	valid_0's l2: 16.2033
[13]	valid_0's l2: 13.6943
[14]	valid_0's l2: 11.6499
[15]	valid_0's l2: 9.95722
[16]	valid_0's l2: 8.56971
[17]	valid_0's l2: 7.43117
[18]	valid_0's l2: 6.48255
[19]	valid_0's l2: 5.6928
[20]	valid_0's l2: 5.05525
[21]	valid_0's l2: 4.5013
[22]	valid_0's l2: 4.0584
[23]	valid_0's l2: 3.67579
[24]	valid_0's l2: 3.36463
[25]	valid_0's l2: 3.10019
[26]	valid_0's l2: 2.88049
[27]	valid_0's l2: 2.69437
[28]	valid_0's l2: 2.54224
[29]	valid_0's l2: 2.40812
[30]	valid_0's l2: 2.29706
[31]	valid_0's l2: 2.20003
[32]	valid_0's l2: 2.11781
[33]	valid_0's l2: 2.04868
[34]	valid_0's l2: 1.98892
[35]	valid_0's l2: 1.93912
[36]	

[298]	valid_0's l2: 1.39212
[299]	valid_0's l2: 1.39207
[300]	valid_0's l2: 1.39204
[301]	valid_0's l2: 1.39192
[302]	valid_0's l2: 1.3919
[303]	valid_0's l2: 1.39185
[304]	valid_0's l2: 1.39185
[305]	valid_0's l2: 1.3916
[306]	valid_0's l2: 1.39137
[307]	valid_0's l2: 1.39128
[308]	valid_0's l2: 1.39136
[309]	valid_0's l2: 1.39133
[310]	valid_0's l2: 1.39111
[311]	valid_0's l2: 1.39096
[312]	valid_0's l2: 1.39089
[313]	valid_0's l2: 1.39084
[314]	valid_0's l2: 1.39085
[315]	valid_0's l2: 1.39089
[316]	valid_0's l2: 1.39086
[317]	valid_0's l2: 1.39077
[318]	valid_0's l2: 1.39071
[319]	valid_0's l2: 1.39073
[320]	valid_0's l2: 1.39062
[321]	valid_0's l2: 1.39051
[322]	valid_0's l2: 1.3905
[323]	valid_0's l2: 1.39041
[324]	valid_0's l2: 1.39024
[325]	valid_0's l2: 1.39027
[326]	valid_0's l2: 1.39015
[327]	valid_0's l2: 1.3902
[328]	valid_0's l2: 1.39015
[329]	valid_0's l2: 1.38945
[330]	valid_0's l2: 1.38936
[331]	valid_0's l2: 1.38899
[332]	valid_0's l2: 1.3888
[333]	valid_0's l2: 1.388

[592]	valid_0's l2: 1.37153
[593]	valid_0's l2: 1.37145
[594]	valid_0's l2: 1.37139
[595]	valid_0's l2: 1.37133
[596]	valid_0's l2: 1.37128
[597]	valid_0's l2: 1.37128
[598]	valid_0's l2: 1.37126
[599]	valid_0's l2: 1.37127
[600]	valid_0's l2: 1.37117
[601]	valid_0's l2: 1.3711
[602]	valid_0's l2: 1.3711
[603]	valid_0's l2: 1.37108
[604]	valid_0's l2: 1.37109
[605]	valid_0's l2: 1.37105
[606]	valid_0's l2: 1.37098
[607]	valid_0's l2: 1.3709
[608]	valid_0's l2: 1.37089
[609]	valid_0's l2: 1.37084
[610]	valid_0's l2: 1.37076
[611]	valid_0's l2: 1.37067
[612]	valid_0's l2: 1.37059
[613]	valid_0's l2: 1.3706
[614]	valid_0's l2: 1.37055
[615]	valid_0's l2: 1.37052
[616]	valid_0's l2: 1.37052
[617]	valid_0's l2: 1.37053
[618]	valid_0's l2: 1.37043
[619]	valid_0's l2: 1.3704
[620]	valid_0's l2: 1.37035
[621]	valid_0's l2: 1.37036
[622]	valid_0's l2: 1.37041
[623]	valid_0's l2: 1.3704
[624]	valid_0's l2: 1.37031
[625]	valid_0's l2: 1.37031
[626]	valid_0's l2: 1.37019
[627]	valid_0's l2: 1.3700

[888]	valid_0's l2: 1.36212
[889]	valid_0's l2: 1.36202
[890]	valid_0's l2: 1.36201
[891]	valid_0's l2: 1.36201
[892]	valid_0's l2: 1.36197
[893]	valid_0's l2: 1.36201
[894]	valid_0's l2: 1.36199
[895]	valid_0's l2: 1.36199
[896]	valid_0's l2: 1.36199
[897]	valid_0's l2: 1.36195
[898]	valid_0's l2: 1.36193
[899]	valid_0's l2: 1.36184
[900]	valid_0's l2: 1.3618
[901]	valid_0's l2: 1.36178
[902]	valid_0's l2: 1.3618
[903]	valid_0's l2: 1.3618
[904]	valid_0's l2: 1.36179
[905]	valid_0's l2: 1.36177
[906]	valid_0's l2: 1.36171
[907]	valid_0's l2: 1.3617
[908]	valid_0's l2: 1.3617
[909]	valid_0's l2: 1.3617
[910]	valid_0's l2: 1.36165
[911]	valid_0's l2: 1.36162
[912]	valid_0's l2: 1.3616
[913]	valid_0's l2: 1.36161
[914]	valid_0's l2: 1.36158
[915]	valid_0's l2: 1.36158
[916]	valid_0's l2: 1.36158
[917]	valid_0's l2: 1.36142
[918]	valid_0's l2: 1.36133
[919]	valid_0's l2: 1.36135
[920]	valid_0's l2: 1.36136
[921]	valid_0's l2: 1.36136
[922]	valid_0's l2: 1.36135
[923]	valid_0's l2: 1.36133

[1180]	valid_0's l2: 1.3572
[1181]	valid_0's l2: 1.3572
[1182]	valid_0's l2: 1.35719
[1183]	valid_0's l2: 1.35717
[1184]	valid_0's l2: 1.35718
[1185]	valid_0's l2: 1.35721
[1186]	valid_0's l2: 1.35716
[1187]	valid_0's l2: 1.35716
[1188]	valid_0's l2: 1.35712
[1189]	valid_0's l2: 1.35713
[1190]	valid_0's l2: 1.3571
[1191]	valid_0's l2: 1.35709
[1192]	valid_0's l2: 1.35705
[1193]	valid_0's l2: 1.35703
[1194]	valid_0's l2: 1.35702
[1195]	valid_0's l2: 1.35702
[1196]	valid_0's l2: 1.35698
[1197]	valid_0's l2: 1.35697
[1198]	valid_0's l2: 1.357
[1199]	valid_0's l2: 1.35697
[1200]	valid_0's l2: 1.35694
[1201]	valid_0's l2: 1.35699
[1202]	valid_0's l2: 1.357
[1203]	valid_0's l2: 1.35697
[1204]	valid_0's l2: 1.35694
[1205]	valid_0's l2: 1.35694
[1206]	valid_0's l2: 1.35694
[1207]	valid_0's l2: 1.35692
[1208]	valid_0's l2: 1.35684
[1209]	valid_0's l2: 1.3568
[1210]	valid_0's l2: 1.35678
[1211]	valid_0's l2: 1.35677
[1212]	valid_0's l2: 1.35682
[1213]	valid_0's l2: 1.35678
[1214]	valid_0's l2: 1

[1464]	valid_0's l2: 1.35425
[1465]	valid_0's l2: 1.35425
[1466]	valid_0's l2: 1.35425
[1467]	valid_0's l2: 1.35424
[1468]	valid_0's l2: 1.35426
[1469]	valid_0's l2: 1.35424
[1470]	valid_0's l2: 1.35429
[1471]	valid_0's l2: 1.35429
[1472]	valid_0's l2: 1.3543
[1473]	valid_0's l2: 1.35429
[1474]	valid_0's l2: 1.3543
[1475]	valid_0's l2: 1.35432
[1476]	valid_0's l2: 1.35432
[1477]	valid_0's l2: 1.35427
[1478]	valid_0's l2: 1.35425
[1479]	valid_0's l2: 1.35424
[1480]	valid_0's l2: 1.35423
[1481]	valid_0's l2: 1.35423
[1482]	valid_0's l2: 1.35421
[1483]	valid_0's l2: 1.35419
[1484]	valid_0's l2: 1.35417
[1485]	valid_0's l2: 1.35416
[1486]	valid_0's l2: 1.35415
[1487]	valid_0's l2: 1.35413
[1488]	valid_0's l2: 1.35413
[1489]	valid_0's l2: 1.35416
[1490]	valid_0's l2: 1.35414
[1491]	valid_0's l2: 1.35414
[1492]	valid_0's l2: 1.35414
[1493]	valid_0's l2: 1.35415
[1494]	valid_0's l2: 1.35415
[1495]	valid_0's l2: 1.35411
[1496]	valid_0's l2: 1.35408
[1497]	valid_0's l2: 1.35409
[1498]	valid_0's

In [14]:
from sklearn.externals import joblib
joblib.dump(feature_grid,'feature_grid_4.pkl')

['feature_grid_4.pkl']

In [15]:
feature_grid

GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LGBMRegressor(bagging_fraction=0.66, boosting_type='gbdt', class_weight=None,
       colsample_bytree=1.0, early_stopping_rounds=100, eval_metric='rmse',
       eval_set=[(         VendorID  passenger_count  trip_distance  RatecodeID  payment_type  \
3563513         2                1     ...=0.0, reg_lambda=0.0, silent=True,
       subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'learning_rate': [0.1], 'n_estimators': [1000, 1500], 'max_depth': [5, 10, 12], 'reg_alpha': [0, 1, 10], 'reg_lambda': [0, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_median_absolute_error', verbose=0)

In [16]:
feature_grid.cv_results_


You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True


You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True


You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True


You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True



{'mean_fit_time': array([387.28945243, 392.23439121, 388.67319095, 388.66528261,
        363.08106411, 362.89931679, 377.77932715, 384.4021585 ,
        382.62414777, 571.84080982, 529.25751066, 531.52253902,
        503.99278843, 507.22567427, 555.22936451, 584.83840132,
        550.52448857, 523.86825383, 267.79252315, 300.52709997,
        323.55315018, 323.02460015, 280.01907516, 335.49525225,
        320.66541684, 330.15760553, 324.38901579, 366.37480593,
        397.03802741, 407.52345037, 365.80158448, 302.96564639,
        425.31250286, 430.92187572, 444.01562834, 403.48349774,
        282.78124928, 329.53126216, 311.24414289, 308.2732265 ,
        302.52454937, 328.26186216, 308.25781608, 319.61816692,
        333.77447689, 365.73449409, 396.25690055, 429.0000006 ,
        408.72656059, 388.98435378, 389.61812794, 405.92899179,
        406.91370428, 341.03125477]),
 'mean_score_time': array([269.3947525 , 249.36501825, 324.42638993, 241.52907836,
        231.11926019, 242.0473

In [17]:
score_train = feature_grid.cv_results_['mean_train_score']
score_test = feature_grid.cv_results_['mean_test_score']
params=feature_grid.cv_results_['params']
fit_time =feature_grid.cv_results_['mean_fit_time']


You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True



In [18]:
for train,test,fit,param in zip(score_train,score_test,fit_time,params):
    print("%f, %f, %f, with: %r" % (train,test,fit,param)) 

-0.367195, -0.369482, 387.289452, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 0}
-0.366946, -0.368959, 392.234391, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 1}
-0.365338, -0.367792, 388.673191, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 10}
-0.366641, -0.369121, 388.665283, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 0}
-0.365638, -0.367912, 363.081064, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 1}
-0.366177, -0.368479, 362.899317, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 10}
-0.364914, -0.367309, 377.779327, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 10, 'reg_lambda': 0}
-0.365764, -0.368360, 384.402158, with: {'learning_rate': 0

In [19]:
#create a tunning for feature selection
import lightgbm as lgbm
n_estimators=[1000,1500]
learning_rate=[0.1]
max_depth=[5,10,12]
reg_alpha=[0,1,10]
reg_lambda=[0,1,10]
param_grid=dict(learning_rate=learning_rate,n_estimators=n_estimators,max_depth=max_depth,reg_alpha=reg_alpha,reg_lambda=reg_lambda)
eval_set=[(X_val,Y_duration_val)]
metric="rmse"
model=lgbm.LGBMRegressor(
                         #min_data_in_leaf=20000,
                         bagging_fraction=0.66,
                         eval_metric=metric,
                         eval_set=eval_set,
                         early_stopping_rounds=100
                         #categorical_feature=['week_day','Borough_PU','Borough_DO','Airport_Ind']
                        )

In [20]:
from sklearn.model_selection import GridSearchCV
grid=GridSearchCV(model,param_grid,cv=2,n_jobs=-1,scoring='neg_median_absolute_error')
feature_grid=grid.fit(X_train,Y_duration_train,eval_set=eval_set)
#feature_grid=grid.fit(X_income_train,Y_income_train)


Found `early_stopping_rounds` in params. Will use it instead of argument



[1]	valid_0's l2: 118.366
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's l2: 97.7279
[3]	valid_0's l2: 80.6763
[4]	valid_0's l2: 66.7181
[5]	valid_0's l2: 55.245
[6]	valid_0's l2: 45.7526
[7]	valid_0's l2: 37.9513
[8]	valid_0's l2: 31.555
[9]	valid_0's l2: 26.2299
[10]	valid_0's l2: 21.8214
[11]	valid_0's l2: 18.2085
[12]	valid_0's l2: 15.2143
[13]	valid_0's l2: 12.7254
[14]	valid_0's l2: 10.6318
[15]	valid_0's l2: 8.93309
[16]	valid_0's l2: 7.51579
[17]	valid_0's l2: 6.32349
[18]	valid_0's l2: 5.34453
[19]	valid_0's l2: 4.52194
[20]	valid_0's l2: 3.84517
[21]	valid_0's l2: 3.26875
[22]	valid_0's l2: 2.78581
[23]	valid_0's l2: 2.38814
[24]	valid_0's l2: 2.05232
[25]	valid_0's l2: 1.77181
[26]	valid_0's l2: 1.53378
[27]	valid_0's l2: 1.33814
[28]	valid_0's l2: 1.16963
[29]	valid_0's l2: 1.03105
[30]	valid_0's l2: 0.912478
[31]	valid_0's l2: 0.810061
[32]	valid_0's l2: 0.725692
[33]	valid_0's l2: 0.652177
[34]	valid_0's l2: 0.590633
[35]	valid_0's l2: 0.5371

[288]	valid_0's l2: 0.154449
[289]	valid_0's l2: 0.154337
[290]	valid_0's l2: 0.154298
[291]	valid_0's l2: 0.154198
[292]	valid_0's l2: 0.154178
[293]	valid_0's l2: 0.15407
[294]	valid_0's l2: 0.153965
[295]	valid_0's l2: 0.153945
[296]	valid_0's l2: 0.153931
[297]	valid_0's l2: 0.153732
[298]	valid_0's l2: 0.153561
[299]	valid_0's l2: 0.153487
[300]	valid_0's l2: 0.15339
[301]	valid_0's l2: 0.153281
[302]	valid_0's l2: 0.15324
[303]	valid_0's l2: 0.153233
[304]	valid_0's l2: 0.153209
[305]	valid_0's l2: 0.153042
[306]	valid_0's l2: 0.15289
[307]	valid_0's l2: 0.152833
[308]	valid_0's l2: 0.152789
[309]	valid_0's l2: 0.152617
[310]	valid_0's l2: 0.152453
[311]	valid_0's l2: 0.152274
[312]	valid_0's l2: 0.152124
[313]	valid_0's l2: 0.151996
[314]	valid_0's l2: 0.151878
[315]	valid_0's l2: 0.151789
[316]	valid_0's l2: 0.151749
[317]	valid_0's l2: 0.151665
[318]	valid_0's l2: 0.151576
[319]	valid_0's l2: 0.151513
[320]	valid_0's l2: 0.151429
[321]	valid_0's l2: 0.151368
[322]	valid_0's l2

[573]	valid_0's l2: 0.139691
[574]	valid_0's l2: 0.13968
[575]	valid_0's l2: 0.139631
[576]	valid_0's l2: 0.139562
[577]	valid_0's l2: 0.139494
[578]	valid_0's l2: 0.139461
[579]	valid_0's l2: 0.139452
[580]	valid_0's l2: 0.139427
[581]	valid_0's l2: 0.139392
[582]	valid_0's l2: 0.139338
[583]	valid_0's l2: 0.139352
[584]	valid_0's l2: 0.139308
[585]	valid_0's l2: 0.139243
[586]	valid_0's l2: 0.13921
[587]	valid_0's l2: 0.139184
[588]	valid_0's l2: 0.139143
[589]	valid_0's l2: 0.13914
[590]	valid_0's l2: 0.139096
[591]	valid_0's l2: 0.139033
[592]	valid_0's l2: 0.138983
[593]	valid_0's l2: 0.138938
[594]	valid_0's l2: 0.138931
[595]	valid_0's l2: 0.138882
[596]	valid_0's l2: 0.138849
[597]	valid_0's l2: 0.138804
[598]	valid_0's l2: 0.138757
[599]	valid_0's l2: 0.138718
[600]	valid_0's l2: 0.138644
[601]	valid_0's l2: 0.138617
[602]	valid_0's l2: 0.138582
[603]	valid_0's l2: 0.138538
[604]	valid_0's l2: 0.13849
[605]	valid_0's l2: 0.13844
[606]	valid_0's l2: 0.138384
[607]	valid_0's l2:

[857]	valid_0's l2: 0.132922
[858]	valid_0's l2: 0.132914
[859]	valid_0's l2: 0.132905
[860]	valid_0's l2: 0.132884
[861]	valid_0's l2: 0.132854
[862]	valid_0's l2: 0.132859
[863]	valid_0's l2: 0.132859
[864]	valid_0's l2: 0.132862
[865]	valid_0's l2: 0.132872
[866]	valid_0's l2: 0.132863
[867]	valid_0's l2: 0.132841
[868]	valid_0's l2: 0.132798
[869]	valid_0's l2: 0.132786
[870]	valid_0's l2: 0.132791
[871]	valid_0's l2: 0.132759
[872]	valid_0's l2: 0.132758
[873]	valid_0's l2: 0.132753
[874]	valid_0's l2: 0.132736
[875]	valid_0's l2: 0.132742
[876]	valid_0's l2: 0.13272
[877]	valid_0's l2: 0.13272
[878]	valid_0's l2: 0.132717
[879]	valid_0's l2: 0.13274
[880]	valid_0's l2: 0.132705
[881]	valid_0's l2: 0.132673
[882]	valid_0's l2: 0.132646
[883]	valid_0's l2: 0.132624
[884]	valid_0's l2: 0.132616
[885]	valid_0's l2: 0.132581
[886]	valid_0's l2: 0.132559
[887]	valid_0's l2: 0.132543
[888]	valid_0's l2: 0.132504
[889]	valid_0's l2: 0.132495
[890]	valid_0's l2: 0.132472
[891]	valid_0's l

[1136]	valid_0's l2: 0.129318
[1137]	valid_0's l2: 0.129307
[1138]	valid_0's l2: 0.129262
[1139]	valid_0's l2: 0.12924
[1140]	valid_0's l2: 0.129241
[1141]	valid_0's l2: 0.129231
[1142]	valid_0's l2: 0.129233
[1143]	valid_0's l2: 0.129234
[1144]	valid_0's l2: 0.129245
[1145]	valid_0's l2: 0.129245
[1146]	valid_0's l2: 0.129239
[1147]	valid_0's l2: 0.129239
[1148]	valid_0's l2: 0.129241
[1149]	valid_0's l2: 0.129233
[1150]	valid_0's l2: 0.129226
[1151]	valid_0's l2: 0.129224
[1152]	valid_0's l2: 0.129219
[1153]	valid_0's l2: 0.129206
[1154]	valid_0's l2: 0.129198
[1155]	valid_0's l2: 0.129199
[1156]	valid_0's l2: 0.129199
[1157]	valid_0's l2: 0.129203
[1158]	valid_0's l2: 0.129204
[1159]	valid_0's l2: 0.129206
[1160]	valid_0's l2: 0.129199
[1161]	valid_0's l2: 0.129189
[1162]	valid_0's l2: 0.129188
[1163]	valid_0's l2: 0.129175
[1164]	valid_0's l2: 0.129138
[1165]	valid_0's l2: 0.129112
[1166]	valid_0's l2: 0.129119
[1167]	valid_0's l2: 0.129101
[1168]	valid_0's l2: 0.129091
[1169]	vali

[1412]	valid_0's l2: 0.126941
[1413]	valid_0's l2: 0.126935
[1414]	valid_0's l2: 0.126937
[1415]	valid_0's l2: 0.126948
[1416]	valid_0's l2: 0.12695
[1417]	valid_0's l2: 0.126954
[1418]	valid_0's l2: 0.126949
[1419]	valid_0's l2: 0.126934
[1420]	valid_0's l2: 0.126937
[1421]	valid_0's l2: 0.126923
[1422]	valid_0's l2: 0.126916
[1423]	valid_0's l2: 0.126927
[1424]	valid_0's l2: 0.126931
[1425]	valid_0's l2: 0.126935
[1426]	valid_0's l2: 0.126941
[1427]	valid_0's l2: 0.126928
[1428]	valid_0's l2: 0.126928
[1429]	valid_0's l2: 0.126918
[1430]	valid_0's l2: 0.126921
[1431]	valid_0's l2: 0.126932
[1432]	valid_0's l2: 0.126934
[1433]	valid_0's l2: 0.126952
[1434]	valid_0's l2: 0.126971
[1435]	valid_0's l2: 0.126965
[1436]	valid_0's l2: 0.126957
[1437]	valid_0's l2: 0.126953
[1438]	valid_0's l2: 0.126959
[1439]	valid_0's l2: 0.126959
[1440]	valid_0's l2: 0.126965
[1441]	valid_0's l2: 0.126956
[1442]	valid_0's l2: 0.126942
[1443]	valid_0's l2: 0.126945
[1444]	valid_0's l2: 0.126944
[1445]	vali

In [21]:
from sklearn.externals import joblib
#joblib.dump(feature_grid,'feature_grid_5.pkl')

['feature_grid_5.pkl']

In [22]:
score_train = feature_grid.cv_results_['mean_train_score']
score_test = feature_grid.cv_results_['mean_test_score']
params=feature_grid.cv_results_['params']
fit_time =feature_grid.cv_results_['mean_fit_time']


You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True



In [23]:
for train,test,fit,param in zip(score_train,score_test,fit_time,params):
    print("%f, %f, %f, with: %r" % (train,test,fit,param)) 

-0.074805, -0.075055, 362.967633, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 0}
-0.076103, -0.076409, 363.225429, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 1}
-0.077100, -0.077435, 380.506688, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 0, 'reg_lambda': 10}
-0.075177, -0.075413, 394.491070, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 0}
-0.075538, -0.075778, 367.156264, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 1}
-0.076152, -0.076497, 367.780470, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 10}
-0.070573, -0.070883, 373.795319, with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 1000, 'reg_alpha': 10, 'reg_lambda': 0}
-0.070864, -0.071230, 361.209389, with: {'learning_rate': 0

## Fit an initial Model to the Hyperparemeters from Above to See Feature Importance

Since boosted trees are prone to overfitting, I wanted to avoid growing the trees too deep. I added regularization hyperparameters, bagged my samples, and added early stopping rounds. I tried several iterations of parameter grids before running the final grids above. Also, I ran my hyperparameter tuning on only one fourth of my data (I split my training dataset into two even datasets for the purposes of validating my early stopping rounds).  

### Total Income Feature Selection

In [25]:
#create a tunning for feature selection
import lightgbm as lgbm
eval_set=[(X_val,Y_income_val)]
metric="rmse"
model=lgbm.LGBMRegressor(
                         learning_rate=0.1,
                         n_estimators=1500,
                         max_depth=10,
                         bagging_fraction=0.66,
                         eval_metric=metric,
                         eval_set=eval_set,
                         early_stopping_rounds=100,
                         reg_alpha=10,
                         reg_lambda=1,
                        )

In [26]:
model_results=model.fit(X_train,Y_income_train,eval_set=eval_set,eval_metric='rmse')


Found `early_stopping_rounds` in params. Will use it instead of argument



[1]	valid_0's l2: 129.096	valid_0's rmse: 11.3621
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's l2: 106.131	valid_0's rmse: 10.302
[3]	valid_0's l2: 87.4635	valid_0's rmse: 9.35219
[4]	valid_0's l2: 72.2964	valid_0's rmse: 8.50273
[5]	valid_0's l2: 59.9414	valid_0's rmse: 7.74218
[6]	valid_0's l2: 49.9142	valid_0's rmse: 7.065
[7]	valid_0's l2: 41.7643	valid_0's rmse: 6.46253
[8]	valid_0's l2: 35.0999	valid_0's rmse: 5.92452
[9]	valid_0's l2: 29.6662	valid_0's rmse: 5.44667
[10]	valid_0's l2: 25.2397	valid_0's rmse: 5.02391
[11]	valid_0's l2: 21.6325	valid_0's rmse: 4.65107
[12]	valid_0's l2: 18.6766	valid_0's rmse: 4.32165
[13]	valid_0's l2: 16.2514	valid_0's rmse: 4.0313
[14]	valid_0's l2: 14.2745	valid_0's rmse: 3.77816
[15]	valid_0's l2: 12.6353	valid_0's rmse: 3.55461
[16]	valid_0's l2: 11.3033	valid_0's rmse: 3.36204
[17]	valid_0's l2: 10.2097	valid_0's rmse: 3.19526
[18]	valid_0's l2: 9.31315	valid_0's rmse: 3.05174
[19]	valid_0's l2: 8.57734	valid

[162]	valid_0's l2: 4.40835	valid_0's rmse: 2.09961
[163]	valid_0's l2: 4.40662	valid_0's rmse: 2.0992
[164]	valid_0's l2: 4.406	valid_0's rmse: 2.09905
[165]	valid_0's l2: 4.40505	valid_0's rmse: 2.09882
[166]	valid_0's l2: 4.4034	valid_0's rmse: 2.09843
[167]	valid_0's l2: 4.40286	valid_0's rmse: 2.0983
[168]	valid_0's l2: 4.40101	valid_0's rmse: 2.09786
[169]	valid_0's l2: 4.40014	valid_0's rmse: 2.09765
[170]	valid_0's l2: 4.39905	valid_0's rmse: 2.09739
[171]	valid_0's l2: 4.39784	valid_0's rmse: 2.0971
[172]	valid_0's l2: 4.39733	valid_0's rmse: 2.09698
[173]	valid_0's l2: 4.39633	valid_0's rmse: 2.09674
[174]	valid_0's l2: 4.3948	valid_0's rmse: 2.09638
[175]	valid_0's l2: 4.39394	valid_0's rmse: 2.09617
[176]	valid_0's l2: 4.39245	valid_0's rmse: 2.09582
[177]	valid_0's l2: 4.39179	valid_0's rmse: 2.09566
[178]	valid_0's l2: 4.39109	valid_0's rmse: 2.09549
[179]	valid_0's l2: 4.39062	valid_0's rmse: 2.09538
[180]	valid_0's l2: 4.39008	valid_0's rmse: 2.09525
[181]	valid_0's l2:

[321]	valid_0's l2: 4.27448	valid_0's rmse: 2.06748
[322]	valid_0's l2: 4.27383	valid_0's rmse: 2.06732
[323]	valid_0's l2: 4.27295	valid_0's rmse: 2.06711
[324]	valid_0's l2: 4.2724	valid_0's rmse: 2.06698
[325]	valid_0's l2: 4.27146	valid_0's rmse: 2.06675
[326]	valid_0's l2: 4.27063	valid_0's rmse: 2.06655
[327]	valid_0's l2: 4.27046	valid_0's rmse: 2.06651
[328]	valid_0's l2: 4.2701	valid_0's rmse: 2.06642
[329]	valid_0's l2: 4.26992	valid_0's rmse: 2.06638
[330]	valid_0's l2: 4.26959	valid_0's rmse: 2.0663
[331]	valid_0's l2: 4.269	valid_0's rmse: 2.06616
[332]	valid_0's l2: 4.26888	valid_0's rmse: 2.06613
[333]	valid_0's l2: 4.26862	valid_0's rmse: 2.06606
[334]	valid_0's l2: 4.26837	valid_0's rmse: 2.066
[335]	valid_0's l2: 4.26753	valid_0's rmse: 2.0658
[336]	valid_0's l2: 4.26669	valid_0's rmse: 2.0656
[337]	valid_0's l2: 4.2655	valid_0's rmse: 2.06531
[338]	valid_0's l2: 4.26468	valid_0's rmse: 2.06511
[339]	valid_0's l2: 4.26363	valid_0's rmse: 2.06486
[340]	valid_0's l2: 4.

[482]	valid_0's l2: 4.1975	valid_0's rmse: 2.04878
[483]	valid_0's l2: 4.19746	valid_0's rmse: 2.04877
[484]	valid_0's l2: 4.19713	valid_0's rmse: 2.04869
[485]	valid_0's l2: 4.19639	valid_0's rmse: 2.04851
[486]	valid_0's l2: 4.19635	valid_0's rmse: 2.0485
[487]	valid_0's l2: 4.19533	valid_0's rmse: 2.04825
[488]	valid_0's l2: 4.1952	valid_0's rmse: 2.04822
[489]	valid_0's l2: 4.19513	valid_0's rmse: 2.0482
[490]	valid_0's l2: 4.19501	valid_0's rmse: 2.04817
[491]	valid_0's l2: 4.19439	valid_0's rmse: 2.04802
[492]	valid_0's l2: 4.19375	valid_0's rmse: 2.04786
[493]	valid_0's l2: 4.19268	valid_0's rmse: 2.0476
[494]	valid_0's l2: 4.19252	valid_0's rmse: 2.04756
[495]	valid_0's l2: 4.19204	valid_0's rmse: 2.04745
[496]	valid_0's l2: 4.19199	valid_0's rmse: 2.04743
[497]	valid_0's l2: 4.19185	valid_0's rmse: 2.0474
[498]	valid_0's l2: 4.19174	valid_0's rmse: 2.04737
[499]	valid_0's l2: 4.19173	valid_0's rmse: 2.04737
[500]	valid_0's l2: 4.19133	valid_0's rmse: 2.04727
[501]	valid_0's l2

[645]	valid_0's l2: 4.15271	valid_0's rmse: 2.03782
[646]	valid_0's l2: 4.15269	valid_0's rmse: 2.03782
[647]	valid_0's l2: 4.15261	valid_0's rmse: 2.0378
[648]	valid_0's l2: 4.15251	valid_0's rmse: 2.03777
[649]	valid_0's l2: 4.15219	valid_0's rmse: 2.03769
[650]	valid_0's l2: 4.15199	valid_0's rmse: 2.03764
[651]	valid_0's l2: 4.15145	valid_0's rmse: 2.03751
[652]	valid_0's l2: 4.15141	valid_0's rmse: 2.0375
[653]	valid_0's l2: 4.15103	valid_0's rmse: 2.03741
[654]	valid_0's l2: 4.15089	valid_0's rmse: 2.03737
[655]	valid_0's l2: 4.15085	valid_0's rmse: 2.03736
[656]	valid_0's l2: 4.15087	valid_0's rmse: 2.03737
[657]	valid_0's l2: 4.15085	valid_0's rmse: 2.03736
[658]	valid_0's l2: 4.15073	valid_0's rmse: 2.03733
[659]	valid_0's l2: 4.1505	valid_0's rmse: 2.03728
[660]	valid_0's l2: 4.15013	valid_0's rmse: 2.03719
[661]	valid_0's l2: 4.1496	valid_0's rmse: 2.03706
[662]	valid_0's l2: 4.1494	valid_0's rmse: 2.03701
[663]	valid_0's l2: 4.14926	valid_0's rmse: 2.03697
[664]	valid_0's l

[806]	valid_0's l2: 4.12402	valid_0's rmse: 2.03077
[807]	valid_0's l2: 4.12386	valid_0's rmse: 2.03073
[808]	valid_0's l2: 4.12336	valid_0's rmse: 2.03061
[809]	valid_0's l2: 4.1232	valid_0's rmse: 2.03057
[810]	valid_0's l2: 4.12296	valid_0's rmse: 2.03051
[811]	valid_0's l2: 4.1229	valid_0's rmse: 2.03049
[812]	valid_0's l2: 4.1227	valid_0's rmse: 2.03044
[813]	valid_0's l2: 4.12254	valid_0's rmse: 2.0304
[814]	valid_0's l2: 4.12245	valid_0's rmse: 2.03038
[815]	valid_0's l2: 4.12228	valid_0's rmse: 2.03034
[816]	valid_0's l2: 4.12222	valid_0's rmse: 2.03033
[817]	valid_0's l2: 4.1221	valid_0's rmse: 2.0303
[818]	valid_0's l2: 4.12162	valid_0's rmse: 2.03018
[819]	valid_0's l2: 4.1213	valid_0's rmse: 2.0301
[820]	valid_0's l2: 4.1211	valid_0's rmse: 2.03005
[821]	valid_0's l2: 4.12095	valid_0's rmse: 2.03001
[822]	valid_0's l2: 4.12097	valid_0's rmse: 2.03002
[823]	valid_0's l2: 4.12076	valid_0's rmse: 2.02997
[824]	valid_0's l2: 4.12076	valid_0's rmse: 2.02997
[825]	valid_0's l2: 4

[967]	valid_0's l2: 4.10032	valid_0's rmse: 2.02492
[968]	valid_0's l2: 4.10019	valid_0's rmse: 2.02489
[969]	valid_0's l2: 4.09975	valid_0's rmse: 2.02478
[970]	valid_0's l2: 4.09949	valid_0's rmse: 2.02472
[971]	valid_0's l2: 4.09922	valid_0's rmse: 2.02465
[972]	valid_0's l2: 4.09907	valid_0's rmse: 2.02462
[973]	valid_0's l2: 4.09892	valid_0's rmse: 2.02458
[974]	valid_0's l2: 4.09879	valid_0's rmse: 2.02455
[975]	valid_0's l2: 4.09883	valid_0's rmse: 2.02456
[976]	valid_0's l2: 4.09864	valid_0's rmse: 2.02451
[977]	valid_0's l2: 4.09841	valid_0's rmse: 2.02445
[978]	valid_0's l2: 4.09827	valid_0's rmse: 2.02442
[979]	valid_0's l2: 4.09801	valid_0's rmse: 2.02435
[980]	valid_0's l2: 4.098	valid_0's rmse: 2.02435
[981]	valid_0's l2: 4.09796	valid_0's rmse: 2.02434
[982]	valid_0's l2: 4.09782	valid_0's rmse: 2.02431
[983]	valid_0's l2: 4.09755	valid_0's rmse: 2.02424
[984]	valid_0's l2: 4.09744	valid_0's rmse: 2.02421
[985]	valid_0's l2: 4.09735	valid_0's rmse: 2.02419
[986]	valid_0'

[1124]	valid_0's l2: 4.07886	valid_0's rmse: 2.01962
[1125]	valid_0's l2: 4.07879	valid_0's rmse: 2.0196
[1126]	valid_0's l2: 4.07868	valid_0's rmse: 2.01957
[1127]	valid_0's l2: 4.07865	valid_0's rmse: 2.01957
[1128]	valid_0's l2: 4.07843	valid_0's rmse: 2.01951
[1129]	valid_0's l2: 4.07842	valid_0's rmse: 2.01951
[1130]	valid_0's l2: 4.07843	valid_0's rmse: 2.01951
[1131]	valid_0's l2: 4.07837	valid_0's rmse: 2.0195
[1132]	valid_0's l2: 4.07813	valid_0's rmse: 2.01944
[1133]	valid_0's l2: 4.07811	valid_0's rmse: 2.01943
[1134]	valid_0's l2: 4.07815	valid_0's rmse: 2.01944
[1135]	valid_0's l2: 4.07805	valid_0's rmse: 2.01942
[1136]	valid_0's l2: 4.07785	valid_0's rmse: 2.01937
[1137]	valid_0's l2: 4.07789	valid_0's rmse: 2.01938
[1138]	valid_0's l2: 4.07741	valid_0's rmse: 2.01926
[1139]	valid_0's l2: 4.07747	valid_0's rmse: 2.01928
[1140]	valid_0's l2: 4.07734	valid_0's rmse: 2.01924
[1141]	valid_0's l2: 4.07721	valid_0's rmse: 2.01921
[1142]	valid_0's l2: 4.0772	valid_0's rmse: 2.01

[1283]	valid_0's l2: 4.06399	valid_0's rmse: 2.01593
[1284]	valid_0's l2: 4.06371	valid_0's rmse: 2.01586
[1285]	valid_0's l2: 4.06369	valid_0's rmse: 2.01586
[1286]	valid_0's l2: 4.06366	valid_0's rmse: 2.01585
[1287]	valid_0's l2: 4.06331	valid_0's rmse: 2.01576
[1288]	valid_0's l2: 4.06329	valid_0's rmse: 2.01576
[1289]	valid_0's l2: 4.06331	valid_0's rmse: 2.01577
[1290]	valid_0's l2: 4.06331	valid_0's rmse: 2.01577
[1291]	valid_0's l2: 4.06328	valid_0's rmse: 2.01576
[1292]	valid_0's l2: 4.06329	valid_0's rmse: 2.01576
[1293]	valid_0's l2: 4.06329	valid_0's rmse: 2.01576
[1294]	valid_0's l2: 4.06323	valid_0's rmse: 2.01575
[1295]	valid_0's l2: 4.06313	valid_0's rmse: 2.01572
[1296]	valid_0's l2: 4.06307	valid_0's rmse: 2.01571
[1297]	valid_0's l2: 4.06302	valid_0's rmse: 2.01569
[1298]	valid_0's l2: 4.06303	valid_0's rmse: 2.0157
[1299]	valid_0's l2: 4.06292	valid_0's rmse: 2.01567
[1300]	valid_0's l2: 4.06282	valid_0's rmse: 2.01564
[1301]	valid_0's l2: 4.06275	valid_0's rmse: 2.

[1438]	valid_0's l2: 4.0541	valid_0's rmse: 2.01348
[1439]	valid_0's l2: 4.05385	valid_0's rmse: 2.01342
[1440]	valid_0's l2: 4.05352	valid_0's rmse: 2.01334
[1441]	valid_0's l2: 4.05351	valid_0's rmse: 2.01333
[1442]	valid_0's l2: 4.05343	valid_0's rmse: 2.01331
[1443]	valid_0's l2: 4.0533	valid_0's rmse: 2.01328
[1444]	valid_0's l2: 4.05318	valid_0's rmse: 2.01325
[1445]	valid_0's l2: 4.0532	valid_0's rmse: 2.01326
[1446]	valid_0's l2: 4.05309	valid_0's rmse: 2.01323
[1447]	valid_0's l2: 4.05307	valid_0's rmse: 2.01322
[1448]	valid_0's l2: 4.05265	valid_0's rmse: 2.01312
[1449]	valid_0's l2: 4.05264	valid_0's rmse: 2.01312
[1450]	valid_0's l2: 4.05253	valid_0's rmse: 2.01309
[1451]	valid_0's l2: 4.0524	valid_0's rmse: 2.01306
[1452]	valid_0's l2: 4.05235	valid_0's rmse: 2.01305
[1453]	valid_0's l2: 4.05222	valid_0's rmse: 2.01301
[1454]	valid_0's l2: 4.05209	valid_0's rmse: 2.01298
[1455]	valid_0's l2: 4.05211	valid_0's rmse: 2.01299
[1456]	valid_0's l2: 4.05204	valid_0's rmse: 2.012

In [27]:
feature_imp = pd.DataFrame(sorted(zip(model.booster_.feature_importance(importance_type='gain'),X_train.columns)), 
                               columns=['LGBM_Value','Feature']).sort_values(by=['LGBM_Value'],ascending=False)
feature_imp

Unnamed: 0,LGBM_Value,Feature
43,1790261000.0,trip_distance
42,48228000.0,mean_speed
41,16211820.0,RatecodeID
40,9541866.0,lng2
39,5937422.0,credit_tip_ratio
38,1418818.0,hour
37,990964.8,day
36,844205.8,lat2
35,626691.2,lng1
34,616459.7,lat1


Observations: 
   * I was suprised that the days of the week had lower than normal importance in the model.
   * The pickup and drop off boroughs were low in importance. This could be due to the fact that location is already captured in lat and long values, which are much more granular. I expect traffic patterns in NYC to vary greatly across a small distance. Boroughs may cover too much space. 
   * The month value is not important because it is a constant (all of the data is from June). If I were building a yearly model, the month would matter more. 
   * I will see how the model performs when dropping month and the pickup/dropoff boroughs. I will also run the model on the categorical attributes and not the one-hot-encoded versions.

### Duration Feature Selection

In [20]:
#create a tunning for feature selection
import lightgbm as lgbm
eval_set=[(X_val,Y_duration_val)]
metric="rmse"
model=lgbm.LGBMRegressor(
                         learning_rate=0.1,
                         n_estimators=1500,
                         max_depth=12,
                         bagging_fraction=0.66,
                         eval_metric=metric,
                         eval_set=eval_set,
                         early_stopping_rounds=100,
                         reg_alpha=10,
                         reg_lambda=1,
                        )

In [21]:
model_results=model.fit(X_train,Y_duration_train,eval_set=eval_set,eval_metric='rmse')


Found `early_stopping_rounds` in params. Will use it instead of argument



[1]	valid_0's l2: 118.368	valid_0's rmse: 10.8797
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's l2: 97.7299	valid_0's rmse: 9.88584
[3]	valid_0's l2: 80.6788	valid_0's rmse: 8.98214
[4]	valid_0's l2: 66.7209	valid_0's rmse: 8.16829
[5]	valid_0's l2: 55.2478	valid_0's rmse: 7.43288
[6]	valid_0's l2: 45.7554	valid_0's rmse: 6.76428
[7]	valid_0's l2: 37.954	valid_0's rmse: 6.16068
[8]	valid_0's l2: 31.5587	valid_0's rmse: 5.61771
[9]	valid_0's l2: 26.2338	valid_0's rmse: 5.1219
[10]	valid_0's l2: 21.8254	valid_0's rmse: 4.67176
[11]	valid_0's l2: 18.2118	valid_0's rmse: 4.26753
[12]	valid_0's l2: 15.2364	valid_0's rmse: 3.90338
[13]	valid_0's l2: 12.7344	valid_0's rmse: 3.56853
[14]	valid_0's l2: 10.6482	valid_0's rmse: 3.26316
[15]	valid_0's l2: 8.94149	valid_0's rmse: 2.99023
[16]	valid_0's l2: 7.51939	valid_0's rmse: 2.74215
[17]	valid_0's l2: 6.32848	valid_0's rmse: 2.51565
[18]	valid_0's l2: 5.34203	valid_0's rmse: 2.31128
[19]	valid_0's l2: 4.52354	val

[159]	valid_0's l2: 0.169838	valid_0's rmse: 0.412114
[160]	valid_0's l2: 0.169813	valid_0's rmse: 0.412083
[161]	valid_0's l2: 0.169784	valid_0's rmse: 0.412049
[162]	valid_0's l2: 0.169657	valid_0's rmse: 0.411894
[163]	valid_0's l2: 0.16932	valid_0's rmse: 0.411485
[164]	valid_0's l2: 0.168876	valid_0's rmse: 0.410946
[165]	valid_0's l2: 0.168785	valid_0's rmse: 0.410834
[166]	valid_0's l2: 0.168542	valid_0's rmse: 0.410539
[167]	valid_0's l2: 0.168269	valid_0's rmse: 0.410206
[168]	valid_0's l2: 0.168212	valid_0's rmse: 0.410136
[169]	valid_0's l2: 0.168209	valid_0's rmse: 0.410133
[170]	valid_0's l2: 0.167994	valid_0's rmse: 0.409871
[171]	valid_0's l2: 0.167901	valid_0's rmse: 0.409757
[172]	valid_0's l2: 0.167776	valid_0's rmse: 0.409604
[173]	valid_0's l2: 0.167807	valid_0's rmse: 0.409642
[174]	valid_0's l2: 0.167654	valid_0's rmse: 0.409455
[175]	valid_0's l2: 0.167483	valid_0's rmse: 0.409246
[176]	valid_0's l2: 0.167498	valid_0's rmse: 0.409266
[177]	valid_0's l2: 0.167422	

[314]	valid_0's l2: 0.153178	valid_0's rmse: 0.39138
[315]	valid_0's l2: 0.152951	valid_0's rmse: 0.391089
[316]	valid_0's l2: 0.152772	valid_0's rmse: 0.39086
[317]	valid_0's l2: 0.152749	valid_0's rmse: 0.390831
[318]	valid_0's l2: 0.152577	valid_0's rmse: 0.390611
[319]	valid_0's l2: 0.152474	valid_0's rmse: 0.390479
[320]	valid_0's l2: 0.152372	valid_0's rmse: 0.390348
[321]	valid_0's l2: 0.152322	valid_0's rmse: 0.390285
[322]	valid_0's l2: 0.152318	valid_0's rmse: 0.39028
[323]	valid_0's l2: 0.152306	valid_0's rmse: 0.390263
[324]	valid_0's l2: 0.152187	valid_0's rmse: 0.390111
[325]	valid_0's l2: 0.151956	valid_0's rmse: 0.389815
[326]	valid_0's l2: 0.151858	valid_0's rmse: 0.38969
[327]	valid_0's l2: 0.151838	valid_0's rmse: 0.389664
[328]	valid_0's l2: 0.15174	valid_0's rmse: 0.389538
[329]	valid_0's l2: 0.151611	valid_0's rmse: 0.389373
[330]	valid_0's l2: 0.151593	valid_0's rmse: 0.389349
[331]	valid_0's l2: 0.15158	valid_0's rmse: 0.389333
[332]	valid_0's l2: 0.15145	valid_

[470]	valid_0's l2: 0.143463	valid_0's rmse: 0.378765
[471]	valid_0's l2: 0.143415	valid_0's rmse: 0.378702
[472]	valid_0's l2: 0.143384	valid_0's rmse: 0.378661
[473]	valid_0's l2: 0.143379	valid_0's rmse: 0.378654
[474]	valid_0's l2: 0.143327	valid_0's rmse: 0.378586
[475]	valid_0's l2: 0.14328	valid_0's rmse: 0.378523
[476]	valid_0's l2: 0.143216	valid_0's rmse: 0.378439
[477]	valid_0's l2: 0.143226	valid_0's rmse: 0.378452
[478]	valid_0's l2: 0.143156	valid_0's rmse: 0.37836
[479]	valid_0's l2: 0.143103	valid_0's rmse: 0.37829
[480]	valid_0's l2: 0.143028	valid_0's rmse: 0.378191
[481]	valid_0's l2: 0.142972	valid_0's rmse: 0.378116
[482]	valid_0's l2: 0.142922	valid_0's rmse: 0.378051
[483]	valid_0's l2: 0.142857	valid_0's rmse: 0.377964
[484]	valid_0's l2: 0.142814	valid_0's rmse: 0.377907
[485]	valid_0's l2: 0.14277	valid_0's rmse: 0.377849
[486]	valid_0's l2: 0.142719	valid_0's rmse: 0.377781
[487]	valid_0's l2: 0.142644	valid_0's rmse: 0.377682
[488]	valid_0's l2: 0.14261	vali

[624]	valid_0's l2: 0.137648	valid_0's rmse: 0.371009
[625]	valid_0's l2: 0.137592	valid_0's rmse: 0.370934
[626]	valid_0's l2: 0.137544	valid_0's rmse: 0.37087
[627]	valid_0's l2: 0.137502	valid_0's rmse: 0.370812
[628]	valid_0's l2: 0.137482	valid_0's rmse: 0.370786
[629]	valid_0's l2: 0.137444	valid_0's rmse: 0.370734
[630]	valid_0's l2: 0.137429	valid_0's rmse: 0.370715
[631]	valid_0's l2: 0.137415	valid_0's rmse: 0.370695
[632]	valid_0's l2: 0.137396	valid_0's rmse: 0.370669
[633]	valid_0's l2: 0.137404	valid_0's rmse: 0.37068
[634]	valid_0's l2: 0.13738	valid_0's rmse: 0.370648
[635]	valid_0's l2: 0.137367	valid_0's rmse: 0.370631
[636]	valid_0's l2: 0.137358	valid_0's rmse: 0.370619
[637]	valid_0's l2: 0.137338	valid_0's rmse: 0.370592
[638]	valid_0's l2: 0.137329	valid_0's rmse: 0.370579
[639]	valid_0's l2: 0.13734	valid_0's rmse: 0.370595
[640]	valid_0's l2: 0.13733	valid_0's rmse: 0.37058
[641]	valid_0's l2: 0.137299	valid_0's rmse: 0.370539
[642]	valid_0's l2: 0.137242	valid

[778]	valid_0's l2: 0.134355	valid_0's rmse: 0.366544
[779]	valid_0's l2: 0.134351	valid_0's rmse: 0.366539
[780]	valid_0's l2: 0.134342	valid_0's rmse: 0.366527
[781]	valid_0's l2: 0.13432	valid_0's rmse: 0.366497
[782]	valid_0's l2: 0.134284	valid_0's rmse: 0.366448
[783]	valid_0's l2: 0.134253	valid_0's rmse: 0.366406
[784]	valid_0's l2: 0.134227	valid_0's rmse: 0.36637
[785]	valid_0's l2: 0.134214	valid_0's rmse: 0.366353
[786]	valid_0's l2: 0.13421	valid_0's rmse: 0.366347
[787]	valid_0's l2: 0.134189	valid_0's rmse: 0.366318
[788]	valid_0's l2: 0.134173	valid_0's rmse: 0.366296
[789]	valid_0's l2: 0.134143	valid_0's rmse: 0.366256
[790]	valid_0's l2: 0.134097	valid_0's rmse: 0.366192
[791]	valid_0's l2: 0.134086	valid_0's rmse: 0.366177
[792]	valid_0's l2: 0.134056	valid_0's rmse: 0.366137
[793]	valid_0's l2: 0.134058	valid_0's rmse: 0.36614
[794]	valid_0's l2: 0.134044	valid_0's rmse: 0.36612
[795]	valid_0's l2: 0.134034	valid_0's rmse: 0.366106
[796]	valid_0's l2: 0.134033	vali

[931]	valid_0's l2: 0.13236	valid_0's rmse: 0.363813
[932]	valid_0's l2: 0.132353	valid_0's rmse: 0.363803
[933]	valid_0's l2: 0.132319	valid_0's rmse: 0.363757
[934]	valid_0's l2: 0.132302	valid_0's rmse: 0.363733
[935]	valid_0's l2: 0.132288	valid_0's rmse: 0.363715
[936]	valid_0's l2: 0.132284	valid_0's rmse: 0.363708
[937]	valid_0's l2: 0.132284	valid_0's rmse: 0.363709
[938]	valid_0's l2: 0.132278	valid_0's rmse: 0.3637
[939]	valid_0's l2: 0.132258	valid_0's rmse: 0.363674
[940]	valid_0's l2: 0.132236	valid_0's rmse: 0.363642
[941]	valid_0's l2: 0.132199	valid_0's rmse: 0.363592
[942]	valid_0's l2: 0.132188	valid_0's rmse: 0.363576
[943]	valid_0's l2: 0.132162	valid_0's rmse: 0.363541
[944]	valid_0's l2: 0.132149	valid_0's rmse: 0.363522
[945]	valid_0's l2: 0.132124	valid_0's rmse: 0.363488
[946]	valid_0's l2: 0.132083	valid_0's rmse: 0.363432
[947]	valid_0's l2: 0.132061	valid_0's rmse: 0.363402
[948]	valid_0's l2: 0.13203	valid_0's rmse: 0.363359
[949]	valid_0's l2: 0.132018	val

[1084]	valid_0's l2: 0.130624	valid_0's rmse: 0.361419
[1085]	valid_0's l2: 0.130612	valid_0's rmse: 0.361403
[1086]	valid_0's l2: 0.130616	valid_0's rmse: 0.361408
[1087]	valid_0's l2: 0.130616	valid_0's rmse: 0.361408
[1088]	valid_0's l2: 0.130601	valid_0's rmse: 0.361388
[1089]	valid_0's l2: 0.130565	valid_0's rmse: 0.361338
[1090]	valid_0's l2: 0.13054	valid_0's rmse: 0.361303
[1091]	valid_0's l2: 0.130526	valid_0's rmse: 0.361284
[1092]	valid_0's l2: 0.130501	valid_0's rmse: 0.361249
[1093]	valid_0's l2: 0.130466	valid_0's rmse: 0.361201
[1094]	valid_0's l2: 0.130446	valid_0's rmse: 0.361173
[1095]	valid_0's l2: 0.130436	valid_0's rmse: 0.36116
[1096]	valid_0's l2: 0.130409	valid_0's rmse: 0.361122
[1097]	valid_0's l2: 0.130391	valid_0's rmse: 0.361097
[1098]	valid_0's l2: 0.130347	valid_0's rmse: 0.361035
[1099]	valid_0's l2: 0.13031	valid_0's rmse: 0.360985
[1100]	valid_0's l2: 0.13029	valid_0's rmse: 0.360957
[1101]	valid_0's l2: 0.130262	valid_0's rmse: 0.360919
[1102]	valid_0

[1234]	valid_0's l2: 0.129333	valid_0's rmse: 0.35963
[1235]	valid_0's l2: 0.129313	valid_0's rmse: 0.359602
[1236]	valid_0's l2: 0.129305	valid_0's rmse: 0.35959
[1237]	valid_0's l2: 0.129287	valid_0's rmse: 0.359566
[1238]	valid_0's l2: 0.129287	valid_0's rmse: 0.359565
[1239]	valid_0's l2: 0.129291	valid_0's rmse: 0.359571
[1240]	valid_0's l2: 0.129291	valid_0's rmse: 0.359571
[1241]	valid_0's l2: 0.129284	valid_0's rmse: 0.359561
[1242]	valid_0's l2: 0.129273	valid_0's rmse: 0.359546
[1243]	valid_0's l2: 0.129265	valid_0's rmse: 0.359535
[1244]	valid_0's l2: 0.129268	valid_0's rmse: 0.359539
[1245]	valid_0's l2: 0.12925	valid_0's rmse: 0.359513
[1246]	valid_0's l2: 0.129252	valid_0's rmse: 0.359517
[1247]	valid_0's l2: 0.129253	valid_0's rmse: 0.359518
[1248]	valid_0's l2: 0.129219	valid_0's rmse: 0.35947
[1249]	valid_0's l2: 0.129214	valid_0's rmse: 0.359463
[1250]	valid_0's l2: 0.129192	valid_0's rmse: 0.359433
[1251]	valid_0's l2: 0.129173	valid_0's rmse: 0.359407
[1252]	valid_0

[1384]	valid_0's l2: 0.128116	valid_0's rmse: 0.357932
[1385]	valid_0's l2: 0.128108	valid_0's rmse: 0.357922
[1386]	valid_0's l2: 0.128099	valid_0's rmse: 0.357909
[1387]	valid_0's l2: 0.128078	valid_0's rmse: 0.35788
[1388]	valid_0's l2: 0.128061	valid_0's rmse: 0.357856
[1389]	valid_0's l2: 0.128068	valid_0's rmse: 0.357866
[1390]	valid_0's l2: 0.128054	valid_0's rmse: 0.357846
[1391]	valid_0's l2: 0.128045	valid_0's rmse: 0.357834
[1392]	valid_0's l2: 0.128045	valid_0's rmse: 0.357834
[1393]	valid_0's l2: 0.128047	valid_0's rmse: 0.357837
[1394]	valid_0's l2: 0.128058	valid_0's rmse: 0.357852
[1395]	valid_0's l2: 0.128031	valid_0's rmse: 0.357814
[1396]	valid_0's l2: 0.128016	valid_0's rmse: 0.357793
[1397]	valid_0's l2: 0.127984	valid_0's rmse: 0.357749
[1398]	valid_0's l2: 0.12799	valid_0's rmse: 0.357757
[1399]	valid_0's l2: 0.12799	valid_0's rmse: 0.357757
[1400]	valid_0's l2: 0.127995	valid_0's rmse: 0.357763
[1401]	valid_0's l2: 0.127972	valid_0's rmse: 0.357732
[1402]	valid_

In [22]:
feature_imp = pd.DataFrame(sorted(zip(model.booster_.feature_importance(importance_type='gain'),X_train.columns)), 
                               columns=['LGBM_Value','Feature']).sort_values(by=['LGBM_Value'],ascending=False)
feature_imp

Unnamed: 0,LGBM_Value,Feature
44,1331071000.0,trip_distance
43,425976100.0,speed_mph
42,168373.7,mean_speed
41,78454.37,direction
40,73582.87,lng1
39,68160.31,lng2
38,51977.13,hour
37,51400.89,tolls_amount
36,42887.53,minute
35,40859.08,credit_tip_ratio


Observations:
   * Much like the income model, boroughs and month were not important attributes.
   * Minute is slightly more important in duration model than income. 
   