## Problem Description
Tree roots growing under sidewalks often cause cracking or lifting of the pavement once the tree surpasses a certain size. This creates significant tripping hazards for pedestrians, and liability issues for property owners. Furthermore, the cost of repairing such damage is in excess of $100 million per year in the United States. As such, this project seeks to:

Predict the likelihood that a particular tree will result in sidewalk damage.

## Database Description
Street tree data from the TreesCount! 2015 Street Tree Census, conducted by volunteers and staff organized by NYC Parks & Recreation and partner organizations. Tree data collected includes tree species, diameter and perception of health. Accompanying blockface data is available indicating status of data collection and data release citywide.

In [1]:
import numpy as np
import pandas as pd
import os
import csv

# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None

# Display up to 60 columns of a dataframe
pd.set_option('display.max_columns', 60)

# Matplotlib visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Set default font size
plt.rcParams['font.size'] = 24

# Internal ipython tool for setting figure size
from IPython.core.pylabtools import figsize

# Seaborn for visualization
import seaborn as sns
sns.set(font_scale = 2)

# Splitting data into training and testing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.metrics import confusion_matrix

# Machine Learning Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import tree

# LIME for explaining predictions
import lime 
import lime.lime_tabular

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  from numpy.core.umath_tests import inner1d


In [2]:
data=pd.read_csv('/home/manish/my-custom-data/2015StreetTreesCensus_TREES.csv')
# data2=pd.read_csv('/home/manish/my-custom-data/Energy_Efficiency_Projects.csv')
# data3=pd.read_csv('/home/manish/my-custom-data/New_York_City_Leading_Causes_of_Death.csv')
# data4=pd.read_csv('/home/manish/my-custom-data/Public_Recycling_Bins.csv')
# data5=pd.read_csv('/home/manish/my-custom-data/Water_Consumption_In_The_New_York_City.csv')
# data6=pd.read_csv('/home/manish/my-custom-data/Youth_Behavior_Risk_Survey.csv')


In [3]:
data

Unnamed: 0,created_at,tree_id,block_id,the_geom,tree_dbh,stump_diam,curb_loc,status,health,spc_latin,spc_common,steward,guards,sidewalk,user_type,problems,root_stone,root_grate,root_other,trnk_wire,trnk_light,trnk_other,brnch_ligh,brnch_shoe,brnch_othe,address,zipcode,zip_city,cb_num,borocode,boroname,cncldist,st_assem,st_senate,nta,nta_name,boro_ct,state,Latitude,longitude,x_sp,y_sp
0,08/27/2015,180683,348711,POINT (-73.84421521958048 40.723091773924274),3,0,OnCurb,Alive,Fair,Acer rubrum,red maple,,,NoDamage,TreesCount Staff,,No,No,No,No,No,No,No,No,No,108-005 70 AVENUE,11375,Forest Hills,406,4,Queens,29,28,16,QN17,Forest Hills,4073900,New York,40.723092,-73.844215,1.027431e+06,202756.768749
1,09/03/2015,200540,315986,POINT (-73.81867945834878 40.79411066708779),21,0,OnCurb,Alive,Fair,Quercus palustris,pin oak,,,Damage,TreesCount Staff,Stones,Yes,No,No,No,No,No,No,No,No,147-074 7 AVENUE,11357,Whitestone,407,4,Queens,19,27,11,QN49,Whitestone,4097300,New York,40.794111,-73.818679,1.034456e+06,228644.837379
2,09/05/2015,204026,218365,POINT (-73.93660770459083 40.717580740099116),3,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1or2,,Damage,Volunteer,,No,No,No,No,No,No,No,No,No,390 MORGAN AVENUE,11211,Brooklyn,301,3,Brooklyn,34,50,18,BK90,East Williamsburg,3044900,New York,40.717581,-73.936608,1.001823e+06,200716.891267
3,09/05/2015,204337,217969,POINT (-73.93445615919741 40.713537494833226),10,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,,,Damage,Volunteer,Stones,Yes,No,No,No,No,No,No,No,No,1027 GRAND STREET,11211,Brooklyn,301,3,Brooklyn,34,53,18,BK90,East Williamsburg,3044900,New York,40.713537,-73.934456,1.002420e+06,199244.253136
4,08/30/2015,189565,223043,POINT (-73.97597938483258 40.66677775537875),21,0,OnCurb,Alive,Good,Tilia americana,American linden,,,Damage,Volunteer,Stones,Yes,No,No,No,No,No,No,No,No,603 6 STREET,11215,Brooklyn,306,3,Brooklyn,39,44,21,BK37,Park Slope-Gowanus,3016500,New York,40.666778,-73.975979,9.909138e+05,182202.425999
5,08/30/2015,190422,106099,POINT (-73.98494997200308 40.770045625891846),11,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1or2,Helpful,NoDamage,Volunteer,,No,No,No,No,No,No,No,No,No,8 COLUMBUS AVENUE,10023,New York,107,1,Manhattan,3,67,27,MN14,Lincoln Square,1014500,New York,40.770046,-73.984950,9.884187e+05,219825.522669
6,08/30/2015,190426,106099,POINT (-73.98533807200513 40.77020969000546),11,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1or2,Helpful,NoDamage,Volunteer,,No,No,No,No,No,No,No,No,No,120 WEST 60 STREET,10023,New York,107,1,Manhattan,3,67,27,MN14,Lincoln Square,1014500,New York,40.770210,-73.985338,9.883112e+05,219885.278455
7,09/07/2015,208649,103940,POINT (-73.98729652382876 40.7627238542921),9,0,OnCurb,Alive,Good,Tilia americana,American linden,,,NoDamage,Volunteer,MetalGrates,No,Yes,No,No,No,No,No,No,No,311 WEST 50 STREET,10019,New York,104,1,Manhattan,3,75,27,MN15,Clinton,1012700,New York,40.762724,-73.987297,9.877691e+05,217157.856088
8,09/08/2015,209610,407443,POINT (-74.07625483097186 40.596579313729144),6,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,,,NoDamage,TreesCount Staff,,No,No,No,No,No,No,No,No,No,65 JEROME AVENUE,10305,Staten Island,502,5,Staten Island,50,64,23,SI14,Grasmere-Arrochar-Ft. Wadsworth,5006400,New York,40.596579,-74.076255,9.630732e+05,156635.554233
9,08/31/2015,192755,207508,POINT (-73.96974394191379 40.58635724735751),21,0,OffsetFromCurb,Alive,Fair,Platanus x acerifolia,London planetree,,,NoDamage,TreesCount Staff,,No,No,No,No,No,No,No,No,No,638 AVENUE Z,11223,Brooklyn,313,3,Brooklyn,47,45,23,BK26,Gravesend,3037402,New York,40.586357,-73.969744,9.926537e+05,152903.630594


In [4]:
data.columns

Index(['created_at', 'tree_id', 'block_id', 'the_geom', 'tree_dbh',
       'stump_diam', 'curb_loc', 'status', 'health', 'spc_latin', 'spc_common',
       'steward', 'guards', 'sidewalk', 'user_type', 'problems', 'root_stone',
       'root_grate', 'root_other', 'trnk_wire', 'trnk_light', 'trnk_other',
       'brnch_ligh', 'brnch_shoe', 'brnch_othe', 'address', 'zipcode',
       'zip_city', 'cb_num', 'borocode', 'boroname', 'cncldist', 'st_assem',
       'st_senate', 'nta', 'nta_name', 'boro_ct', 'state', 'Latitude',
       'longitude', 'x_sp', 'y_sp'],
      dtype='object')

In [5]:
data.status.value_counts()

Alive    652173
Stump     17654
Dead      13961
Name: status, dtype: int64

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683788 entries, 0 to 683787
Data columns (total 42 columns):
created_at    683788 non-null object
tree_id       683788 non-null int64
block_id      683788 non-null int64
the_geom      683788 non-null object
tree_dbh      683788 non-null int64
stump_diam    683788 non-null int64
curb_loc      683788 non-null object
status        683788 non-null object
health        652172 non-null object
spc_latin     652169 non-null object
spc_common    652169 non-null object
steward       652173 non-null object
guards        652172 non-null object
sidewalk      652172 non-null object
user_type     683788 non-null object
problems      652124 non-null object
root_stone    683788 non-null object
root_grate    683788 non-null object
root_other    683788 non-null object
trnk_wire     683788 non-null object
trnk_light    683788 non-null object
trnk_other    683788 non-null object
brnch_ligh    683788 non-null object
brnch_shoe    683788 non-null object
brnch

In [7]:
data.isnull().any()

created_at    False
tree_id       False
block_id      False
the_geom      False
tree_dbh      False
stump_diam    False
curb_loc      False
status        False
health         True
spc_latin      True
spc_common     True
steward        True
guards         True
sidewalk       True
user_type     False
problems       True
root_stone    False
root_grate    False
root_other    False
trnk_wire     False
trnk_light    False
trnk_other    False
brnch_ligh    False
brnch_shoe    False
brnch_othe    False
address       False
zipcode       False
zip_city      False
cb_num        False
borocode      False
boroname      False
cncldist      False
st_assem      False
st_senate     False
nta           False
nta_name      False
boro_ct       False
state         False
Latitude      False
longitude     False
x_sp          False
y_sp          False
dtype: bool

In [8]:
data.isnull().sum()

created_at        0
tree_id           0
block_id          0
the_geom          0
tree_dbh          0
stump_diam        0
curb_loc          0
status            0
health        31616
spc_latin     31619
spc_common    31619
steward       31615
guards        31616
sidewalk      31616
user_type         0
problems      31664
root_stone        0
root_grate        0
root_other        0
trnk_wire         0
trnk_light        0
trnk_other        0
brnch_ligh        0
brnch_shoe        0
brnch_othe        0
address           0
zipcode           0
zip_city          0
cb_num            0
borocode          0
boroname          0
cncldist          0
st_assem          0
st_senate         0
nta               0
nta_name          0
boro_ct           0
state             0
Latitude          0
longitude         0
x_sp              0
y_sp              0
dtype: int64

In [9]:
data.problems.value_counts()

None                                                                                               426280
Stones                                                                                              95673
BranchLights                                                                                        29452
Stones,BranchLights                                                                                 17808
RootOther                                                                                           11418
TrunkOther                                                                                          11143
BranchOther                                                                                          8352
Stones,TrunkOther                                                                                    5183
Stones,RootOther                                                                                     4468
WiresRope                                     

In [10]:
no_classification = data[data['sidewalk'].isnull()]
classification = data[data['sidewalk'].notnull()]
print(no_classification.shape)
print(classification.shape)

(31616, 42)
(652172, 42)


In [11]:
classification.isnull().sum()

created_at     0
tree_id        0
block_id       0
the_geom       0
tree_dbh       0
stump_diam     0
curb_loc       0
status         0
health         1
spc_latin      5
spc_common     5
steward        0
guards         1
sidewalk       0
user_type      0
problems      49
root_stone     0
root_grate     0
root_other     0
trnk_wire      0
trnk_light     0
trnk_other     0
brnch_ligh     0
brnch_shoe     0
brnch_othe     0
address        0
zipcode        0
zip_city       0
cb_num         0
borocode       0
boroname       0
cncldist       0
st_assem       0
st_senate      0
nta            0
nta_name       0
boro_ct        0
state          0
Latitude       0
longitude      0
x_sp           0
y_sp           0
dtype: int64

In [12]:
# problems is not required as it is already present as other fetures so removing it
classfication2=classification.drop(columns=['problems'],axis=1)

In [13]:
classfication2.isnull().sum()

created_at    0
tree_id       0
block_id      0
the_geom      0
tree_dbh      0
stump_diam    0
curb_loc      0
status        0
health        1
spc_latin     5
spc_common    5
steward       0
guards        1
sidewalk      0
user_type     0
root_stone    0
root_grate    0
root_other    0
trnk_wire     0
trnk_light    0
trnk_other    0
brnch_ligh    0
brnch_shoe    0
brnch_othe    0
address       0
zipcode       0
zip_city      0
cb_num        0
borocode      0
boroname      0
cncldist      0
st_assem      0
st_senate     0
nta           0
nta_name      0
boro_ct       0
state         0
Latitude      0
longitude     0
x_sp          0
y_sp          0
dtype: int64

In [14]:
classfication2.health.value_counts()

Good    528849
Fair     96504
Poor     26818
Name: health, dtype: int64

In [15]:
classfication2.guards.value_counts()

None       572305
Helpful     51866
Harmful     20252
Unsure       7748
Name: guards, dtype: int64

In [16]:
classfication2.spc_common.mode()

0    London planetree
dtype: object

In [17]:
classfication2.health.fillna(value='Good',inplace=True)
classfication2.guards.fillna(value='Unsure',inplace=True)
classfication2.spc_common.fillna(classfication2.spc_common.mode()[0],inplace=True)
classfication2.spc_latin.fillna(classfication2.spc_latin.mode()[0],inplace=True)

In [18]:
classfication2.isnull().sum()

created_at    0
tree_id       0
block_id      0
the_geom      0
tree_dbh      0
stump_diam    0
curb_loc      0
status        0
health        0
spc_latin     0
spc_common    0
steward       0
guards        0
sidewalk      0
user_type     0
root_stone    0
root_grate    0
root_other    0
trnk_wire     0
trnk_light    0
trnk_other    0
brnch_ligh    0
brnch_shoe    0
brnch_othe    0
address       0
zipcode       0
zip_city      0
cb_num        0
borocode      0
boroname      0
cncldist      0
st_assem      0
st_senate     0
nta           0
nta_name      0
boro_ct       0
state         0
Latitude      0
longitude     0
x_sp          0
y_sp          0
dtype: int64

In [19]:
spc_common=classfication2.spc_common.value_counts()
spc_latin=classfication2.spc_latin.value_counts()

In [20]:
spc_common_toreolace=[]
spc_dict=spc_common.to_dict()
for key,value in spc_dict.items():
    if value<20000:
        spc_common_toreolace.append(key)
spc_latin_toreolace=[]
spc_ldict=spc_latin.to_dict()
for key,value in spc_ldict.items():
    if value<20000:
        spc_latin_toreolace.append(key)

In [21]:

classfication2=classfication2.replace(spc_common_toreolace,'combined_common')

In [22]:
classfication2=classfication2.replace(spc_latin_toreolace,'combined_common')
classfication2['spc_latin'].value_counts()

combined_common                       245283
Platanus x acerifolia                  87019
Gleditsia triacanthos var. inermis     64262
Pyrus calleryana                       58931
Quercus palustris                      53185
Acer platanoides                       34189
Tilia cordata                          29742
Prunus                                 29279
Zelkova serrata                        29258
Ginkgo biloba                          21024
Name: spc_latin, dtype: int64

In [23]:
# distribution
classfication2['sidewalk'].value_counts()

NoDamage    464978
Damage      187194
Name: sidewalk, dtype: int64

Since it is a highly unbalanced dataset we have to balance it first.

In [24]:
no_damage=classfication2[classfication2['sidewalk']=="NoDamage"]
# no_damage=no_damage.sample(frac=0.6)
damage=classfication2[classfication2['sidewalk']=="Damage"]
damage=damage.sample(frac=1)
classfication2=pd.concat([no_damage,damage,damage],axis=0,ignore_index=True)
classfication2

Unnamed: 0,created_at,tree_id,block_id,the_geom,tree_dbh,stump_diam,curb_loc,status,health,spc_latin,spc_common,steward,guards,sidewalk,user_type,root_stone,root_grate,root_other,trnk_wire,trnk_light,trnk_other,brnch_ligh,brnch_shoe,brnch_othe,address,zipcode,zip_city,cb_num,borocode,boroname,cncldist,st_assem,st_senate,nta,nta_name,boro_ct,state,Latitude,longitude,x_sp,y_sp
0,08/27/2015,180683,348711,POINT (-73.84421521958048 40.723091773924274),3,0,OnCurb,Alive,Fair,combined_common,combined_common,,,NoDamage,TreesCount Staff,No,No,No,No,No,No,No,No,No,108-005 70 AVENUE,11375,Forest Hills,406,4,Queens,29,28,16,QN17,Forest Hills,4073900,New York,40.723092,-73.844215,1.027431e+06,202756.768749
1,08/30/2015,190422,106099,POINT (-73.98494997200308 40.770045625891846),11,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1or2,Helpful,NoDamage,Volunteer,No,No,No,No,No,No,No,No,No,8 COLUMBUS AVENUE,10023,New York,107,1,Manhattan,3,67,27,MN14,Lincoln Square,1014500,New York,40.770046,-73.984950,9.884187e+05,219825.522669
2,08/30/2015,190426,106099,POINT (-73.98533807200513 40.77020969000546),11,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,1or2,Helpful,NoDamage,Volunteer,No,No,No,No,No,No,No,No,No,120 WEST 60 STREET,10023,New York,107,1,Manhattan,3,67,27,MN14,Lincoln Square,1014500,New York,40.770210,-73.985338,9.883112e+05,219885.278455
3,09/07/2015,208649,103940,POINT (-73.98729652382876 40.7627238542921),9,0,OnCurb,Alive,Good,combined_common,combined_common,,,NoDamage,Volunteer,No,Yes,No,No,No,No,No,No,No,311 WEST 50 STREET,10019,New York,104,1,Manhattan,3,75,27,MN15,Clinton,1012700,New York,40.762724,-73.987297,9.877691e+05,217157.856088
4,09/08/2015,209610,407443,POINT (-74.07625483097186 40.596579313729144),6,0,OnCurb,Alive,Good,Gleditsia triacanthos var. inermis,honeylocust,,,NoDamage,TreesCount Staff,No,No,No,No,No,No,No,No,No,65 JEROME AVENUE,10305,Staten Island,502,5,Staten Island,50,64,23,SI14,Grasmere-Arrochar-Ft. Wadsworth,5006400,New York,40.596579,-74.076255,9.630732e+05,156635.554233
5,08/31/2015,192755,207508,POINT (-73.96974394191379 40.58635724735751),21,0,OffsetFromCurb,Alive,Fair,Platanus x acerifolia,London planetree,,,NoDamage,TreesCount Staff,No,No,No,No,No,No,No,No,No,638 AVENUE Z,11223,Brooklyn,313,3,Brooklyn,47,45,23,BK26,Gravesend,3037402,New York,40.586357,-73.969744,9.926537e+05,152903.630594
6,09/05/2015,203719,302371,POINT (-73.91117076849402 40.78242822973097),11,0,OnCurb,Alive,Good,Platanus x acerifolia,London planetree,,,NoDamage,Volunteer,No,No,No,No,No,No,No,No,No,20-025 24 STREET,11105,Astoria,401,4,Queens,22,36,13,QN72,Steinway,4010500,New York,40.782428,-73.911171,1.008850e+06,224349.036588
7,09/05/2015,203726,302371,POINT (-73.91201956608866 40.78173511421239),8,0,OnCurb,Alive,Poor,Platanus x acerifolia,London planetree,,,NoDamage,Volunteer,No,No,No,No,No,No,No,No,No,20-055 24 STREET,11105,Astoria,401,4,Queens,22,36,13,QN72,Steinway,4010500,New York,40.781735,-73.912020,1.008615e+06,224096.273970
8,09/01/2015,195202,415896,POINT (-74.16267038247524 40.55710259269471),13,0,OnCurb,Alive,Fair,Platanus x acerifolia,London planetree,,,NoDamage,TreesCount Staff,Yes,No,No,No,No,No,No,No,No,35 FENWAY CIRCLE,10308,Staten Island,503,5,Staten Island,51,62,24,SI54,Great Kills,5014607,New York,40.557103,-74.162670,9.390480e+05,142285.957932
9,08/30/2015,189465,219493,POINT (-73.96821054029427 40.69473313907219),22,0,OnCurb,Alive,Good,Platanus x acerifolia,London planetree,3or4,Harmful,NoDamage,Volunteer,No,No,Yes,No,No,No,No,No,No,100 WAVERLY AVENUE,11205,Brooklyn,302,3,Brooklyn,35,50,25,BK69,Clinton Hill,3019100,New York,40.694733,-73.968211,9.930653e+05,192388.065077


In [25]:
classfication2=classfication2.sample(frac=1)
classfication2

Unnamed: 0,created_at,tree_id,block_id,the_geom,tree_dbh,stump_diam,curb_loc,status,health,spc_latin,spc_common,steward,guards,sidewalk,user_type,root_stone,root_grate,root_other,trnk_wire,trnk_light,trnk_other,brnch_ligh,brnch_shoe,brnch_othe,address,zipcode,zip_city,cb_num,borocode,boroname,cncldist,st_assem,st_senate,nta,nta_name,boro_ct,state,Latitude,longitude,x_sp,y_sp
594240,07/02/2015,47472,219312,POINT (-73.98840075692947 40.69057204202966),10,0,OnCurb,Alive,Fair,Gleditsia triacanthos var. inermis,honeylocust,1or2,,Damage,Volunteer,No,Yes,Yes,No,No,Yes,No,No,Yes,150 LIVINGSTON STREET,11201,Brooklyn,302,3,Brooklyn,33,52,25,BK38,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill,3003700,New York,40.690572,-73.988401,9.874667e+05,190870.668505
810400,07/18/2016,631096,307653,POINT (-73.77669538074173 40.71874891014502),8,0,OnCurb,Alive,Good,Ginkgo biloba,ginkgo,,,Damage,TreesCount Staff,No,No,No,No,No,No,No,No,No,86-086 CHEVY CHASE STREET,11432,Jamaica,408,4,Queens,24,24,11,QN06,Jamaica Estates-Holliswood,4047200,New York,40.718749,-73.776695,1.046151e+06,201215.032003
591481,12/21/2015,528922,413244,POINT (-74.18027807596867 40.551887857252034),11,0,OnCurb,Alive,Good,combined_common,combined_common,,,Damage,NYC Parks Staff,No,No,No,No,No,No,No,No,No,198 WOEHRLE AVENUE,10312,Staten Island,503,5,Staten Island,51,63,24,SI48,Arden Heights,5017008,New York,40.551888,-74.180278,9.341513e+05,140395.685794
227903,01/03/2016,542889,304651,POINT (-73.9050899362948 40.7533727100594),16,0,OnCurb,Alive,Good,Pyrus calleryana,Callery pear,,,NoDamage,Volunteer,Yes,No,No,No,No,No,No,No,No,56-002 NORTHERN BOULEVARD,11377,Woodside,402,4,Queens,26,30,12,QN63,Woodside,4025500,New York,40.753373,-73.905090,1.010546e+06,213764.919305
134131,11/02/2015,412773,338696,POINT (-73.79232237797451 40.5924123815478),4,0,OnCurb,Alive,Good,combined_common,combined_common,1or2,Unsure,NoDamage,Volunteer,No,No,No,No,No,No,Yes,No,Yes,319 BEACH 63 STREET,11692,Arverne,414,4,Queens,31,31,10,QN12,Hammels-Arverne-Edgemere,4096400,New York,40.592412,-73.792322,1.041928e+06,155176.589631
159322,11/06/2015,438866,506511,POINT (-73.87682662663791 40.879898132140994),5,0,OnCurb,Alive,Good,Ginkgo biloba,ginkgo,,,NoDamage,Volunteer,No,No,No,No,No,No,No,No,No,3402 TRYON AVENUE,10467,Bronx,207,2,Bronx,11,81,36,BX43,Norwood,2042300,New York,40.879898,-73.876827,1.018311e+06,259872.389754
202566,12/09/2015,505757,507628,POINT (-73.85852058178114 40.81494844014869),3,0,OnCurb,Alive,Good,Prunus,cherry,1or2,,NoDamage,TreesCount Staff,No,No,No,No,No,No,No,No,No,456 UNDERHILL AVENUE,10473,Bronx,209,2,Bronx,18,85,34,BX09,Soundview-Castle Hill-Clason Point-Harding Park,2000400,New York,40.814948,-73.858521,1.023412e+06,236216.436409
234729,12/22/2015,531525,343814,POINT (-73.87084441906444 40.74824007251451),27,0,OnCurb,Alive,Good,Quercus palustris,pin oak,,,NoDamage,TreesCount Staff,No,No,No,No,No,No,No,No,No,95-001 40 ROAD,11373,Elmhurst,404,4,Queens,21,34,13,QN29,Elmhurst,4046500,New York,40.748240,-73.870844,1.020036e+06,211907.070658
313820,12/27/2015,536539,415182,POINT (-74.14092608671861 40.56315500085412),2,0,OnCurb,Alive,Good,combined_common,combined_common,,,NoDamage,NYC Parks Staff,No,No,No,No,No,No,No,No,No,222 CORONA AVENUE,10306,Staten Island,503,5,Staten Island,51,62,24,SI54,Great Kills,5014606,New York,40.563155,-74.140926,9.450937e+05,144480.523753
601098,07/26/2015,95311,413869,POINT (-74.15187733034254 40.544561163701076),17,0,OnCurb,Alive,Good,Quercus palustris,pin oak,1or2,,Damage,TreesCount Staff,Yes,No,No,No,No,No,No,No,No,38 HOLLY AVENUE,10308,Staten Island,503,5,Staten Island,51,64,24,SI54,Great Kills,5015601,New York,40.544561,-74.151877,9.420392e+05,137711.436569


In [26]:
columns_to_remove_num=['tree_id','block_id','zipcode','boro_ct','Latitude','longitude','x_sp','y_sp','borocode']
numeric_subset=classfication2.select_dtypes(include=['number'])
numeric_subset2=numeric_subset.drop(columns=columns_to_remove_num,axis=1)

In [27]:
print(classfication2.columns)
print(numeric_subset2.columns)
categorical_features=classfication2[['status','health','spc_latin','spc_common','steward','guards','root_stone','root_grate','root_other','trnk_wire','trnk_light','trnk_other','brnch_ligh','brnch_shoe','brnch_othe','borocode']]
categorical_features=pd.get_dummies(categorical_features)
categorical_features

Index(['created_at', 'tree_id', 'block_id', 'the_geom', 'tree_dbh',
       'stump_diam', 'curb_loc', 'status', 'health', 'spc_latin', 'spc_common',
       'steward', 'guards', 'sidewalk', 'user_type', 'root_stone',
       'root_grate', 'root_other', 'trnk_wire', 'trnk_light', 'trnk_other',
       'brnch_ligh', 'brnch_shoe', 'brnch_othe', 'address', 'zipcode',
       'zip_city', 'cb_num', 'borocode', 'boroname', 'cncldist', 'st_assem',
       'st_senate', 'nta', 'nta_name', 'boro_ct', 'state', 'Latitude',
       'longitude', 'x_sp', 'y_sp'],
      dtype='object')
Index(['tree_dbh', 'stump_diam', 'cb_num', 'cncldist', 'st_assem',
       'st_senate'],
      dtype='object')


Unnamed: 0,borocode,status_Alive,health_Fair,health_Good,health_Poor,spc_latin_Acer platanoides,spc_latin_Ginkgo biloba,spc_latin_Gleditsia triacanthos var. inermis,spc_latin_Platanus x acerifolia,spc_latin_Prunus,spc_latin_Pyrus calleryana,spc_latin_Quercus palustris,spc_latin_Tilia cordata,spc_latin_Zelkova serrata,spc_latin_combined_common,spc_common_Callery pear,spc_common_Japanese zelkova,spc_common_London planetree,spc_common_Norway maple,spc_common_cherry,spc_common_combined_common,spc_common_ginkgo,spc_common_honeylocust,spc_common_littleleaf linden,spc_common_pin oak,steward_1or2,steward_3or4,steward_4orMore,steward_None,guards_Harmful,guards_Helpful,guards_None,guards_Unsure,root_stone_No,root_stone_Yes,root_grate_No,root_grate_Yes,root_other_No,root_other_Yes,trnk_wire_No,trnk_wire_Yes,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brnch_ligh_No,brnch_ligh_Yes,brnch_shoe_No,brnch_shoe_Yes,brnch_othe_No,brnch_othe_Yes
594240,3,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,1
810400,4,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
591481,5,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
227903,4,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
134131,4,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1
159322,2,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
202566,2,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
234729,4,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
313820,5,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
601098,5,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0


In [28]:
data_final=pd.concat([numeric_subset2,categorical_features],axis=1)
data_final

Unnamed: 0,tree_dbh,stump_diam,cb_num,cncldist,st_assem,st_senate,borocode,status_Alive,health_Fair,health_Good,health_Poor,spc_latin_Acer platanoides,spc_latin_Ginkgo biloba,spc_latin_Gleditsia triacanthos var. inermis,spc_latin_Platanus x acerifolia,spc_latin_Prunus,spc_latin_Pyrus calleryana,spc_latin_Quercus palustris,spc_latin_Tilia cordata,spc_latin_Zelkova serrata,spc_latin_combined_common,spc_common_Callery pear,spc_common_Japanese zelkova,spc_common_London planetree,spc_common_Norway maple,spc_common_cherry,spc_common_combined_common,spc_common_ginkgo,spc_common_honeylocust,spc_common_littleleaf linden,spc_common_pin oak,steward_1or2,steward_3or4,steward_4orMore,steward_None,guards_Harmful,guards_Helpful,guards_None,guards_Unsure,root_stone_No,root_stone_Yes,root_grate_No,root_grate_Yes,root_other_No,root_other_Yes,trnk_wire_No,trnk_wire_Yes,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brnch_ligh_No,brnch_ligh_Yes,brnch_shoe_No,brnch_shoe_Yes,brnch_othe_No,brnch_othe_Yes
594240,10,0,302,33,52,25,3,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,1
810400,8,0,408,24,24,11,4,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
591481,11,0,503,51,63,24,5,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
227903,16,0,402,26,30,12,4,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
134131,4,0,414,31,31,10,4,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1
159322,5,0,207,11,81,36,2,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
202566,3,0,209,18,85,34,2,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
234729,27,0,404,21,34,13,4,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
313820,2,0,503,51,62,24,5,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
601098,17,0,503,51,64,24,5,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0


In [29]:
tree_labels=classfication2['sidewalk'].map({'Damage':1,'NoDamage':0}).astype(np.int32)
tree_data=data_final
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(tree_labels)
print(le.classes_)
tree_labels=le.transform(tree_labels)
tree_labels

[0 1]


array([1, 1, 1, ..., 0, 0, 1])

In [30]:
scaler=MinMaxScaler()
scaler.fit(tree_data)
tree_data=scaler.transform(tree_data)

In [31]:
x,x_test,y,y_test= train_test_split(tree_data, tree_labels, test_size = 0.3, random_state = 42)

In [32]:
print(x.shape,'\n',x_test.shape,'\n',y.shape,'\n',y_test.shape)

(587556, 57) 
 (251810, 57) 
 (587556,) 
 (251810,)


In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report,f1_score,precision_score,recall_score
from sklearn.neighbors import KNeighborsClassifier

In [34]:
def fit_and_evaluate(model):
    
    # Train the model
    model.fit(x, y)
    
    # Make predictions and evalute
    model_pred = model.predict(x_test)
    model_acuracy = accuracy_score(y_test, model_pred)
    model_f1=f1_score(y_test,model_pred)
    model_precision=precision_score(y_test,model_pred)
    model_recall=recall_score(y_test,model_pred)
    conf=confusion_matrix(y_test,model_pred,labels=[1,0])
    # Return the performance metric
    return model_acuracy,model_f1,model_precision,model_recall,conf

In [35]:
lr=LogisticRegression()
accuracy,f1,precision,recall,_=fit_and_evaluate(lr)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

acc: 0.6848774869941623  f1:  0.5888826141244372  precision  0.7049243363929546  recall:  0.5056453693735375


In [36]:
svm=LinearSVC(C=1000)
accuracy,f1,precision,recall,_=fit_and_evaluate(svm)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

acc: 0.6362733807235614  f1:  0.4338677974063863  precision  0.7106035756949928  recall:  0.31226143976938064


In [37]:
gbtree=GradientBoostingClassifier()
accuracy,f1,precision,recall,_=fit_and_evaluate(gbtree)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

rf=RandomForestClassifier()
accuracy,f1,precision,recall,_=fit_and_evaluate(rf)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

nb=GaussianNB()
accuracy,f1,precision,recall,_=fit_and_evaluate(nb)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

ab=AdaBoostClassifier()
accuracy,f1,precision,recall,_=fit_and_evaluate(ab)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

bc=BaggingClassifier()
accuracy,f1,precision,recall,_=fit_and_evaluate(bc)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

# knn=KNeighborsClassifier()
# accuracy,f1,precision,recall,_=fit_and_evaluate(knn)
# print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)

acc: 0.6906199118382907  f1:  0.5912537055011937  precision  0.7205058694151045  recall:  0.5013212566618918
acc: 0.7339740280370121  f1:  0.696554598247855  precision  0.7095002999123333  recall:  0.6840728515121048
acc: 0.6646241213613439  f1:  0.5954423733766389  precision  0.6449935655278343  recall:  0.5529614833664018
acc: 0.6890393550692983  f1:  0.5987033952594492  precision  0.7060266885848281  recall:  0.5197031843620155
acc: 0.7331043246892498  f1:  0.6966768816937388  precision  0.7069410859529567  recall:  0.6867064674846298


In [38]:
gbtree_adv=GradientBoostingClassifier(max_depth=20,max_features=10,max_leaf_nodes=5,n_estimators=1000)
accuracy,f1,precision,recall,conf=fit_and_evaluate(gbtree)
print("acc:",accuracy," f1: ",f1," precision ",precision," recall: ",recall)
print(conf)

acc: 0.6906199118382907  f1:  0.5912537055011937  precision  0.7205058694151045  recall:  0.5013212566618918
[[ 56345  56048]
 [ 21857 117560]]


In [39]:
rf.feature_importances_

array([3.37675248e-01, 0.00000000e+00, 8.26751953e-02, 7.65624689e-02,
       9.49915074e-02, 6.05924711e-02, 6.68100094e-03, 0.00000000e+00,
       1.03820007e-02, 1.13945015e-02, 6.29930411e-03, 2.20901666e-03,
       2.41629487e-03, 2.97014673e-03, 3.34010357e-03, 2.35362693e-03,
       2.95983613e-03, 2.72599928e-03, 2.79032244e-03, 2.47055338e-03,
       4.28443282e-03, 2.67744702e-03, 2.53403419e-03, 5.42895132e-03,
       2.17871159e-03, 2.29537125e-03, 4.58284325e-03, 2.26668192e-03,
       2.72778082e-03, 2.78169140e-03, 2.91962562e-03, 9.37372397e-03,
       4.12066566e-03, 9.79898601e-04, 9.37275656e-03, 5.19985041e-03,
       7.01701986e-03, 8.65815883e-03, 3.55812875e-03, 7.03164411e-02,
       7.69465087e-02, 1.13477015e-03, 1.28635986e-03, 5.98876226e-03,
       5.40238314e-03, 3.89650743e-03, 3.90864073e-03, 5.97603509e-04,
       5.75116590e-04, 5.47127138e-03, 5.20208589e-03, 5.62224219e-03,
       8.58772357e-03, 3.02178856e-04, 2.98306796e-04, 4.96070180e-03,
      

In [40]:

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

In [41]:
# Create the model to use for hyperparameter tuning# Create 
model = RandomForestClassifier(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring ="accuracy",
                               n_jobs = 1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

# Fit on the training data
random_cv.fit(x, y)
# Get all of the cv results and sort by the test performance
random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)

random_results.head(10)

Fitting 4 folds for each of 25 candidates, totalling 100 fits


KeyboardInterrupt: 