<a id = 'top'></a>

# Crime in Boston, Revisited (Version 1.0)

**Ying Zhou**

**Table of contents**

[1.Data wrangling](#1)

[1.1 Exploration](#1.1)

[1.2 Removing irrelevant columns](#1.2)

[1.3 Process location data](#1.3)

[1.4 Process time](#1.4)

[1.5 Remove non-crimes](#1.5)

[1.6 Combine the two dataframes](#1.6)

[2.More preprocessing](#2)

[2.1 Preparation](#2.1)

[2.2 Adding weather data](#2.2B)

[2.3 Adding unemployment data](#2.2C)

[2.4 Adding holiday data](#2.2D)

[3.Choosing the best regressor](#3)

[3.1 Select and split](#2.2A)

[3.2 Linear Regressor](#2.2)

[3.3 BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor](#2.4)

[3.4 GradientBoostingRegressor, RandomForestRegressor, LGBMRegressor](#2.4)

[3.5 KNeighborsRegressor, RadiusNeighborsRegressor](#2.5)

[3.6 DecisionTreeRegressor](#2.6)

[3.7 Ridge, RidgeCV, BayesianRidge](#2.7)

[3.8 HuberRegressor, TheilSenRegressor, RANSACRegressor](#2.8)

[3.9 MLPRegressor](#2.9)

[3.10 SVR](#2.10)

[4. Tuning hyperparameters](#4)

Now let's return to the problem of crime in Boston. This time we will predict the amount of crimes, do some validation and finally use all my data to make the prediction about crime in Boston in the future. We won't do preliminary analysis any more because especially for the last 3-4 years I think this is already explored in details in the last project.

Again let's first import the usual packages.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import datetime
import pickle
from bs4 import BeautifulSoup

Since we need to do some machine learning let's import regression-related parts of sklearn too. However this local computer can not handle deep learning which is why we won't import Keras. If necessary we will do some regression on Google Colab.

In [2]:
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler

from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor

from sklearn.svm import SVR

from sklearn.neural_network import MLPRegressor

import lightgbm as lgb

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


Since we need to draw graphs we need to write our multiliner function here which can help us leave more room for tick labels if the tick labels are really long.

In [3]:
def multiliner(string_list, n):
    length = len(string_list)
    for i in range(length):
        rem = i % n
        string_list[i] = '\n' * rem + string_list[i]
    return string_list

Time to get the data!

In [4]:
#Since the new crime dataset is updated regularly the links may not be stable, hence we use the preview page to obtain the actual links
new_preview_url = 'https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b'
new_req = requests.get(new_preview_url)
new_req.raise_for_status()
if new_req.status_code == requests.codes.ok:
    new_soup = BeautifulSoup(new_req.text, 'html.parser')
    new_url = new_soup.find_all('a',{'class':'btn btn-primary resource-url-analytics resource-type-None'})[0]['href']
old_url = 'https://data.boston.gov/dataset/eefad66a-e805-4b35-b170-d26e2028c373/resource/ba5ed0e2-e901-438c-b2e0-4acfc3c452b9/download/crime-incident-reports-july-2012-august-2015-source-legacy-system.csv'

In [5]:
df_new = pd.read_csv(new_url)
df_old = pd.read_csv(old_url)

  interactivity=interactivity, compiler=compiler, result=result)


[Return to top](#top)
<a id = '1'></a>
# 1. Data Wrangling

<a id = '1.1'></a>
[Return to top](#top)
## 1.1 Exploration

In [6]:
df_new.shape

(406126, 17)

In [7]:
df_old.head()

Unnamed: 0,COMPNOS,NatureCode,INCIDENT_TYPE_DESCRIPTION,MAIN_CRIMECODE,REPTDISTRICT,REPORTINGAREA,FROMDATE,WEAPONTYPE,Shooting,DOMESTIC,SHIFT,Year,Month,DAY_WEEK,UCRPART,X,Y,STREETNAME,XSTREETNAME,Location
0,120420285.0,BERPTA,RESIDENTIAL BURGLARY,05RB,D4,629,07/08/2012 06:00:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,763273.1791,2951498.962,ABERDEEN ST,,"(42.34638135, -71.10379454)"
1,120419202.0,PSHOT,AGGRAVATED ASSAULT,04xx,B2,327,07/08/2012 06:03:00 AM,Firearm,Yes,No,Last,2012,7,Sunday,Part One,771223.1638,2940772.099,HOWARD AV,,"(42.31684135, -71.07458456)"
2,120419213.0,ARMROB,ROBBERY,03xx,D4,625,07/08/2012 06:26:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,765118.8605,2950217.536,JERSEY ST,QUEENSBERRY ST,"(42.34284135, -71.09698955)"
3,120419223.0,ALARMC,COMMERCIAL BURGLARY,05CB,B2,258,07/08/2012 06:56:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,773591.8648,2940638.174,COLUMBIA RD,,"(42.3164411, -71.06582908)"
4,120419236.0,ARMROB,ROBBERY,03xx,E18,496,07/08/2012 07:15:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,759042.7315,2923832.681,COLLINS ST,,"(42.27051636, -71.11989955)"


In [8]:
df_new.head()

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I192056997,3830,Motor Vehicle Accident Response,M/V - LEAVING SCENE - PERSONAL INJURY,C6,206,,2019-07-23 15:40:00,2019,7,Tuesday,15,Part Three,SEAPORT BLVD,42.353115,-71.048421,"(42.35311458, -71.04842121)"
1,I192056994,3006,Medical Assistance,SICK/INJURED/MEDICAL - PERSON,C11,335,,2019-07-23 18:46:00,2019,7,Tuesday,18,Part Three,HAMILTON ST,42.306587,-71.067479,"(42.30658655, -71.06747889)"
2,I192056992,3207,Property Found,PROPERTY - FOUND,E13,304,,2019-07-23 20:36:00,2019,7,Tuesday,20,Part Three,COLUMBUS AVE,42.318939,-71.098262,"(42.31893878, -71.09826197)"
3,I192056991,3006,Medical Assistance,SICK/INJURED/MEDICAL - PERSON,B2,316,,2019-07-23 20:48:00,2019,7,Tuesday,20,Part Three,PARK VIEW ST,42.311095,-71.092385,"(42.31109466, -71.09238463)"
4,I192056986,413,Aggravated Assault,ASSAULT - AGGRAVATED - BATTERY,A1,105,,2019-07-23 19:39:00,2019,7,Tuesday,19,Part One,SUMMER ST,42.355216,-71.060129,"(42.35521625, -71.06012863)"


In [9]:
df_old.shape

(268056, 20)

In [10]:
df_new.dtypes

INCIDENT_NUMBER         object
OFFENSE_CODE             int64
OFFENSE_CODE_GROUP      object
OFFENSE_DESCRIPTION     object
DISTRICT                object
REPORTING_AREA          object
SHOOTING                object
OCCURRED_ON_DATE        object
YEAR                     int64
MONTH                    int64
DAY_OF_WEEK             object
HOUR                     int64
UCR_PART                object
STREET                  object
Lat                    float64
Long                   float64
Location                object
dtype: object

In [11]:
df_old.dtypes

COMPNOS                      float64
NatureCode                    object
INCIDENT_TYPE_DESCRIPTION     object
MAIN_CRIMECODE                object
REPTDISTRICT                  object
REPORTINGAREA                  int64
FROMDATE                      object
WEAPONTYPE                    object
Shooting                      object
DOMESTIC                      object
SHIFT                         object
Year                           int64
Month                          int64
DAY_WEEK                      object
UCRPART                       object
X                            float64
Y                            float64
STREETNAME                    object
XSTREETNAME                   object
Location                      object
dtype: object

We are very interested in knowing whether the `Lat` / `Long` / `Location` data contains de facto NaN values that aren't labelled as NaN.

In [12]:
df_new['Lat'].value_counts()

 42.348624    1617
 42.361839    1574
 42.284826    1396
 42.328663    1293
 42.256216    1203
 42.297555    1063
 42.331521     971
 42.341288     969
-1.000000      903
 42.335119     883
 42.326966     840
 42.352312     832
 42.309719     828
 42.332108     816
 42.339542     816
 42.326968     789
 42.355123     769
 42.334018     713
 42.342850     690
 42.298489     684
 42.334288     664
 42.310434     663
 42.349802     630
 42.350959     625
 42.333679     621
 42.366435     607
 42.370818     606
 42.356024     603
 42.348406     596
 42.349056     594
              ... 
 42.298472       1
 42.337916       1
 42.340362       1
 42.301391       1
 42.259043       1
 42.279556       1
 42.269286       1
 42.263175       1
 42.347082       1
 42.294600       1
 42.389708       1
 42.322369       1
 42.359521       1
 42.379794       1
 42.333665       1
 42.355278       1
 42.294312       1
 42.304556       1
 42.283780       1
 42.318399       1
 42.380392       1
 42.311887  

In [13]:
df_new['Long'].value_counts()

-71.082776    1617
-71.059765    1574
-71.091374    1396
-71.085634    1293
-71.124019    1203
-71.059709    1063
-71.070853     971
-71.054679     969
-1.000000      903
-71.074917     883
-71.061986     840
-71.063705     832
-71.104294     828
-71.069409     816
-71.070144     816
-71.080519     789
-71.060880     769
-71.076381     713
-71.065162     690
-71.063133     684
-71.072395     664
-71.061340     663
-71.078410     630
-71.074128     625
-71.091878     621
-71.061354     607
-71.039291     606
-71.061776     603
-71.086883     596
-71.150498     594
              ... 
-71.134538       1
-71.104654       1
-71.070620       1
-71.095560       1
-71.103900       1
-71.081540       1
-71.126924       1
-71.156048       1
-71.083212       1
-71.116032       1
-71.079025       1
-71.146787       1
-71.076284       1
-71.124570       1
-71.086179       1
-71.070546       1
-71.071404       1
-71.134441       1
-71.053764       1
-71.050224       1
-71.070924       1
-71.064030  

In [14]:
df_new['Location'].value_counts()

(0.00000000, 0.00000000)       25634
(42.34862382, -71.08277637)     1617
(42.36183857, -71.05976489)     1574
(42.28482577, -71.09137369)     1396
(42.32866284, -71.08563401)     1293
(42.25621592, -71.12401947)     1203
(42.29755533, -71.05970910)     1063
(42.33152148, -71.07085307)      971
(42.34128751, -71.05467933)      969
(-1.00000000, -1.00000000)       903
(42.33511904, -71.07491710)      883
(42.32696647, -71.06198607)      840
(42.35231190, -71.06370510)      832
(42.30971857, -71.10429432)      828
(42.33210843, -71.07014395)      816
(42.33954199, -71.06940877)      816
(42.32696802, -71.08051941)      789
(42.35512339, -71.06087980)      769
(42.33401829, -71.07638124)      713
(42.34285014, -71.06516235)      690
(42.29848866, -71.06313294)      684
(42.33428841, -71.07239518)      664
(42.31043400, -71.06134010)      663
(42.34980175, -71.07840978)      630
(42.35095909, -71.07412780)      625
(42.33367922, -71.09187755)      621
(42.36643546, -71.06135413)      607
(

Other than the (0,0)s and (-1,-1)s they are mostly reasonable. So I think we will do a filter and treat completely absurd outliers as NAs.

In [15]:
df_old['Location'].value_counts()

(0.0, 0.0)                               14981
(42.3286598, -71.08561842)                1506
(42.32543556, -71.06387302)               1008
(42.28486136, -71.09132455)                843
(42.34130529, -71.0547108)                 735
(42.31037135, -71.06123456)                714
(42.34865634, -71.08256955)                699
(42.29754136, -71.05973457)                695
(42.36164815, -71.05998657)                675
(42.33950635, -71.06938956)                635
(42.25642136, -71.12394954)                624
(42.35237134, -71.06490456)                597
(42.33325635, -71.07289955)                595
(42.35230134, -71.06367456)                580
(42.33372337, -71.09095643)                532
(42.28714136, -71.14857453)                463
(42.34898135, -71.15091453)                431
(42.32723569, -71.08059616)                426
(42.35505634, -71.06084456)                425
(42.30972244, -71.10427304)                416
(42.34710135, -71.07960455)                397
(42.35075635,

<a id = '1.2'></a>
[Return to top](#top)
## 1.2 Removing irrelevant columns

As usual we will filter out what's irrelevant. For example I haven't figured out what an RA number actually is. As for `X` and `Y` in the old table they are also irrelevant so we will get rid of them.

In [16]:
df_old_simplified = df_old[['INCIDENT_TYPE_DESCRIPTION', 'FROMDATE', 'Year' ,'Month', 'DAY_WEEK', 'UCRPART', 'STREETNAME', 'Location']]

In [17]:
df_old_simplified['INCIDENT_TYPE_DESCRIPTION'].value_counts()

VAL                                 27363
OTHER LARCENY                       24443
SIMPLE ASSAULT                      17697
MedAssist                           17128
MVAcc                               13832
VANDALISM                           13339
InvPer                              12937
LARCENY FROM MOTOR VEHICLE          12742
DRUG CHARGES                        12042
FRAUD                                8742
PropLost                             8522
TOWED                                7526
RESIDENTIAL BURGLARY                 6737
InvProp                              6592
AGGRAVATED ASSAULT                   5649
Service                              5353
ROBBERY                              4974
PersLoc                              4745
AUTO THEFT                           4620
PropFound                            4316
Argue                                2833
Arrest                               1959
OTHER                                1902
FIRE                              

Oh so homogenizing the data can be hard. However this still has to be done.

In [18]:
df_new_simplified = df_new[['OFFENSE_CODE_GROUP','OCCURRED_ON_DATE','YEAR','MONTH','DAY_OF_WEEK','HOUR','UCR_PART','STREET','Lat','Long']]

In [19]:
df_new_simplified['OFFENSE_CODE_GROUP'].value_counts()

Motor Vehicle Accident Response              47283
Larceny                                      32991
Medical Assistance                           30884
Investigate Person                           23810
Other                                        22831
Drug Violation                               21053
Simple Assault                               20280
Vandalism                                    19141
Verbal Disputes                              16907
Investigate Property                         14387
Towed                                        14230
Larceny From Motor Vehicle                   13277
Property Lost                                12788
Warrant Arrests                              10592
Aggravated Assault                           10107
Fraud                                         7800
Violations                                    7630
Missing Person Located                        6872
Residential Burglary                          6605
Auto Theft                     

I think we are definitely going to restrict our concerns to major crimes.

In [20]:
df_old_simplified.dtypes

INCIDENT_TYPE_DESCRIPTION    object
FROMDATE                     object
Year                          int64
Month                         int64
DAY_WEEK                     object
UCRPART                      object
STREETNAME                   object
Location                     object
dtype: object

<a id = '1.3'></a>
[Return to top](#top)
## 1.3 Process location data

In [21]:
def get_lat_long(loc_string):
    loc_list = loc_string.lstrip('(').rstrip(')').split()
    return loc_list[0].strip(','), loc_list[1]

In [22]:
get_lat_long('(42.34638135, -71.10379454)')

('42.34638135', '-71.10379454')

In [23]:
df_old_simplified['Lat'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[0])
df_old_simplified['Long'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [24]:
df_old_simplified.tail()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Location,Lat,Long
268051,Motor Vehicle Accident Response,08/10/2015 02:38:00 AM,2015,8,Monday,Part Three,HARVARD ST,"(0.0, 0.0)",0.0,0.0
268052,Police Service Incidents,08/10/2015 04:46:00 AM,2015,8,Monday,Part Three,DORCHESTER AVE,"(0.0, 0.0)",0.0,0.0
268053,Motor Vehicle Accident Response,08/10/2015 04:48:00 AM,2015,8,Monday,Part Three,DECKARD ST,"(0.0, 0.0)",0.0,0.0
268054,Investigate Person,08/10/2015 05:01:00 AM,2015,8,Monday,Part Three,HAMMOND ST,"(0.0, 0.0)",0.0,0.0
268055,Motor Vehicle Accident Response,08/10/2015 05:20:00 AM,2015,8,Monday,Part Three,,"(0.0, 0.0)",0.0,0.0


In [25]:
del df_old_simplified['Location']

Now we need to filter out NAs.

In [26]:
def lat_na_er(num_string):
    try:
        num = float(num_string)
        if num < 40 or num > 45:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [27]:
def long_na_er(num_string):
    try:
        num = float(num_string)
        if num < -75 or num > -70:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [28]:
df_old_simplified['Lat'] = df_old_simplified['Lat'].apply(lat_na_er)
df_old_simplified['Long'] = df_old_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [29]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199


In [30]:
df_old_simplified.describe()

Unnamed: 0,Year,Month,Lat,Long
count,268056.0,268056.0,253075.0,253075.0
mean,2013.538664,6.589134,42.323847,-71.08336
std,0.970562,3.323806,0.031772,0.030869
min,2012.0,1.0,42.232264,-71.178674
25%,2013.0,4.0,42.299386,-71.098625
50%,2014.0,7.0,42.32866,-71.078035
75%,2014.0,9.0,42.349236,-71.06228
max,2015.0,12.0,42.395105,-70.964365


Great. We need to do the same for the new one.

In [31]:
df_new_simplified['Lat'] = df_new_simplified['Lat'].apply(lat_na_er)
df_new_simplified['Long'] = df_new_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [32]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Motor Vehicle Accident Response,2019-07-23 15:40:00,2019,7,Tuesday,15,Part Three,SEAPORT BLVD,42.353115,-71.048421
1,Medical Assistance,2019-07-23 18:46:00,2019,7,Tuesday,18,Part Three,HAMILTON ST,42.306587,-71.067479
2,Property Found,2019-07-23 20:36:00,2019,7,Tuesday,20,Part Three,COLUMBUS AVE,42.318939,-71.098262
3,Medical Assistance,2019-07-23 20:48:00,2019,7,Tuesday,20,Part Three,PARK VIEW ST,42.311095,-71.092385
4,Aggravated Assault,2019-07-23 19:39:00,2019,7,Tuesday,19,Part One,SUMMER ST,42.355216,-71.060129


In [33]:
df_new_simplified.describe()

Unnamed: 0,YEAR,MONTH,HOUR,Lat,Long
count,406126.0,406126.0,406126.0,379589.0,379589.0
mean,2016.995999,6.567747,13.113467,42.322142,-71.082975
std,1.240457,3.336296,6.293223,0.031907,0.029704
min,2015.0,1.0,0.0,42.232413,-71.178674
25%,2016.0,4.0,9.0,42.297466,-71.097348
50%,2017.0,7.0,14.0,42.325574,-71.077665
75%,2018.0,9.0,18.0,42.348577,-71.062609
max,2019.0,12.0,23.0,42.395042,-70.963676


Now we need to process time.

<a id = '1.4'></a>
[Return to top](#top)
## 1.4 Process time

In [34]:
df_new_simplified['OCCURRED_ON_DATE'].isna().sum()

0

In [35]:
df_old_simplified['FROMDATE'].isna().sum()

0

At least there are no open NAs. Now let's check the timeline.

In [36]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Motor Vehicle Accident Response,2019-07-23 15:40:00,2019,7,Tuesday,15,Part Three,SEAPORT BLVD,42.353115,-71.048421
1,Medical Assistance,2019-07-23 18:46:00,2019,7,Tuesday,18,Part Three,HAMILTON ST,42.306587,-71.067479
2,Property Found,2019-07-23 20:36:00,2019,7,Tuesday,20,Part Three,COLUMBUS AVE,42.318939,-71.098262
3,Medical Assistance,2019-07-23 20:48:00,2019,7,Tuesday,20,Part Three,PARK VIEW ST,42.311095,-71.092385
4,Aggravated Assault,2019-07-23 19:39:00,2019,7,Tuesday,19,Part One,SUMMER ST,42.355216,-71.060129


We need to round time to hours because police officers don't really document minutes and seconds carefully (to see why this is true please check out the old Crime in Boston project).

In [37]:
df_new_simplified['day'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[8:10]))
df_new_simplified['min'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-5:-3]))
df_new_simplified['sec'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-2:]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,day,min,sec
0,Motor Vehicle Accident Response,2019-07-23 15:40:00,2019,7,Tuesday,15,Part Three,SEAPORT BLVD,42.353115,-71.048421,23,40,0
1,Medical Assistance,2019-07-23 18:46:00,2019,7,Tuesday,18,Part Three,HAMILTON ST,42.306587,-71.067479,23,46,0
2,Property Found,2019-07-23 20:36:00,2019,7,Tuesday,20,Part Three,COLUMBUS AVE,42.318939,-71.098262,23,36,0
3,Medical Assistance,2019-07-23 20:48:00,2019,7,Tuesday,20,Part Three,PARK VIEW ST,42.311095,-71.092385,23,48,0
4,Aggravated Assault,2019-07-23 19:39:00,2019,7,Tuesday,19,Part One,SUMMER ST,42.355216,-71.060129,23,39,0


In [39]:
del df_new_simplified['OCCURRED_ON_DATE']

In [40]:
def is_leap(year):
    if year % 4 != 0:
        return False
    elif year % 100 != 0:
        return True
    elif year % 400 != 0:
        return False
    else:
        return True

def num_of_days(month, year):
    non_leap = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    if month != 2:
        return non_leap[month - 1]
    else:
        if is_leap(year):
            return 29
        else:
            return 28
tie_break_round_up = False #Tie break round up status
NEXT = {'Monday': 'Tuesday', 'Tuesday': 'Wednesday', 'Wednesday': 'Thursday', 'Thursday':'Friday','Friday':'Saturday','Saturday':'Sunday','Sunday':'Monday'}



In [41]:
for index, row in df_new_simplified.iterrows():
    round_up = False #Round up this time?
    if df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_new_simplified.at[index, 'min'] > 30 or (df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_new_simplified.at[index, 'HOUR'] = df_new_simplified.at[index, 'HOUR'] + 1
        if df_new_simplified.at[index, 'HOUR'] == 24:
            df_new_simplified.at[index, 'HOUR'] = 0
            df_new_simplified.at[index, 'day'] = df_new_simplified.at[index, 'day'] + 1
            df_new_simplified.at[index, 'DAY_OF_WEEK'] = NEXT[df_new_simplified.at[index, 'DAY_OF_WEEK']]
            if df_new_simplified.at[index, 'day'] > num_of_days(df_new_simplified.at[index, 'MONTH'], df_new_simplified.at[index, 'YEAR']):
                df_new_simplified.at[index, 'day'] = 1
                df_new_simplified.at[index, 'MONTH'] = df_new_simplified.at[index, 'MONTH'] + 1
                if df_new_simplified.at[index,'MONTH'] == 13:
                    df_new_simplified.at[index,'MONTH'] = 1
                    df_new_simplified.at[index, 'YEAR'] = df_new_simplified.at[index, 'YEAR'] + 1

In [42]:
def extract_hour(old_string):
    hour = int(old_string[11:13])
    code = old_string[-2:]
    if hour == 12:
        hour = hour - 12
    if code == 'PM':
        hour = hour + 12
    return hour

In [43]:
df_old_simplified['day'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[3:5]))
df_old_simplified['min'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[14:16]))
df_old_simplified['sec'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[17:19]))
df_old_simplified['hour'] = df_old_simplified['FROMDATE'].apply(extract_hour)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [44]:
for index, row in df_old_simplified.iterrows():
    round_up = False #Round up this time?
    if df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_old_simplified.at[index, 'min'] > 30 or (df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_old_simplified.at[index, 'hour'] = df_old_simplified.at[index, 'hour'] + 1
        if df_old_simplified.at[index, 'hour'] == 24:
            df_old_simplified.at[index, 'hour'] = 0
            df_old_simplified.at[index, 'day'] = df_old_simplified.at[index, 'day'] + 1
            df_old_simplified.at[index, 'DAY_WEEK'] = NEXT[df_old_simplified.at[index, 'DAY_WEEK']]
            if df_old_simplified.at[index, 'day'] > num_of_days(df_old_simplified.at[index, 'Month'], df_old_simplified.at[index, 'Year']):
                df_old_simplified.at[index, 'day'] = 1
                df_old_simplified.at[index, 'Month'] = df_old_simplified.at[index, 'Month'] + 1
                if df_old_simplified.at[index,'Month'] == 13:
                    df_old_simplified.at[index,'Month'] = 1
                    df_old_simplified.at[index, 'Year'] = df_old_simplified.at[index, 'Year'] + 1


In [45]:
df_old_simplified.head(10)

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,min,sec,hour
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,0,0,6
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,3,0,6
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,26,0,6
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,56,0,7
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,15,0,7
5,ROBBERY,07/08/2012 07:32:00 AM,2012,7,Sunday,Part One,SYDNEY ST,42.313282,-71.053006,8,32,0,8
6,ROBBERY,07/08/2012 07:50:00 AM,2012,7,Sunday,Part One,REGENT ST,42.324251,-71.08621,8,50,0,8
7,SIMPLE ASSAULT,07/08/2012 07:50:00 AM,2012,7,Sunday,Part Two,WASHINGTON ST,42.349246,-71.063785,8,50,0,8
8,MedAssist,07/08/2012 07:53:00 AM,2012,7,Sunday,Part Three,FANEUIL ST,42.351746,-71.16591,8,53,0,8
9,MedAssist,07/08/2012 08:05:00 AM,2012,7,Sunday,Part Three,RIVER ST,42.259383,-71.117294,8,5,0,8


In [46]:
del df_new_simplified['min']
del df_new_simplified['sec']
del df_old_simplified['min']
del df_old_simplified['sec']
del df_old_simplified['FROMDATE']

In [47]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


<a id = '1.5'></a>
[Return to top](#top)
## 1.5 Remove non-crimes

As usual we only care about major crimes.

In [48]:
df_new_clean = df_new_simplified.loc[(df_new_simplified['UCR_PART'] == 'Part One') | (df_new_simplified['OFFENSE_CODE_GROUP'] == 'Arson')]

In [49]:
df_new_clean['UCR_PART'].value_counts()

Part One    76883
Other         108
Name: UCR_PART, dtype: int64

In [50]:
df_new_clean['OFFENSE_CODE_GROUP'].value_counts()

Larceny                       32991
Larceny From Motor Vehicle    13277
Aggravated Assault            10107
Residential Burglary           6605
Auto Theft                     5903
Robbery                        5559
Commercial Burglary            1614
Other Burglary                  562
Homicide                        265
Arson                           108
Name: OFFENSE_CODE_GROUP, dtype: int64

In [51]:
df_old_O = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Other']
df_old_NA = df_old_simplified.loc[df_old_simplified['UCRPART'].isnull()]

In [52]:
df_old_O['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MVAcc                              9671
PersLoc                            3479
PersMiss                            780
07RV                                613
Hazardous                           493
Service                             260
Plates                               45
ARSON                                30
Auto Theft Recovery                  29
MedAssist                            22
HateCrim                             19
License Plate Related Incidents       5
Arson                                 3
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [53]:
df_old_NA.shape

(0, 10)

In [54]:
df_old_simplified['UCRPART'].value_counts()

Part Two      98341
Part One      65261
Part three    55482
Part Three    33523
Other         15449
Name: UCRPART, dtype: int64

Unclean data. That's fine.

In [55]:
df_old_2 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Two']
df_old_3 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Three']
df_old_33 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part three']

In [56]:
df_old_33['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MedAssist                   12401
InvPer                       9448
PropLost                     5890
TOWED                        5524
InvProp                      4862
Service                      3505
PropFound                    2964
Argue                        2065
Arrest                       1374
FIRE                         1294
PhoneCalls                    995
LICViol                       836
32GUN                         747
Gather                        718
Landlord                      716
DEATH INVESTIGATION           678
SearchWarr                    521
PropDam                       502
Plates                        228
Harbor                        150
VIOLATION OF LIQUOR LAWS       30
Explos                         23
Aircraft                        7
Labor                           4
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [57]:
df_old_semiclean = df_old_simplified.loc[(df_old_simplified['UCRPART'] == 'Part One') | (df_old_simplified['UCRPART'] == 'Other')]

OK I think the Part Twos, Part Threes and others other than arson can be ignored.

In [58]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] = df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].apply(lambda x: x.upper())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [59]:
df_old_semiclean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


In [60]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                      24443
LARCENY FROM MOTOR VEHICLE         13265
MVACC                               9671
RESIDENTIAL BURGLARY                7119
AGGRAVATED ASSAULT                  6008
ROBBERY                             5193
AUTO THEFT                          4851
PERSLOC                             3479
COMMERCIAL BURGLARY                 1550
BENOPROP                            1367
LARCENY                             1288
PERSMISS                             780
07RV                                 613
HAZARDOUS                            493
SERVICE                              260
HOMICIDE                             144
PLATES                                45
ARSON                                 33
AUTO THEFT RECOVERY                   29
OTHER BURGLARY                        22
MEDASSIST                             22
HATECRIM                              19
MANSLAUG                               9
LICENSE PLATE RELATED INCIDENTS        5
RAPE AND ATTEMPT

In [61]:
df_old_clean = df_old_semiclean.loc[(df_old_semiclean['UCRPART'] == 'Part One') | (df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] == 'Arson')]

In [62]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
BENOPROP                       1367
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
MANSLAUG                          9
RAPE AND ATTEMPTED                2
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

`BENOPROP` means "Break and enter, no property taken". Since it is in `Other` in the new data source let's remove it. `RAPE AND ATTEMPTED` and `MANSLAUG` need to be removed as well because they are either not present in the new data source or is not in `Part One`.

In [63]:
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'BENOPROP'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'MANSLAUG'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'RAPE AND ATTEMPTED'] 

In [64]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

Now we can drop the `UCR_PART`.

In [65]:
del df_old_clean['UCRPART']

In [66]:
del df_new_clean['UCR_PART']

Let's store the data so that it isn't lost.

In [67]:
df_old_clean.to_csv('old.csv')
df_new_clean.to_csv('new.csv')

<a id = '1.6'></a>
[Return to top](#top)
## 1.6 Combine the two dataframes

Now it's time to merge the two dataframes. 

In [68]:
df_new_clean.head()

Unnamed: 0,OFFENSE_CODE_GROUP,YEAR,MONTH,DAY_OF_WEEK,HOUR,STREET,Lat,Long,day
4,Aggravated Assault,2019,7,Tuesday,20,SUMMER ST,42.355216,-71.060129,23
7,Larceny,2019,7,Tuesday,21,BLUE HILL AVE,42.285154,-71.091022,23
14,Larceny,2019,7,Wednesday,16,MASSACHUSETTS AVE,42.336892,-71.077551,17
16,Larceny,2019,7,Tuesday,15,RIVER ST,42.271302,-71.074424,23
20,Larceny,2019,7,Tuesday,18,SUMMER ST,42.354262,-71.058833,23


In [69]:
df_old_clean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [70]:
df_new_clean.rename(index = str, columns = {'OFFENSE_CODE_GROUP':'crime', 'YEAR': 'year', 'MONTH': 'month', 'DAY_OF_WEEK': 'dayw', 'HOUR': 'hour','STREET':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [71]:
df_new_clean.head()

Unnamed: 0,crime,year,month,dayw,hour,street,lat,long,day
4,Aggravated Assault,2019,7,Tuesday,20,SUMMER ST,42.355216,-71.060129,23
7,Larceny,2019,7,Tuesday,21,BLUE HILL AVE,42.285154,-71.091022,23
14,Larceny,2019,7,Wednesday,16,MASSACHUSETTS AVE,42.336892,-71.077551,17
16,Larceny,2019,7,Tuesday,15,RIVER ST,42.271302,-71.074424,23
20,Larceny,2019,7,Tuesday,18,SUMMER ST,42.354262,-71.058833,23


In [72]:
df_old_clean.rename(index = str, columns = {'INCIDENT_TYPE_DESCRIPTION':'crime', 'Year': 'year', 'Month': 'month', 'DAY_WEEK': 'dayw', 'hour': 'hour','STREETNAME':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

In [73]:
df_old_clean.head()

Unnamed: 0,crime,year,month,dayw,street,lat,long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [74]:
correct_order = ['crime','year','month','day','dayw','hour','street','lat','long']

In [75]:
df_old_clean = df_old_clean[correct_order]
df_new_clean = df_new_clean[correct_order]

In [76]:
df_old_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.1199


In [77]:
df_new_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
4,Aggravated Assault,2019,7,23,Tuesday,20,SUMMER ST,42.355216,-71.060129
7,Larceny,2019,7,23,Tuesday,21,BLUE HILL AVE,42.285154,-71.091022
14,Larceny,2019,7,17,Wednesday,16,MASSACHUSETTS AVE,42.336892,-71.077551
16,Larceny,2019,7,23,Tuesday,15,RIVER ST,42.271302,-71.074424
20,Larceny,2019,7,23,Tuesday,18,SUMMER ST,42.354262,-71.058833


In [78]:
frames = [df_old_clean, df_new_clean]

In [79]:
df_clean = pd.concat(frames, ignore_index = True)

In [80]:
df_clean.tail()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
140869,Aggravated Assault,2015,11,20,Friday,11,BLUE HILL AVE,42.301897,-71.085549
140870,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140871,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140872,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140873,Homicide,2015,7,9,Thursday,14,RIVER ST,42.255926,-71.123172


In [81]:
df_old_clean.shape

(63883, 9)

In [82]:
df_new_clean.shape

(76991, 9)

In [83]:
df_clean.shape

(140874, 9)

In [84]:
df_old_clean.shape[0] + df_new_clean.shape[0] == df_clean.shape[0]

True

Now we need to merge the crimes.

In [85]:
df_clean['crime'] = df_clean['crime'].apply(lambda x: x.upper())

In [86]:
df_clean['crime'].value_counts()

LARCENY                       34279
LARCENY FROM MOTOR VEHICLE    26542
OTHER LARCENY                 24443
AGGRAVATED ASSAULT            16115
RESIDENTIAL BURGLARY          13724
AUTO THEFT                    10754
ROBBERY                       10752
COMMERCIAL BURGLARY            3164
OTHER BURGLARY                  584
HOMICIDE                        409
ARSON                           108
Name: crime, dtype: int64

There is a disparity in what `LARCENY` means. Hence we will simply merge all larcenies into `LARCENY`.

In [87]:
df_clean['crime'] = df_clean['crime'].replace({'LARCENY FROM MOTOR VEHICLE': 'LARCENY', 'OTHER LARCENY': 'LARCENY'})

In [88]:
df_clean['crime'].value_counts()

LARCENY                 85264
AGGRAVATED ASSAULT      16115
RESIDENTIAL BURGLARY    13724
AUTO THEFT              10754
ROBBERY                 10752
COMMERCIAL BURGLARY      3164
OTHER BURGLARY            584
HOMICIDE                  409
ARSON                     108
Name: crime, dtype: int64

Now we need to save this file so that in the future if we start from any place below Chapter 1 we can directly use this file.

In [89]:
df_clean.to_csv('final.csv')

<a id = '2'></a>
[Return to top](#top)
# 2. More preprocessing

In [90]:
#df_clean = pd.read_csv('final.csv',index_col = 0) #Used if we skip Ch 1

In [91]:
df_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.1199


<a id = '2.1'></a>
[Return to top](#top)
## 2.1 Preparation

In [92]:
df_clean.isna().sum()

crime        0
year         0
month        0
day          0
dayw         0
hour         0
street    1534
lat       5703
long      5703
dtype: int64

Now we should drop the NAs.

In [93]:
df_final = df_clean.dropna()

In [94]:
df_final.shape

(134940, 9)

In [95]:
df_final.sort_values(['year','month','day'])

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.096990
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.119900
5,ROBBERY,2012,7,8,Sunday,8,SYDNEY ST,42.313282,-71.053006
6,ROBBERY,2012,7,8,Sunday,8,REGENT ST,42.324251,-71.086210
7,RESIDENTIAL BURGLARY,2012,7,8,Sunday,11,CATBIRD COURT,42.288138,-71.094849
8,LARCENY,2012,7,8,Sunday,12,HILLSIDE ST,42.331666,-71.107630
9,AUTO THEFT,2012,7,8,Sunday,12,E 7TH ST,42.332171,-71.042240


In [96]:
def first_day(df):
    row = df.sort_values(['year','month','day']).iloc[0,:]
    return str(row.month) + '/' + str(row.day) + '/' + str(row.year)
def last_day(df):
    row = df.sort_values(['year','month','day']).iloc[-1,:]
    return str(row.month) + '/' + str(row.day) + '/' + str(row.year)

In [97]:
def count_crimes(df, crime, year, month, day, hour):
    df1 = df[(df['crime'] == crime) & (df['year'] == year)]
    df2 = df1[(df1['month'] == month) & (df1['day'] == day)]
    return df2[df2['hour'] == hour].shape[0]

In [98]:
time_tuple_list = pd.date_range(start = first_day(df_final), end = last_day(df_final)).tolist()

In [99]:
crime_list = df_final.crime.unique().tolist()

In [100]:
df_temp = df_final[['crime','year','month','day','dayw','hour']].groupby(['crime','year','month','day','dayw','hour']).size().unstack(fill_value = 0).stack().reset_index()

In [101]:
df_temp.rename(index = str, columns = {0:'counts'}, inplace = True)

In [102]:
df_temp[df_temp.crime == 'ARSON'].head(200)

Unnamed: 0,crime,year,month,day,dayw,hour,counts
61248,ARSON,2015,6,20,Saturday,0,0
61249,ARSON,2015,6,20,Saturday,1,0
61250,ARSON,2015,6,20,Saturday,2,0
61251,ARSON,2015,6,20,Saturday,3,0
61252,ARSON,2015,6,20,Saturday,4,0
61253,ARSON,2015,6,20,Saturday,5,0
61254,ARSON,2015,6,20,Saturday,6,0
61255,ARSON,2015,6,20,Saturday,7,0
61256,ARSON,2015,6,20,Saturday,8,0
61257,ARSON,2015,6,20,Saturday,9,0


For some unknown reasons we don't have information about arson before mid-2015 which is very weird. So we decided to drop `arson` for now until we can find other crime descriptions that are essentially arson.

In [103]:
crime_list.remove('ARSON')

In [104]:
df_al = df_final[df_final.crime != 'ARSON']
df_temp = df_al[['crime','year','month','day','dayw','hour']].groupby(['crime','year','month','day','dayw','hour']).size().unstack(fill_value = 0).stack().reset_index()

In [105]:
df_temp.head()

Unnamed: 0,crime,year,month,day,dayw,hour,0
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0


In [106]:
df_temp.shape

(360456, 7)

In [107]:
dicc = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3: 'Thursday',4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
for ind, row in df_temp.iterrows():
    year = df_temp.at[ind, 'year']
    month = df_temp.at[ind, 'month']
    day = df_temp.at[ind, 'day']
    dayw = dicc[datetime.date(year, month, day).weekday()]
    if dayw != df_temp.at[ind, 'dayw']:
        print(row)

In [108]:
df_temp.hour.value_counts()

23    15019
22    15019
1     15019
2     15019
3     15019
4     15019
5     15019
6     15019
7     15019
8     15019
9     15019
10    15019
11    15019
12    15019
13    15019
14    15019
15    15019
16    15019
17    15019
18    15019
19    15019
20    15019
21    15019
0     15019
Name: hour, dtype: int64

In [109]:
cset = set()
for ind, row in df_temp.iterrows():
    item = (row.crime, row.year, row.month, row.day)
    cset.add(item)

In [110]:
len(cset)

15019

In [111]:
cset

{('AUTO THEFT', 2013, 7, 8),
 ('ROBBERY', 2016, 12, 29),
 ('LARCENY', 2014, 3, 5),
 ('ROBBERY', 2015, 5, 10),
 ('AGGRAVATED ASSAULT', 2016, 2, 7),
 ('LARCENY', 2014, 12, 11),
 ('RESIDENTIAL BURGLARY', 2018, 10, 31),
 ('AUTO THEFT', 2015, 3, 9),
 ('RESIDENTIAL BURGLARY', 2016, 6, 8),
 ('ROBBERY', 2013, 2, 20),
 ('LARCENY', 2019, 6, 15),
 ('AGGRAVATED ASSAULT', 2014, 10, 28),
 ('RESIDENTIAL BURGLARY', 2018, 9, 14),
 ('RESIDENTIAL BURGLARY', 2013, 3, 2),
 ('LARCENY', 2013, 4, 16),
 ('ROBBERY', 2014, 10, 18),
 ('AGGRAVATED ASSAULT', 2017, 1, 5),
 ('COMMERCIAL BURGLARY', 2017, 11, 22),
 ('ROBBERY', 2016, 7, 21),
 ('RESIDENTIAL BURGLARY', 2017, 1, 30),
 ('RESIDENTIAL BURGLARY', 2012, 7, 30),
 ('ROBBERY', 2013, 6, 21),
 ('ROBBERY', 2017, 8, 2),
 ('ROBBERY', 2017, 7, 6),
 ('RESIDENTIAL BURGLARY', 2019, 6, 24),
 ('RESIDENTIAL BURGLARY', 2014, 6, 4),
 ('ROBBERY', 2019, 3, 11),
 ('AUTO THEFT', 2015, 8, 5),
 ('LARCENY', 2017, 11, 18),
 ('AGGRAVATED ASSAULT', 2013, 10, 20),
 ('COMMERCIAL BURGLARY',

In [112]:
crime_set = set(crime_list)
time_set = set(time_tuple_list)

In [113]:
full_set = {(crime, time.year, time.month, time.day) for crime in crime_set for time in time_set}

In [114]:
len(full_set)

20576

In [115]:
zeroset = full_set - cset

In [116]:
cset - full_set

set()

In [117]:
uzeroset = cset - full_set

In [118]:
len(zeroset) + len(cset) - len(full_set)

0

In [119]:
full_set.issuperset(cset)

True

In [120]:

def process_row(tup):
    dicc_list = []
    dayw = dicc[datetime.date(tup[1], tup[2], tup[3]).weekday()]
    for i in range(24):
        ind_dic = {'crime': tup[0], 'year': tup[1], 'month': tup[2], 'day': tup[3], 'dayw': dayw, 'hour': i, 'counts': 0}
        dicc_list.append(ind_dic)
    return dicc_list

In [121]:
def process_set(zeroset):
    dicc_list = []
    for row in zeroset:
        dicc_list.extend(process_row(row))
    return dicc_list

In [122]:
zero_dicc_list = process_set(zeroset)

In [123]:
len(zero_dicc_list) 

133368

In [124]:
24 * len(zeroset)

133368

In [125]:
df_zero = pd.DataFrame(zero_dicc_list, columns = ['crime', 'year', 'month', 'day', 'dayw', 'hour', 'counts'])

In [126]:
df_zero.head()

Unnamed: 0,crime,year,month,day,dayw,hour,counts
0,HOMICIDE,2016,9,3,Saturday,0,0
1,HOMICIDE,2016,9,3,Saturday,1,0
2,HOMICIDE,2016,9,3,Saturday,2,0
3,HOMICIDE,2016,9,3,Saturday,3,0
4,HOMICIDE,2016,9,3,Saturday,4,0


In [127]:
df_zero.shape

(133368, 7)

In [128]:
df_temp.rename(index = str, columns = {0:'counts'}, inplace = True)

In [129]:
df_ag = pd.concat([df_temp, df_zero], ignore_index = True)

In [130]:
df_ag.to_csv('ag.csv')

If `ag.csv` is not updated we should read it here.

In [131]:
df_ag = pd.read_csv('ag.csv', index_col = 0)

In [132]:
df_ag.dayw.value_counts()

Tuesday      70656
Sunday       70656
Monday       70656
Thursday     70464
Saturday     70464
Wednesday    70464
Friday       70464
Name: dayw, dtype: int64

In [133]:
len(crime_set)

8

In [134]:
df_ag.head()

Unnamed: 0,crime,year,month,day,dayw,hour,counts
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0


In [135]:
df_ag['year'] = df_ag.year.astype('category')
df_ag['month'] = df_ag.month.astype('category')
df_ag['day'] = df_ag.day.astype('category')
df_ag['dayw'] = df_ag.dayw.astype('category')
df_ag['hour'] = df_ag.hour.astype('category')
df_ag['crime'] = df_ag.crime.astype('category')
df_ag['counts'] = df_ag.counts.astype(float)

In [136]:
df_ag.dtypes

crime     category
year      category
month     category
day       category
dayw      category
hour      category
counts     float64
dtype: object

In [137]:
df_ag.crime.value_counts()

ROBBERY                 61728
RESIDENTIAL BURGLARY    61728
OTHER BURGLARY          61728
LARCENY                 61728
HOMICIDE                61728
COMMERCIAL BURGLARY     61728
AUTO THEFT              61728
AGGRAVATED ASSAULT      61728
Name: crime, dtype: int64

<a id = '2.2B'></a>
[Return to top](#top)
## 2.2 Adding weather data

Now we need to add supplementary data. We need unemployment data from Federal Reserve as well as weather data from NOAA and holiday information. The weather data has already been preprocessed in a separate R notebook.

In [138]:
df_weather = pd.read_csv('weather_processed.csv')

In [139]:
df_weather.head()

Unnamed: 0,AWND,PRCP,SNOW,TMAX,TMIN,WSF2,WT01,WT02,WT03,WT04,...,WT13,WT14,WT15,WT16,WT17,WT18,WT22,YEAR,MONTH,DAY
0,8.95,0.0,0.0,89,71,19.9,0,0,0,0,...,0,0,0,0,0,0,0,2012,7,8
1,8.05,0.0,0.0,84,67,16.1,0,0,0,0,...,0,0,0,0,0,0,0,2012,7,9
2,7.38,0.0,0.0,83,65,14.1,0,0,0,0,...,0,0,0,0,0,0,0,2012,7,10
3,8.28,0.0,0.0,80,66,15.0,0,0,0,0,...,0,0,0,0,0,0,0,2012,7,11
4,10.74,0.0,0.0,86,66,19.9,0,0,0,0,...,0,0,0,0,0,0,0,2012,7,12


In [140]:
df_weather.dtypes

AWND     float64
PRCP     float64
SNOW     float64
TMAX       int64
TMIN       int64
WSF2     float64
WT01       int64
WT02       int64
WT03       int64
WT04       int64
WT05       int64
WT06       int64
WT08       int64
WT09       int64
WT13       int64
WT14       int64
WT15       int64
WT16       int64
WT17       int64
WT18       int64
WT22       int64
YEAR       int64
MONTH      int64
DAY        int64
dtype: object

In [141]:
df_weather['year'] = df_weather.YEAR.astype('category')
df_weather['month'] = df_weather.MONTH.astype('category')
df_weather['day'] = df_weather.DAY.astype('category')

In [142]:
df_weather.dtypes

AWND      float64
PRCP      float64
SNOW      float64
TMAX        int64
TMIN        int64
WSF2      float64
WT01        int64
WT02        int64
WT03        int64
WT04        int64
WT05        int64
WT06        int64
WT08        int64
WT09        int64
WT13        int64
WT14        int64
WT15        int64
WT16        int64
WT17        int64
WT18        int64
WT22        int64
YEAR        int64
MONTH       int64
DAY         int64
year     category
month    category
day      category
dtype: object

In [143]:
del df_weather['YEAR']
del df_weather['MONTH']
del df_weather['DAY']

In [144]:
df_weather.dtypes

AWND      float64
PRCP      float64
SNOW      float64
TMAX        int64
TMIN        int64
WSF2      float64
WT01        int64
WT02        int64
WT03        int64
WT04        int64
WT05        int64
WT06        int64
WT08        int64
WT09        int64
WT13        int64
WT14        int64
WT15        int64
WT16        int64
WT17        int64
WT18        int64
WT22        int64
year     category
month    category
day      category
dtype: object

<a id = '2.2C'></a>
[Return to top](#top)
## 2.3 Adding unemployment data

Now we need to get the unemployment data too. We don't use poverty data this time because it is not available even for 2018, let alone 2019. The data hasn't been seasonally adjusted because we precisely want to understand the influence seasoning has on crime.

In [145]:
df_ue = pd.read_csv('MAURN.csv')

In [146]:
df_ue

Unnamed: 0,DATE,MAURN
0,2012-07-01,6.9
1,2012-08-01,6.6
2,2012-09-01,6.6
3,2012-10-01,6.2
4,2012-11-01,6.3
5,2012-12-01,6.6
6,2013-01-01,7.6
7,2013-02-01,7.2
8,2013-03-01,7.0
9,2013-04-01,6.5


In [147]:
df_ue['year'] = df_ue['DATE'].apply(lambda x: int(x[:4]))
df_ue['month'] = df_ue['DATE'].apply(lambda x: int(x[5:7]))

In [148]:
del df_ue['DATE']

In [149]:
df_ue.head()

Unnamed: 0,MAURN,year,month
0,6.9,2012,7
1,6.6,2012,8
2,6.6,2012,9
3,6.2,2012,10
4,6.3,2012,11


In [150]:
df_ue['year'] = df_ue.year.astype('category')
df_ue['month'] = df_ue.month.astype('category')

In [151]:
df_ue.dtypes

MAURN     float64
year     category
month    category
dtype: object

Let's first merge the three datasets and then add holiday info.

In [152]:
df_temp1 = pd.merge(df_ag, df_weather, how = 'left', on = ['year','month','day'], validate = 'many_to_one')

In [153]:
df_temp1.isna().sum()

crime        0
year         0
month        0
day          0
dayw         0
hour         0
counts       0
AWND      2880
PRCP      2880
SNOW      2880
TMAX      2880
TMIN      2880
WSF2      2880
WT01      2880
WT02      2880
WT03      2880
WT04      2880
WT05      2880
WT06      2880
WT08      2880
WT09      2880
WT13      2880
WT14      2880
WT15      2880
WT16      2880
WT17      2880
WT18      2880
WT22      2880
dtype: int64

In [154]:
df_temp2 = pd.merge(df_temp1, df_ue, how = 'left', on = ['year','month'], validate = 'many_to_one')

In [155]:
df_temp2

Unnamed: 0,crime,year,month,day,dayw,hour,counts,AWND,PRCP,SNOW,...,WT08,WT09,WT13,WT14,WT15,WT16,WT17,WT18,WT22,MAURN
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
5,AGGRAVATED ASSAULT,2012,7,8,Sunday,5,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
6,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,1.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
7,AGGRAVATED ASSAULT,2012,7,8,Sunday,7,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
8,AGGRAVATED ASSAULT,2012,7,8,Sunday,8,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9
9,AGGRAVATED ASSAULT,2012,7,8,Sunday,9,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9


In [156]:
df_temp2.crime.value_counts()

ROBBERY                 61728
RESIDENTIAL BURGLARY    61728
OTHER BURGLARY          61728
LARCENY                 61728
HOMICIDE                61728
COMMERCIAL BURGLARY     61728
AUTO THEFT              61728
AGGRAVATED ASSAULT      61728
Name: crime, dtype: int64

In [157]:
df_na_free = df_temp2.dropna()

In [158]:
df_na_free.shape

(489408, 29)

In [159]:
df_na_free.tail(100)

Unnamed: 0,crime,year,month,day,dayw,hour,counts,AWND,PRCP,SNOW,...,WT08,WT09,WT13,WT14,WT15,WT16,WT17,WT18,WT22,MAURN
493724,OTHER BURGLARY,2018,2,14,Wednesday,20,0.0,11.18,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
493725,OTHER BURGLARY,2018,2,14,Wednesday,21,0.0,11.18,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
493726,OTHER BURGLARY,2018,2,14,Wednesday,22,0.0,11.18,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
493727,OTHER BURGLARY,2018,2,14,Wednesday,23,0.0,11.18,0.00,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
493728,OTHER BURGLARY,2014,9,21,Sunday,0,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5
493729,OTHER BURGLARY,2014,9,21,Sunday,1,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5
493730,OTHER BURGLARY,2014,9,21,Sunday,2,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5
493731,OTHER BURGLARY,2014,9,21,Sunday,3,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5
493732,OTHER BURGLARY,2014,9,21,Sunday,4,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5
493733,OTHER BURGLARY,2014,9,21,Sunday,5,0.0,7.61,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5


In [160]:
df_na_free.to_csv('temp.csv')

<a id = '2.2D'></a>
[Return to top](#top)
## 2.4 Adding holiday data

Now time to add the holidays. We will only use Massachusetts holidays which a lot of people participate in.

In [161]:
import holidays

In [162]:
ma_holidays = holidays.CountryHoliday('US', prov=None, state='MA')

In [163]:
df_na_free.sort_values(by = ['year','month','day'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [164]:
df_na_free.iat[0, 3]

8

In [165]:
row_count = df_na_free.shape[0]
start_dat = datetime.datetime(df_na_free.iat[0, 1], df_na_free.iat[0, 2], df_na_free.iat[0, 3])
end_dat = datetime.datetime(df_na_free.iat[row_count - 1, 1], df_na_free.iat[row_count - 1, 2], df_na_free.iat[row_count - 1, 3])
datelist = pd.date_range(start = start_dat, end = end_dat).tolist()
hol_list = ["New Year's Day", 'Memorial Day', 'Independence Day', 'Labor Day', 'Thanksgiving', 'Christmas Day']
leng = len(datelist)
hs_bit = False 
cal_list = [None] * leng
for i in range(leng):
    date_ = datelist[i]
    holiday = ma_holidays.get(date_)
    if holiday and holiday in hol_list:#Holiday 
        cal_list[i] = str(holiday)
        if i:#Not the first day in the range
            cal_list[i - 1] = str(holiday) + ' Eve'
        if i != leng - 1:#Not the last day in the range
            cal_list[i + 1] = 'Post-' + str(holiday)
    if date_.month == 12 and date_.day >= 27 and date_.day <= 30:
        cal_list[i] = 'Holiday Season'

In [166]:
hol_dic_list = []
for i in range(leng):
    date_ = datelist[i]
    year = date_.year
    month = date_.month
    day = date_.day
    hol_dic_list.append({'year':year,'month':month,'day':day,'HOLIDAY':cal_list[i]})

In [167]:
df_holiday = pd.DataFrame(hol_dic_list)

In [168]:
df_holiday['year'] = df_holiday.year.astype('category')
df_holiday['month'] = df_holiday.month.astype('category')
df_holiday['day'] = df_holiday.day.astype('category')

In [169]:
df_ag_final = pd.merge(df_temp2, df_holiday, how = 'left', on = ['year','month','day'], validate = 'many_to_one')

In [170]:
df_ag_final.head(100)

Unnamed: 0,crime,year,month,day,dayw,hour,counts,AWND,PRCP,SNOW,...,WT09,WT13,WT14,WT15,WT16,WT17,WT18,WT22,MAURN,HOLIDAY
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
5,AGGRAVATED ASSAULT,2012,7,8,Sunday,5,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
6,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,1.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
7,AGGRAVATED ASSAULT,2012,7,8,Sunday,7,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
8,AGGRAVATED ASSAULT,2012,7,8,Sunday,8,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,
9,AGGRAVATED ASSAULT,2012,7,8,Sunday,9,0.0,8.95,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,


Now we need to save this file so that in the future if we start from any place below Chapter 2 we can directly use this file.

In [171]:
df_ag_final.to_csv('ag_final.csv')

<a id = '3'></a>
[Return to top](#top)
# 3. Choosing the best regressor

In [172]:
#df_ag_final = pd.read_csv('ag_final.csv', index_col = 0) #Used if we skip Ch 2
#df_ag_final['year'] = df_ag_final.year.astype('category')
#df_ag_final['month'] = df_ag_final.month.astype('category')
#df_ag_final['day'] = df_ag_final.day.astype('category')

<a id = '2.2A'></a>
[Return to top](#top)
## 3.1 Select and split

Since we already have clean data we need to get dummies.

In [173]:
df_ag_final.columns

Index(['crime', 'year', 'month', 'day', 'dayw', 'hour', 'counts', 'AWND',
       'PRCP', 'SNOW', 'TMAX', 'TMIN', 'WSF2', 'WT01', 'WT02', 'WT03', 'WT04',
       'WT05', 'WT06', 'WT08', 'WT09', 'WT13', 'WT14', 'WT15', 'WT16', 'WT17',
       'WT18', 'WT22', 'MAURN', 'HOLIDAY'],
      dtype='object')

In [174]:
df_ag_final.shape

(493824, 30)

We need to change the dtype of `HOLIDAY` so that `None` isn't considered `NA`.

In [175]:
df_ag_final['HOLIDAY'] = df_ag_final['HOLIDAY'].astype(str)
df_ag_final['HOLIDAY'] = df_ag_final['HOLIDAY'].astype('category')

In [176]:
df_ag_final.dropna(inplace = True)

In [177]:
df_ag_final.shape

(489408, 30)

In [222]:
df_ag_final.HOLIDAY.value_counts()

None                     460416
Holiday Season             5376
New Year's Day Eve         1344
Thanksgiving Eve           1344
Christmas Day Eve          1344
Labor Day                  1344
Labor Day Eve              1344
Memorial Day               1344
Memorial Day Eve           1344
New Year's Day             1344
Christmas Day              1344
Post-Christmas Day         1344
Post-Labor Day             1344
Post-Memorial Day          1344
Post-New Year's Day        1344
Post-Thanksgiving          1344
Thanksgiving               1344
Post-Independence Day      1152
Independence Day Eve       1152
Independence Day           1152
nan                           0
Name: HOLIDAY, dtype: int64

In [178]:
df_ag_temp = df_ag_final.groupby(['crime', 'year', 'month', 'day', 'dayw','AWND',
       'PRCP', 'SNOW', 'TMAX', 'TMIN', 'WSF2', 'WT01', 'WT02', 'WT03', 'WT04',
       'WT05', 'WT06', 'WT08', 'WT09', 'WT13', 'WT14', 'WT15', 'WT16', 'WT17',
       'WT18', 'WT22', 'MAURN', 'HOLIDAY'])['counts'].sum().reset_index(name = 'counts')

In [179]:
df_ag_temp.shape

(20392, 29)

In [180]:
20392 * 24

489408

In [181]:
df_ag_temp.head(100)

Unnamed: 0,crime,year,month,day,dayw,AWND,PRCP,SNOW,TMAX,TMIN,...,WT13,WT14,WT15,WT16,WT17,WT18,WT22,MAURN,HOLIDAY,counts
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,8.95,0.00,0.0,89.0,71.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,8.0
1,AGGRAVATED ASSAULT,2012,7,9,Monday,8.05,0.00,0.0,84.0,67.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,6.0
2,AGGRAVATED ASSAULT,2012,7,10,Tuesday,7.38,0.00,0.0,83.0,65.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,11.0
3,AGGRAVATED ASSAULT,2012,7,11,Wednesday,8.28,0.00,0.0,80.0,66.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,7.0
4,AGGRAVATED ASSAULT,2012,7,12,Thursday,10.74,0.00,0.0,86.0,66.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,8.0
5,AGGRAVATED ASSAULT,2012,7,13,Friday,10.96,0.00,0.0,90.0,69.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,11.0
6,AGGRAVATED ASSAULT,2012,7,14,Saturday,10.51,0.00,0.0,91.0,72.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,10.0
7,AGGRAVATED ASSAULT,2012,7,15,Sunday,8.72,0.00,0.0,91.0,72.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.9,,2.0
8,AGGRAVATED ASSAULT,2012,7,16,Monday,9.62,0.00,0.0,88.0,72.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.9,,8.0
9,AGGRAVATED ASSAULT,2012,7,17,Tuesday,12.30,0.00,0.0,97.0,76.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.9,,10.0


In [182]:
df_ag_temp.columns

Index(['crime', 'year', 'month', 'day', 'dayw', 'AWND', 'PRCP', 'SNOW', 'TMAX',
       'TMIN', 'WSF2', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08',
       'WT09', 'WT13', 'WT14', 'WT15', 'WT16', 'WT17', 'WT18', 'WT22', 'MAURN',
       'HOLIDAY', 'counts'],
      dtype='object')

In [183]:
df_ag2 = df_ag_temp

In [184]:
df_dummies = pd.get_dummies(df_ag2)

In [185]:
df_dummies.head()

Unnamed: 0,AWND,PRCP,SNOW,TMAX,TMIN,WSF2,WT01,WT02,WT03,WT04,...,HOLIDAY_None,HOLIDAY_Post-Christmas Day,HOLIDAY_Post-Independence Day,HOLIDAY_Post-Labor Day,HOLIDAY_Post-Memorial Day,HOLIDAY_Post-New Year's Day,HOLIDAY_Post-Thanksgiving,HOLIDAY_Thanksgiving,HOLIDAY_Thanksgiving Eve,HOLIDAY_nan
0,8.95,0.0,0.0,89.0,71.0,19.9,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,8.05,0.0,0.0,84.0,67.0,16.1,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
2,7.38,0.0,0.0,83.0,65.0,14.1,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
3,8.28,0.0,0.0,80.0,66.0,15.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
4,10.74,0.0,0.0,86.0,66.0,19.9,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0


In [186]:
df_dummies.columns.tolist()

['AWND',
 'PRCP',
 'SNOW',
 'TMAX',
 'TMIN',
 'WSF2',
 'WT01',
 'WT02',
 'WT03',
 'WT04',
 'WT05',
 'WT06',
 'WT08',
 'WT09',
 'WT13',
 'WT14',
 'WT15',
 'WT16',
 'WT17',
 'WT18',
 'WT22',
 'MAURN',
 'counts',
 'crime_AGGRAVATED ASSAULT',
 'crime_AUTO THEFT',
 'crime_COMMERCIAL BURGLARY',
 'crime_HOMICIDE',
 'crime_LARCENY',
 'crime_OTHER BURGLARY',
 'crime_RESIDENTIAL BURGLARY',
 'crime_ROBBERY',
 'year_2012',
 'year_2013',
 'year_2014',
 'year_2015',
 'year_2016',
 'year_2017',
 'year_2018',
 'year_2019',
 'month_1',
 'month_2',
 'month_3',
 'month_4',
 'month_5',
 'month_6',
 'month_7',
 'month_8',
 'month_9',
 'month_10',
 'month_11',
 'month_12',
 'day_1',
 'day_2',
 'day_3',
 'day_4',
 'day_5',
 'day_6',
 'day_7',
 'day_8',
 'day_9',
 'day_10',
 'day_11',
 'day_12',
 'day_13',
 'day_14',
 'day_15',
 'day_16',
 'day_17',
 'day_18',
 'day_19',
 'day_20',
 'day_21',
 'day_22',
 'day_23',
 'day_24',
 'day_25',
 'day_26',
 'day_27',
 'day_28',
 'day_29',
 'day_30',
 'day_31',
 'dayw_F

In [187]:
df_dummies = df_dummies[['AWND',
 'PRCP',
 'SNOW',
 'TMAX',
 'TMIN',
 'WSF2',
 'WT01',
 'WT02',
 'WT03',
 'WT04',
 'WT05',
 'WT06',
 'WT08',
 'WT09',
 'WT13',
 'WT14',
 'WT15',
 'WT16',
 'WT17',
 'WT18',
 'WT22',
 'MAURN',
 'crime_AGGRAVATED ASSAULT',
 'crime_AUTO THEFT',
 'crime_COMMERCIAL BURGLARY',
 'crime_HOMICIDE',
 'crime_LARCENY',
 'crime_OTHER BURGLARY',
 'crime_RESIDENTIAL BURGLARY',
 'crime_ROBBERY',
 'year_2012',
 'year_2013',
 'year_2014',
 'year_2015',
 'year_2016',
 'year_2017',
 'year_2018',
 'year_2019',
 'month_1',
 'month_2',
 'month_3',
 'month_4',
 'month_5',
 'month_6',
 'month_7',
 'month_8',
 'month_9',
 'month_10',
 'month_11',
 'month_12',
 'day_1',
 'day_2',
 'day_3',
 'day_4',
 'day_5',
 'day_6',
 'day_7',
 'day_8',
 'day_9',
 'day_10',
 'day_11',
 'day_12',
 'day_13',
 'day_14',
 'day_15',
 'day_16',
 'day_17',
 'day_18',
 'day_19',
 'day_20',
 'day_21',
 'day_22',
 'day_23',
 'day_24',
 'day_25',
 'day_26',
 'day_27',
 'day_28',
 'day_29',
 'day_30',
 'day_31',
 'dayw_Friday',
 'dayw_Monday',
 'dayw_Saturday',
 'dayw_Sunday',
 'dayw_Thursday',
 'dayw_Tuesday',
 'dayw_Wednesday',
 'HOLIDAY_Christmas Day',
 'HOLIDAY_Christmas Day Eve',
 'HOLIDAY_Holiday Season',
 'HOLIDAY_Independence Day',
 'HOLIDAY_Independence Day Eve',
 'HOLIDAY_Labor Day',
 'HOLIDAY_Labor Day Eve',
 'HOLIDAY_Memorial Day',
 'HOLIDAY_Memorial Day Eve',
 "HOLIDAY_New Year's Day",
 "HOLIDAY_New Year's Day Eve",
 'HOLIDAY_Post-Christmas Day',
 'HOLIDAY_Post-Independence Day',
 'HOLIDAY_Post-Labor Day',
 'HOLIDAY_Post-Memorial Day',
 "HOLIDAY_Post-New Year's Day",
 'HOLIDAY_Post-Thanksgiving',
 'HOLIDAY_Thanksgiving',
 'HOLIDAY_Thanksgiving Eve','counts']]

In [188]:
df_dummies.head()

Unnamed: 0,AWND,PRCP,SNOW,TMAX,TMIN,WSF2,WT01,WT02,WT03,WT04,...,HOLIDAY_New Year's Day Eve,HOLIDAY_Post-Christmas Day,HOLIDAY_Post-Independence Day,HOLIDAY_Post-Labor Day,HOLIDAY_Post-Memorial Day,HOLIDAY_Post-New Year's Day,HOLIDAY_Post-Thanksgiving,HOLIDAY_Thanksgiving,HOLIDAY_Thanksgiving Eve,counts
0,8.95,0.0,0.0,89.0,71.0,19.9,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,8.0
1,8.05,0.0,0.0,84.0,67.0,16.1,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,6.0
2,7.38,0.0,0.0,83.0,65.0,14.1,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,11.0
3,8.28,0.0,0.0,80.0,66.0,15.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,7.0
4,10.74,0.0,0.0,86.0,66.0,19.9,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,8.0


In [189]:
X = df_dummies.iloc[:,:-1]
y = df_dummies['counts']

In [190]:
X.shape

(20392, 107)

In [191]:
y.shape

(20392,)

In [192]:
df_dummies.dtypes

AWND                             float64
PRCP                             float64
SNOW                             float64
TMAX                             float64
TMIN                             float64
WSF2                             float64
WT01                             float64
WT02                             float64
WT03                             float64
WT04                             float64
WT05                             float64
WT06                             float64
WT08                             float64
WT09                             float64
WT13                             float64
WT14                             float64
WT15                             float64
WT16                             float64
WT17                             float64
WT18                             float64
WT22                             float64
MAURN                            float64
crime_AGGRAVATED ASSAULT           uint8
crime_AUTO THEFT                   uint8
crime_COMMERCIAL

In [193]:
y.head()

0     8.0
1     6.0
2    11.0
3     7.0
4     8.0
Name: counts, dtype: float64

In [194]:
X_trainn, X_test, y_trainn, y_test = train_test_split(X, y, test_size=0.2, random_state=52)
X_train, X_val, y_train, y_val = train_test_split(X_trainn, y_trainn, test_size=0.2, random_state=52)

In [195]:
X_train.shape

(13050, 107)

Now we need scaling.

In [196]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.
  """


In [197]:
regressor_list = []
ev_train = []
ev_test = []
r2_train = []
r2_test = []
mse_train = []
mse_test = []
mae_train = []
mae_test = []
mdae_train = []
mdae_test = []

In [198]:
def regression(regressor, x_train, x_test, y_train):
    reg = regressor
    reg.fit(x_train, y_train)
    
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    
    return y_train_reg, y_test_reg

In [199]:
def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    regressor_list.append(str(regressor))
    
    ev_train_c = explained_variance_score(y_train, y_train_reg)
    ev_train.append(ev_train_c)
    ev_test_c = explained_variance_score(y_test, y_test_reg)
    ev_test.append(ev_test_c)
    
    r2_train_c = r2_score(y_train, y_train_reg)
    r2_train.append(r2_train_c)
    r2_test_c = r2_score(y_test, y_test_reg)
    r2_test.append(r2_test_c)
    
    mse_train_c = mean_squared_error(y_train, y_train_reg)
    mse_train.append(mse_train_c)
    mse_test_c = mean_squared_error(y_test, y_test_reg)
    mse_test.append(mse_test_c)

    mae_train_c = mean_absolute_error(y_train, y_train_reg)
    mae_train.append(mae_train_c)
    mae_test_c = mean_absolute_error(y_test, y_test_reg)
    mae_test.append(mae_test_c)  
    
    mdae_train_c = median_absolute_error(y_train, y_train_reg)
    mdae_train.append(mdae_train_c)
    mdae_test_c = median_absolute_error(y_test, y_test_reg)
    mdae_test.append(mdae_test_c)
    
    print("______________________________________________________________________________")
    print(str(regressor))
    print("______________________________________________________________________________")
    print("EV score. Train: ", ev_train_c)
    print("EV score. Test: ", ev_test_c)
    print("---------")
    print("R2 score. Train: ", r2_train_c)
    print("R2 score. Test: ", r2_test_c)
    print("---------")
    print("MSE score. Train: ", mse_train_c)
    print("MSE score. Test: ", mse_test_c)
    print("---------")
    print("MAE score. Train: ", mae_train_c)
    print("MAE score. Test: ", mae_test_c)
    print("---------")
    print("MdAE score. Train: ", mdae_train_c)
    print("MdAE score. Test: ", mdae_test_c)

<a id = '2.2'></a>
[Return to top](#top)
## 3.2 Linear Regressor

Let's first try linear regression.

In [200]:
lreg = LinearRegression()
lreg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(lreg, X_train, X_val, y_train)
scores(lreg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
______________________________________________________________________________
EV score. Train:  0.8660215526656384
EV score. Test:  0.8716934120795978
---------
R2 score. Train:  0.8660212222799015
R2 score. Test:  0.8716569856471773
---------
MSE score. Train:  14.915922686639982
MSE score. Test:  15.134643475338141
---------
MAE score. Train:  2.3067840855518305
MAE score. Test:  2.3127311312842704
---------
MdAE score. Train:  1.481689453125
MdAE score. Test:  1.4913360807956897


In [201]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(sgd_reg, X_train, X_val, y_train)
scores(sgd_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=None,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.8260816632496372
EV score. Test:  0.8666903389722732
---------
R2 score. Train:  0.8260422499163514
R2 score. Test:  0.8666795074609245
---------
MSE score. Train:  19.36680118406767
MSE score. Test:  15.721604582143033
---------
MAE score. Train:  2.4508070003845694
MAE score. Test:  2.421779620647653
---------
MdAE score. Train:  1.5700086417397965
MdAE score. Test:  1.5508467715440943




<a id = '2.3'></a>
[Return to top](#top)
## 3.3 BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

In [202]:
ba_reg = BaggingRegressor()
ba_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ba_reg, X_train, X_val, y_train)
scores(ba_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=None,
         verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9809484923428795
EV score. Test:  0.8981377982161151
---------
R2 score. Train:  0.9809470249143541
R2 score. Test:  0.8981336348702832
---------
MSE score. Train:  2.121177011494253
MSE score. Test:  12.012427214220041
---------
MAE score. Train:  0.7734559386973179
MAE score. Test:  1.9842782715292673
---------
MdAE score. Train:  0.3999999999999999
MdAE score. Test:  1.1


In [203]:
ada_reg = AdaBoostRegressor(learning_rate=1,n_estimators=100)
ada_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ada_reg, X_train, X_val, y_train)
scores(ada_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
AdaBoostRegressor(base_estimator=None, learning_rate=1, loss='linear',
         n_estimators=100, random_state=None)
______________________________________________________________________________
EV score. Train:  0.8802211266175866
EV score. Test:  0.8803685435565827
---------
R2 score. Train:  0.8698100865556593
R2 score. Test:  0.8709326421155485
---------
MSE score. Train:  14.494106578379599
MSE score. Test:  15.220060520902704
---------
MAE score. Train:  2.811284012050363
MAE score. Test:  2.8557883785481235
---------
MdAE score. Train:  2.3170895647790286
MdAE score. Test:  2.3170895647790286


In [204]:
et_reg = ExtraTreesRegressor(n_estimators=100)
et_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(et_reg, X_train, X_val, y_train)
scores(et_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
          oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.9014238713004334
---------
R2 score. Train:  1.0
R2 score. Test:  0.9014022191093879
---------
MSE score. Train:  0.0
MSE score. Test:  11.626984676677905
---------
MAE score. Train:  0.0
MAE score. Test:  1.9658473797119214
---------
MdAE score. Train:  0.0
MdAE score. Test:  1.0300000000000002


<a id = '2.4'></a>
[Return to top](#top)
## 3.4 GradientBoostingRegressor, RandomForestRegressor, LGBMRegressor

In [205]:
gb_reg = GradientBoostingRegressor()
gb_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(gb_reg, X_train, X_val, y_train)
scores(gb_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.910772627428149
EV score. Test:  0.9077022522662475
---------
R2 score. Train:  0.910772627428149
R2 score. Test:  0.9077013329096059
---------
MSE score. Train:  9.933726919006649
MSE score. Test:  10.884171816487475
---------
MAE score. Train:  1.8968863155071785
MAE score. Test:  1.96541192

In [206]:
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rf_reg, X_train, X_val, y_train)
scores(rf_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9866242609600007
EV score. Test:  0.9063450928704111
---------
R2 score. Train:  0.986624181230806
R2 score. Test:  0.9063395077594479
---------
MSE score. Train:  1.489136429118774
MSE score. Test:  11.04476285626724
---------
MAE score. Train:  0.7014068965517242
MAE score. Test:  1.9115905608335888
---------
MdAE score. Train:  0.3899999999999997
MdAE score. Test:  1.02


In [207]:

gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.01,
                        n_estimators=1000)
gbm.fit(X_train, y_train,
        eval_set=[(X_val, y_val)],
        eval_metric='l1',
        early_stopping_rounds=5)

#print('Starting predicting...')
# predict
y_train_reg = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
y_val_reg = gbm.predict(X_val, num_iteration=gbm.best_iteration_)
scores(gbm, y_train, y_val, y_train_reg, y_val_reg)

[1]	valid_0's l1: 6.92967	valid_0's l2: 115.847
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l1: 6.8684	valid_0's l2: 113.773
[3]	valid_0's l1: 6.80772	valid_0's l2: 111.74
[4]	valid_0's l1: 6.74772	valid_0's l2: 109.75
[5]	valid_0's l1: 6.68825	valid_0's l2: 107.797
[6]	valid_0's l1: 6.62944	valid_0's l2: 105.886
[7]	valid_0's l1: 6.57115	valid_0's l2: 104.01
[8]	valid_0's l1: 6.51351	valid_0's l2: 102.173
[9]	valid_0's l1: 6.45646	valid_0's l2: 100.375
[10]	valid_0's l1: 6.40009	valid_0's l2: 98.615
[11]	valid_0's l1: 6.34415	valid_0's l2: 96.883
[12]	valid_0's l1: 6.28894	valid_0's l2: 95.1921
[13]	valid_0's l1: 6.23413	valid_0's l2: 93.5284
[14]	valid_0's l1: 6.18002	valid_0's l2: 91.9043
[15]	valid_0's l1: 6.1263	valid_0's l2: 90.3059
[16]	valid_0's l1: 6.07319	valid_0's l2: 88.7411
[17]	valid_0's l1: 6.02079	valid_0's l2: 87.2102
[18]	valid_0's l1: 5.96901	valid_0's l2: 85.702
[19]	valid_0's l1: 5.91782	valid_0's l2: 84.2263
[20]	valid_0's l1: 5.867	

[181]	valid_0's l1: 2.30439	valid_0's l2: 13.9116
[182]	valid_0's l1: 2.2984	valid_0's l2: 13.8557
[183]	valid_0's l1: 2.29209	valid_0's l2: 13.7931
[184]	valid_0's l1: 2.28632	valid_0's l2: 13.7357
[185]	valid_0's l1: 2.28047	valid_0's l2: 13.6773
[186]	valid_0's l1: 2.2747	valid_0's l2: 13.6177
[187]	valid_0's l1: 2.26909	valid_0's l2: 13.5591
[188]	valid_0's l1: 2.26407	valid_0's l2: 13.5079
[189]	valid_0's l1: 2.25893	valid_0's l2: 13.4603
[190]	valid_0's l1: 2.25384	valid_0's l2: 13.4092
[191]	valid_0's l1: 2.24929	valid_0's l2: 13.3645
[192]	valid_0's l1: 2.24459	valid_0's l2: 13.3199
[193]	valid_0's l1: 2.24049	valid_0's l2: 13.276
[194]	valid_0's l1: 2.23571	valid_0's l2: 13.2296
[195]	valid_0's l1: 2.2308	valid_0's l2: 13.1821
[196]	valid_0's l1: 2.22611	valid_0's l2: 13.1373
[197]	valid_0's l1: 2.22168	valid_0's l2: 13.0968
[198]	valid_0's l1: 2.21725	valid_0's l2: 13.0504
[199]	valid_0's l1: 2.21297	valid_0's l2: 13.0084
[200]	valid_0's l1: 2.20859	valid_0's l2: 12.9647
[201

[368]	valid_0's l1: 1.96488	valid_0's l2: 10.8522
[369]	valid_0's l1: 1.96457	valid_0's l2: 10.8517
[370]	valid_0's l1: 1.96443	valid_0's l2: 10.8516
[371]	valid_0's l1: 1.96377	valid_0's l2: 10.8493
[372]	valid_0's l1: 1.96308	valid_0's l2: 10.8467
[373]	valid_0's l1: 1.96237	valid_0's l2: 10.8425
[374]	valid_0's l1: 1.96207	valid_0's l2: 10.841
[375]	valid_0's l1: 1.96139	valid_0's l2: 10.838
[376]	valid_0's l1: 1.96109	valid_0's l2: 10.8363
[377]	valid_0's l1: 1.961	valid_0's l2: 10.8385
[378]	valid_0's l1: 1.96074	valid_0's l2: 10.8382
[379]	valid_0's l1: 1.96061	valid_0's l2: 10.8386
[380]	valid_0's l1: 1.96035	valid_0's l2: 10.8369
[381]	valid_0's l1: 1.9596	valid_0's l2: 10.8328
[382]	valid_0's l1: 1.95939	valid_0's l2: 10.8324
[383]	valid_0's l1: 1.95926	valid_0's l2: 10.8325
[384]	valid_0's l1: 1.95856	valid_0's l2: 10.8294
[385]	valid_0's l1: 1.95836	valid_0's l2: 10.8296
[386]	valid_0's l1: 1.95812	valid_0's l2: 10.8282
[387]	valid_0's l1: 1.95792	valid_0's l2: 10.8281
[388]

<a id = '2.5'></a>
[Return to top](#top)
## 3.5 KNeighborsRegressor, RadiusNeighborsRegressor

In [208]:
kn_reg = KNeighborsRegressor(n_neighbors=15)
kn_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(kn_reg, X_train, X_val, y_train)
scores(kn_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=15, p=2,
          weights='uniform')
______________________________________________________________________________
EV score. Train:  0.6286747259025022
EV score. Test:  0.5614919301555228
---------
R2 score. Train:  0.6280387982892348
R2 score. Test:  0.5599812938398957
---------
MSE score. Train:  41.410622051937
MSE score. Test:  51.88849797391631
---------
MAE score. Train:  3.915959131545338
MAE score. Test:  4.348963121871488
---------
MdAE score. Train:  2.6
MdAE score. Test:  2.8


In [209]:
rn_reg = RadiusNeighborsRegressor(radius = 10)
rn_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rn_reg, X_train, X_val, y_train)
scores(rn_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RadiusNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=None, p=2, radius=10,
             weights='uniform')
______________________________________________________________________________
EV score. Train:  0.23572963680315984
EV score. Test:  0.20856303397276588
---------
R2 score. Train:  0.23572649502184628
R2 score. Test:  0.20829209850054498
---------
MSE score. Train:  85.08694216868783
MSE score. Test:  93.36088049843163
---------
MAE score. Train:  5.9378538785961705
MAE score. Test:  6.217524360339318
---------
MdAE score. Train:  4.4226829143674635
MdAE score. Test:  4.5234657039711195


<a id = '2.6'></a>
[Return to top](#top)
## 3.6 DecisionTreeRegressor

In [210]:
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(dt_reg, X_train, X_val, y_train)
scores(dt_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.8245126152551021
---------
R2 score. Train:  1.0
R2 score. Test:  0.8245119454292755
---------
MSE score. Train:  0.0
MSE score. Test:  20.69414649095924
---------
MAE score. Train:  0.0
MAE score. Test:  2.570946981305547
---------
MdAE score. Train:  0.0
MdAE score. Test:  1.0


<a id = '2.7'></a>
[Return to top](#top)
## 3.7 Ridge, RidgeCV, BayesianRidge

In [211]:
rid_reg = Ridge()
rid_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rid_reg, X_train, X_val, y_train)
scores(rid_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
______________________________________________________________________________
EV score. Train:  0.8661309797211773
EV score. Test:  0.8716762484042033
---------
R2 score. Train:  0.8661309797211771
R2 score. Test:  0.8716357156042334
---------
MSE score. Train:  14.903703337156355
MSE score. Test:  15.137151710930732
---------
MAE score. Train:  2.3032340324813796
MAE score. Test:  2.3103442243162085
---------
MdAE score. Train:  1.4767444155222171
MdAE score. Test:  1.4799024293586402


In [212]:
ric_reg = RidgeCV()
ric_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ric_reg, X_train, X_val, y_train)
scores(ric_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
______________________________________________________________________________
EV score. Train:  0.86613043871525
EV score. Test:  0.8716594432458014
---------
R2 score. Train:  0.86613043871525
R2 score. Test:  0.8716189979068323
---------
MSE score. Train:  14.903763567610177
MSE score. Test:  15.139123118499503
---------
MAE score. Train:  2.302958590951548
MAE score. Test:  2.3101536554304065
---------
MdAE score. Train:  1.475292066339255
MdAE score. Test:  1.4824178715961356


In [213]:
br_reg = BayesianRidge()
br_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(br_reg, X_train, X_val, y_train)
scores(br_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.8661292851387465
EV score. Test:  0.8716449540869293
---------
R2 score. Train:  0.8661292851387465
R2 score. Test:  0.8716045618170357
---------
MSE score. Train:  14.903891995845077
MSE score. Test:  15.14082547116242
---------
MAE score. Train:  2.3027316286479973
MAE score. Test:  2.3099957367683452
---------
MdAE score. Train:  1.474179354328009
MdAE score. Test:  1.4824942218325177


<a id = '2.8'></a>
[Return to top](#top)
## 3.8 HuberRegressor, TheilSenRegressor, RANSACRegressor

In [214]:
hu_reg = HuberRegressor()
hu_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(hu_reg, X_train, X_val, y_train)
scores(hu_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
HuberRegressor(alpha=0.0001, epsilon=1.35, fit_intercept=True, max_iter=100,
        tol=1e-05, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.8604546964553823
EV score. Test:  0.8672332321034703
---------
R2 score. Train:  0.8597160149409656
R2 score. Test:  0.866108718448451
---------
MSE score. Train:  15.617884495750403
MSE score. Test:  15.788913958091436
---------
MAE score. Train:  2.148462126072797
MAE score. Test:  2.1405980251973538
---------
MdAE score. Train:  1.1145426957987974
MdAE score. Test:  1.0624309904125768


In [215]:
ts_reg = TheilSenRegressor()
ts_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ts_reg, X_train, X_val, y_train)
scores(ts_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
TheilSenRegressor(copy_X=True, fit_intercept=True, max_iter=300,
         max_subpopulation=10000, n_jobs=None, n_subsamples=None,
         random_state=None, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.8536964470768756
EV score. Test:  0.8635076522491181
---------
R2 score. Train:  0.8502902510838656
R2 score. Test:  0.8597084523985269
---------
MSE score. Train:  16.667259384426814
MSE score. Test:  16.54365503458371
---------
MAE score. Train:  2.4970547648266708
MAE score. Test:  2.4684186629117995
---------
MdAE score. Train:  1.5586882421687673
MdAE score. Test:  1.524489179180473


In [216]:
ran_reg = RANSACRegressor()
ran_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ran_reg, X_train, X_val, y_train)
scores(ran_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
        loss='absolute_loss', max_skips=inf, max_trials=100,
        min_samples=None, random_state=None, residual_threshold=None,
        stop_n_inliers=inf, stop_probability=0.99, stop_score=inf)
______________________________________________________________________________
EV score. Train:  -1.5333554102738554e+25
EV score. Test:  -1.4479236966822343e+25
---------
R2 score. Train:  -1.541733473399785e+25
R2 score. Test:  -1.4605966571743452e+25
---------
MSE score. Train:  1.716419397457033e+27
MSE score. Test:  1.722385108303187e+27
---------
MAE score. Train:  8778636869973.04
MAE score. Test:  8762876501199.338
---------
MdAE score. Train:  1.95703125
MdAE score. Test:  1.8955078125


<a id = '2.9'></a>
[Return to top](#top)
## 3.9 MLPRegressor

In [217]:
mlp_reg = MLPRegressor(max_iter=3000)
mlp_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(mlp_reg, X_train, X_val, y_train)
scores(mlp_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=3000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9784103928302242
EV score. Test:  0.8762615103339596
---------
R2 score. Train:  0.9782534117117805
R2 score. Test:  0.8762074187956501
---------
MSE score. Train:  2.4210582834464316
MSE score. Test:  14.59804096753689
---------
MAE score. Train:  1.1394581647032844
MAE score. Test:  2.6188643184785327
---------
MdAE score. Train:  0.8626107099373974
MdAE score. T

<a id = '2.10'></a>
[Return to top](#top)
## 3.10 SVR

In [218]:

svr_reg = SVR(degree = 2)
svr_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(svr_reg, X_train, X_val, y_train)
scores(svr_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
SVR(C=1.0, cache_size=200, coef0=0.0, degree=2, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.818198263745133
EV score. Test:  0.7869868253217395
---------
R2 score. Train:  0.8112134179776052
R2 score. Test:  0.7778917505257301
---------
MSE score. Train:  21.0177022782216
MSE score. Test:  26.191757967313176
---------
MAE score. Train:  2.1349102337731627
MAE score. Test:  2.4387797599978436
---------
MdAE score. Train:  0.9538834745483618
MdAE score. Test:  1.0989921345454974


<a id = '4'></a>
[Return to top](#top)
# 4. Tuning hyperparameters

We will use one ensemble method: random forest.

<a id = '3.1'></a>
[Return to top](#top)
## 4.1 Tuning random forests

In [228]:
X_train.shape

(13050, 107)

In [232]:
np.amin(X_train)

-3.1486467899651593

In [233]:
from hpsklearn import HyperoptEstimator, any_classifier
from hyperopt import tpe
estim = HyperoptEstimator(algo = tpe.suggest)

estim.fit(X_train, y_train)

print(estim.score(X_val, y_val))
print(estim.best_model())

  0%|          | 0/1 [00:00<?, ?it/s, best loss: ?]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]






ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [219]:
from sklearn.model_selection import RandomizedSearchCV

In [220]:
param_dist = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

In [234]:
rf_reg = RandomForestRegressor()
regs = RandomizedSearchCV(estimator = rf_reg, param_distributions = param_dist, n_iter = 100, cv = 3, verbose=3, random_state=42, n_jobs = -1)
regs.fit(X_trainn, y_trainn)
y_trainn_reg, y_test_reg = regression(regs, X_trainn, X_test, y_trainn)
print(regs.best_estimator_)
scores(regs, y_trainn, y_test, y_trainn_reg, y_test_reg)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed: 25.4min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 134.7min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 307.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 320.1min finished


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed: 34.8min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 155.1min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 342.1min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 352.4min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=10,
           min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
    

In [235]:
rf = regs.best_estimator_

In [237]:
rf_final = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=10,
           min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Now it is time to train and save the model.

In [238]:
rf_final.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=10,
           min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [239]:
import pickle
pickle.dump(rf_final, open('/Users/CatLover/Documents/DataScience/BostonCrime/rf_final.p','wb'))