<a id = 'top'></a>

# Crime in Boston, Revisited (Version 1.0)

**Ying Zhou**

**Table of contents**

[1.Data wrangling](#1)

[1.1 Exploration](#1.1)

[1.2 Removing irrelevant columns](#1.2)

[1.3 Process location data](#1.3)

[1.4 Process time](#1.4)

[1.5 Remove non-crimes](#1.5)

[1.6 Combine the two dataframes](#1.6)

Now let's return to the problem of crime in Boston. This time we will predict the amount of crimes, do some validation and finally use all my data to make the prediction about crime in Boston in the future. We won't do preliminary analysis any more because especially for the last 3-4 years I think this is already explored in details in the last project.

Again let's first import the usual packages.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

Since we need to do some machine learning let's import regression-related parts of sklearn too. However this local computer can not handle deep learning which is why we won't import Keras. If necessary we will do some regression on Google Colab.

In [2]:
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler

from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor

from sklearn.neural_network import MLPRegressor

Since we need to draw graphs we need to write our multiliner function here which can help us leave more room for tick labels if the tick labels are really long.

In [3]:
def multiliner(string_list, n):
    length = len(string_list)
    for i in range(length):
        rem = i % n
        string_list[i] = '\n' * rem + string_list[i]
    return string_list

Time to get the data!

In [4]:
new_url = 'https://og-production-open-data-bostonma-892364687672.s3.amazonaws.com/resources/12cb3883-56f5-47de-afa5-3b1cf61b257b/tmpzokwq5wj.csv?Signature=giZBvWOyWPaSWlxS2QZQHaZ5WiI%3D&Expires=1561927370&AWSAccessKeyId=AKIAJJIENTAPKHZMIPXQ'
old_url = 'https://og-production-open-data-bostonma-892364687672.s3.amazonaws.com/resources/ba5ed0e2-e901-438c-b2e0-4acfc3c452b9/crime-incident-reports-july-2012-august-2015-source-legacy-system.csv?Signature=2yHVUsMcwrQvciLozm3e2dOXnco%3D&Expires=1561927471&AWSAccessKeyId=AKIAJJIENTAPKHZMIPXQ'

In [5]:
df_new = pd.read_csv(new_url)
df_old = pd.read_csv(old_url)

  interactivity=interactivity, compiler=compiler, result=result)


[Return to top](#top)
<a id = '1'></a>
# 1. Data Wrangling

<a id = '1.1'></a>
[Return to top](#top)
## 1.1 Exploration

In [7]:
df_new.shape

(399168, 17)

In [8]:
df_old.head()

Unnamed: 0,COMPNOS,NatureCode,INCIDENT_TYPE_DESCRIPTION,MAIN_CRIMECODE,REPTDISTRICT,REPORTINGAREA,FROMDATE,WEAPONTYPE,Shooting,DOMESTIC,SHIFT,Year,Month,DAY_WEEK,UCRPART,X,Y,STREETNAME,XSTREETNAME,Location
0,120420285.0,BERPTA,RESIDENTIAL BURGLARY,05RB,D4,629,07/08/2012 06:00:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,763273.1791,2951498.962,ABERDEEN ST,,"(42.34638135, -71.10379454)"
1,120419202.0,PSHOT,AGGRAVATED ASSAULT,04xx,B2,327,07/08/2012 06:03:00 AM,Firearm,Yes,No,Last,2012,7,Sunday,Part One,771223.1638,2940772.099,HOWARD AV,,"(42.31684135, -71.07458456)"
2,120419213.0,ARMROB,ROBBERY,03xx,D4,625,07/08/2012 06:26:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,765118.8605,2950217.536,JERSEY ST,QUEENSBERRY ST,"(42.34284135, -71.09698955)"
3,120419223.0,ALARMC,COMMERCIAL BURGLARY,05CB,B2,258,07/08/2012 06:56:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,773591.8648,2940638.174,COLUMBIA RD,,"(42.3164411, -71.06582908)"
4,120419236.0,ARMROB,ROBBERY,03xx,E18,496,07/08/2012 07:15:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,759042.7315,2923832.681,COLLINS ST,,"(42.27051636, -71.11989955)"


In [10]:
df_new.head()

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I192049736,3831,Motor Vehicle Accident Response,M/V - LEAVING SCENE - PROPERTY DAMAGE,,,,2019-06-28 21:43:00,2019,6,Friday,21,Part Three,,42.33144,-71.094469,"(42.33143980, -71.09446919)"
1,I192049735,2900,Other,VAL - VIOLATION OF AUTO LAW - OTHER,D4,618.0,,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895,"(42.34043847, -71.08895009)"
2,I192049735,2906,Violations,VAL - OPERATING UNREG/UNINS CAR,D4,618.0,,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895,"(42.34043847, -71.08895009)"
3,I192049733,3006,Medical Assistance,SICK/INJURED/MEDICAL - PERSON,A1,95.0,,2019-06-29 20:18:00,2019,6,Saturday,20,Part Three,STATE ST,42.359613,-71.051958,"(42.35961316, -71.05195809)"
4,I192049732,3114,Investigate Property,INVESTIGATE PROPERTY,B3,410.0,,2019-06-29 21:16:00,2019,6,Saturday,21,Part Three,CEDAR ST,42.273353,-71.075589,"(42.27335315, -71.07558873)"


In [9]:
df_old.shape

(268056, 20)

In [11]:
df_new.dtypes

INCIDENT_NUMBER         object
OFFENSE_CODE             int64
OFFENSE_CODE_GROUP      object
OFFENSE_DESCRIPTION     object
DISTRICT                object
REPORTING_AREA          object
SHOOTING                object
OCCURRED_ON_DATE        object
YEAR                     int64
MONTH                    int64
DAY_OF_WEEK             object
HOUR                     int64
UCR_PART                object
STREET                  object
Lat                    float64
Long                   float64
Location                object
dtype: object

In [12]:
df_old.dtypes

COMPNOS                      float64
NatureCode                    object
INCIDENT_TYPE_DESCRIPTION     object
MAIN_CRIMECODE                object
REPTDISTRICT                  object
REPORTINGAREA                  int64
FROMDATE                      object
WEAPONTYPE                    object
Shooting                      object
DOMESTIC                      object
SHIFT                         object
Year                           int64
Month                          int64
DAY_WEEK                      object
UCRPART                       object
X                            float64
Y                            float64
STREETNAME                    object
XSTREETNAME                   object
Location                      object
dtype: object

We are very interested in knowing whether the `Lat` / `Long` / `Location` data contains de facto NaN values that aren't labelled as NaN.

In [13]:
df_new['Lat'].value_counts()

 42.348624    1592
 42.361839    1552
 42.284826    1375
 42.328663    1277
 42.256216    1187
 42.297555    1040
 42.341288     960
 42.331521     955
-1.000000      900
 42.335119     875
 42.352312     832
 42.326966     826
 42.309719     812
 42.339542     808
 42.332108     782
 42.326968     781
 42.355123     759
 42.334018     699
 42.298489     678
 42.342850     678
 42.310434     659
 42.334288     648
 42.350959     623
 42.349802     621
 42.333679     619
 42.366435     605
 42.356024     596
 42.370818     594
 42.352418     585
 42.349056     581
              ... 
 42.259312       1
 42.380392       1
 42.357697       1
 42.326103       1
 42.276166       1
 42.343468       1
 42.294600       1
 42.311653       1
 42.263175       1
 42.309143       1
 42.340034       1
 42.284756       1
 42.317088       1
 42.357355       1
 42.323489       1
 42.333038       1
 42.246237       1
 42.285579       1
 42.357968       1
 42.343208       1
 42.279179       1
 42.328330  

In [14]:
df_new['Long'].value_counts()

-71.082776    1592
-71.059765    1552
-71.091374    1375
-71.085634    1277
-71.124019    1187
-71.059709    1040
-71.054679     960
-71.070853     955
-1.000000      900
-71.074917     875
-71.063705     832
-71.061986     826
-71.104294     812
-71.069409     808
-71.070144     782
-71.080519     781
-71.060880     759
-71.076381     699
-71.063133     678
-71.065162     678
-71.061340     659
-71.072395     648
-71.074128     623
-71.078410     621
-71.091878     619
-71.061354     605
-71.061776     596
-71.039291     594
-71.065255     585
-71.150498     581
              ... 
-71.166579       1
-71.069261       1
-71.160414       1
-71.114912       1
-71.090630       1
-71.067131       1
-71.150049       1
-71.068327       1
-71.171106       1
-71.110494       1
-71.151113       1
-71.046307       1
-71.095733       1
-71.163494       1
-71.070267       1
-71.058111       1
-71.055086       1
-71.137020       1
-71.022056       1
-71.125175       1
-71.119572       1
-71.150656  

In [15]:
df_new['Location'].value_counts()

(0.00000000, 0.00000000)       25455
(42.34862382, -71.08277637)     1592
(42.36183857, -71.05976489)     1552
(42.28482577, -71.09137369)     1375
(42.32866284, -71.08563401)     1277
(42.25621592, -71.12401947)     1187
(42.29755533, -71.05970910)     1040
(42.34128751, -71.05467933)      960
(42.33152148, -71.07085307)      955
(-1.00000000, -1.00000000)       900
(42.33511904, -71.07491710)      875
(42.35231190, -71.06370510)      832
(42.32696647, -71.06198607)      826
(42.30971857, -71.10429432)      812
(42.33954199, -71.06940877)      808
(42.33210843, -71.07014395)      782
(42.32696802, -71.08051941)      781
(42.35512339, -71.06087980)      759
(42.33401829, -71.07638124)      699
(42.29848866, -71.06313294)      678
(42.34285014, -71.06516235)      678
(42.31043400, -71.06134010)      659
(42.33428841, -71.07239518)      648
(42.35095909, -71.07412780)      623
(42.34980175, -71.07840978)      621
(42.33367922, -71.09187755)      619
(42.36643546, -71.06135413)      605
(

Other than the (0,0)s and (-1,-1)s they are mostly reasonable. So I think we will do a filter and treat completely absurd outliers as NAs.

In [16]:
df_old['Location'].value_counts()

(0.0, 0.0)                               14981
(42.3286598, -71.08561842)                1506
(42.32543556, -71.06387302)               1008
(42.28486136, -71.09132455)                843
(42.34130529, -71.0547108)                 735
(42.31037135, -71.06123456)                714
(42.34865634, -71.08256955)                699
(42.29754136, -71.05973457)                695
(42.36164815, -71.05998657)                675
(42.33950635, -71.06938956)                635
(42.25642136, -71.12394954)                624
(42.35237134, -71.06490456)                597
(42.33325635, -71.07289955)                595
(42.35230134, -71.06367456)                580
(42.33372337, -71.09095643)                532
(42.28714136, -71.14857453)                463
(42.34898135, -71.15091453)                431
(42.32723569, -71.08059616)                426
(42.35505634, -71.06084456)                425
(42.30972244, -71.10427304)                416
(42.34710135, -71.07960455)                397
(42.35075635,

<a id = '1.2'></a>
[Return to top](#top)
## 1.2 Removing irrelevant columns

As usual we will filter out what's irrelevant. For example I haven't figured out what an RA number actually is. As for `X` and `Y` in the old table they are also irrelevant so we will get rid of them.

In [141]:
df_old_simplified = df_old[['INCIDENT_TYPE_DESCRIPTION', 'FROMDATE', 'Year' ,'Month', 'DAY_WEEK', 'UCRPART', 'STREETNAME', 'Location']]

In [142]:
df_old_simplified['INCIDENT_TYPE_DESCRIPTION'].value_counts()

VAL                                 27363
OTHER LARCENY                       24443
SIMPLE ASSAULT                      17697
MedAssist                           17128
MVAcc                               13832
VANDALISM                           13339
InvPer                              12937
LARCENY FROM MOTOR VEHICLE          12742
DRUG CHARGES                        12042
FRAUD                                8742
PropLost                             8522
TOWED                                7526
RESIDENTIAL BURGLARY                 6737
InvProp                              6592
AGGRAVATED ASSAULT                   5649
Service                              5353
ROBBERY                              4974
PersLoc                              4745
AUTO THEFT                           4620
PropFound                            4316
Argue                                2833
Arrest                               1959
OTHER                                1902
FIRE                              

Oh so homogenizing the data can be hard. However this still has to be done.

In [143]:
df_new_simplified = df_new[['OFFENSE_CODE_GROUP','OCCURRED_ON_DATE','YEAR','MONTH','DAY_OF_WEEK','HOUR','UCR_PART','STREET','Lat','Long']]

In [144]:
df_new_simplified['OFFENSE_CODE_GROUP'].value_counts()

Motor Vehicle Accident Response              46488
Larceny                                      32413
Medical Assistance                           30270
Investigate Person                           23342
Other                                        22474
Drug Violation                               20747
Simple Assault                               19909
Vandalism                                    18845
Verbal Disputes                              16572
Investigate Property                         14118
Towed                                        14049
Larceny From Motor Vehicle                   13077
Property Lost                                12542
Warrant Arrests                              10384
Aggravated Assault                            9910
Fraud                                         7675
Violations                                    7500
Missing Person Located                        6746
Residential Burglary                          6538
Auto Theft                     

I think we are definitely going to restrict our concerns to major crimes.

In [145]:
df_old_simplified.dtypes

INCIDENT_TYPE_DESCRIPTION    object
FROMDATE                     object
Year                          int64
Month                         int64
DAY_WEEK                     object
UCRPART                      object
STREETNAME                   object
Location                     object
dtype: object

<a id = '1.3'></a>
[Return to top](#top)
## 1.3 Process location data

In [146]:
def get_lat_long(loc_string):
    loc_list = loc_string.lstrip('(').rstrip(')').split()
    return loc_list[0].strip(','), loc_list[1]

In [147]:
get_lat_long('(42.34638135, -71.10379454)')

('42.34638135', '-71.10379454')

In [148]:
df_old_simplified['Lat'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[0])
df_old_simplified['Long'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [149]:
df_old_simplified.tail()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Location,Lat,Long
268051,Motor Vehicle Accident Response,08/10/2015 02:38:00 AM,2015,8,Monday,Part Three,HARVARD ST,"(0.0, 0.0)",0.0,0.0
268052,Police Service Incidents,08/10/2015 04:46:00 AM,2015,8,Monday,Part Three,DORCHESTER AVE,"(0.0, 0.0)",0.0,0.0
268053,Motor Vehicle Accident Response,08/10/2015 04:48:00 AM,2015,8,Monday,Part Three,DECKARD ST,"(0.0, 0.0)",0.0,0.0
268054,Investigate Person,08/10/2015 05:01:00 AM,2015,8,Monday,Part Three,HAMMOND ST,"(0.0, 0.0)",0.0,0.0
268055,Motor Vehicle Accident Response,08/10/2015 05:20:00 AM,2015,8,Monday,Part Three,,"(0.0, 0.0)",0.0,0.0


In [150]:
del df_old_simplified['Location']

Now we need to filter out NAs.

In [151]:
def lat_na_er(num_string):
    try:
        num = float(num_string)
        if num < 40 or num > 45:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [152]:
def long_na_er(num_string):
    try:
        num = float(num_string)
        if num < -75 or num > -70:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [153]:
df_old_simplified['Lat'] = df_old_simplified['Lat'].apply(lat_na_er)
df_old_simplified['Long'] = df_old_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [154]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199


In [155]:
df_old_simplified.describe()

Unnamed: 0,Year,Month,Lat,Long
count,268056.0,268056.0,253075.0,253075.0
mean,2013.538664,6.589134,42.323847,-71.08336
std,0.970562,3.323806,0.031772,0.030869
min,2012.0,1.0,42.232264,-71.178674
25%,2013.0,4.0,42.299386,-71.098625
50%,2014.0,7.0,42.32866,-71.078035
75%,2014.0,9.0,42.349236,-71.06228
max,2015.0,12.0,42.395105,-70.964365


Great. We need to do the same for the new one.

In [156]:
df_new_simplified['Lat'] = df_new_simplified['Lat'].apply(lat_na_er)
df_new_simplified['Long'] = df_new_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [157]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Motor Vehicle Accident Response,2019-06-28 21:43:00,2019,6,Friday,21,Part Three,,42.33144,-71.094469
1,Other,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895
2,Violations,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895
3,Medical Assistance,2019-06-29 20:18:00,2019,6,Saturday,20,Part Three,STATE ST,42.359613,-71.051958
4,Investigate Property,2019-06-29 21:16:00,2019,6,Saturday,21,Part Three,CEDAR ST,42.273353,-71.075589


In [158]:
df_new_simplified.describe()

Unnamed: 0,YEAR,MONTH,HOUR,Lat,Long
count,399168.0,399168.0,399168.0,372813.0,372813.0
mean,2016.96132,6.56267,13.112218,42.322165,-71.08296
std,1.222671,3.36401,6.28956,0.031897,0.029694
min,2015.0,1.0,0.0,42.232413,-71.178674
25%,2016.0,4.0,9.0,42.297521,-71.097315
50%,2017.0,7.0,14.0,42.32561,-71.077649
75%,2018.0,9.0,18.0,42.34861,-71.062607
max,2019.0,12.0,23.0,42.395042,-70.963676


Now we need to process time.

<a id = '1.4'></a>
[Return to top](#top)
## 1.4 Process time

In [159]:
df_new_simplified['OCCURRED_ON_DATE'].isna().sum()

0

In [160]:
df_old_simplified['FROMDATE'].isna().sum()

0

At least there are no open NAs. Now let's check the timeline.

In [161]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Motor Vehicle Accident Response,2019-06-28 21:43:00,2019,6,Friday,21,Part Three,,42.33144,-71.094469
1,Other,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895
2,Violations,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895
3,Medical Assistance,2019-06-29 20:18:00,2019,6,Saturday,20,Part Three,STATE ST,42.359613,-71.051958
4,Investigate Property,2019-06-29 21:16:00,2019,6,Saturday,21,Part Three,CEDAR ST,42.273353,-71.075589


We need to round time to hours because police officers don't really document minutes and seconds carefully (to see why this is true please check out the old Crime in Boston project).

In [162]:
df_new_simplified['day'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[8:10]))
df_new_simplified['min'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-5:-3]))
df_new_simplified['sec'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-2:]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [163]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,day,min,sec
0,Motor Vehicle Accident Response,2019-06-28 21:43:00,2019,6,Friday,21,Part Three,,42.33144,-71.094469,28,43,0
1,Other,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895,29,57,0
2,Violations,2019-06-29 20:57:00,2019,6,Saturday,20,Part Two,HUNTINGTON AVE,42.340438,-71.08895,29,57,0
3,Medical Assistance,2019-06-29 20:18:00,2019,6,Saturday,20,Part Three,STATE ST,42.359613,-71.051958,29,18,0
4,Investigate Property,2019-06-29 21:16:00,2019,6,Saturday,21,Part Three,CEDAR ST,42.273353,-71.075589,29,16,0


In [164]:
del df_new_simplified['OCCURRED_ON_DATE']

In [165]:
def is_leap(year):
    if year % 4 != 0:
        return False
    elif year % 100 != 0:
        return True
    elif year % 400 != 0:
        return False
    else:
        return True

def num_of_days(month, year):
    non_leap = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    if month != 2:
        return non_leap[month - 1]
    else:
        if is_leap(year):
            return 29
        else:
            return 28
tie_break_round_up = False #Tie break round up status




In [166]:
for index, row in df_new_simplified.iterrows():
    round_up = False #Round up this time?
    if df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_new_simplified.at[index, 'min'] > 30 or (df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_new_simplified.at[index, 'HOUR'] = df_new_simplified.at[index, 'HOUR'] + 1
        if df_new_simplified.at[index, 'HOUR'] == 24:
            df_new_simplified.at[index, 'HOUR'] = 0
            df_new_simplified.at[index, 'day'] = df_new_simplified.at[index, 'day'] + 1
            if df_new_simplified.at[index, 'day'] > num_of_days(df_new_simplified.at[index, 'MONTH'], df_new_simplified.at[index, 'YEAR']):
                df_new_simplified.at[index, 'day'] = 1
                df_new_simplified.at[index, 'MONTH'] = df_new_simplified.at[index, 'MONTH'] + 1
                if df_new_simplified.at[index,'MONTH'] == 13:
                    df_new_simplified.at[index,'MONTH'] = 1
                    df_new_simplified.at[index, 'YEAR'] = df_new_simplified.at[index, 'YEAR'] + 1

In [167]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,day,min,sec
0,Motor Vehicle Accident Response,2019,6,Friday,22,Part Three,,42.33144,-71.094469,28,43,0
1,Other,2019,6,Saturday,21,Part Two,HUNTINGTON AVE,42.340438,-71.08895,29,57,0
2,Violations,2019,6,Saturday,21,Part Two,HUNTINGTON AVE,42.340438,-71.08895,29,57,0
3,Medical Assistance,2019,6,Saturday,20,Part Three,STATE ST,42.359613,-71.051958,29,18,0
4,Investigate Property,2019,6,Saturday,21,Part Three,CEDAR ST,42.273353,-71.075589,29,16,0


In [168]:
c = '07/08/2012 00:45:39 PM'

In [169]:
int(c[17:19])

39

In [170]:
def extract_hour(old_string):
    hour = int(old_string[11:13])
    code = old_string[-2:]
    if hour == 12:
        hour = hour - 12
    if code == 'PM':
        hour = hour + 12
    return hour

In [171]:
extract_hour(c)

12

In [172]:
df_old_simplified['day'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[3:5]))
df_old_simplified['min'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[14:16]))
df_old_simplified['sec'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[17:19]))
df_old_simplified['hour'] = df_old_simplified['FROMDATE'].apply(extract_hour)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [173]:
for index, row in df_old_simplified.iterrows():
    round_up = False #Round up this time?
    if df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_old_simplified.at[index, 'min'] > 30 or (df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_old_simplified.at[index, 'hour'] = df_old_simplified.at[index, 'hour'] + 1
        if df_old_simplified.at[index, 'hour'] == 24:
            df_old_simplified.at[index, 'hour'] = 0
            df_old_simplified.at[index, 'day'] = df_old_simplified.at[index, 'day'] + 1
            if df_old_simplified.at[index, 'day'] > num_of_days(df_old_simplified.at[index, 'Month'], df_old_simplified.at[index, 'Year']):
                df_old_simplified.at[index, 'day'] = 1
                df_old_simplified.at[index, 'Month'] = df_old_simplified.at[index, 'Month'] + 1
                if df_old_simplified.at[index,'Month'] == 13:
                    df_old_simplified.at[index,'Month'] = 1
                    df_old_simplified.at[index, 'Year'] = df_old_simplified.at[index, 'Year'] + 1


In [175]:
df_old_simplified.head(10)

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,min,sec,hour
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,0,0,6
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,3,0,6
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,26,0,6
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,56,0,7
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,15,0,7
5,ROBBERY,07/08/2012 07:32:00 AM,2012,7,Sunday,Part One,SYDNEY ST,42.313282,-71.053006,8,32,0,8
6,ROBBERY,07/08/2012 07:50:00 AM,2012,7,Sunday,Part One,REGENT ST,42.324251,-71.08621,8,50,0,8
7,SIMPLE ASSAULT,07/08/2012 07:50:00 AM,2012,7,Sunday,Part Two,WASHINGTON ST,42.349246,-71.063785,8,50,0,8
8,MedAssist,07/08/2012 07:53:00 AM,2012,7,Sunday,Part Three,FANEUIL ST,42.351746,-71.16591,8,53,0,8
9,MedAssist,07/08/2012 08:05:00 AM,2012,7,Sunday,Part Three,RIVER ST,42.259383,-71.117294,8,5,0,8


In [176]:
del df_new_simplified['min']
del df_new_simplified['sec']
del df_old_simplified['min']
del df_old_simplified['sec']
del df_old_simplified['FROMDATE']

In [177]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


<a id = '1.5'></a>
[Return to top](#top)
## 1.5 Remove non-crimes

As usual we only care about major crimes.

In [194]:
df_new_clean = df_new_simplified.loc[(df_new_simplified['UCR_PART'] == 'Part One') | (df_new_simplified['OFFENSE_CODE_GROUP'] == 'Arson')]

In [195]:
df_new_clean['UCR_PART'].value_counts()

Part One    75602
Other         107
Name: UCR_PART, dtype: int64

In [196]:
df_new_clean['OFFENSE_CODE_GROUP'].value_counts()

Larceny                       32413
Larceny From Motor Vehicle    13077
Aggravated Assault             9910
Residential Burglary           6538
Auto Theft                     5787
Robbery                        5465
Commercial Burglary            1593
Other Burglary                  555
Homicide                        264
Arson                           107
Name: OFFENSE_CODE_GROUP, dtype: int64

In [185]:
df_old_O = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Other']
df_old_NA = df_old_simplified.loc[df_old_simplified['UCRPART'].isnull()]

In [186]:
df_old_O['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MVAcc                              9671
PersLoc                            3479
PersMiss                            780
07RV                                613
Hazardous                           493
Service                             260
Plates                               45
ARSON                                30
Auto Theft Recovery                  29
MedAssist                            22
HateCrim                             19
License Plate Related Incidents       5
Arson                                 3
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [188]:
df_old_NA.shape

(0, 10)

In [189]:
df_old_simplified['UCRPART'].value_counts()

Part Two      98341
Part One      65261
Part three    55482
Part Three    33523
Other         15449
Name: UCRPART, dtype: int64

Unclean data. That's fine.

In [190]:
df_old_2 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Two']
df_old_3 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Three']
df_old_33 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part three']

In [193]:
df_old_33['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MedAssist                   12401
InvPer                       9448
PropLost                     5890
TOWED                        5524
InvProp                      4862
Service                      3505
PropFound                    2964
Argue                        2065
Arrest                       1374
FIRE                         1294
PhoneCalls                    995
LICViol                       836
32GUN                         747
Gather                        718
Landlord                      716
DEATH INVESTIGATION           678
SearchWarr                    521
PropDam                       502
Plates                        228
Harbor                        150
VIOLATION OF LIQUOR LAWS       30
Explos                         23
Aircraft                        7
Labor                           4
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [197]:
df_old_semiclean = df_old_simplified.loc[(df_old_simplified['UCRPART'] == 'Part One') | (df_old_simplified['UCRPART'] == 'Other')]

OK I think the Part Twos, Part Threes and others other than arson can be ignored.

In [199]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] = df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].apply(lambda x: x.upper())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [200]:
df_old_semiclean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


In [201]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                      24443
LARCENY FROM MOTOR VEHICLE         13265
MVACC                               9671
RESIDENTIAL BURGLARY                7119
AGGRAVATED ASSAULT                  6008
ROBBERY                             5193
AUTO THEFT                          4851
PERSLOC                             3479
COMMERCIAL BURGLARY                 1550
BENOPROP                            1367
LARCENY                             1288
PERSMISS                             780
07RV                                 613
HAZARDOUS                            493
SERVICE                              260
HOMICIDE                             144
PLATES                                45
ARSON                                 33
AUTO THEFT RECOVERY                   29
MEDASSIST                             22
OTHER BURGLARY                        22
HATECRIM                              19
MANSLAUG                               9
LICENSE PLATE RELATED INCIDENTS        5
RAPE AND ATTEMPT

In [202]:
df_old_clean = df_old_semiclean.loc[(df_old_semiclean['UCRPART'] == 'Part One') | (df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] == 'Arson')]

In [203]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
BENOPROP                       1367
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
MANSLAUG                          9
RAPE AND ATTEMPTED                2
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

`BENOPROP` means "Break and enter, no property taken". Since it is in `Other` in the new data source let's remove it. `RAPE AND ATTEMPTED` and `MANSLAUG` need to be removed as well because they are either not present in the new data source or is not in `Part One`.

In [204]:
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'BENOPROP'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'MANSLAUG'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'RAPE AND ATTEMPTED'] 

In [205]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

Now we can drop the `UCR_PART`.

In [207]:
del df_old_clean['UCRPART']

In [208]:
del df_new_clean['UCR_PART']

Let's store the data so that it isn't lost.

In [211]:
df_old_clean.to_csv('old.csv')
df_new_clean.to_csv('new.csv')

<a id = '1.6'></a>
[Return to top](#top)
## 1.6 Combine the two dataframes

Now it's time to merge the two dataframes. 

In [209]:
df_new_clean.head()

Unnamed: 0,OFFENSE_CODE_GROUP,YEAR,MONTH,DAY_OF_WEEK,HOUR,STREET,Lat,Long,day
6,Larceny From Motor Vehicle,2019,6,Saturday,21,BEACON ST,42.355052,-71.073907,29
13,Residential Burglary,2019,6,Saturday,18,SUNNYSIDE ST,,,29
18,Auto Theft,2019,6,Saturday,16,FRANCIS ST,42.336063,-71.107828,29
20,Larceny,2019,6,Friday,7,SWAN AVE,,,28
23,Aggravated Assault,2019,6,Saturday,19,WASHINGTON ST,42.355123,-71.06088,29


In [210]:
df_old_clean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [213]:
df_new_clean.rename(index = str, columns = {'OFFENSE_CODE_GROUP':'crime', 'YEAR': 'year', 'MONTH': 'month', 'DAY_OF_WEEK': 'dayw', 'HOUR': 'hour','STREET':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [214]:
df_new_clean.head()

Unnamed: 0,crime,year,month,dayw,hour,street,lat,long,day
6,Larceny From Motor Vehicle,2019,6,Saturday,21,BEACON ST,42.355052,-71.073907,29
13,Residential Burglary,2019,6,Saturday,18,SUNNYSIDE ST,,,29
18,Auto Theft,2019,6,Saturday,16,FRANCIS ST,42.336063,-71.107828,29
20,Larceny,2019,6,Friday,7,SWAN AVE,,,28
23,Aggravated Assault,2019,6,Saturday,19,WASHINGTON ST,42.355123,-71.06088,29


In [215]:
df_old_clean.rename(index = str, columns = {'INCIDENT_TYPE_DESCRIPTION':'crime', 'Year': 'year', 'Month': 'month', 'DAY_WEEK': 'dayw', 'hour': 'hour','STREETNAME':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

In [216]:
df_old_clean.head()

Unnamed: 0,crime,year,month,dayw,street,lat,long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [217]:
correct_order = ['crime','year','month','day','dayw','hour','street','lat','long']

In [218]:
df_old_clean = df_old_clean[correct_order]
df_new_clean = df_new_clean[correct_order]

In [219]:
df_old_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.1199


In [220]:
df_new_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
6,Larceny From Motor Vehicle,2019,6,29,Saturday,21,BEACON ST,42.355052,-71.073907
13,Residential Burglary,2019,6,29,Saturday,18,SUNNYSIDE ST,,
18,Auto Theft,2019,6,29,Saturday,16,FRANCIS ST,42.336063,-71.107828
20,Larceny,2019,6,28,Friday,7,SWAN AVE,,
23,Aggravated Assault,2019,6,29,Saturday,19,WASHINGTON ST,42.355123,-71.06088


In [221]:
frames = [df_old_clean, df_new_clean]

In [222]:
df_clean = pd.concat(frames, ignore_index = True)

In [224]:
df_clean.tail()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
139587,Aggravated Assault,2015,11,20,Friday,11,BLUE HILL AVE,42.301897,-71.085549
139588,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
139589,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
139590,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
139591,Homicide,2015,7,9,Thursday,14,RIVER ST,42.255926,-71.123172


In [225]:
df_old_clean.shape

(63883, 9)

In [226]:
df_new_clean.shape

(75709, 9)

In [227]:
df_clean.shape

(139592, 9)

In [228]:
df_old_clean.shape[0] + df_new_clean.shape[0] == df_clean.shape[0]

True

Now we need to merge the crimes.

In [229]:
df_clean['crime'] = df_clean['crime'].apply(lambda x: x.upper())

In [230]:
df_clean['crime'].value_counts()

LARCENY                       33701
LARCENY FROM MOTOR VEHICLE    26342
OTHER LARCENY                 24443
AGGRAVATED ASSAULT            15918
RESIDENTIAL BURGLARY          13657
ROBBERY                       10658
AUTO THEFT                    10638
COMMERCIAL BURGLARY            3143
OTHER BURGLARY                  577
HOMICIDE                        408
ARSON                           107
Name: crime, dtype: int64

There is a disparity in what `LARCENY` means. Hence we will simply merge all larcenies into `LARCENY`.

In [231]:
df_clean['crime'] = df_clean['crime'].replace({'LARCENY FROM MOTOR VEHICLE': 'LARCENY', 'OTHER LARCENY': 'LARCENY'})

In [232]:
df_clean['crime'].value_counts()

LARCENY                 84486
AGGRAVATED ASSAULT      15918
RESIDENTIAL BURGLARY    13657
ROBBERY                 10658
AUTO THEFT              10638
COMMERCIAL BURGLARY      3143
OTHER BURGLARY            577
HOMICIDE                  408
ARSON                     107
Name: crime, dtype: int64

Now we can store the file.

In [233]:
df_clean.to_csv('final.csv')

<a id = '2'></a>
[Return to top](#top)
# 2. Regressions

<a id = '2.1'></a>
[Return to top](#top)
## 2.1 Preparation

In [235]:
df_clean.isna().sum()

crime        0
year         0
month        0
day          0
dayw         0
hour         0
street    1525
lat       5712
long      5712
dtype: int64

Now we should drop the NAs.

In [236]:
df_final = df_clean.dropna()

In [237]:
df_final.shape

(133656, 9)

In [238]:
df_final.isna().sum()

crime     0
year      0
month     0
day       0
dayw      0
hour      0
street    0
lat       0
long      0
dtype: int64

In [254]:
df_reg1_alt = df_final.groupby(['crime', 'year', 'month', 'dayw', 'hour']).size().unstack(fill_value=0)

In [255]:
df_reg1_alt.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,hour,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
crime,year,month,dayw,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
AGGRAVATED ASSAULT,2012,7,Friday,1,4,1,2,0,0,0,1,1,0,...,1,0,1,1,1,0,3,0,0,3
AGGRAVATED ASSAULT,2012,7,Monday,3,1,1,2,0,0,0,0,1,0,...,0,2,1,2,1,3,3,1,1,3
AGGRAVATED ASSAULT,2012,7,Saturday,1,3,2,2,1,0,0,0,0,1,...,1,1,1,0,2,2,2,0,1,0
AGGRAVATED ASSAULT,2012,7,Sunday,3,1,2,1,3,1,2,0,1,0,...,0,0,1,0,1,1,2,2,3,3
AGGRAVATED ASSAULT,2012,7,Thursday,0,0,1,0,0,0,0,0,0,1,...,1,0,1,2,0,0,1,1,1,0
AGGRAVATED ASSAULT,2012,7,Tuesday,3,2,1,0,0,0,1,0,0,3,...,5,1,1,2,2,4,0,2,0,1
AGGRAVATED ASSAULT,2012,7,Wednesday,1,2,1,1,0,0,0,0,0,2,...,0,3,1,1,2,0,1,0,0,0
AGGRAVATED ASSAULT,2012,8,Friday,3,2,3,0,0,1,1,0,0,0,...,1,1,2,1,2,1,3,2,5,3
AGGRAVATED ASSAULT,2012,8,Monday,4,1,3,1,3,0,0,1,0,0,...,0,0,1,1,1,0,3,0,0,2
AGGRAVATED ASSAULT,2012,8,Saturday,3,1,6,4,0,0,1,1,0,1,...,1,2,3,0,2,0,1,2,4,2


In [256]:
df_reg2 = df_final.groupby(['crime', 'year', 'month']).size().reset_index(name = 'counts')

In [257]:
df_reg2.head(40)

Unnamed: 0,crime,year,month,counts
0,AGGRAVATED ASSAULT,2012,7,162
1,AGGRAVATED ASSAULT,2012,8,205
2,AGGRAVATED ASSAULT,2012,9,206
3,AGGRAVATED ASSAULT,2012,10,140
4,AGGRAVATED ASSAULT,2012,11,134
5,AGGRAVATED ASSAULT,2012,12,149
6,AGGRAVATED ASSAULT,2013,1,134
7,AGGRAVATED ASSAULT,2013,2,128
8,AGGRAVATED ASSAULT,2013,3,158
9,AGGRAVATED ASSAULT,2013,4,166


In [258]:
df_reg2.dtypes

crime     object
year       int64
month      int64
counts     int64
dtype: object

In [260]:
df_reg2['year'] = df_reg2.year.astype('category')
df_reg2['month'] = df_reg2.month.astype('category')
df_reg2['counts'] = df_reg2.counts.astype(float)

In [261]:
df_reg2.dtypes

crime       object
year      category
month     category
counts     float64
dtype: object

In [262]:
df_reg2.shape

(677, 4)

In [264]:
df_dummies = pd.get_dummies(df_reg2, columns = '')

In [265]:
df_dummies.head()

Unnamed: 0,counts,crime_AGGRAVATED ASSAULT,crime_ARSON,crime_AUTO THEFT,crime_COMMERCIAL BURGLARY,crime_HOMICIDE,crime_LARCENY,crime_OTHER BURGLARY,crime_RESIDENTIAL BURGLARY,crime_ROBBERY,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,162.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,205.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,206.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,140.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,134.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [266]:
X = df_dummies.iloc[:,1:]
y = df_dummies['counts']

In [269]:
y.shape

(677,)

In [283]:
y.head()

0    162.0
1    205.0
2    206.0
3    140.0
4    134.0
Name: counts, dtype: float64

In [278]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12)

In [279]:
X_train.shape

(453, 29)

In [301]:
regressor_list = []
ev_train = []
ev_test = []
r2_train = []
r2_test = []
mse_train = []
mse_test = []
mae_train = []
mae_test = []
mdae_train = []
mdae_test = []

In [297]:
def regression(regressor, x_train, x_test, y_train):
    reg = regressor
    reg.fit(x_train, y_train)
    
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    
    return y_train_reg, y_test_reg

In [298]:
def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    regressor_list.append(str(regressor))
    
    ev_train_c = explained_variance_score(y_train, y_train_reg)
    ev_train.append(ev_train_c)
    ev_test_c = explained_variance_score(y_test, y_test_reg)
    ev_test.append(ev_test_c)
    
    r2_train_c = r2_score(y_train, y_train_reg)
    r2_train.append(r2_train_c)
    r2_test_c = r2_score(y_test, y_test_reg)
    r2_test.append(r2_test_c)
    
    mse_train_c = mean_squared_error(y_train, y_train_reg)
    mse_train.append(mse_train_c)
    mse_test_c = mean_squared_error(y_test, y_test_reg)
    mse_test.append(mse_test_c)

    mae_train_c = mean_absolute_error(y_train, y_train_reg)
    mae_train.append(mae_train_c)
    mae_test_c = mean_absolute_error(y_test, y_test_reg)
    mae_test.append(mae_test_c)  
    
    mdae_train_c = median_absolute_error(y_train, y_train_reg)
    mdae_train.append(mdae_train_c)
    mdae_test_c = median_absolute_error(y_test, y_test_reg)
    mdae_test.append(mdae_test_c)
    
    print("______________________________________________________________________________")
    print(str(regressor))
    print("______________________________________________________________________________")
    print("EV score. Train: ", ev_train_c)
    print("EV score. Test: ", ev_test_c)
    print("---------")
    print("R2 score. Train: ", r2_train_c)
    print("R2 score. Test: ", r2_test_c)
    print("---------")
    print("MSE score. Train: ", mse_train_c)
    print("MSE score. Test: ", mse_test_c)
    print("---------")
    print("MAE score. Train: ", mae_train_c)
    print("MAE score. Test: ", mae_test_c)
    print("---------")
    print("MdAE score. Train: ", mdae_train_c)
    print("MdAE score. Test: ", mdae_test_c)

In [325]:
df_dummies.corr()

Unnamed: 0,counts,crime_AGGRAVATED ASSAULT,crime_ARSON,crime_AUTO THEFT,crime_COMMERCIAL BURGLARY,crime_HOMICIDE,crime_LARCENY,crime_OTHER BURGLARY,crime_RESIDENTIAL BURGLARY,crime_ROBBERY,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
counts,1.0,-0.024914,-0.159253,-0.095976,-0.197394,-0.236739,0.944793,-0.169531,-0.044923,-0.0977,...,-0.033161,-0.019335,-0.002567,0.007619,0.051694,0.035514,0.015475,0.013705,-0.001495,0.003223
crime_AGGRAVATED ASSAULT,-0.024914,1.0,-0.094313,-0.141653,-0.141653,-0.141653,-0.141653,-0.105131,-0.141653,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_ARSON,-0.159253,-0.094313,1.0,-0.094313,-0.094313,-0.094313,-0.094313,-0.069997,-0.094313,-0.094313,...,-0.007021,-0.028655,-0.007021,0.033628,-0.007021,-0.007021,0.014263,0.014263,0.014263,-0.007021
crime_AUTO THEFT,-0.095976,-0.141653,-0.094313,1.0,-0.141653,-0.141653,-0.141653,-0.105131,-0.141653,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_COMMERCIAL BURGLARY,-0.197394,-0.141653,-0.094313,-0.141653,1.0,-0.141653,-0.141653,-0.105131,-0.141653,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_HOMICIDE,-0.236739,-0.141653,-0.094313,-0.141653,-0.141653,1.0,-0.141653,-0.105131,-0.141653,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_LARCENY,0.944793,-0.141653,-0.094313,-0.141653,-0.141653,-0.141653,1.0,-0.105131,-0.141653,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_OTHER BURGLARY,-0.169531,-0.105131,-0.069997,-0.105131,-0.105131,-0.105131,-0.105131,1.0,-0.105131,-0.105131,...,-0.0011,0.000401,-0.0011,0.014748,-0.0011,-0.0011,-0.002578,-0.002578,-0.002578,-0.0011
crime_RESIDENTIAL BURGLARY,-0.044923,-0.141653,-0.094313,-0.141653,-0.141653,-0.141653,-0.141653,-0.105131,1.0,-0.141653,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841
crime_ROBBERY,-0.0977,-0.141653,-0.094313,-0.141653,-0.141653,-0.141653,-0.141653,-0.105131,-0.141653,1.0,...,0.000841,0.002883,0.000841,-0.005092,0.000841,0.000841,-0.001168,-0.001168,-0.001168,0.000841


<a id = '2.2'></a>
[Return to top](#top)
## 2.2 Linear Regressor

Let's first try linear regression.

In [302]:
lreg = LinearRegression()
lreg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(lreg, X_train, X_test, y_train)
scores(lreg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
______________________________________________________________________________
EV score. Train:  0.9302959771837847
EV score. Test:  0.967042657923691
---------
R2 score. Train:  0.93028392413455
R2 score. Test:  0.9667115200018876
---------
MSE score. Train:  5631.896247240618
MSE score. Test:  3932.46875
---------
MAE score. Train:  37.48565121412803
MAE score. Test:  42.049107142857146
---------
MdAE score. Train:  23.0
MdAE score. Test:  27.0


In [307]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(sgd_reg, X_train, X_test, y_train)
scores(sgd_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=None,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.5555254734227664
EV score. Test:  0.5576120182265658
---------
R2 score. Train:  0.5554936143894806
R2 score. Test:  0.541462827795476
---------
MSE score. Train:  35908.702747783565
MSE score. Test:  54168.38199010318
---------
MAE score. Train:  109.23507795523254
MAE score. Test:  140.67546268402785
---------
MdAE score. Train:  71.36729308297123
MdAE score. Test:  79.14053216733369




In [294]:
X.head()

Unnamed: 0,crime_AGGRAVATED ASSAULT,crime_ARSON,crime_AUTO THEFT,crime_COMMERCIAL BURGLARY,crime_HOMICIDE,crime_LARCENY,crime_OTHER BURGLARY,crime_RESIDENTIAL BURGLARY,crime_ROBBERY,year_2012,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [295]:
y.head()

0    162.0
1    205.0
2    206.0
3    140.0
4    134.0
Name: counts, dtype: float64

<a id = '2.3'></a>
[Return to top](#top)
## 2.3 BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

In [303]:
ba_reg = BaggingRegressor()
ba_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(ba_reg, X_train, X_test, y_train)
scores(ba_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=None,
         verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9912881520085127
EV score. Test:  0.983460642537918
---------
R2 score. Train:  0.9912612183824068
R2 score. Test:  0.9833076464884634
---------
MSE score. Train:  705.9478145695364
MSE score. Test:  1971.918169642857
---------
MAE score. Train:  9.71523178807947
MAE score. Test:  24.928124999999998
---------
MdAE score. Train:  3.1999999999999993
MdAE score. Test:  9.699999999999996


In [304]:
ada_reg = AdaBoostRegressor()
ada_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(ada_reg, X_train, X_test, y_train)
scores(ada_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None)
______________________________________________________________________________
EV score. Train:  0.9494708579569429
EV score. Test:  0.9503523103085866
---------
R2 score. Train:  0.9487582978345273
R2 score. Test:  0.9502465237887683
---------
MSE score. Train:  4139.474956750488
MSE score. Test:  5877.52851483853
---------
MAE score. Train:  51.61402001795291
MAE score. Test:  58.39801753028452
---------
MdAE score. Train:  50.795698924731184
MdAE score. Test:  52.939419087136926


In [305]:
et_reg = ExtraTreesRegressor()
et_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(et_reg, X_train, X_test, y_train)
scores(et_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
          oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.9637986153568655
---------
R2 score. Train:  1.0
R2 score. Test:  0.9630562470808416
---------
MSE score. Train:  0.0
MSE score. Test:  4364.277187499999
---------
MAE score. Train:  0.0
MAE score. Test:  35.685267857142854
---------
MdAE score. Train:  0.0
MdAE score. Test:  12.5




<a id = '2.4'></a>
[Return to top](#top)
## 2.4 GradientBoostingRegressor, RandomForestRegressor

In [306]:
gb_reg = GradientBoostingRegressor()
gb_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(gb_reg, X_train, X_test, y_train)
scores(gb_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.987283918069472
EV score. Test:  0.9772170739399785
---------
R2 score. Train:  0.987283918069472
R2 score. Test:  0.9772147335977058
---------
MSE score. Train:  1027.2473488375983
MSE score. Test:  2691.692384047804
---------
MAE score. Train:  19.343972992963135
MAE score. Test:  30.7306989

In [308]:
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(rf_reg, X_train, X_test, y_train)
scores(rf_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9915092111350801
EV score. Test:  0.9807125564593433
---------
R2 score. Train:  0.9915064717771375
R2 score. Test:  0.980559977839495
---------
MSE score. Train:  686.1354304635761
MSE score. Test:  2296.5085714285715
---------
MAE score. Train:  10.020529801324502
MAE score. Test:  27.383928571428573
---------
MdAE score. Train:  3.1999999999999886
MdAE score. Test:  11.200000000000017




In [323]:
import lightgbm as lgb
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.01,
                        n_estimators=1000)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

#print('Starting predicting...')
# predict
y_train_reg = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
y_test_reg = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
scores(gbm, y_train, y_test, y_train_reg, y_test_reg)

[1]	valid_0's l2: 119653	valid_0's l1: 212.521
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 117436	valid_0's l1: 210.556
[3]	valid_0's l2: 115263	valid_0's l1: 208.612
[4]	valid_0's l2: 113132	valid_0's l1: 206.687
[5]	valid_0's l2: 111044	valid_0's l1: 204.781
[6]	valid_0's l2: 108996	valid_0's l1: 202.895
[7]	valid_0's l2: 106988	valid_0's l1: 201.028
[8]	valid_0's l2: 105019	valid_0's l1: 199.18
[9]	valid_0's l2: 103089	valid_0's l1: 197.351
[10]	valid_0's l2: 101197	valid_0's l1: 195.54
[11]	valid_0's l2: 99342	valid_0's l1: 193.748
[12]	valid_0's l2: 97523.2	valid_0's l1: 191.973
[13]	valid_0's l2: 95739.9	valid_0's l1: 190.216
[14]	valid_0's l2: 93991.6	valid_0's l1: 188.48
[15]	valid_0's l2: 92277.4	valid_0's l1: 186.763
[16]	valid_0's l2: 90596.7	valid_0's l1: 185.062
[17]	valid_0's l2: 88948.9	valid_0's l1: 183.379
[18]	valid_0's l2: 87333.4	valid_0's l1: 181.712
[19]	valid_0's l2: 85749.4	valid_0's l1: 180.062
[20]	valid_0's l2: 84196.3	valid

[234]	valid_0's l2: 6461.57	valid_0's l1: 46.5485
[235]	valid_0's l2: 6434.38	valid_0's l1: 46.4407
[236]	valid_0's l2: 6407.71	valid_0's l1: 46.3278
[237]	valid_0's l2: 6381.18	valid_0's l1: 46.2195
[238]	valid_0's l2: 6355.41	valid_0's l1: 46.1085
[239]	valid_0's l2: 6329.77	valid_0's l1: 46.0022
[240]	valid_0's l2: 6304.84	valid_0's l1: 45.8926
[241]	valid_0's l2: 6280.34	valid_0's l1: 45.784
[242]	valid_0's l2: 6255.71	valid_0's l1: 45.6698
[243]	valid_0's l2: 6232	valid_0's l1: 45.5693
[244]	valid_0's l2: 6208.45	valid_0's l1: 45.4677
[245]	valid_0's l2: 6185.57	valid_0's l1: 45.363
[246]	valid_0's l2: 6163.06	valid_0's l1: 45.2653
[247]	valid_0's l2: 6140.68	valid_0's l1: 45.1664
[248]	valid_0's l2: 6118.93	valid_0's l1: 45.0644
[249]	valid_0's l2: 6097.56	valid_0's l1: 44.9634
[250]	valid_0's l2: 6076.54	valid_0's l1: 44.8691
[251]	valid_0's l2: 6055.61	valid_0's l1: 44.7738
[252]	valid_0's l2: 6035.28	valid_0's l1: 44.6753
[253]	valid_0's l2: 6015.3	valid_0's l1: 44.5779
[254]	

<a id = '2.5'></a>
[Return to top](#top)
## 2.5 KNeighborsRegressor, RadiusNeighborsRegressor

In [309]:
kn_reg = KNeighborsRegressor()
kn_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(kn_reg, X_train, X_test, y_train)
scores(kn_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform')
______________________________________________________________________________
EV score. Train:  0.8035257727364177
EV score. Test:  0.8163236423734623
---------
R2 score. Train:  0.803366521906197
R2 score. Test:  0.8142576922939433
---------
MSE score. Train:  15884.705695364235
MSE score. Test:  21942.30017857143
---------
MAE score. Train:  64.10684326710816
MAE score. Test:  86.81339285714286
---------
MdAE score. Train:  20.599999999999994
MdAE score. Test:  27.299999999999997


In [310]:
rn_reg = RadiusNeighborsRegressor()
rn_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(rn_reg, X_train, X_test, y_train)
scores(rn_reg, y_train, y_test, y_train_reg, y_test_reg)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

<a id = '2.6'></a>
[Return to top](#top)
## 2.6 DecisionTreeRegressor

In [311]:
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(dt_reg, X_train, X_test, y_train)
scores(dt_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.9585666838807293
---------
R2 score. Train:  1.0
R2 score. Test:  0.9575884434383146
---------
MSE score. Train:  0.0
MSE score. Test:  5010.205357142857
---------
MAE score. Train:  0.0
MAE score. Test:  37.50892857142857
---------
MdAE score. Train:  0.0
MdAE score. Test:  12.0


<a id = '2.7'></a>
[Return to top](#top)
## 2.7 Ridge, RidgeCV, BayesianRidge

In [312]:
rid_reg = Ridge()
rid_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(rid_reg, X_train, X_test, y_train)
scores(rid_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
______________________________________________________________________________
EV score. Train:  0.9300704800472378
EV score. Test:  0.9666377670362386
---------
R2 score. Train:  0.9300704800472377
R2 score. Test:  0.9660487834272384
---------
MSE score. Train:  5649.138969803619
MSE score. Test:  4010.759824553049
---------
MAE score. Train:  37.11906102475521
MAE score. Test:  41.718117354680395
---------
MdAE score. Train:  23.346204632868762
MdAE score. Test:  26.575464976093755


In [313]:
ric_reg = RidgeCV()
ric_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(ric_reg, X_train, X_test, y_train)
scores(ric_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
______________________________________________________________________________
EV score. Train:  0.9304979106611214
EV score. Test:  0.9673426287863283
---------
R2 score. Train:  0.9304979106611214
R2 score. Test:  0.9668946221261571
---------
MSE score. Train:  5614.609704631936
MSE score. Test:  3910.838342670234
---------
MAE score. Train:  37.46057984687948
MAE score. Test:  41.93998317406733
---------
MdAE score. Train:  23.85737570893454
MdAE score. Test:  26.634577337366565


In [314]:
br_reg = BayesianRidge()
br_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(br_reg, X_train, X_test, y_train)
scores(br_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.9304823633572259
EV score. Test:  0.9672960076815633
---------
R2 score. Train:  0.9304823633572259
R2 score. Test:  0.966831266698133
---------
MSE score. Train:  5615.865667498133
MSE score. Test:  3918.3227108619203
---------
MAE score. Train:  37.41150545469795
MAE score. Test:  41.88726963352501
---------
MdAE score. Train:  24.110828286664997
MdAE score. Test:  27.91167153340777


<a id = '2.8'></a>
[Return to top](#top)
## 2.8 HuberRegressor, TheilSenRegressor, RANSACRegressor

In [315]:
hu_reg = HuberRegressor()
hu_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(hu_reg, X_train, X_test, y_train)
scores(hu_reg, y_train, y_test, y_train_reg, y_test_reg)

ValueError: HuberRegressor convergence failed: l-BFGS-b solver terminated with ABNORMAL_TERMINATION_IN_LNSRCH

In [316]:
ts_reg = TheilSenRegressor()
ts_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(ts_reg, X_train, X_test, y_train)
scores(ts_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
TheilSenRegressor(copy_X=True, fit_intercept=True, max_iter=300,
         max_subpopulation=10000, n_jobs=None, n_subsamples=None,
         random_state=None, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.9264663467695186
EV score. Test:  0.9643155122752458
---------
R2 score. Train:  0.9264197116755328
R2 score. Test:  0.9631859642607021
---------
MSE score. Train:  5944.06016892319
MSE score. Test:  4348.95330499865
---------
MAE score. Train:  34.92049296483508
MAE score. Test:  40.391997981775084
---------
MdAE score. Train:  20.735701080555856
MdAE score. Test:  24.06863783775823


In [317]:
ran_reg = RANSACRegressor()
ran_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(ran_reg, X_train, X_test, y_train)
scores(ran_reg, y_train, y_test, y_train_reg, y_test_reg)

______________________________________________________________________________
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
        loss='absolute_loss', max_skips=inf, max_trials=100,
        min_samples=None, random_state=None, residual_threshold=None,
        stop_n_inliers=inf, stop_probability=0.99, stop_score=inf)
______________________________________________________________________________
EV score. Train:  0.9004313076134113
EV score. Test:  0.9485699720222694
---------
R2 score. Train:  0.8983172087254725
R2 score. Test:  0.9465886235829546
---------
MSE score. Train:  8214.273730684326
MSE score. Test:  6309.647321428572
---------
MAE score. Train:  39.094922737306845
MAE score. Test:  42.424107142857146
---------
MdAE score. Train:  15.0
MdAE score. Test:  17.5


<a id = '2.9'></a>
[Return to top](#top)
## 2.9 MLPRegressor

In [320]:
mlp_reg = MLPRegressor(max_iter=1000)
mlp_reg.fit(X_train, y_train)
y_train_reg, y_test_reg = regression(mlp_reg, X_train, X_test, y_train)
scores(mlp_reg, y_train, y_test, y_train_reg, y_test_reg)



______________________________________________________________________________
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9424551748890773
EV score. Test:  0.9744970845635339
---------
R2 score. Train:  0.9424489741386339
R2 score. Test:  0.9741183156557403
---------
MSE score. Train:  4649.163088281384
MSE score. Test:  3057.481593840793
---------
MAE score. Train:  31.8338392974262
MAE score. Test:  36.020691119117984
---------
MdAE score. Train:  19.55716298864514
MdAE score. Test:

