<a id = 'top'></a>

# Crime in Boston, Revisited (Version 1.0)

**Ying Zhou**

**Table of contents**

[1.Data wrangling](#1)

[1.1 Exploration](#1.1)

[1.2 Removing irrelevant columns](#1.2)

[1.3 Process location data](#1.3)

[1.4 Process time](#1.4)

[1.5 Remove non-crimes](#1.5)

[1.6 Combine the two dataframes](#1.6)

[2.Regressions](#2)

[2.1 Preparation](#2.1)

[2.2 Select and split](#2.2A)

[2.3 Linear Regressor](#2.2)

[2.4 BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor](#2.4)

[2.5 GradientBoostingRegressor, RandomForestRegressor, LGBMRegressor](#2.4)

[2.6 KNeighborsRegressor, RadiusNeighborsRegressor](#2.5)

[2.7 DecisionTreeRegressor](#2.6)

[2.8 Ridge, RidgeCV, BayesianRidge](#2.7)

[2.9 HuberRegressor, TheilSenRegressor, RANSACRegressor](#2.8)

[2.10 MLPRegressor](#2.9)

[2.11 SVR](#2.10)

[3. Tuning hyperparameters](#3)

Now let's return to the problem of crime in Boston. This time we will predict the amount of crimes, do some validation and finally use all my data to make the prediction about crime in Boston in the future. We won't do preliminary analysis any more because especially for the last 3-4 years I think this is already explored in details in the last project.

Again let's first import the usual packages.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import datetime
from bs4 import BeautifulSoup

Since we need to do some machine learning let's import regression-related parts of sklearn too. However this local computer can not handle deep learning which is why we won't import Keras. If necessary we will do some regression on Google Colab.

In [2]:
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler

from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor

from sklearn.svm import SVR

from sklearn.neural_network import MLPRegressor

import lightgbm as lgb

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


Since we need to draw graphs we need to write our multiliner function here which can help us leave more room for tick labels if the tick labels are really long.

In [3]:
def multiliner(string_list, n):
    length = len(string_list)
    for i in range(length):
        rem = i % n
        string_list[i] = '\n' * rem + string_list[i]
    return string_list

Time to get the data!

In [4]:
#Since the new crime dataset is updated regularly the links may not be stable, hence we use the preview page to obtain the actual links
new_preview_url = 'https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b'
new_req = requests.get(new_preview_url)
new_req.raise_for_status()
if new_req.status_code == requests.codes.ok:
    new_soup = BeautifulSoup(new_req.text, 'html.parser')
    new_url = new_soup.find_all('a',{'class':'btn btn-primary resource-url-analytics resource-type-None'})[0]['href']
old_url = 'https://data.boston.gov/dataset/eefad66a-e805-4b35-b170-d26e2028c373/resource/ba5ed0e2-e901-438c-b2e0-4acfc3c452b9/download/crime-incident-reports-july-2012-august-2015-source-legacy-system.csv'

In [5]:
df_new = pd.read_csv(new_url)
df_old = pd.read_csv(old_url)

  interactivity=interactivity, compiler=compiler, result=result)


[Return to top](#top)
<a id = '1'></a>
# 1. Data Wrangling

<a id = '1.1'></a>
[Return to top](#top)
## 1.1 Exploration

In [6]:
df_new.shape

(404051, 17)

In [7]:
df_old.head()

Unnamed: 0,COMPNOS,NatureCode,INCIDENT_TYPE_DESCRIPTION,MAIN_CRIMECODE,REPTDISTRICT,REPORTINGAREA,FROMDATE,WEAPONTYPE,Shooting,DOMESTIC,SHIFT,Year,Month,DAY_WEEK,UCRPART,X,Y,STREETNAME,XSTREETNAME,Location
0,120420285.0,BERPTA,RESIDENTIAL BURGLARY,05RB,D4,629,07/08/2012 06:00:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,763273.1791,2951498.962,ABERDEEN ST,,"(42.34638135, -71.10379454)"
1,120419202.0,PSHOT,AGGRAVATED ASSAULT,04xx,B2,327,07/08/2012 06:03:00 AM,Firearm,Yes,No,Last,2012,7,Sunday,Part One,771223.1638,2940772.099,HOWARD AV,,"(42.31684135, -71.07458456)"
2,120419213.0,ARMROB,ROBBERY,03xx,D4,625,07/08/2012 06:26:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,765118.8605,2950217.536,JERSEY ST,QUEENSBERRY ST,"(42.34284135, -71.09698955)"
3,120419223.0,ALARMC,COMMERCIAL BURGLARY,05CB,B2,258,07/08/2012 06:56:00 AM,Other,No,No,Last,2012,7,Sunday,Part One,773591.8648,2940638.174,COLUMBIA RD,,"(42.3164411, -71.06582908)"
4,120419236.0,ARMROB,ROBBERY,03xx,E18,496,07/08/2012 07:15:00 AM,Firearm,No,No,Last,2012,7,Sunday,Part One,759042.7315,2923832.681,COLLINS ST,,"(42.27051636, -71.11989955)"


In [8]:
df_new.head()

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I192054879,3301,Verbal Disputes,VERBAL DISPUTE,B2,314,,2019-07-16 20:35:00,2019,7,Tuesday,20,Part Three,HOWLAND ST,42.314448,-71.089934,"(42.31444840, -71.08993418)"
1,I192054875,2403,Disorderly Conduct,DISTURBING THE PEACE,C11,361,,2019-07-16 21:10:00,2019,7,Tuesday,21,Part Two,MATHER ST,42.294686,-71.063125,"(42.29468596, -71.06312544)"
2,I192054874,3301,Verbal Disputes,VERBAL DISPUTE,B3,466,,2019-07-16 20:34:00,2019,7,Tuesday,20,Part Three,OUTLOOK RD,42.282455,-71.094488,"(42.28245514, -71.09448768)"
3,I192054873,2662,Ballistics,BALLISTICS EVIDENCE/FOUND,B2,325,,2019-07-16 21:18:00,2019,7,Tuesday,21,Part Two,QUINCY ST,42.313322,-71.075915,"(42.31332153, -71.07591511)"
4,I192054866,617,Larceny,LARCENY THEFT FROM BUILDING,C6,194,,2019-07-16 20:37:00,2019,7,Tuesday,20,Part One,DORCHESTER AVE,42.33014,-71.056958,"(42.33014012, -71.05695787)"


In [9]:
df_old.shape

(268056, 20)

In [10]:
df_new.dtypes

INCIDENT_NUMBER         object
OFFENSE_CODE             int64
OFFENSE_CODE_GROUP      object
OFFENSE_DESCRIPTION     object
DISTRICT                object
REPORTING_AREA          object
SHOOTING                object
OCCURRED_ON_DATE        object
YEAR                     int64
MONTH                    int64
DAY_OF_WEEK             object
HOUR                     int64
UCR_PART                object
STREET                  object
Lat                    float64
Long                   float64
Location                object
dtype: object

In [11]:
df_old.dtypes

COMPNOS                      float64
NatureCode                    object
INCIDENT_TYPE_DESCRIPTION     object
MAIN_CRIMECODE                object
REPTDISTRICT                  object
REPORTINGAREA                  int64
FROMDATE                      object
WEAPONTYPE                    object
Shooting                      object
DOMESTIC                      object
SHIFT                         object
Year                           int64
Month                          int64
DAY_WEEK                      object
UCRPART                       object
X                            float64
Y                            float64
STREETNAME                    object
XSTREETNAME                   object
Location                      object
dtype: object

We are very interested in knowing whether the `Lat` / `Long` / `Location` data contains de facto NaN values that aren't labelled as NaN.

In [12]:
df_new['Lat'].value_counts()

 42.348624    1611
 42.361839    1573
 42.284826    1394
 42.328663    1286
 42.256216    1200
 42.297555    1055
 42.331521     969
 42.341288     965
-1.000000      905
 42.335119     883
 42.326966     835
 42.352312     832
 42.309719     822
 42.339542     811
 42.332108     803
 42.326968     787
 42.355123     766
 42.334018     706
 42.342850     687
 42.298489     684
 42.310434     670
 42.334288     654
 42.349802     626
 42.350959     624
 42.333679     620
 42.366435     607
 42.370818     602
 42.356024     600
 42.349056     591
 42.348406     591
              ... 
 42.301570       1
 42.321381       1
 42.284653       1
 42.349891       1
 42.292743       1
 42.318082       1
 42.282142       1
 42.352737       1
 42.348822       1
 42.332089       1
 42.357649       1
 42.350426       1
 42.288607       1
 42.345610       1
 42.272102       1
 42.294009       1
 42.309141       1
 42.331083       1
 42.310795       1
 42.356965       1
 42.373353       1
 42.358054  

In [13]:
df_new['Long'].value_counts()

-71.082776    1611
-71.059765    1573
-71.091374    1394
-71.085634    1286
-71.124019    1200
-71.059709    1055
-71.070853     969
-71.054679     965
-1.000000      905
-71.074917     883
-71.061986     835
-71.063705     832
-71.104294     822
-71.069409     811
-71.070144     803
-71.080519     787
-71.060880     766
-71.076381     706
-71.065162     687
-71.063133     684
-71.061340     670
-71.072395     654
-71.078410     626
-71.074128     624
-71.091878     620
-71.061354     607
-71.039291     602
-71.061776     600
-71.086883     591
-71.150498     591
              ... 
-71.084590       1
-71.123187       1
-71.140785       1
-71.142722       1
-71.051884       1
-71.060373       1
-71.061490       1
-71.086027       1
-71.052596       1
-71.048927       1
-71.050892       1
-71.096301       1
-71.120902       1
-71.127557       1
-71.026605       1
-71.053756       1
-71.050320       1
-71.068767       1
-71.064799       1
-71.164690       1
-71.075020       1
-71.105416  

In [14]:
df_new['Location'].value_counts()

(0.00000000, 0.00000000)       26219
(42.34862382, -71.08277637)     1611
(42.36183857, -71.05976489)     1573
(42.28482577, -71.09137369)     1394
(42.32866284, -71.08563401)     1286
(42.25621592, -71.12401947)     1200
(42.29755533, -71.05970910)     1055
(42.33152148, -71.07085307)      969
(42.34128751, -71.05467933)      965
(-1.00000000, -1.00000000)       905
(42.33511904, -71.07491710)      883
(42.32696647, -71.06198607)      835
(42.35231190, -71.06370510)      832
(42.30971857, -71.10429432)      822
(42.33954199, -71.06940877)      811
(42.33210843, -71.07014395)      803
(42.32696802, -71.08051941)      787
(42.35512339, -71.06087980)      766
(42.33401829, -71.07638124)      706
(42.34285014, -71.06516235)      687
(42.29848866, -71.06313294)      684
(42.31043400, -71.06134010)      670
(42.33428841, -71.07239518)      654
(42.34980175, -71.07840978)      626
(42.35095909, -71.07412780)      624
(42.33367922, -71.09187755)      620
(42.36643546, -71.06135413)      607
(

Other than the (0,0)s and (-1,-1)s they are mostly reasonable. So I think we will do a filter and treat completely absurd outliers as NAs.

In [15]:
df_old['Location'].value_counts()

(0.0, 0.0)                               14981
(42.3286598, -71.08561842)                1506
(42.32543556, -71.06387302)               1008
(42.28486136, -71.09132455)                843
(42.34130529, -71.0547108)                 735
(42.31037135, -71.06123456)                714
(42.34865634, -71.08256955)                699
(42.29754136, -71.05973457)                695
(42.36164815, -71.05998657)                675
(42.33950635, -71.06938956)                635
(42.25642136, -71.12394954)                624
(42.35237134, -71.06490456)                597
(42.33325635, -71.07289955)                595
(42.35230134, -71.06367456)                580
(42.33372337, -71.09095643)                532
(42.28714136, -71.14857453)                463
(42.34898135, -71.15091453)                431
(42.32723569, -71.08059616)                426
(42.35505634, -71.06084456)                425
(42.30972244, -71.10427304)                416
(42.34710135, -71.07960455)                397
(42.35075635,

<a id = '1.2'></a>
[Return to top](#top)
## 1.2 Removing irrelevant columns

As usual we will filter out what's irrelevant. For example I haven't figured out what an RA number actually is. As for `X` and `Y` in the old table they are also irrelevant so we will get rid of them.

In [16]:
df_old_simplified = df_old[['INCIDENT_TYPE_DESCRIPTION', 'FROMDATE', 'Year' ,'Month', 'DAY_WEEK', 'UCRPART', 'STREETNAME', 'Location']]

In [17]:
df_old_simplified['INCIDENT_TYPE_DESCRIPTION'].value_counts()

VAL                                 27363
OTHER LARCENY                       24443
SIMPLE ASSAULT                      17697
MedAssist                           17128
MVAcc                               13832
VANDALISM                           13339
InvPer                              12937
LARCENY FROM MOTOR VEHICLE          12742
DRUG CHARGES                        12042
FRAUD                                8742
PropLost                             8522
TOWED                                7526
RESIDENTIAL BURGLARY                 6737
InvProp                              6592
AGGRAVATED ASSAULT                   5649
Service                              5353
ROBBERY                              4974
PersLoc                              4745
AUTO THEFT                           4620
PropFound                            4316
Argue                                2833
Arrest                               1959
OTHER                                1902
FIRE                              

Oh so homogenizing the data can be hard. However this still has to be done.

In [18]:
df_new_simplified = df_new[['OFFENSE_CODE_GROUP','OCCURRED_ON_DATE','YEAR','MONTH','DAY_OF_WEEK','HOUR','UCR_PART','STREET','Lat','Long']]

In [19]:
df_new_simplified['OFFENSE_CODE_GROUP'].value_counts()

Motor Vehicle Accident Response              47057
Larceny                                      32780
Medical Assistance                           30702
Investigate Person                           23681
Other                                        22730
Drug Violation                               20957
Simple Assault                               20174
Vandalism                                    19053
Verbal Disputes                              16805
Investigate Property                         14310
Towed                                        14177
Larceny From Motor Vehicle                   13217
Property Lost                                12734
Warrant Arrests                              10513
Aggravated Assault                           10044
Fraud                                         7756
Violations                                    7591
Missing Person Located                        6834
Residential Burglary                          6591
Auto Theft                     

I think we are definitely going to restrict our concerns to major crimes.

In [20]:
df_old_simplified.dtypes

INCIDENT_TYPE_DESCRIPTION    object
FROMDATE                     object
Year                          int64
Month                         int64
DAY_WEEK                     object
UCRPART                      object
STREETNAME                   object
Location                     object
dtype: object

<a id = '1.3'></a>
[Return to top](#top)
## 1.3 Process location data

In [21]:
def get_lat_long(loc_string):
    loc_list = loc_string.lstrip('(').rstrip(')').split()
    return loc_list[0].strip(','), loc_list[1]

In [22]:
get_lat_long('(42.34638135, -71.10379454)')

('42.34638135', '-71.10379454')

In [23]:
df_old_simplified['Lat'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[0])
df_old_simplified['Long'] = df_old_simplified['Location'].apply(lambda x: get_lat_long(x)[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [24]:
df_old_simplified.tail()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Location,Lat,Long
268051,Motor Vehicle Accident Response,08/10/2015 02:38:00 AM,2015,8,Monday,Part Three,HARVARD ST,"(0.0, 0.0)",0.0,0.0
268052,Police Service Incidents,08/10/2015 04:46:00 AM,2015,8,Monday,Part Three,DORCHESTER AVE,"(0.0, 0.0)",0.0,0.0
268053,Motor Vehicle Accident Response,08/10/2015 04:48:00 AM,2015,8,Monday,Part Three,DECKARD ST,"(0.0, 0.0)",0.0,0.0
268054,Investigate Person,08/10/2015 05:01:00 AM,2015,8,Monday,Part Three,HAMMOND ST,"(0.0, 0.0)",0.0,0.0
268055,Motor Vehicle Accident Response,08/10/2015 05:20:00 AM,2015,8,Monday,Part Three,,"(0.0, 0.0)",0.0,0.0


In [25]:
del df_old_simplified['Location']

Now we need to filter out NAs.

In [26]:
def lat_na_er(num_string):
    try:
        num = float(num_string)
        if num < 40 or num > 45:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [27]:
def long_na_er(num_string):
    try:
        num = float(num_string)
        if num < -75 or num > -70:
            return np.nan
        return num
    except ValueError as e:
        return np.nan
    

In [28]:
df_old_simplified['Lat'] = df_old_simplified['Lat'].apply(lat_na_er)
df_old_simplified['Long'] = df_old_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [29]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199


In [30]:
df_old_simplified.describe()

Unnamed: 0,Year,Month,Lat,Long
count,268056.0,268056.0,253075.0,253075.0
mean,2013.538664,6.589134,42.323847,-71.08336
std,0.970562,3.323806,0.031772,0.030869
min,2012.0,1.0,42.232264,-71.178674
25%,2013.0,4.0,42.299386,-71.098625
50%,2014.0,7.0,42.32866,-71.078035
75%,2014.0,9.0,42.349236,-71.06228
max,2015.0,12.0,42.395105,-70.964365


Great. We need to do the same for the new one.

In [31]:
df_new_simplified['Lat'] = df_new_simplified['Lat'].apply(lat_na_er)
df_new_simplified['Long'] = df_new_simplified['Long'].apply(long_na_er)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [32]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Verbal Disputes,2019-07-16 20:35:00,2019,7,Tuesday,20,Part Three,HOWLAND ST,42.314448,-71.089934
1,Disorderly Conduct,2019-07-16 21:10:00,2019,7,Tuesday,21,Part Two,MATHER ST,42.294686,-71.063125
2,Verbal Disputes,2019-07-16 20:34:00,2019,7,Tuesday,20,Part Three,OUTLOOK RD,42.282455,-71.094488
3,Ballistics,2019-07-16 21:18:00,2019,7,Tuesday,21,Part Two,QUINCY ST,42.313322,-71.075915
4,Larceny,2019-07-16 20:37:00,2019,7,Tuesday,20,Part One,DORCHESTER AVE,42.33014,-71.056958


In [33]:
df_new_simplified.describe()

Unnamed: 0,YEAR,MONTH,HOUR,Lat,Long
count,404051.0,404051.0,404051.0,376927.0,376927.0
mean,2016.985801,6.565981,13.112436,42.32215,-71.082957
std,1.235361,3.344443,6.292317,0.0319,0.029678
min,2015.0,1.0,0.0,42.232413,-71.178674
25%,2016.0,4.0,9.0,42.297521,-71.097273
50%,2017.0,7.0,14.0,42.325586,-71.077665
75%,2018.0,9.0,18.0,42.348601,-71.062607
max,2019.0,12.0,23.0,42.395042,-70.963676


Now we need to process time.

<a id = '1.4'></a>
[Return to top](#top)
## 1.4 Process time

In [34]:
df_new_simplified['OCCURRED_ON_DATE'].isna().sum()

0

In [35]:
df_old_simplified['FROMDATE'].isna().sum()

0

At least there are no open NAs. Now let's check the timeline.

In [36]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long
0,Verbal Disputes,2019-07-16 20:35:00,2019,7,Tuesday,20,Part Three,HOWLAND ST,42.314448,-71.089934
1,Disorderly Conduct,2019-07-16 21:10:00,2019,7,Tuesday,21,Part Two,MATHER ST,42.294686,-71.063125
2,Verbal Disputes,2019-07-16 20:34:00,2019,7,Tuesday,20,Part Three,OUTLOOK RD,42.282455,-71.094488
3,Ballistics,2019-07-16 21:18:00,2019,7,Tuesday,21,Part Two,QUINCY ST,42.313322,-71.075915
4,Larceny,2019-07-16 20:37:00,2019,7,Tuesday,20,Part One,DORCHESTER AVE,42.33014,-71.056958


We need to round time to hours because police officers don't really document minutes and seconds carefully (to see why this is true please check out the old Crime in Boston project).

In [37]:
df_new_simplified['day'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[8:10]))
df_new_simplified['min'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-5:-3]))
df_new_simplified['sec'] = df_new_simplified['OCCURRED_ON_DATE'].apply(lambda x: int(x[-2:]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
df_new_simplified.head()

Unnamed: 0,OFFENSE_CODE_GROUP,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,day,min,sec
0,Verbal Disputes,2019-07-16 20:35:00,2019,7,Tuesday,20,Part Three,HOWLAND ST,42.314448,-71.089934,16,35,0
1,Disorderly Conduct,2019-07-16 21:10:00,2019,7,Tuesday,21,Part Two,MATHER ST,42.294686,-71.063125,16,10,0
2,Verbal Disputes,2019-07-16 20:34:00,2019,7,Tuesday,20,Part Three,OUTLOOK RD,42.282455,-71.094488,16,34,0
3,Ballistics,2019-07-16 21:18:00,2019,7,Tuesday,21,Part Two,QUINCY ST,42.313322,-71.075915,16,18,0
4,Larceny,2019-07-16 20:37:00,2019,7,Tuesday,20,Part One,DORCHESTER AVE,42.33014,-71.056958,16,37,0


In [39]:
del df_new_simplified['OCCURRED_ON_DATE']

In [40]:
def is_leap(year):
    if year % 4 != 0:
        return False
    elif year % 100 != 0:
        return True
    elif year % 400 != 0:
        return False
    else:
        return True

def num_of_days(month, year):
    non_leap = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    if month != 2:
        return non_leap[month - 1]
    else:
        if is_leap(year):
            return 29
        else:
            return 28
tie_break_round_up = False #Tie break round up status
NEXT = {'Monday': 'Tuesday', 'Tuesday': 'Wednesday', 'Wednesday': 'Thursday', 'Thursday':'Friday','Friday':'Saturday','Saturday':'Sunday','Sunday':'Monday'}



In [41]:
for index, row in df_new_simplified.iterrows():
    round_up = False #Round up this time?
    if df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_new_simplified.at[index, 'min'] > 30 or (df_new_simplified.at[index, 'min'] == 30 and df_new_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_new_simplified.at[index, 'HOUR'] = df_new_simplified.at[index, 'HOUR'] + 1
        if df_new_simplified.at[index, 'HOUR'] == 24:
            df_new_simplified.at[index, 'HOUR'] = 0
            df_new_simplified.at[index, 'day'] = df_new_simplified.at[index, 'day'] + 1
            df_new_simplified.at[index, 'DAY_OF_WEEK'] = NEXT[df_new_simplified.at[index, 'DAY_OF_WEEK']]
            if df_new_simplified.at[index, 'day'] > num_of_days(df_new_simplified.at[index, 'MONTH'], df_new_simplified.at[index, 'YEAR']):
                df_new_simplified.at[index, 'day'] = 1
                df_new_simplified.at[index, 'MONTH'] = df_new_simplified.at[index, 'MONTH'] + 1
                if df_new_simplified.at[index,'MONTH'] == 13:
                    df_new_simplified.at[index,'MONTH'] = 1
                    df_new_simplified.at[index, 'YEAR'] = df_new_simplified.at[index, 'YEAR'] + 1

In [42]:
def extract_hour(old_string):
    hour = int(old_string[11:13])
    code = old_string[-2:]
    if hour == 12:
        hour = hour - 12
    if code == 'PM':
        hour = hour + 12
    return hour

In [43]:
df_old_simplified['day'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[3:5]))
df_old_simplified['min'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[14:16]))
df_old_simplified['sec'] = df_old_simplified['FROMDATE'].apply(lambda x: int(x[17:19]))
df_old_simplified['hour'] = df_old_simplified['FROMDATE'].apply(extract_hour)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [44]:
for index, row in df_old_simplified.iterrows():
    round_up = False #Round up this time?
    if df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] == 0: #Tie break
        if tie_break_round_up:
            round_up = True
        tie_break_round_up = not tie_break_round_up
    if df_old_simplified.at[index, 'min'] > 30 or (df_old_simplified.at[index, 'min'] == 30 and df_old_simplified.at[index, 'sec'] > 0):
        round_up = True
    if round_up:
        df_old_simplified.at[index, 'hour'] = df_old_simplified.at[index, 'hour'] + 1
        if df_old_simplified.at[index, 'hour'] == 24:
            df_old_simplified.at[index, 'hour'] = 0
            df_old_simplified.at[index, 'day'] = df_old_simplified.at[index, 'day'] + 1
            df_old_simplified.at[index, 'DAY_WEEK'] = NEXT[df_old_simplified.at[index, 'DAY_WEEK']]
            if df_old_simplified.at[index, 'day'] > num_of_days(df_old_simplified.at[index, 'Month'], df_old_simplified.at[index, 'Year']):
                df_old_simplified.at[index, 'day'] = 1
                df_old_simplified.at[index, 'Month'] = df_old_simplified.at[index, 'Month'] + 1
                if df_old_simplified.at[index,'Month'] == 13:
                    df_old_simplified.at[index,'Month'] = 1
                    df_old_simplified.at[index, 'Year'] = df_old_simplified.at[index, 'Year'] + 1


In [45]:
df_old_simplified.head(10)

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,FROMDATE,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,min,sec,hour
0,RESIDENTIAL BURGLARY,07/08/2012 06:00:00 AM,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,0,0,6
1,AGGRAVATED ASSAULT,07/08/2012 06:03:00 AM,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,3,0,6
2,ROBBERY,07/08/2012 06:26:00 AM,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,26,0,6
3,COMMERCIAL BURGLARY,07/08/2012 06:56:00 AM,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,56,0,7
4,ROBBERY,07/08/2012 07:15:00 AM,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,15,0,7
5,ROBBERY,07/08/2012 07:32:00 AM,2012,7,Sunday,Part One,SYDNEY ST,42.313282,-71.053006,8,32,0,8
6,ROBBERY,07/08/2012 07:50:00 AM,2012,7,Sunday,Part One,REGENT ST,42.324251,-71.08621,8,50,0,8
7,SIMPLE ASSAULT,07/08/2012 07:50:00 AM,2012,7,Sunday,Part Two,WASHINGTON ST,42.349246,-71.063785,8,50,0,8
8,MedAssist,07/08/2012 07:53:00 AM,2012,7,Sunday,Part Three,FANEUIL ST,42.351746,-71.16591,8,53,0,8
9,MedAssist,07/08/2012 08:05:00 AM,2012,7,Sunday,Part Three,RIVER ST,42.259383,-71.117294,8,5,0,8


In [46]:
del df_new_simplified['min']
del df_new_simplified['sec']
del df_old_simplified['min']
del df_old_simplified['sec']
del df_old_simplified['FROMDATE']

In [47]:
df_old_simplified.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


<a id = '1.5'></a>
[Return to top](#top)
## 1.5 Remove non-crimes

As usual we only care about major crimes.

In [48]:
df_new_clean = df_new_simplified.loc[(df_new_simplified['UCR_PART'] == 'Part One') | (df_new_simplified['OFFENSE_CODE_GROUP'] == 'Arson')]

In [49]:
df_new_clean['UCR_PART'].value_counts()

Part One    76473
Other         107
Name: UCR_PART, dtype: int64

In [50]:
df_new_clean['OFFENSE_CODE_GROUP'].value_counts()

Larceny                       32780
Larceny From Motor Vehicle    13217
Aggravated Assault            10044
Residential Burglary           6591
Auto Theft                     5884
Robbery                        5525
Commercial Burglary            1607
Other Burglary                  561
Homicide                        264
Arson                           107
Name: OFFENSE_CODE_GROUP, dtype: int64

In [51]:
df_old_O = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Other']
df_old_NA = df_old_simplified.loc[df_old_simplified['UCRPART'].isnull()]

In [52]:
df_old_O['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MVAcc                              9671
PersLoc                            3479
PersMiss                            780
07RV                                613
Hazardous                           493
Service                             260
Plates                               45
ARSON                                30
Auto Theft Recovery                  29
MedAssist                            22
HateCrim                             19
License Plate Related Incidents       5
Arson                                 3
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [53]:
df_old_NA.shape

(0, 10)

In [54]:
df_old_simplified['UCRPART'].value_counts()

Part Two      98341
Part One      65261
Part three    55482
Part Three    33523
Other         15449
Name: UCRPART, dtype: int64

Unclean data. That's fine.

In [55]:
df_old_2 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Two']
df_old_3 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part Three']
df_old_33 = df_old_simplified.loc[df_old_simplified['UCRPART'] == 'Part three']

In [56]:
df_old_33['INCIDENT_TYPE_DESCRIPTION'].value_counts()

MedAssist                   12401
InvPer                       9448
PropLost                     5890
TOWED                        5524
InvProp                      4862
Service                      3505
PropFound                    2964
Argue                        2065
Arrest                       1374
FIRE                         1294
PhoneCalls                    995
LICViol                       836
32GUN                         747
Gather                        718
Landlord                      716
DEATH INVESTIGATION           678
SearchWarr                    521
PropDam                       502
Plates                        228
Harbor                        150
VIOLATION OF LIQUOR LAWS       30
Explos                         23
Aircraft                        7
Labor                           4
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

In [57]:
df_old_semiclean = df_old_simplified.loc[(df_old_simplified['UCRPART'] == 'Part One') | (df_old_simplified['UCRPART'] == 'Other')]

OK I think the Part Twos, Part Threes and others other than arson can be ignored.

In [58]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] = df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].apply(lambda x: x.upper())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [59]:
df_old_semiclean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,UCRPART,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,Part One,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,Part One,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,Part One,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,Part One,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,Part One,COLLINS ST,42.270516,-71.1199,8,7


In [60]:
df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                      24443
LARCENY FROM MOTOR VEHICLE         13265
MVACC                               9671
RESIDENTIAL BURGLARY                7119
AGGRAVATED ASSAULT                  6008
ROBBERY                             5193
AUTO THEFT                          4851
PERSLOC                             3479
COMMERCIAL BURGLARY                 1550
BENOPROP                            1367
LARCENY                             1288
PERSMISS                             780
07RV                                 613
HAZARDOUS                            493
SERVICE                              260
HOMICIDE                             144
PLATES                                45
ARSON                                 33
AUTO THEFT RECOVERY                   29
OTHER BURGLARY                        22
MEDASSIST                             22
HATECRIM                              19
MANSLAUG                               9
LICENSE PLATE RELATED INCIDENTS        5
RAPE AND ATTEMPT

In [61]:
df_old_clean = df_old_semiclean.loc[(df_old_semiclean['UCRPART'] == 'Part One') | (df_old_semiclean['INCIDENT_TYPE_DESCRIPTION'] == 'Arson')]

In [62]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
BENOPROP                       1367
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
MANSLAUG                          9
RAPE AND ATTEMPTED                2
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

`BENOPROP` means "Break and enter, no property taken". Since it is in `Other` in the new data source let's remove it. `RAPE AND ATTEMPTED` and `MANSLAUG` need to be removed as well because they are either not present in the new data source or is not in `Part One`.

In [63]:
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'BENOPROP'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'MANSLAUG'] 
df_old_clean = df_old_clean[df_old_clean['INCIDENT_TYPE_DESCRIPTION'] != 'RAPE AND ATTEMPTED'] 

In [64]:
df_old_clean['INCIDENT_TYPE_DESCRIPTION'].value_counts()

OTHER LARCENY                 24443
LARCENY FROM MOTOR VEHICLE    13265
RESIDENTIAL BURGLARY           7119
AGGRAVATED ASSAULT             6008
ROBBERY                        5193
AUTO THEFT                     4851
COMMERCIAL BURGLARY            1550
LARCENY                        1288
HOMICIDE                        144
OTHER BURGLARY                   22
Name: INCIDENT_TYPE_DESCRIPTION, dtype: int64

Now we can drop the `UCR_PART`.

In [65]:
del df_old_clean['UCRPART']

In [66]:
del df_new_clean['UCR_PART']

Let's store the data so that it isn't lost.

In [67]:
df_old_clean.to_csv('old.csv')
df_new_clean.to_csv('new.csv')

<a id = '1.6'></a>
[Return to top](#top)
## 1.6 Combine the two dataframes

Now it's time to merge the two dataframes. 

In [68]:
df_new_clean.head()

Unnamed: 0,OFFENSE_CODE_GROUP,YEAR,MONTH,DAY_OF_WEEK,HOUR,STREET,Lat,Long,day
4,Larceny,2019,7,Tuesday,21,DORCHESTER AVE,42.33014,-71.056958,16
11,Larceny,2019,7,Tuesday,18,TREMONT ST,42.336409,-71.08565,16
17,Larceny,2019,7,Tuesday,20,COLUMBIA RD,,,16
19,Larceny,2019,7,Thursday,23,COLUMBUS AVE,42.346575,-71.074387,11
22,Other Burglary,2019,7,Sunday,7,GREEN ST,42.311438,-71.109264,14


In [69]:
df_old_clean.head()

Unnamed: 0,INCIDENT_TYPE_DESCRIPTION,Year,Month,DAY_WEEK,STREETNAME,Lat,Long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [70]:
df_new_clean.rename(index = str, columns = {'OFFENSE_CODE_GROUP':'crime', 'YEAR': 'year', 'MONTH': 'month', 'DAY_OF_WEEK': 'dayw', 'HOUR': 'hour','STREET':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [71]:
df_new_clean.head()

Unnamed: 0,crime,year,month,dayw,hour,street,lat,long,day
4,Larceny,2019,7,Tuesday,21,DORCHESTER AVE,42.33014,-71.056958,16
11,Larceny,2019,7,Tuesday,18,TREMONT ST,42.336409,-71.08565,16
17,Larceny,2019,7,Tuesday,20,COLUMBIA RD,,,16
19,Larceny,2019,7,Thursday,23,COLUMBUS AVE,42.346575,-71.074387,11
22,Other Burglary,2019,7,Sunday,7,GREEN ST,42.311438,-71.109264,14


In [72]:
df_old_clean.rename(index = str, columns = {'INCIDENT_TYPE_DESCRIPTION':'crime', 'Year': 'year', 'Month': 'month', 'DAY_WEEK': 'dayw', 'hour': 'hour','STREETNAME':'street','Lat':'lat','Long':'long','day':'day'}, inplace = True)

In [73]:
df_old_clean.head()

Unnamed: 0,crime,year,month,dayw,street,lat,long,day,hour
0,RESIDENTIAL BURGLARY,2012,7,Sunday,ABERDEEN ST,42.346381,-71.103795,8,6
1,AGGRAVATED ASSAULT,2012,7,Sunday,HOWARD AV,42.316841,-71.074585,8,6
2,ROBBERY,2012,7,Sunday,JERSEY ST,42.342841,-71.09699,8,6
3,COMMERCIAL BURGLARY,2012,7,Sunday,COLUMBIA RD,42.316441,-71.065829,8,7
4,ROBBERY,2012,7,Sunday,COLLINS ST,42.270516,-71.1199,8,7


In [74]:
correct_order = ['crime','year','month','day','dayw','hour','street','lat','long']

In [75]:
df_old_clean = df_old_clean[correct_order]
df_new_clean = df_new_clean[correct_order]

In [76]:
df_old_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.1199


In [77]:
df_new_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
4,Larceny,2019,7,16,Tuesday,21,DORCHESTER AVE,42.33014,-71.056958
11,Larceny,2019,7,16,Tuesday,18,TREMONT ST,42.336409,-71.08565
17,Larceny,2019,7,16,Tuesday,20,COLUMBIA RD,,
19,Larceny,2019,7,11,Thursday,23,COLUMBUS AVE,42.346575,-71.074387
22,Other Burglary,2019,7,14,Sunday,7,GREEN ST,42.311438,-71.109264


In [78]:
frames = [df_old_clean, df_new_clean]

In [79]:
df_clean = pd.concat(frames, ignore_index = True)

In [80]:
df_clean.tail()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
140458,Aggravated Assault,2015,11,20,Friday,11,BLUE HILL AVE,42.301897,-71.085549
140459,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140460,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140461,Larceny,2018,12,13,Thursday,0,BROOKLEDGE ST,42.309563,-71.089902
140462,Homicide,2015,7,9,Thursday,14,RIVER ST,42.255926,-71.123172


In [81]:
df_old_clean.shape

(63883, 9)

In [82]:
df_new_clean.shape

(76580, 9)

In [83]:
df_clean.shape

(140463, 9)

In [84]:
df_old_clean.shape[0] + df_new_clean.shape[0] == df_clean.shape[0]

True

Now we need to merge the crimes.

In [85]:
df_clean['crime'] = df_clean['crime'].apply(lambda x: x.upper())

In [86]:
df_clean['crime'].value_counts()

LARCENY                       34068
LARCENY FROM MOTOR VEHICLE    26482
OTHER LARCENY                 24443
AGGRAVATED ASSAULT            16052
RESIDENTIAL BURGLARY          13710
AUTO THEFT                    10735
ROBBERY                       10718
COMMERCIAL BURGLARY            3157
OTHER BURGLARY                  583
HOMICIDE                        408
ARSON                           107
Name: crime, dtype: int64

There is a disparity in what `LARCENY` means. Hence we will simply merge all larcenies into `LARCENY`.

In [87]:
df_clean['crime'] = df_clean['crime'].replace({'LARCENY FROM MOTOR VEHICLE': 'LARCENY', 'OTHER LARCENY': 'LARCENY'})

In [88]:
df_clean['crime'].value_counts()

LARCENY                 84993
AGGRAVATED ASSAULT      16052
RESIDENTIAL BURGLARY    13710
AUTO THEFT              10735
ROBBERY                 10718
COMMERCIAL BURGLARY      3157
OTHER BURGLARY            583
HOMICIDE                  408
ARSON                     107
Name: crime, dtype: int64

Now we can store the file.

In [89]:
df_clean.to_csv('final.csv')

<a id = '2'></a>
[Return to top](#top)
# 2. Regressions

In [90]:
df_clean = pd.read_csv('final.csv',index_col = 0)

In [91]:
df_clean.head()

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.09699
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.1199


<a id = '2.1'></a>
[Return to top](#top)
## 2.1 Preparation

In [92]:
df_clean.isna().sum()

crime        0
year         0
month        0
day          0
dayw         0
hour         0
street    1529
lat       5834
long      5834
dtype: int64

Now we should drop the NAs.

In [93]:
df_final = df_clean.dropna()

In [94]:
df_final.shape

(134403, 9)

In [95]:
df_final.sort_values(['year','month','day'])

Unnamed: 0,crime,year,month,day,dayw,hour,street,lat,long
0,RESIDENTIAL BURGLARY,2012,7,8,Sunday,6,ABERDEEN ST,42.346381,-71.103795
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,6,HOWARD AV,42.316841,-71.074585
2,ROBBERY,2012,7,8,Sunday,6,JERSEY ST,42.342841,-71.096990
3,COMMERCIAL BURGLARY,2012,7,8,Sunday,7,COLUMBIA RD,42.316441,-71.065829
4,ROBBERY,2012,7,8,Sunday,7,COLLINS ST,42.270516,-71.119900
5,ROBBERY,2012,7,8,Sunday,8,SYDNEY ST,42.313282,-71.053006
6,ROBBERY,2012,7,8,Sunday,8,REGENT ST,42.324251,-71.086210
7,RESIDENTIAL BURGLARY,2012,7,8,Sunday,11,CATBIRD COURT,42.288138,-71.094849
8,LARCENY,2012,7,8,Sunday,12,HILLSIDE ST,42.331666,-71.107630
9,AUTO THEFT,2012,7,8,Sunday,12,E 7TH ST,42.332171,-71.042240


In [96]:
def first_day(df):
    row = df.sort_values(['year','month','day']).iloc[0,:]
    return str(row.month) + '/' + str(row.day) + '/' + str(row.year)
def last_day(df):
    row = df.sort_values(['year','month','day']).iloc[-1,:]
    return str(row.month) + '/' + str(row.day) + '/' + str(row.year)

In [97]:
def count_crimes(df, crime, year, month, day, hour):
    df1 = df[(df['crime'] == crime) & (df['year'] == year)]
    df2 = df1[(df1['month'] == month) & (df1['day'] == day)]
    return df2[df2['hour'] == hour].shape[0]

In [98]:
time_tuple_list = pd.date_range(start = first_day(df_final), end = last_day(df_final)).tolist()

In [99]:
crime_list = df_final.crime.unique().tolist()

In [100]:
df_temp = df_final[['crime','year','month','day','dayw','hour']].groupby(['crime','year','month','day','dayw','hour']).size().unstack(fill_value = 0).stack().reset_index()

In [101]:
df_temp.rename(index = str, columns = {0:'counts'}, inplace = True)

In [102]:
df_temp[df_temp.crime == 'ARSON'].head(200)

Unnamed: 0,crime,year,month,day,dayw,hour,counts
61056,ARSON,2015,6,20,Saturday,0,0
61057,ARSON,2015,6,20,Saturday,1,0
61058,ARSON,2015,6,20,Saturday,2,0
61059,ARSON,2015,6,20,Saturday,3,0
61060,ARSON,2015,6,20,Saturday,4,0
61061,ARSON,2015,6,20,Saturday,5,0
61062,ARSON,2015,6,20,Saturday,6,0
61063,ARSON,2015,6,20,Saturday,7,0
61064,ARSON,2015,6,20,Saturday,8,0
61065,ARSON,2015,6,20,Saturday,9,0


For some unknown reasons we don't have information about arson before mid-2015 which is very weird. So we decided to drop `arson` for now until we can find other crime descriptions that are essentially arson.

In [103]:
crime_list.remove('ARSON')

In [104]:
df_al = df_final[df_final.crime != 'ARSON']
df_temp = df_al[['crime','year','month','day','dayw','hour']].groupby(['crime','year','month','day','dayw','hour']).size().unstack(fill_value = 0).stack().reset_index()

In [105]:
df_temp.head()

Unnamed: 0,crime,year,month,day,dayw,hour,0
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0


In [106]:
df_temp.shape

(359280, 7)

In [107]:
dicc = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3: 'Thursday',4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
for ind, row in df_temp.iterrows():
    year = df_temp.at[ind, 'year']
    month = df_temp.at[ind, 'month']
    day = df_temp.at[ind, 'day']
    dayw = dicc[datetime.date(year, month, day).weekday()]
    if dayw != df_temp.at[ind, 'dayw']:
        print(row)

In [110]:
df_temp.hour.value_counts()

23    14970
22    14970
1     14970
2     14970
3     14970
4     14970
5     14970
6     14970
7     14970
8     14970
9     14970
10    14970
11    14970
12    14970
13    14970
14    14970
15    14970
16    14970
17    14970
18    14970
19    14970
20    14970
21    14970
0     14970
Name: hour, dtype: int64

In [111]:
cset = set()
for ind, row in df_temp.iterrows():
    item = (row.crime, row.year, row.month, row.day)
    cset.add(item)

In [112]:
len(cset)

14970

In [115]:
cset

{('AGGRAVATED ASSAULT', 2014, 6, 12),
 ('RESIDENTIAL BURGLARY', 2014, 12, 8),
 ('LARCENY', 2015, 7, 18),
 ('AUTO THEFT', 2018, 8, 11),
 ('COMMERCIAL BURGLARY', 2017, 7, 21),
 ('LARCENY', 2019, 1, 6),
 ('ROBBERY', 2018, 11, 16),
 ('AGGRAVATED ASSAULT', 2013, 8, 16),
 ('RESIDENTIAL BURGLARY', 2015, 6, 20),
 ('LARCENY', 2014, 11, 6),
 ('AUTO THEFT', 2013, 5, 12),
 ('ROBBERY', 2015, 1, 12),
 ('RESIDENTIAL BURGLARY', 2018, 9, 13),
 ('AGGRAVATED ASSAULT', 2016, 9, 7),
 ('RESIDENTIAL BURGLARY', 2013, 2, 13),
 ('ROBBERY', 2012, 10, 30),
 ('AGGRAVATED ASSAULT', 2015, 5, 30),
 ('ROBBERY', 2017, 11, 13),
 ('AUTO THEFT', 2012, 10, 9),
 ('AGGRAVATED ASSAULT', 2018, 12, 11),
 ('AUTO THEFT', 2017, 6, 14),
 ('ROBBERY', 2014, 1, 15),
 ('RESIDENTIAL BURGLARY', 2015, 5, 9),
 ('ROBBERY', 2019, 7, 4),
 ('LARCENY', 2018, 10, 22),
 ('AGGRAVATED ASSAULT', 2016, 4, 24),
 ('AUTO THEFT', 2014, 11, 9),
 ('AUTO THEFT', 2019, 1, 12),
 ('LARCENY', 2016, 6, 3),
 ('ROBBERY', 2017, 6, 30),
 ('COMMERCIAL BURGLARY', 2012

In [116]:
crime_set = set(crime_list)
time_set = set(time_tuple_list)

In [117]:
full_set = {(crime, time.year, time.month, time.day) for crime in crime_set for time in time_set}

In [118]:
len(full_set)

20520

In [119]:
zeroset = full_set - cset

In [120]:
cset - full_set

set()

In [121]:
uzeroset = cset - full_set

In [122]:
len(zeroset) + len(cset) - len(full_set)

0

In [123]:
full_set.issuperset(cset)

True

In [124]:

def process_row(tup):
    dicc_list = []
    dayw = dicc[datetime.date(tup[1], tup[2], tup[3]).weekday()]
    for i in range(24):
        ind_dic = {'crime': tup[0], 'year': tup[1], 'month': tup[2], 'day': tup[3], 'dayw': dayw, 'hour': i, 'counts': 0}
        dicc_list.append(ind_dic)
    return dicc_list

In [125]:
def process_set(zeroset):
    dicc_list = []
    for row in zeroset:
        dicc_list.extend(process_row(row))
    return dicc_list

In [126]:
zero_dicc_list = process_set(zeroset)

In [127]:
len(zero_dicc_list) 

133200

In [128]:
24 * len(zeroset)

133200

In [129]:
df_zero = pd.DataFrame(zero_dicc_list, columns = ['crime', 'year', 'month', 'day', 'dayw', 'hour', 'counts'])

In [130]:
df_zero.head()

Unnamed: 0,crime,year,month,day,dayw,hour,counts
0,COMMERCIAL BURGLARY,2019,6,13,Thursday,0,0
1,COMMERCIAL BURGLARY,2019,6,13,Thursday,1,0
2,COMMERCIAL BURGLARY,2019,6,13,Thursday,2,0
3,COMMERCIAL BURGLARY,2019,6,13,Thursday,3,0
4,COMMERCIAL BURGLARY,2019,6,13,Thursday,4,0


In [131]:
df_zero.shape

(133200, 7)

In [132]:
df_temp.rename(index = str, columns = {0:'counts'}, inplace = True)

In [133]:
df_ag = pd.concat([df_temp, df_zero], ignore_index = True)

In [134]:
df_ag.to_csv('ag.csv')

In [135]:
df_ag.dayw.value_counts()

Tuesday      70464
Monday       70464
Sunday       70464
Saturday     70272
Friday       70272
Thursday     70272
Wednesday    70272
Name: dayw, dtype: int64

In [138]:
len(crime_set)

8

In [139]:
df_ag.head()

Unnamed: 0,crime,year,month,day,dayw,hour,counts
0,AGGRAVATED ASSAULT,2012,7,8,Sunday,0,0
1,AGGRAVATED ASSAULT,2012,7,8,Sunday,1,0
2,AGGRAVATED ASSAULT,2012,7,8,Sunday,2,0
3,AGGRAVATED ASSAULT,2012,7,8,Sunday,3,0
4,AGGRAVATED ASSAULT,2012,7,8,Sunday,4,0


In [140]:
df_ag['year'] = df_ag.year.astype('category')
df_ag['month'] = df_ag.month.astype('category')
df_ag['day'] = df_ag.day.astype('category')
df_ag['dayw'] = df_ag.dayw.astype('category')
df_ag['hour'] = df_ag.hour.astype('category')
df_ag['crime'] = df_ag.crime.astype('category')
df_ag['counts'] = df_ag.counts.astype(float)

In [141]:
df_ag.dtypes

crime     category
year      category
month     category
day       category
dayw      category
hour      category
counts     float64
dtype: object

In [142]:
df_ag.crime.value_counts()

ROBBERY                 61560
RESIDENTIAL BURGLARY    61560
OTHER BURGLARY          61560
LARCENY                 61560
HOMICIDE                61560
COMMERCIAL BURGLARY     61560
AUTO THEFT              61560
AGGRAVATED ASSAULT      61560
Name: crime, dtype: int64

<a id = '2.2A'></a>
[Return to top](#top)
## 2.2 Select and split

Since we already have clean data we need to do some feature engineering.

In [143]:
df_ag2 = df_ag.groupby(['crime', 'year', 'month', 'dayw'])['counts'].sum().reset_index(name = 'counts')

In [144]:
df_ag2

Unnamed: 0,crime,year,month,dayw,counts
0,AGGRAVATED ASSAULT,2012,7,Friday,25.0
1,AGGRAVATED ASSAULT,2012,7,Monday,27.0
2,AGGRAVATED ASSAULT,2012,7,Saturday,21.0
3,AGGRAVATED ASSAULT,2012,7,Sunday,25.0
4,AGGRAVATED ASSAULT,2012,7,Thursday,14.0
5,AGGRAVATED ASSAULT,2012,7,Tuesday,33.0
6,AGGRAVATED ASSAULT,2012,7,Wednesday,17.0
7,AGGRAVATED ASSAULT,2012,8,Friday,32.0
8,AGGRAVATED ASSAULT,2012,8,Monday,22.0
9,AGGRAVATED ASSAULT,2012,8,Saturday,38.0


In [145]:
df_dummies = pd.get_dummies(df_ag2)

In [146]:
df_dummies.head()

Unnamed: 0,counts,crime_AGGRAVATED ASSAULT,crime_AUTO THEFT,crime_COMMERCIAL BURGLARY,crime_HOMICIDE,crime_LARCENY,crime_OTHER BURGLARY,crime_RESIDENTIAL BURGLARY,crime_ROBBERY,year_2012,...,month_10,month_11,month_12,dayw_Friday,dayw_Monday,dayw_Saturday,dayw_Sunday,dayw_Thursday,dayw_Tuesday,dayw_Wednesday
0,25.0,1,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
1,27.0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
2,21.0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,25.0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,14.0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [147]:
df_dummies.columns

Index(['counts', 'crime_AGGRAVATED ASSAULT', 'crime_AUTO THEFT',
       'crime_COMMERCIAL BURGLARY', 'crime_HOMICIDE', 'crime_LARCENY',
       'crime_OTHER BURGLARY', 'crime_RESIDENTIAL BURGLARY', 'crime_ROBBERY',
       'year_2012', 'year_2013', 'year_2014', 'year_2015', 'year_2016',
       'year_2017', 'year_2018', 'year_2019', 'month_1', 'month_2', 'month_3',
       'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9',
       'month_10', 'month_11', 'month_12', 'dayw_Friday', 'dayw_Monday',
       'dayw_Saturday', 'dayw_Sunday', 'dayw_Thursday', 'dayw_Tuesday',
       'dayw_Wednesday'],
      dtype='object')

In [148]:
X = df_dummies.iloc[:,1:]
y = df_dummies['counts']

In [149]:
y.shape

(4760,)

In [150]:
y.head()

0    25.0
1    27.0
2    21.0
3    25.0
4    14.0
Name: counts, dtype: float64

In [151]:
X_trainn, X_test, y_trainn, y_test = train_test_split(X, y, test_size=0.2, random_state=52)
X_train, X_val, y_train, y_val = train_test_split(X_trainn, y_trainn, test_size=0.2, random_state=52)

In [152]:
X_train.shape

(3046, 35)

In [153]:
regressor_list = []
ev_train = []
ev_test = []
r2_train = []
r2_test = []
mse_train = []
mse_test = []
mae_train = []
mae_test = []
mdae_train = []
mdae_test = []

In [154]:
def regression(regressor, x_train, x_test, y_train):
    reg = regressor
    reg.fit(x_train, y_train)
    
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    
    return y_train_reg, y_test_reg

In [155]:
def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    regressor_list.append(str(regressor))
    
    ev_train_c = explained_variance_score(y_train, y_train_reg)
    ev_train.append(ev_train_c)
    ev_test_c = explained_variance_score(y_test, y_test_reg)
    ev_test.append(ev_test_c)
    
    r2_train_c = r2_score(y_train, y_train_reg)
    r2_train.append(r2_train_c)
    r2_test_c = r2_score(y_test, y_test_reg)
    r2_test.append(r2_test_c)
    
    mse_train_c = mean_squared_error(y_train, y_train_reg)
    mse_train.append(mse_train_c)
    mse_test_c = mean_squared_error(y_test, y_test_reg)
    mse_test.append(mse_test_c)

    mae_train_c = mean_absolute_error(y_train, y_train_reg)
    mae_train.append(mae_train_c)
    mae_test_c = mean_absolute_error(y_test, y_test_reg)
    mae_test.append(mae_test_c)  
    
    mdae_train_c = median_absolute_error(y_train, y_train_reg)
    mdae_train.append(mdae_train_c)
    mdae_test_c = median_absolute_error(y_test, y_test_reg)
    mdae_test.append(mdae_test_c)
    
    print("______________________________________________________________________________")
    print(str(regressor))
    print("______________________________________________________________________________")
    print("EV score. Train: ", ev_train_c)
    print("EV score. Test: ", ev_test_c)
    print("---------")
    print("R2 score. Train: ", r2_train_c)
    print("R2 score. Test: ", r2_test_c)
    print("---------")
    print("MSE score. Train: ", mse_train_c)
    print("MSE score. Test: ", mse_test_c)
    print("---------")
    print("MAE score. Train: ", mae_train_c)
    print("MAE score. Test: ", mae_test_c)
    print("---------")
    print("MdAE score. Train: ", mdae_train_c)
    print("MdAE score. Test: ", mdae_test_c)

<a id = '2.2'></a>
[Return to top](#top)
## 2.3 Linear Regressor

Let's first try linear regression.

In [156]:
lreg = LinearRegression()
lreg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(lreg, X_train, X_val, y_train)
scores(lreg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
______________________________________________________________________________
EV score. Train:  0.9143945432873462
EV score. Test:  0.8827995315420987
---------
R2 score. Train:  0.9143945262730107
R2 score. Test:  0.8827992274044586
---------
MSE score. Train:  170.25226815938166
MSE score. Test:  233.90416895748749
---------
MAE score. Train:  7.529039108667105
MAE score. Test:  8.118243520341208
---------
MdAE score. Train:  4.65625
MdAE score. Test:  4.91015625


In [157]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(sgd_reg, X_train, X_val, y_train)
scores(sgd_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=None,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9054579782838977
EV score. Test:  0.8730438206672417
---------
R2 score. Train:  0.9054503434555083
R2 score. Test:  0.8730325025847367
---------
MSE score. Train:  188.04046960510124
MSE score. Test:  253.39617060390353
---------
MAE score. Train:  7.745740899775923
MAE score. Test:  8.143369530164447
---------
MdAE score. Train:  4.930616293433571
MdAE score. Test:  5.11858268049243




<a id = '2.3'></a>
[Return to top](#top)
## 2.4 BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor

In [158]:
ba_reg = BaggingRegressor()
ba_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ba_reg, X_train, X_val, y_train)
scores(ba_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=None,
         verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9908782098682681
EV score. Test:  0.951245415609386
---------
R2 score. Train:  0.9908779716495478
R2 score. Test:  0.9512317155118306
---------
MSE score. Train:  18.141900853578466
MSE score. Test:  97.3296062992126
---------
MAE score. Train:  2.2193368351936966
MAE score. Test:  5.548556430446194
---------
MdAE score. Train:  1.0
MdAE score. Test:  2.9000000000000004


In [159]:
ada_reg = AdaBoostRegressor(learning_rate=1,n_estimators=100)
ada_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ada_reg, X_train, X_val, y_train)
scores(ada_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
AdaBoostRegressor(base_estimator=None, learning_rate=1, loss='linear',
         n_estimators=100, random_state=None)
______________________________________________________________________________
EV score. Train:  0.9067996317974019
EV score. Test:  0.873257042123503
---------
R2 score. Train:  0.9059081651990938
R2 score. Test:  0.8729562921466618
---------
MSE score. Train:  187.12995317589844
MSE score. Test:  253.54826805845966
---------
MAE score. Train:  9.413667303591383
MAE score. Test:  9.975730284832332
---------
MdAE score. Train:  7.269639781987321
MdAE score. Test:  7.269639781987321


In [160]:
et_reg = ExtraTreesRegressor(n_estimators=100)
et_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(et_reg, X_train, X_val, y_train)
scores(et_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
          oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.919522358923459
---------
R2 score. Train:  1.0
R2 score. Test:  0.9195195630832128
---------
MSE score. Train:  0.0
MSE score. Test:  160.61933123359577
---------
MAE score. Train:  0.0
MAE score. Test:  6.938477690288715
---------
MdAE score. Train:  0.0
MdAE score. Test:  3.7399999999999993


<a id = '2.4'></a>
[Return to top](#top)
## 2.5 GradientBoostingRegressor, RandomForestRegressor, LGBMRegressor

In [161]:
gb_reg = GradientBoostingRegressor()
gb_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(gb_reg, X_train, X_val, y_train)
scores(gb_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9513838884549657
EV score. Test:  0.9223412832864206
---------
R2 score. Train:  0.9513838884549657
R2 score. Test:  0.9223009688448833
---------
MSE score. Train:  96.68778057380284
MSE score. Test:  155.06832343041123
---------
MAE score. Train:  5.565297417861016
MAE score. Test:  6.3364163

In [162]:
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rf_reg, X_train, X_val, y_train)
scores(rf_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9923313967267854
EV score. Test:  0.9537432524190378
---------
R2 score. Train:  0.9923312787386604
R2 score. Test:  0.9537275974117084
---------
MSE score. Train:  15.25156198292843
MSE score. Test:  92.3484345144357
---------
MAE score. Train:  2.08919238345371
MAE score. Test:  5.3308792650918635
---------
MdAE score. Train:  0.9900000000000047
MdAE score. Test:  2.705


In [163]:

gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.01,
                        n_estimators=1000)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

#print('Starting predicting...')
# predict
y_train_reg = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
y_val_reg = gbm.predict(X_val, num_iteration=gbm.best_iteration_)
scores(gbm, y_train, y_val, y_train_reg, y_val_reg)

[1]	valid_0's l1: 27.5675	valid_0's l2: 1921.27
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l1: 27.3076	valid_0's l2: 1886.27
[3]	valid_0's l1: 27.0514	valid_0's l2: 1851.97
[4]	valid_0's l1: 26.7975	valid_0's l2: 1818.34
[5]	valid_0's l1: 26.5462	valid_0's l2: 1785.39
[6]	valid_0's l1: 26.2972	valid_0's l2: 1753.08
[7]	valid_0's l1: 26.0509	valid_0's l2: 1721.41
[8]	valid_0's l1: 25.8071	valid_0's l2: 1690.37
[9]	valid_0's l1: 25.5656	valid_0's l2: 1659.95
[10]	valid_0's l1: 25.3266	valid_0's l2: 1630.12
[11]	valid_0's l1: 25.0905	valid_0's l2: 1600.89
[12]	valid_0's l1: 24.8571	valid_0's l2: 1572.25
[13]	valid_0's l1: 24.6262	valid_0's l2: 1544.17
[14]	valid_0's l1: 24.3978	valid_0's l2: 1516.65
[15]	valid_0's l1: 24.1711	valid_0's l2: 1489.67
[16]	valid_0's l1: 23.9472	valid_0's l2: 1463.23
[17]	valid_0's l1: 23.7251	valid_0's l2: 1437.31
[18]	valid_0's l1: 23.5054	valid_0's l2: 1411.89
[19]	valid_0's l1: 23.2884	valid_0's l2: 1387
[20]	valid_0's l1: 2

[168]	valid_0's l1: 7.53564	valid_0's l2: 201.835
[169]	valid_0's l1: 7.49984	valid_0's l2: 200.62
[170]	valid_0's l1: 7.46238	valid_0's l2: 199.169
[171]	valid_0's l1: 7.427	valid_0's l2: 197.848
[172]	valid_0's l1: 7.39207	valid_0's l2: 196.675
[173]	valid_0's l1: 7.35638	valid_0's l2: 195.327
[174]	valid_0's l1: 7.32259	valid_0's l2: 194.19
[175]	valid_0's l1: 7.28897	valid_0's l2: 192.894
[176]	valid_0's l1: 7.25716	valid_0's l2: 191.681
[177]	valid_0's l1: 7.22495	valid_0's l2: 190.619
[178]	valid_0's l1: 7.19263	valid_0's l2: 189.456
[179]	valid_0's l1: 7.16066	valid_0's l2: 188.316
[180]	valid_0's l1: 7.12927	valid_0's l2: 187.143
[181]	valid_0's l1: 7.09891	valid_0's l2: 186.023
[182]	valid_0's l1: 7.06714	valid_0's l2: 184.845
[183]	valid_0's l1: 7.03788	valid_0's l2: 183.873
[184]	valid_0's l1: 7.0072	valid_0's l2: 182.759
[185]	valid_0's l1: 6.97745	valid_0's l2: 181.705
[186]	valid_0's l1: 6.94943	valid_0's l2: 180.735
[187]	valid_0's l1: 6.92027	valid_0's l2: 179.693
[188]

[343]	valid_0's l1: 5.35754	valid_0's l2: 126.049
[344]	valid_0's l1: 5.35593	valid_0's l2: 125.943
[345]	valid_0's l1: 5.35364	valid_0's l2: 125.874
[346]	valid_0's l1: 5.35278	valid_0's l2: 125.784
[347]	valid_0's l1: 5.3509	valid_0's l2: 125.704
[348]	valid_0's l1: 5.349	valid_0's l2: 125.576
[349]	valid_0's l1: 5.34744	valid_0's l2: 125.518
[350]	valid_0's l1: 5.34518	valid_0's l2: 125.454
[351]	valid_0's l1: 5.344	valid_0's l2: 125.369
[352]	valid_0's l1: 5.34175	valid_0's l2: 125.286
[353]	valid_0's l1: 5.33998	valid_0's l2: 125.161
[354]	valid_0's l1: 5.33745	valid_0's l2: 125.079
[355]	valid_0's l1: 5.33589	valid_0's l2: 125.006
[356]	valid_0's l1: 5.33432	valid_0's l2: 124.917
[357]	valid_0's l1: 5.33275	valid_0's l2: 124.858
[358]	valid_0's l1: 5.33138	valid_0's l2: 124.769
[359]	valid_0's l1: 5.32944	valid_0's l2: 124.703
[360]	valid_0's l1: 5.32731	valid_0's l2: 124.641
[361]	valid_0's l1: 5.32637	valid_0's l2: 124.555
[362]	valid_0's l1: 5.32458	valid_0's l2: 124.483
[363]

[520]	valid_0's l1: 5.19795	valid_0's l2: 116.245
[521]	valid_0's l1: 5.19809	valid_0's l2: 116.209
[522]	valid_0's l1: 5.1976	valid_0's l2: 116.189
[523]	valid_0's l1: 5.19668	valid_0's l2: 116.159
[524]	valid_0's l1: 5.19673	valid_0's l2: 116.101
[525]	valid_0's l1: 5.19661	valid_0's l2: 116.089
[526]	valid_0's l1: 5.19569	valid_0's l2: 116.046
[527]	valid_0's l1: 5.19605	valid_0's l2: 116.003
[528]	valid_0's l1: 5.19608	valid_0's l2: 115.944
[529]	valid_0's l1: 5.19527	valid_0's l2: 115.922
[530]	valid_0's l1: 5.19452	valid_0's l2: 115.895
[531]	valid_0's l1: 5.19407	valid_0's l2: 115.858
[532]	valid_0's l1: 5.19434	valid_0's l2: 115.812
[533]	valid_0's l1: 5.19356	valid_0's l2: 115.769
[534]	valid_0's l1: 5.19411	valid_0's l2: 115.739
[535]	valid_0's l1: 5.19414	valid_0's l2: 115.687
[536]	valid_0's l1: 5.19369	valid_0's l2: 115.659
[537]	valid_0's l1: 5.19377	valid_0's l2: 115.62
[538]	valid_0's l1: 5.19342	valid_0's l2: 115.61
[539]	valid_0's l1: 5.19352	valid_0's l2: 115.543
[54

<a id = '2.5'></a>
[Return to top](#top)
## 2.6 KNeighborsRegressor, RadiusNeighborsRegressor

In [164]:
kn_reg = KNeighborsRegressor(n_neighbors=15)
kn_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(kn_reg, X_train, X_val, y_train)
scores(kn_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=15, p=2,
          weights='uniform')
______________________________________________________________________________
EV score. Train:  0.8788201260222026
EV score. Test:  0.8510219200568501
---------
R2 score. Train:  0.8788186529821859
R2 score. Test:  0.85099129024237
---------
MSE score. Train:  241.00560735390675
MSE score. Test:  297.38505686789154
---------
MAE score. Train:  9.043838914423286
MAE score. Test:  9.609448818897638
---------
MdAE score. Train:  5.800000000000001
MdAE score. Test:  6.266666666666664


In [165]:
rn_reg = RadiusNeighborsRegressor()
rn_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rn_reg, X_train, X_val, y_train)
scores(rn_reg, y_train, y_val, y_train_reg, y_val_reg)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

<a id = '2.6'></a>
[Return to top](#top)
## 2.7 DecisionTreeRegressor

In [166]:
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(dt_reg, X_train, X_val, y_train)
scores(dt_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
______________________________________________________________________________
EV score. Train:  1.0
EV score. Test:  0.9205177432206231
---------
R2 score. Train:  1.0
R2 score. Test:  0.9205137529679716
---------
MSE score. Train:  0.0
MSE score. Test:  158.63517060367454
---------
MAE score. Train:  0.0
MAE score. Test:  7.044619422572178
---------
MdAE score. Train:  0.0
MdAE score. Test:  4.0


<a id = '2.7'></a>
[Return to top](#top)
## 2.8 Ridge, RidgeCV, BayesianRidge

In [167]:
rid_reg = Ridge()
rid_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(rid_reg, X_train, X_val, y_train)
scores(rid_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
______________________________________________________________________________
EV score. Train:  0.9145347104027429
EV score. Test:  0.882768654058932
---------
R2 score. Train:  0.9145347104027429
R2 score. Test:  0.8827680888111018
---------
MSE score. Train:  169.9734697951207
MSE score. Test:  233.96631399834618
---------
MAE score. Train:  7.46700585346394
MAE score. Test:  8.075628439716919
---------
MdAE score. Train:  4.585740758048404
MdAE score. Test:  4.786096187970363


In [168]:
ric_reg = RidgeCV()
ric_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ric_reg, X_train, X_val, y_train)
scores(ric_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
______________________________________________________________________________
EV score. Train:  0.9145408245979972
EV score. Test:  0.8827918055209272
---------
R2 score. Train:  0.9145408245979972
R2 score. Test:  0.8827912307626803
---------
MSE score. Train:  169.9613098763129
MSE score. Test:  233.92012830492345
---------
MAE score. Train:  7.471854301634164
MAE score. Test:  8.084360713438809
---------
MdAE score. Train:  4.589682625790569
MdAE score. Test:  4.772476909789841


In [169]:
br_reg = BayesianRidge()
br_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(br_reg, X_train, X_val, y_train)
scores(br_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.9145400405534609
EV score. Test:  0.8827858697472574
---------
R2 score. Train:  0.9145400405534609
R2 score. Test:  0.8827852978420795
---------
MSE score. Train:  169.9628691850211
MSE score. Test:  233.93196896801706
---------
MAE score. Train:  7.470338527172346
MAE score. Test:  8.081688849244465
---------
MdAE score. Train:  4.587430666085272
MdAE score. Test:  4.764661097083625


<a id = '2.8'></a>
[Return to top](#top)
## 2.9 HuberRegressor, TheilSenRegressor, RANSACRegressor

In [172]:
hu_reg = HuberRegressor()
hu_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(hu_reg, X_train, X_val, y_train)
scores(hu_reg, y_train, y_val, y_train_reg, y_val_reg)

ValueError: HuberRegressor convergence failed: l-BFGS-b solver terminated with ABNORMAL_TERMINATION_IN_LNSRCH

In [173]:
ts_reg = TheilSenRegressor()
ts_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ts_reg, X_train, X_val, y_train)
scores(ts_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
TheilSenRegressor(copy_X=True, fit_intercept=True, max_iter=300,
         max_subpopulation=10000, n_jobs=None, n_subsamples=None,
         random_state=None, tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.9140037764667392
EV score. Test:  0.8816885743768864
---------
R2 score. Train:  0.913968955575851
R2 score. Test:  0.8816661051029172
---------
MSE score. Train:  171.09864366902104
MSE score. Test:  236.16560482006412
---------
MAE score. Train:  7.265004004882804
MAE score. Test:  7.841129494372931
---------
MdAE score. Train:  4.303513460283774
MdAE score. Test:  4.408136681506226


In [174]:
ran_reg = RANSACRegressor()
ran_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(ran_reg, X_train, X_val, y_train)
scores(ran_reg, y_train, y_val, y_train_reg, y_val_reg)

______________________________________________________________________________
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
        loss='absolute_loss', max_skips=inf, max_trials=100,
        min_samples=None, random_state=None, residual_threshold=None,
        stop_n_inliers=inf, stop_probability=0.99, stop_score=inf)
______________________________________________________________________________
EV score. Train:  -1.4966453012597644e+24
EV score. Test:  -1.4639210852020116e+24
---------
R2 score. Train:  -1.712540040434718e+24
R2 score. Test:  -1.669921956472948e+24
---------
MSE score. Train:  3.405901673152552e+27
MSE score. Test:  3.332757103919688e+27
---------
MAE score. Train:  20721286507857.785
MAE score. Test:  20276279657685.297
---------
MdAE score. Train:  3.23046875
MdAE score. Test:  3.640625


<a id = '2.9'></a>
[Return to top](#top)
## 2.10 MLPRegressor

In [175]:
mlp_reg = MLPRegressor(max_iter=3000)
mlp_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(mlp_reg, X_train, X_val, y_train)
scores(mlp_reg, y_train, y_val, y_train_reg, y_val_reg)



______________________________________________________________________________
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=3000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9808386218201977
EV score. Test:  0.9487174518816162
---------
R2 score. Train:  0.9808074545491883
R2 score. Test:  0.9486641365827317
---------
MSE score. Train:  38.1701572632328
MSE score. Test:  102.45386787482674
---------
MAE score. Train:  3.966820190595234
MAE score. Test:  6.041489458745662
---------
MdAE score. Train:  2.4904955212390756
MdAE score. Test

<a id = '2.10'></a>
[Return to top](#top)
## 2.11 SVR

In [176]:

svr_reg = SVR(degree = 7)
svr_reg.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(svr_reg, X_train, X_val, y_train)
scores(svr_reg, y_train, y_val, y_train_reg, y_val_reg)



______________________________________________________________________________
SVR(C=1.0, cache_size=200, coef0=0.0, degree=7, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)
______________________________________________________________________________
EV score. Train:  0.3099166488249139
EV score. Test:  0.2921215679782172
---------
R2 score. Train:  0.20836396906827515
R2 score. Test:  0.19344291444113937
---------
MSE score. Train:  1574.4066816643797
MSE score. Test:  1609.6913069461746
---------
MAE score. Train:  17.07964623542133
MAE score. Test:  16.873520880698724
---------
MdAE score. Train:  3.243198098022847
MdAE score. Test:  3.6300865944170266


<a id = '3'></a>
[Return to top](#top)
# 3. Tuning hyperparameters

We will use one ensemble method: random forest.

<a id = '3.1'></a>
[Return to top](#top)
## 3.1 Tuning random forests

In [179]:
rf_params = {"n_estimators": np.arange(40,2000,20)}

In [180]:
rf_reg = RandomForestRegressor()
regs = GridSearchCV(rf_reg, rf_params)
regs.fit(X_train, y_train)
y_train_reg, y_val_reg = regression(regs, X_train, X_val, y_train)
print(regs.best_estimator_)
scores(regs, y_train, y_val, y_train_reg, y_val_reg)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1140, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={

In [181]:
rf_reg = RandomForestRegressor(n_estimators=1140)
rf_reg.fit(X_trainn, y_trainn)
y_trainn_reg, y_test_reg = regression(rf_reg, X_trainn, X_test, y_trainn)
scores(rf_reg, y_trainn, y_test, y_trainn_reg, y_test_reg)

______________________________________________________________________________
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1140, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
______________________________________________________________________________
EV score. Train:  0.9932817452476971
EV score. Test:  0.9523165082641395
---------
R2 score. Train:  0.9932815512677632
R2 score. Test:  0.9523163096763901
---------
MSE score. Train:  13.371063414227649
MSE score. Test:  93.30776531434365
---------
MAE score. Train:  2.0049282212885156
MAE score. Test:  5.288408521303259
---------
MdAE score. Train:  0.992543859649123
MdAE score. Test:  2.7214912280701755


Now it is time to train and save the model.

In [182]:
rf_reg_final = RandomForestRegressor(n_estimators=1140)
rf_reg_final.fit(X, y)

NameError: name 'pickle' is not defined

In [183]:
import pickle
pickle.dump(rf_reg_final, open('/Users/CatLover/Documents/DataScience/BostonCrime/final_predictor.p','wb'))