#### Food Inspection Data Set Description
"The Health Division of the Department of Inspectional Services ensures that all food establishments in the City of Boston meet relevant sanitary codes and standards. Businesses that serve food are inspected at least once a year, and follow-up inspections are performed on high risk establishments. Health inspections are also conducted in response to complaints of unsanitary conditions or illness."  *(Source - https://data.boston.gov)*

The number of violations is defined as:

1) *      Minor Violation

2) **     Major Violation

3) ***    Severe Violation

*(Source - https://restaurantprediction.weebly.com)*


#### Data Cleaning Summary
1) Convert attributes to appropriate date type and format

    a) licenseno and property_id to object from int and float, respectively
    
    b) all -dttm columns (5) to datetime 
    
    c) zip to 5-digit format
    
2) Filter data by "active" license

3) Filter data for inspection result ('resultdttm') for dates from 2015 to current (to match df_crime)

4) Take care of missing values - **TO BE DONE**

    a) there are 21 rows with zip as 00000.  These 21 rows are from Boston, MA and of two business names.  Potentially can fill zip with appropriate zip code
    
    b) do not necessarily need to delete any other rows.

#### Current Final Clean Data Set Information

1) Name = dfins_activefrm15

2) Shape = 171,293 rows x 26 columns

In [161]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
pd.set_option('display.max_columns', None)  # display all columns on screen

In [162]:
df_ins = pd.read_csv(r'C:\Users\ale\Desktop\MIST6150\Project\df_ins.csv')

In [163]:
df_ins.head(3)

Unnamed: 0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,zip,property_id,location
0,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,15-4-202.16,*,Non-Food Contact Surfaces,2013-02-15 12:19:42,Fail,,Provide glass storage rack.,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"
1,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,16-4-501.11/.15,*,Dishwashng Facilities,2013-02-15 12:19:42,Fail,,Provide dish rack over 3 bay sink to replace m...,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"
2,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,16-4-501.11/.15,*,Dishwashng Facilities,2013-02-15 12:19:42,Fail,,Provide dish washer at ware washing area.,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"


In [164]:
df_ins.shape

(562720, 26)

In [165]:
df_ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562720 entries, 0 to 562719
Data columns (total 26 columns):
businessname    562720 non-null object
dbaname         5426 non-null object
legalowner      392419 non-null object
namelast        562720 non-null object
namefirst       562606 non-null object
licenseno       562720 non-null int64
issdttm         562720 non-null object
expdttm         562720 non-null object
licstatus       562720 non-null object
licensecat      562720 non-null object
descript        562720 non-null object
result          562720 non-null object
resultdttm      562720 non-null object
violation       530897 non-null object
viollevel       530897 non-null object
violdesc        530896 non-null object
violdttm        562720 non-null object
violstatus      530897 non-null object
statusdate      233765 non-null object
comments        562716 non-null object
address         562720 non-null object
city            562694 non-null object
state           562720 non-null ob

In [166]:
# copy df_ins as df_ins_clean.  
df_ins_clean = df_ins.copy()

In [167]:
# change data type of "issdttm", "expdttm", "resultdttm", "violdttm", and "statusdate" to date/time 
df_ins_clean['issdttm']=pd.to_datetime(df_ins['issdttm'], format='%Y-%m-%d', errors = 'coerce')
df_ins_clean['expdttm']=pd.to_datetime(df_ins['expdttm'], format='%Y-%m-%d', errors = 'coerce')
df_ins_clean['resultdttm']=pd.to_datetime(df_ins['resultdttm'], format='%Y-%m-%d', errors = 'coerce')
df_ins_clean['violdttm']=pd.to_datetime(df_ins['violdttm'], format='%Y-%m-%d', errors = 'coerce')
df_ins_clean['statusdate']=pd.to_datetime(df_ins['statusdate'], format='%Y-%m-%d', errors = 'coerce')

In [168]:
# change data type of "licenseno" from integer to object, and "property_id" from float to object
df_ins_clean['licenseno']=df_ins['licenseno'].astype(str)
df_ins_clean['property_id']=df_ins['property_id'].astype(str)

In [169]:
# change zip to correct 5-digit format
df_ins_clean['zip']=df_ins['zip'].astype(str).str.zfill(5)

In [170]:
# confirm data type change for df_ins_clean
df_ins_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562720 entries, 0 to 562719
Data columns (total 26 columns):
businessname    562720 non-null object
dbaname         5426 non-null object
legalowner      392419 non-null object
namelast        562720 non-null object
namefirst       562606 non-null object
licenseno       562720 non-null object
issdttm         562607 non-null datetime64[ns]
expdttm         562449 non-null datetime64[ns]
licstatus       562720 non-null object
licensecat      562720 non-null object
descript        562720 non-null object
result          562720 non-null object
resultdttm      556323 non-null datetime64[ns]
violation       530897 non-null object
viollevel       530897 non-null object
violdesc        530896 non-null object
violdttm        530894 non-null datetime64[ns]
violstatus      530897 non-null object
statusdate      233765 non-null datetime64[ns]
comments        562716 non-null object
address         562720 non-null object
city            562694 non-null 

In [171]:
df_ins_clean.head(3)

Unnamed: 0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,zip,property_id,location
0,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,15-4-202.16,*,Non-Food Contact Surfaces,2013-02-15 12:19:42,Fail,NaT,Provide glass storage rack.,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"
1,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,16-4-501.11/.15,*,Dishwashng Facilities,2013-02-15 12:19:42,Fail,NaT,Provide dish rack over 3 bay sink to replace m...,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"
2,100 Percent Delicia Food,,BRENNAN PATRICK E,Marte,Civelis,87059,2013-04-05 12:47:23,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2013-02-15 12:19:42,16-4-501.11/.15,*,Dishwashng Facilities,2013-02-15 12:19:42,Fail,NaT,Provide dish washer at ware washing area.,635 Hyde Park AVE,Roslindale,MA,2131,77476.0,"(42.278590000, -71.119440000)"


In [172]:
df_ins_clean.describe()

Unnamed: 0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,zip,property_id,location
count,562720,5426,392419.0,562720,562606.0,562720.0,562607,562449,562720,562720,562720,562720,556323,530897,530897,530896,530894,530897,233765,562716.0,562720,562694,562720,562720.0,562720.0,402155
unique,7060,97,2648.0,6479,3060.0,8304.0,8276,17,3,4,4,17,114094,92,5,89,87638,3,29825,225545.0,4540,56,4,43.0,3666.0,3140
top,Subway,LCY Inc.,,CVS PHARMACY INC.,,23987.0,2012-03-12 13:27:51,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2012-12-30 00:00:00,23-4-602.13,*,Non-Food Contact Surfaces Clean,2007-11-02 00:00:00,Fail,2013-10-29 15:39:07,,1 Citywide ST,Boston,MA,2116.0,,"(42.355830000, -71.060400000)"
freq,2503,371,31498.0,2287,267285.0,881.0,881,346837,355029,268680,268680,238342,4770,42724,402223,42724,121,291534,88,56996.0,4368,210887,550572,44223.0,129834.0,2244
first,,,,,,,2007-01-01 15:15:05,2007-12-31 00:00:00,,,,,2006-04-04 08:49:18,,,,2006-11-21 00:00:00,,2006-12-27 00:00:00,,,,,,,
last,,,,,,,2019-03-07 10:40:45,2019-12-31 23:59:00,,,,,2019-03-08 13:54:29,,,,2019-03-08 13:54:29,,2019-03-08 15:42:26,,,,,,,


In [173]:
# filter by license = active
dfins_active=df_ins_clean[df_ins_clean['licstatus']=='Active']
dfins_active.shape

(355029, 26)

In [174]:
# filter by resultdttm=>2015 (to match with df_crime)
dfins_activefrm15=dfins_active[dfins_active['resultdttm']>='2015']
dfins_activefrm15.shape

(171293, 26)

In [175]:
dfins_activefrm15.describe()

Unnamed: 0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,zip,property_id,location
count,171293,959,112516.0,171293,171293.0,171293.0,171226,171226,171293,171293,171293,171293,171293,165314,165314,165314,165314,165314,74373,171293.0,171293,171293,171293,171293.0,171293.0,126411
unique,3477,18,1552.0,2955,1542.0,3840.0,3824,3,1,4,4,12,36181,90,3,88,30223,2,9566,76645.0,2881,47,4,36.0,2504.0,2075
top,Subway,1844 Inc.,,CVS PHARMACY INC.,,137896.0,2015-06-30 08:46:46,2019-12-31 23:59:00,Active,FS,Eating & Drinking,HE_Fail,2018-11-15 07:53:49,23-4-602.13,*,Non-Food Contact Surfaces Clean,2018-11-15 07:53:49,Fail,2017-06-30 15:24:18,,1 Citywide ST,Boston,MA,2116.0,,"(42.285670000, -71.155480000)"
freq,919,177,12445.0,1052,74269.0,415.0,415,164383,171293,86370,86370,74759,50,13142,123698,13142,50,90942,57,7507.0,2437,66769,168888,16646.0,27157.0,576
first,,,,,,,2011-11-02 14:01:20,2016-12-31 23:59:00,,,,,2015-01-02 10:07:21,,,,2007-06-05 14:20:00,,2015-01-02 15:33:37,,,,,,,
last,,,,,,,2019-03-07 10:40:45,2019-12-31 23:59:00,,,,,2019-03-08 13:54:29,,,,2019-03-08 13:54:29,,2019-03-08 15:42:26,,,,,,,


In [184]:
dfins_activefrm15.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171293 entries, 16 to 562719
Data columns (total 26 columns):
businessname    171293 non-null object
dbaname         959 non-null object
legalowner      112516 non-null object
namelast        171293 non-null object
namefirst       171293 non-null object
licenseno       171293 non-null object
issdttm         171226 non-null datetime64[ns]
expdttm         171226 non-null datetime64[ns]
licstatus       171293 non-null object
licensecat      171293 non-null object
descript        171293 non-null object
result          171293 non-null object
resultdttm      171293 non-null datetime64[ns]
violation       165314 non-null object
viollevel       165314 non-null object
violdesc        165314 non-null object
violdttm        165314 non-null datetime64[ns]
violstatus      165314 non-null object
statusdate      74373 non-null datetime64[ns]
comments        171293 non-null object
address         171293 non-null object
city            171293 non-null o

In [177]:
# check for null data
dfins_activefrm15.isnull().sum()

businessname         0
dbaname         170334
legalowner       58777
namelast             0
namefirst            0
licenseno            0
issdttm             67
expdttm             67
licstatus            0
licensecat           0
descript             0
result               0
resultdttm           0
violation         5979
viollevel         5979
violdesc          5979
violdttm          5979
violstatus        5979
statusdate       96920
comments             0
address              0
city                 0
state                0
zip                  0
property_id          0
location         44882
dtype: int64

In [178]:
# there are 35 unique zip codes (minus 0000)
dfins_activefrm15.groupby(['zip']).count()

Unnamed: 0_level_0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,property_id,location
zip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
0,21,0,0,21,21,21,21,21,21,21,21,21,21,16,16,16,16,16,7,21,21,21,21,21,0
2108,5357,0,3882,5357,5357,5357,5357,5357,5357,5357,5357,5357,5357,5117,5117,5117,5117,5117,2354,5357,5357,5357,5357,5357,4603
2109,6177,0,3385,6177,6177,6177,6177,6177,6177,6177,6177,6177,6177,5946,5946,5946,5946,5946,2761,6177,6177,6177,6177,6177,3175
2110,4725,12,3211,4725,4725,4725,4725,4725,4725,4725,4725,4725,4725,4350,4350,4350,4350,4350,1976,4725,4725,4725,4725,4725,3683
2111,5079,132,3545,5079,5079,5079,5074,5074,5079,5079,5079,5079,5079,4707,4707,4707,4707,4707,2225,5079,5079,5079,5079,5079,4446
2113,4022,0,3260,4022,4022,4022,4021,4021,4022,4022,4022,4022,4022,3893,3893,3893,3893,3893,1765,4022,4022,4022,4022,4022,3630
2114,6618,151,4015,6618,6618,6618,6617,6617,6618,6618,6618,6618,6618,6211,6211,6211,6211,6211,2839,6618,6618,6618,6618,6618,5217
2115,10172,40,7670,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,9681,9681,9681,9681,9681,4396,10172,10172,10172,10172,10172,7448
2116,16646,50,12046,16646,16646,16646,16645,16645,16646,16646,16646,16646,16646,16234,16234,16234,16234,16234,7192,16646,16646,16646,16646,16646,14422
2117,94,0,94,94,94,94,94,94,94,94,94,94,94,94,94,94,94,94,43,94,94,94,94,94,94


In [199]:
# entries with zip = 0000
dfins_activefrm15[dfins_activefrm15['zip']<'02108']

Unnamed: 0,businessname,dbaname,legalowner,namelast,namefirst,licenseno,issdttm,expdttm,licstatus,licensecat,descript,result,resultdttm,violation,viollevel,violdesc,violdttm,violstatus,statusdate,comments,address,city,state,zip,property_id,location
203955,Frog Pond Boston @ Boston Common,,,The Skating Club of Boston,,75274,2012-02-24 14:04:35,2019-12-31 23:59:00,Active,FT,Eating & Drinking w/ Take Out,HE_Filed,2015-06-30 11:39:06,23-4-602.13,*,Non-Food Contact Surfaces Clean,2015-06-30 11:39:06,Fail,NaT,clean drain compartments to fryers. Clean inte...,,,,0,,
203956,Frog Pond Boston @ Boston Common,,,The Skating Club of Boston,,75274,2012-02-24 14:04:35,2019-12-31 23:59:00,Active,FT,Eating & Drinking w/ Take Out,HE_Filed,2018-06-29 10:17:36,23-4-602.13,*,Non-Food Contact Surfaces Clean,2018-06-29 10:17:36,Fail,NaT,Clean to remove ice build up from dippin dots ...,,,,0,,
203969,Frog Pond Boston @ Boston Common,,,The Skating Club of Boston,,75274,2012-02-24 14:04:35,2019-12-31 23:59:00,Active,FT,Eating & Drinking w/ Take Out,HE_Pass,2016-12-02 12:42:09,,,,NaT,,NaT,,,,,0,,
203970,Frog Pond Boston @ Boston Common,,,The Skating Club of Boston,,75274,2012-02-24 14:04:35,2019-12-31 23:59:00,Active,FT,Eating & Drinking w/ Take Out,HE_Pass,2017-07-14 12:50:14,,,,NaT,,NaT,,,,,0,,
280679,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:27:17,,,,NaT,,NaT,,CITYWIDE,BOSTON,MA,0,,
280680,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:40:39,05-4-302.12,*,Food Thermometers Provided,2018-06-07 08:40:39,Fail,NaT,provide accurate cooks thermometer,CITYWIDE,BOSTON,MA,0,,
280681,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:40:39,14-4-202.11,*,Food Contact Surfaces Design,2018-06-07 08:40:39,Fail,NaT,provide extra serving utensils,CITYWIDE,BOSTON,MA,0,,
280682,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:40:39,17-4-302.14,*,Test Kit Provided,2018-06-07 08:40:39,Fail,NaT,provide proper test kit,CITYWIDE,BOSTON,MA,0,,
280683,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:40:39,27-5-103.11-.12,***,Hot and Cold Water,2018-06-07 08:40:39,Fail,NaT,repair handsink to provide hot and cold water ...,CITYWIDE,BOSTON,MA,0,,
280684,LEGAL SEA FOOD,,,LEGAL SEA FOODS INC.,RISK BENEFITS,25105,2012-01-11 07:50:32,2019-12-31 23:59:00,Active,MFW,Mobile Food Walk On,HE_Fail,2018-06-07 08:40:39,29-5-201/02.11,*,Installed and Maintained,2018-06-07 08:40:39,Fail,NaT,repair leak from under steamtable,CITYWIDE,BOSTON,MA,0,,
