## NYC Restaurant Inspections 
### Author: Jack Robbins

**Dataset Used**: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data

In [200]:
# Important imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [201]:
inspections = pd.read_csv("data/DOHMH_New_York_City_Restaurant_Inspection_Results_20241121.csv", low_memory=False)

In [202]:
inspections

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location Point1
0,50145878,ANOTHER COUNTRY,Manhattan,10,EAST 16 STREET,10003.0,9175411321,,01/01/1900,,...,,40.737151,-73.992263,105.0,2.0,5200.0,1016078.0,1.008430e+09,MN13,
1,50148522,OK CANAAN,Queens,4318,MAIN ST,11355.0,7188868844,,01/01/1900,,...,,40.751984,-73.826484,407.0,20.0,79702.0,4115474.0,4.051258e+09,QN22,
2,50118771,DELI PIZZA LUNCHEONETTE,Brooklyn,603,AVENUE Z,11223.0,3472071540,Pizza,02/23/2022,Violations were cited in the following area(s).,...,Pre-permit (Operational) / Initial Inspection,40.586061,-73.971655,313.0,47.0,37402.0,3195689.0,3.072130e+09,BK26,
3,50161894,PATOK,Manhattan,104,W 35TH ST,10001.0,9173276332,,01/01/1900,,...,,40.750616,-73.987711,105.0,4.0,10900.0,,1.000000e+00,MN17,
4,50114783,"HUDSON NIA JFK T1, JV",Queens,JFK,TERMINAL 1,11430.0,201 8218189,,01/01/1900,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262409,50016265,ADORO LEI,Manhattan,287,HUDSON STREET,10013.0,6466665096,Pizza,10/31/2024,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.725439,-74.007638,102.0,3.0,3700.0,1010311.0,1.005940e+09,MN24,
262410,50091187,Luv Pizza,Manhattan,485,7 AVENUE,10018.0,6466924282,Pizza,04/13/2022,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.752458,-73.989544,105.0,3.0,10900.0,1015235.0,1.008128e+09,MN17,
262411,50136479,CECCHI'S BAR & GRILL,Manhattan,105,WEST 13 STREET,10011.0,9172501151,American,06/02/2023,Violations were cited in the following area(s).,...,Pre-permit (Non-operational) / Initial Inspection,40.736882,-73.997842,102.0,3.0,7100.0,1010653.0,1.006090e+09,MN23,
262412,50095003,TWO BROTHERS PIZZA,Bronx,3039,BUHRE AVENUE,10461.0,7188246261,Pizza,11/04/2022,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.847437,-73.831102,210.0,13.0,26602.0,2046599.0,2.041960e+09,BX10,


In [203]:
# Let's get an idea of the shape of the graph
inspections.shape

(262414, 27)

In [204]:
null_values = inspections.isnull().sum()
print(null_values)

CAMIS                         0
DBA                           0
BORO                          0
BUILDING                    375
STREET                        3
ZIPCODE                    2684
PHONE                         3
CUISINE DESCRIPTION        3221
INSPECTION DATE               0
ACTION                     3221
VIOLATION CODE             4838
VIOLATION DESCRIPTION      4838
CRITICAL FLAG                 0
SCORE                     13310
GRADE                    136589
GRADE DATE               145829
RECORD DATE                   0
INSPECTION TYPE            3221
Latitude                    360
Longitude                   360
Community Board            3326
Council District           3311
Census Tract               3311
BIN                        4656
BBL                         645
NTA                        3326
Location Point1          262414
dtype: int64


### Let's drop unneeded columns

The columns BIN, BBL, NTA, and location point 1 have no official description on the data page and are therefore useless to us. We'll get rid of them. We'll also remove the CAMIS, GRADE DATE, PHONE, DBA, VIOLATION DESCRIPTION, latitude and longitude columns. These columns are documented but they are not useful to us, so it makes sense to remove.

In [205]:
inspections.drop(['Location Point1', 'NTA', 'BBL', 'BIN', 'CAMIS', 'GRADE DATE', 'PHONE', 'Latitude', 'Longitude', 'DBA',\
                  'BUILDING', 'STREET', 'VIOLATION DESCRIPTION', 'RECORD DATE'], axis = 1, inplace=True)
null_values = inspections.isnull().sum()
print(null_values)

BORO                        0
ZIPCODE                  2684
CUISINE DESCRIPTION      3221
INSPECTION DATE             0
ACTION                   3221
VIOLATION CODE           4838
CRITICAL FLAG               0
SCORE                   13310
GRADE                  136589
INSPECTION TYPE          3221
Community Board          3326
Council District         3311
Census Tract             3311
dtype: int64


### Let's analyze these findings 
As we can see above there are a lot of null values for the grade and grade date. The grade date would be very interesting for us to look at, so it's tempting to try and either fill those nulls or drop those columns

In [206]:
inspections['SCORE'].describe()

count    249104.000000
mean         24.115743
std          18.176170
min           0.000000
25%          12.000000
50%          20.000000
75%          32.000000
max         168.000000
Name: SCORE, dtype: float64

In [207]:
# Let's remove all rows that have a null score
inspections.dropna(subset=['SCORE'], inplace=True)

### Filling in missing grades

From the [NYC department of health](https://www.nyc.gov/assets/doh/downloads/pdf/about/healthcode/health-code-chapter23.pdf) website, the letter grade based off of score is as follows
* Grade A: 0-13 points scored
* Grade B: 14-27 points scored
* Grade C: >=28 points scored

We can use this to figure out what the grades are now

In [208]:
def grade_from_score(score):
    if score < 14:
        return 'A'
    elif score < 28:
        return 'B'
    else:
        return 'C'
    
# Fill in the grade based on score
for index, row in inspections.iterrows():
    inspections.at[index, 'GRADE'] = grade_from_score(int(row['SCORE']))

In [209]:
null_values = inspections.isnull().sum()
print(null_values)

BORO                      0
ZIPCODE                2523
CUISINE DESCRIPTION       0
INSPECTION DATE           0
ACTION                    0
VIOLATION CODE         1020
CRITICAL FLAG             0
SCORE                     0
GRADE                     0
INSPECTION TYPE           0
Community Board        3094
Council District       3079
Census Tract           3079
dtype: int64


In [210]:
inspections.shape

(249104, 13)

## Handling the rest of the nulls

At this point, we have a small number of rows that still contain null values. This is an acceptable loss for us, so we will go through and drop them.

In [211]:
inspections.dropna(how='any', inplace=True)

In [212]:
# Let's see how we did
null_values = inspections.isnull().sum()
print(null_values)

BORO                   0
ZIPCODE                0
CUISINE DESCRIPTION    0
INSPECTION DATE        0
ACTION                 0
VIOLATION CODE         0
CRITICAL FLAG          0
SCORE                  0
GRADE                  0
INSPECTION TYPE        0
Community Board        0
Council District       0
Census Tract           0
dtype: int64


In [213]:
inspections.info()

<class 'pandas.core.frame.DataFrame'>
Index: 245038 entries, 2 to 262413
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   BORO                 245038 non-null  object 
 1   ZIPCODE              245038 non-null  float64
 2   CUISINE DESCRIPTION  245038 non-null  object 
 3   INSPECTION DATE      245038 non-null  object 
 4   ACTION               245038 non-null  object 
 5   VIOLATION CODE       245038 non-null  object 
 6   CRITICAL FLAG        245038 non-null  object 
 7   SCORE                245038 non-null  float64
 8   GRADE                245038 non-null  object 
 9   INSPECTION TYPE      245038 non-null  object 
 10  Community Board      245038 non-null  float64
 11  Council District     245038 non-null  float64
 12  Census Tract         245038 non-null  float64
dtypes: float64(5), object(8)
memory usage: 26.2+ MB


## Remaining cleanup

Even though we no longer have any null values in our dataframe, that does not mean that there isn't still junk in there. Let's look at each feature and see what still needs to be cleaned up

In [214]:
# Boro looks good to me
inspections['BORO'].unique()

array(['Brooklyn', 'Manhattan', 'Bronx', 'Queens', 'Staten Island'],
      dtype=object)

In [215]:
# Let's deal with zipcodes. Notice how zipcodes are floats, when in reality they should be ints. Let's fix this real quick
inspections['ZIPCODE'] = inspections['ZIPCODE'].apply(lambda x: int(x))

In [216]:
# Let's take a look at cuisine description
inspections['CUISINE DESCRIPTION'].unique()

array(['Pizza', 'Seafood', 'American', 'Creole/Cajun', 'Sandwiches',
       'Mexican', 'Jewish/Kosher', 'Spanish', 'Latin American',
       'Bakery Products/Desserts', 'Chinese', 'Other', 'Caribbean',
       'Thai', 'Creole', 'Mediterranean', 'Italian', 'Filipino',
       'Tex-Mex', 'Salads', 'Korean', 'Pancakes/Waffles',
       'Frozen Desserts', 'Japanese', 'Portuguese', 'Russian', 'Irish',
       'Coffee/Tea', 'Bangladeshi', 'Indian', 'Hamburgers', 'Tapas',
       'Chicken', 'Asian/Asian Fusion', 'Indonesian', 'Greek', 'French',
       'German', 'Southeast Asian', 'Donuts', 'Soul Food',
       'Eastern European', 'Fusion', 'Chinese/Cuban', 'Vegan',
       'Middle Eastern', 'Vegetarian', 'Pakistani', 'Peruvian', 'Polish',
       'Bagels/Pretzels', 'Sandwiches/Salads/Mixed Buffet',
       'Juice, Smoothies, Fruit Salads', 'Chinese/Japanese', 'Steakhouse',
       'Bottled Beverages', 'Barbecue', 'African', 'Nuts/Confectionary',
       'English', 'Armenian', 'New American', 'Turkish', '

In [217]:
# How many of these do we have?
inspections[inspections['CUISINE DESCRIPTION'] == 'Not Listed/Not Applicable']

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE,Community Board,Council District,Census Tract
2296,Brooklyn,11201,Not Listed/Not Applicable,06/24/2024,Violations were cited in the following area(s).,10G,Not Critical,25.0,B,Cycle Inspection / Initial Inspection,302.0,33.0,1500.0
3485,Brooklyn,11211,Not Listed/Not Applicable,12/14/2022,Violations were cited in the following area(s).,04K,Critical,11.0,A,Cycle Inspection / Initial Inspection,301.0,34.0,52300.0
5465,Queens,11101,Not Listed/Not Applicable,12/02/2021,Violations were cited in the following area(s).,04A,Critical,12.0,A,Pre-permit (Operational) / Initial Inspection,402.0,26.0,700.0
9358,Manhattan,10012,Not Listed/Not Applicable,07/11/2024,Violations were cited in the following area(s).,10F,Not Critical,32.0,C,Cycle Inspection / Initial Inspection,102.0,2.0,6500.0
10333,Brooklyn,11201,Not Listed/Not Applicable,12/19/2022,Violations were cited in the following area(s).,10F,Not Critical,3.0,A,Pre-permit (Operational) / Second Compliance I...,302.0,33.0,1500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
244034,Brooklyn,11201,Not Listed/Not Applicable,02/22/2022,Violations were cited in the following area(s).,04N,Critical,0.0,A,Pre-permit (Operational) / Initial Inspection,302.0,33.0,1500.0
247519,Brooklyn,11201,Not Listed/Not Applicable,02/22/2022,Violations were cited in the following area(s).,10H,Not Critical,0.0,A,Pre-permit (Operational) / Initial Inspection,302.0,33.0,1500.0
250994,Queens,11101,Not Listed/Not Applicable,05/21/2024,Violations were cited in the following area(s).,10F,Not Critical,9.0,A,Cycle Inspection / Initial Inspection,402.0,26.0,700.0
251180,Manhattan,10002,Not Listed/Not Applicable,10/20/2022,Violations were cited in the following area(s).,10F,Not Critical,13.0,A,Pre-permit (Operational) / Initial Inspection,103.0,1.0,1800.0


In [218]:
# We'll get rid of these rows because they're likely junk
indices_to_drop = inspections[inspections['CUISINE DESCRIPTION'] == 'Not Listed/Not Applicable'].index
inspections.drop(index=indices_to_drop, inplace=True)

In [219]:
# All fixed now...
inspections['CUISINE DESCRIPTION'].unique()

array(['Pizza', 'Seafood', 'American', 'Creole/Cajun', 'Sandwiches',
       'Mexican', 'Jewish/Kosher', 'Spanish', 'Latin American',
       'Bakery Products/Desserts', 'Chinese', 'Other', 'Caribbean',
       'Thai', 'Creole', 'Mediterranean', 'Italian', 'Filipino',
       'Tex-Mex', 'Salads', 'Korean', 'Pancakes/Waffles',
       'Frozen Desserts', 'Japanese', 'Portuguese', 'Russian', 'Irish',
       'Coffee/Tea', 'Bangladeshi', 'Indian', 'Hamburgers', 'Tapas',
       'Chicken', 'Asian/Asian Fusion', 'Indonesian', 'Greek', 'French',
       'German', 'Southeast Asian', 'Donuts', 'Soul Food',
       'Eastern European', 'Fusion', 'Chinese/Cuban', 'Vegan',
       'Middle Eastern', 'Vegetarian', 'Pakistani', 'Peruvian', 'Polish',
       'Bagels/Pretzels', 'Sandwiches/Salads/Mixed Buffet',
       'Juice, Smoothies, Fruit Salads', 'Chinese/Japanese', 'Steakhouse',
       'Bottled Beverages', 'Barbecue', 'African', 'Nuts/Confectionary',
       'English', 'Armenian', 'New American', 'Turkish', '

### Dealing with Inspection Date
Inspection date here by itself is fine, but I suspect that we won't be able to get trends out of a day-by-day date. Instead, we'll extract the months here and replace the entire column with a months column

In [220]:
# A helper function for us
def date_to_months(date):
    # Split along a /
    split = date.split("/")
    
    return int(split[0])

inspections['INSPECTION MONTH'] = inspections['INSPECTION DATE'].apply(date_to_months)
inspections.drop('INSPECTION DATE', axis=1, inplace=True)
inspections

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,ACTION,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE,Community Board,Council District,Census Tract,INSPECTION MONTH
2,Brooklyn,11223,Pizza,Violations were cited in the following area(s).,09B,Not Critical,36.0,C,Pre-permit (Operational) / Initial Inspection,313.0,47.0,37402.0,2
5,Brooklyn,11217,Seafood,Violations were cited in the following area(s).,09B,Not Critical,7.0,A,Pre-permit (Operational) / Initial Inspection,306.0,39.0,12901.0,11
51,Brooklyn,11214,American,Violations were cited in the following area(s).,09B,Not Critical,8.0,A,Cycle Inspection / Initial Inspection,313.0,47.0,34800.0,4
54,Brooklyn,11210,Creole/Cajun,Violations were cited in the following area(s).,02B,Critical,12.0,A,Cycle Inspection / Initial Inspection,318.0,45.0,74000.0,2
59,Brooklyn,11217,Sandwiches,Violations were cited in the following area(s).,04M,Critical,12.0,A,Cycle Inspection / Re-inspection,306.0,39.0,12901.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
262409,Manhattan,10013,Pizza,Violations were cited in the following area(s).,10F,Not Critical,29.0,C,Cycle Inspection / Initial Inspection,102.0,3.0,3700.0,10
262410,Manhattan,10018,Pizza,Violations were cited in the following area(s).,04J,Critical,17.0,B,Cycle Inspection / Initial Inspection,105.0,3.0,10900.0,4
262411,Manhattan,10011,American,Violations were cited in the following area(s).,10F,Not Critical,3.0,A,Pre-permit (Non-operational) / Initial Inspection,102.0,3.0,7100.0,6
262412,Bronx,10461,Pizza,Violations were cited in the following area(s).,08A,Not Critical,16.0,B,Cycle Inspection / Initial Inspection,210.0,13.0,26602.0,11


In [221]:
# Let's look at action. This looks fine to me, we'll encode it later
inspections['ACTION'].unique()

array(['Violations were cited in the following area(s).',
       'Establishment re-closed by DOHMH.',
       'Establishment re-opened by DOHMH.',
       'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.',
       'No violations were recorded at the time of this inspection.'],
      dtype=object)

In [222]:
# Let's now look at violation code. These also look fine to me, and we'll encode them later
inspections['VIOLATION CODE'].unique()

array(['09B', '02B', '04M', '09C', '04L', '10J', '09E', '04K', '05B',
       '10C', '04A', '03I', '06F', '08A', '10B', '06C', '10F', '02G',
       '06D', '08C', '06E', '06A', '05D', '04N', '08B', '04J', '10D',
       '10H', '09A', '02H', '04C', '06B', '04H', '10G', '04F', '28-06',
       '05H', '10E', '05E', '04O', '10I', '10A', '05F', '28-05', '04P',
       '06G', '02A', '03A', '02C', '02I', '05C', '05A', '04D', '04E',
       '02F', '03C', '03B', '06I', '03E', '07A', '02D', '28-07', '06H',
       '04B', '09D', '03F', '03G', '04I', '03D', '22F', '18-11', '22G'],
      dtype=object)

In [223]:
# Let's have a look at critical_flag. Also looks completely fine to me
inspections['CRITICAL FLAG'].unique()

array(['Not Critical', 'Critical'], dtype=object)

In [224]:
# Let's just encode it right now
inspections['CRITICAL FLAG'] = inspections['CRITICAL FLAG'].apply(lambda x: 1 if x == 'Critical' else 0)
inspections

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,ACTION,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE,Community Board,Council District,Census Tract,INSPECTION MONTH
2,Brooklyn,11223,Pizza,Violations were cited in the following area(s).,09B,0,36.0,C,Pre-permit (Operational) / Initial Inspection,313.0,47.0,37402.0,2
5,Brooklyn,11217,Seafood,Violations were cited in the following area(s).,09B,0,7.0,A,Pre-permit (Operational) / Initial Inspection,306.0,39.0,12901.0,11
51,Brooklyn,11214,American,Violations were cited in the following area(s).,09B,0,8.0,A,Cycle Inspection / Initial Inspection,313.0,47.0,34800.0,4
54,Brooklyn,11210,Creole/Cajun,Violations were cited in the following area(s).,02B,1,12.0,A,Cycle Inspection / Initial Inspection,318.0,45.0,74000.0,2
59,Brooklyn,11217,Sandwiches,Violations were cited in the following area(s).,04M,1,12.0,A,Cycle Inspection / Re-inspection,306.0,39.0,12901.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
262409,Manhattan,10013,Pizza,Violations were cited in the following area(s).,10F,0,29.0,C,Cycle Inspection / Initial Inspection,102.0,3.0,3700.0,10
262410,Manhattan,10018,Pizza,Violations were cited in the following area(s).,04J,1,17.0,B,Cycle Inspection / Initial Inspection,105.0,3.0,10900.0,4
262411,Manhattan,10011,American,Violations were cited in the following area(s).,10F,0,3.0,A,Pre-permit (Non-operational) / Initial Inspection,102.0,3.0,7100.0,6
262412,Bronx,10461,Pizza,Violations were cited in the following area(s).,08A,0,16.0,B,Cycle Inspection / Initial Inspection,210.0,13.0,26602.0,11


In [225]:
# Let's now check in on score. It should be numerical, but instead it's an object. We'll fix this real quick
inspections['SCORE'] = inspections['SCORE'].apply(lambda x: int(x))

In [226]:
# Almost there, let's check the grade now. This looks good to me
inspections['GRADE'].unique()

array(['C', 'A', 'B'], dtype=object)

In [227]:
# Let's check inspection type. This looks good to me as well, we'll encode it later on
inspections['INSPECTION TYPE'].unique()

array(['Pre-permit (Operational) / Initial Inspection',
       'Cycle Inspection / Initial Inspection',
       'Cycle Inspection / Re-inspection',
       'Cycle Inspection / Reopening Inspection',
       'Pre-permit (Operational) / Re-inspection',
       'Pre-permit (Operational) / Compliance Inspection',
       'Pre-permit (Operational) / Reopening Inspection',
       'Cycle Inspection / Compliance Inspection',
       'Pre-permit (Non-operational) / Initial Inspection',
       'Pre-permit (Operational) / Second Compliance Inspection',
       'Inter-Agency Task Force / Initial Inspection',
       'Cycle Inspection / Second Compliance Inspection',
       'Pre-permit (Non-operational) / Re-inspection',
       'Pre-permit (Non-operational) / Second Compliance Inspection',
       'Pre-permit (Non-operational) / Compliance Inspection'],
      dtype=object)

In [228]:
# The community board, Council district and census tract are all floats, and should be converted to ints. We'll do that right now
inspections['Council District'] = inspections['Council District'].apply(lambda x: int(x))
inspections['Community Board'] = inspections['Community Board'].apply(lambda x: int(x))
inspections['Census Tract'] = inspections['Census Tract'].apply(lambda x: int(x))

In [229]:
inspections

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,ACTION,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE,Community Board,Council District,Census Tract,INSPECTION MONTH
2,Brooklyn,11223,Pizza,Violations were cited in the following area(s).,09B,0,36,C,Pre-permit (Operational) / Initial Inspection,313,47,37402,2
5,Brooklyn,11217,Seafood,Violations were cited in the following area(s).,09B,0,7,A,Pre-permit (Operational) / Initial Inspection,306,39,12901,11
51,Brooklyn,11214,American,Violations were cited in the following area(s).,09B,0,8,A,Cycle Inspection / Initial Inspection,313,47,34800,4
54,Brooklyn,11210,Creole/Cajun,Violations were cited in the following area(s).,02B,1,12,A,Cycle Inspection / Initial Inspection,318,45,74000,2
59,Brooklyn,11217,Sandwiches,Violations were cited in the following area(s).,04M,1,12,A,Cycle Inspection / Re-inspection,306,39,12901,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
262409,Manhattan,10013,Pizza,Violations were cited in the following area(s).,10F,0,29,C,Cycle Inspection / Initial Inspection,102,3,3700,10
262410,Manhattan,10018,Pizza,Violations were cited in the following area(s).,04J,1,17,B,Cycle Inspection / Initial Inspection,105,3,10900,4
262411,Manhattan,10011,American,Violations were cited in the following area(s).,10F,0,3,A,Pre-permit (Non-operational) / Initial Inspection,102,3,7100,6
262412,Bronx,10461,Pizza,Violations were cited in the following area(s).,08A,0,16,B,Cycle Inspection / Initial Inspection,210,13,26602,11


In [230]:
inspections.info()

<class 'pandas.core.frame.DataFrame'>
Index: 244945 entries, 2 to 262413
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   BORO                 244945 non-null  object
 1   ZIPCODE              244945 non-null  int64 
 2   CUISINE DESCRIPTION  244945 non-null  object
 3   ACTION               244945 non-null  object
 4   VIOLATION CODE       244945 non-null  object
 5   CRITICAL FLAG        244945 non-null  int64 
 6   SCORE                244945 non-null  int64 
 7   GRADE                244945 non-null  object
 8   INSPECTION TYPE      244945 non-null  object
 9   Community Board      244945 non-null  int64 
 10  Council District     244945 non-null  int64 
 11  Census Tract         244945 non-null  int64 
 12  INSPECTION MONTH     244945 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 26.2+ MB


## Data cleaning finished -> Time to Encode

Now that the data is all cleaned, we can encode all of the categorical columns in preparation for exploratory data analysis/model training

In [231]:
# We'll make a separate label encoder for each here just in case we want to make a prediction and encode some data later
boro_le = preprocessing.LabelEncoder()
desc_le = preprocessing.LabelEncoder()
action_le = preprocessing.LabelEncoder()
code_le = preprocessing.LabelEncoder()
grade_le = preprocessing.LabelEncoder()
type_le = preprocessing.LabelEncoder()

inspections['BORO'] = boro_le.fit_transform(inspections['BORO'])
inspections['CUISINE DESCRIPTION'] = desc_le.fit_transform(inspections['CUISINE DESCRIPTION'])
inspections['ACTION'] = action_le.fit_transform(inspections['ACTION'])
inspections['VIOLATION CODE'] = code_le.fit_transform(inspections['VIOLATION CODE'])
inspections['GRADE'] = grade_le.fit_transform(inspections['GRADE'])
inspections['INSPECTION TYPE'] = type_le.fit_transform(inspections['INSPECTION TYPE'])

In [232]:
# The final cleaned and encoded dataset
inspections

Unnamed: 0,BORO,ZIPCODE,CUISINE DESCRIPTION,ACTION,VIOLATION CODE,CRITICAL FLAG,SCORE,GRADE,INSPECTION TYPE,Community Board,Council District,Census Tract,INSPECTION MONTH
2,1,11223,66,4,52,0,36,2,11,313,47,37402,2
5,1,11217,74,4,52,0,7,0,11,306,39,12901,11
51,1,11214,2,4,52,0,8,0,1,313,47,34800,4
54,1,11210,25,4,1,1,12,0,1,318,45,74000,2
59,1,11217,71,4,27,1,12,0,2,306,39,12901,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
262409,2,10013,66,4,61,0,29,2,1,102,3,3700,10
262410,2,10018,66,4,24,1,17,1,1,105,3,10900,4
262411,2,10011,2,4,61,0,3,0,7,102,3,7100,6
262412,0,10461,66,4,48,0,16,1,1,210,13,26602,11
