## NYC Restaurant Inspections 
### Author: Jack Robbins

**Dataset Used**: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data

In [1]:
# Important imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [2]:
inspections = pd.read_csv("data/DOHMH_New_York_City_Restaurant_Inspection_Results_20241121.csv")

In [3]:
inspections

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location Point1
0,50161678,CHIPOTLE MEXICAN GRILL #3056,Brooklyn,1746,ATLANTIC AVENUE,11213.0,6143187413,,01/01/1900,,...,,40.677506,-73.932340,308.0,36.0,30900.0,3251126.0,3.013360e+09,BK61,
1,50153724,ALICE'S TEA CUP CHAPTER II,Manhattan,156,EAST 64 STREET,10065.0,6464107205,,01/01/1900,,...,,40.765200,-73.965568,108.0,4.0,12000.0,1042114.0,1.013980e+09,MN40,
2,50154473,MONSIEUR BISTRO,Manhattan,853,LEXINGTON AVENUE,10065.0,3476076861,,01/01/1900,,...,,40.765661,-73.965661,108.0,4.0,12000.0,1042380.0,1.013990e+09,MN40,
3,50157000,HEA SOUTHEAST ASIAN STREET FOOD,Queens,3636,PRINCE ST,11354.0,9177431029,,01/01/1900,,...,,40.761392,-73.832895,407.0,20.0,86900.0,4534920.0,4.049708e+09,QN22,
4,50122996,MARTINY'S,Manhattan,121,EAST 17 STREET,10003.0,1646644923,,01/01/1900,,...,,40.735975,-73.987620,105.0,2.0,5000.0,1082518.0,1.008730e+09,MN21,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259052,41337363,NATIONAL BAKERY,Bronx,1617,WESTCHESTER AVENUE,10472.0,7188930660,Bakery Products/Desserts,10/21/2024,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.829441,-73.874825,209.0,17.0,6200.0,2025353.0,2.037770e+09,BX08,
259053,50044368,CHOPSTICK,Brooklyn,884,REMSEN AVENUE,11236.0,7186293555,Chinese,01/28/2022,Violations were cited in the following area(s).,...,Cycle Inspection / Re-inspection,40.646378,-73.912834,318.0,42.0,96000.0,3397396.0,3.079200e+09,BK50,
259054,50123737,WONDERFUL RESTAURANT,Bronx,518,EAST 240 STREET,10470.0,7183252800,Chinese,09/14/2022,Violations were cited in the following area(s).,...,Pre-permit (Operational) / Initial Inspection,40.901165,-73.861981,212.0,11.0,45102.0,2019826.0,2.033980e+09,BX62,
259055,41086368,LUZ DE AMERICA RESTAURANT,Queens,10430,ROOSEVELT AVENUE,11368.0,7186512060,Latin American,01/19/2023,Violations were cited in the following area(s).,...,Cycle Inspection / Re-inspection,40.750175,-73.860548,404.0,21.0,40300.0,4048773.0,4.019840e+09,QN26,


In [4]:
# Let's get an idea of the shape of the graph
inspections.shape

(259057, 27)

In [5]:
null_values = inspections.isnull().sum()
print(null_values)

CAMIS                         0
DBA                         966
BORO                          0
BUILDING                    372
STREET                        3
ZIPCODE                    2662
PHONE                         3
CUISINE DESCRIPTION        2846
INSPECTION DATE               0
ACTION                     2846
VIOLATION CODE             4444
VIOLATION DESCRIPTION      4444
CRITICAL FLAG                 0
SCORE                     12765
GRADE                    134200
GRADE DATE               143494
RECORD DATE                   0
INSPECTION TYPE            2846
Latitude                    353
Longitude                   353
Community Board            3297
Council District           3287
Census Tract               3287
BIN                        4610
BBL                         638
NTA                        3297
Location Point1          259057
dtype: int64


### Let's drop unneeded columns

The columns BIN, BBL, NTA, and location point 1 have no official description on the data page and are therefore useless to us. We'll get rid of them. We'll also remove the CAMIS, GRADE DATE, PHONE, latitude and longitude columns. These columns are documented but they are not useful to us, so it makes sense to remove.

In [6]:
inspections.drop(['Location Point1', 'NTA', 'BBL', 'BIN', 'CAMIS', 'GRADE DATE', 'PHONE', 'Latitude', 'Longitude']\
                 , axis = 1, inplace=True)
null_values = inspections.isnull().sum()
print(null_values)

DBA                         966
BORO                          0
BUILDING                    372
STREET                        3
ZIPCODE                    2662
CUISINE DESCRIPTION        2846
INSPECTION DATE               0
ACTION                     2846
VIOLATION CODE             4444
VIOLATION DESCRIPTION      4444
CRITICAL FLAG                 0
SCORE                     12765
GRADE                    134200
RECORD DATE                   0
INSPECTION TYPE            2846
Community Board            3297
Council District           3287
Census Tract               3287
dtype: int64


### Let's analyze these findings 
As we can see above there are a lot of null values for the grade and grade date. The grade date would be very interesting for us to look at, so it's tempting to try and either fill those nulls or drop those columns

In [7]:
inspections['SCORE'].describe()

count    246292.000000
mean         24.091895
std          18.210236
min           0.000000
25%          12.000000
50%          20.000000
75%          32.000000
max         168.000000
Name: SCORE, dtype: float64

In [8]:
# Let's remove all rows that have a null score
inspections.dropna(subset=['SCORE'], inplace=True)

### Filling in missing grades

From the [NYC department of health](https://www.nyc.gov/assets/doh/downloads/pdf/about/healthcode/health-code-chapter23.pdf) website, the letter grade based off of score is as follows
* Grade A: 0-13 points scored
* Grade B: 14-27 points scored
* Grade C: >=28 points scored

We can use this to figure out what the grades are now

In [None]:
def grade_from_score(score):
    if score < 14:
        return 'A'
    elif score < 28:
        return 'B'
    else:
        return 'C'
    
# Fill in the grade based on score
for index, row in inspections.iterrows():
    inspections.at[index, 'GRADE'] = grade_from_score(int(row['SCORE']))

In [None]:
null_values = inspections.isnull().sum()
print(null_values)