# Is it possible to predict some catering provider may pass of fail an inspection?

This notebook aims to introduce some concepts of Artificial Intelligence - i.e., machine learning. We are going to use a dataset captured by Chicago food inspections. More details can be found below.

https://www.kaggle.com/datasets/tjkyner/chicago-food-inspections

This notebook demonstrates the extend how a decision tree may predict accurately the outcome of an inspection. 

# Step 1 - Let's explore the data

We import the relevant libraries. We discover the data is stored in one reasonably-sized file.  The dataset has 17 columns and more than 220,000 rows. The datasets appears to be quite complete, with very few missing observations. 

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix



for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/chicago-food-inspections/Food_Inspections.csv


In [2]:
source = "/kaggle/input/chicago-food-inspections/Food_Inspections.csv"
data   = pd.read_csv(source) 
data.shape

(221468, 17)

In [3]:
data.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #          float64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                float64
Inspection Date     object
Inspection Type     object
Results             object
Violations          object
Latitude           float64
Longitude          float64
Location            object
dtype: object

In [4]:
data.isnull().sum()/data.shape[0]

Inspection ID      0.000000
DBA Name           0.000009
AKA Name           0.011203
License #          0.000077
Facility Type      0.022184
Risk               0.000321
Address            0.000000
City               0.000754
State              0.000248
Zip                0.000235
Inspection Date    0.000000
Inspection Type    0.000005
Results            0.000000
Violations         0.267538
Latitude           0.003414
Longitude          0.003414
Location           0.003414
dtype: float64

In [5]:
rows =data.isnull

# Step 2 - Let's clean the data


The list of columns appears to store company names and other information that can lead to an identification. For that reason we remove some of the column to protect those businesses.  We keep the outcome of the inspections, the facility type, risk, city, zip code, inspection type, violations, latitude and longitude. The latter could lead to identification. We assume due to the concentration of businesses in Chicago area, we can limit identification. 

In [6]:
cols = ['Results','Facility Type', 'Risk','City','Zip','Inspection Type','Violations','Latitude','Longitude']
data = data.loc[:,cols]
data


Unnamed: 0,Results,Facility Type,Risk,City,Zip,Inspection Type,Violations,Latitude,Longitude
0,Pass,School,Risk 1 (High),CHICAGO,60615.0,Canvass,,41.798029,-87.602463
1,No Entry,Restaurant,Risk 1 (High),CHICAGO,60611.0,Non-Inspection,,41.891652,-87.622604
2,Not Ready,Restaurant,Risk 1 (High),CHICAGO,60607.0,License,,41.867330,-87.642117
3,Out of Business,Restaurant,Risk 1 (High),CHICAGO,60659.0,Canvass,,41.985362,-87.689652
4,No Entry,Restaurant,Risk 1 (High),CHICAGO,60625.0,Non-Inspection,,41.975927,-87.699046
...,...,...,...,...,...,...,...,...,...
221463,Pass,Long-Term Care Facility,Risk 1 (High),CHICAGO,60611.0,Canvass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",41.897438,-87.626020
221464,Fail,Daycare (2 - 6 Years),Risk 1 (High),CHICAGO,60827.0,License,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.655907,-87.599022
221465,Pass,Restaurant,Risk 2 (Medium),CHICAGO,60602.0,Short Form Complaint,38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS...,41.883115,-87.625173
221466,Pass,Restaurant,Risk 2 (Medium),CHICAGO,60656.0,Suspected Food Poisoning,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.962768,-87.836840


Some unique and character values appears to be repeated. Therefore, we transform those columns with some numerical categorical values. It will support learning some patterns using a learning algorithm.

## Results

In [7]:
len(data.Results.unique())

7

In [8]:
data["Results"]=  pd.Categorical(data.Results)
data.dtypes

Results            category
Facility Type        object
Risk                 object
City                 object
Zip                 float64
Inspection Type      object
Violations           object
Latitude            float64
Longitude           float64
dtype: object

In [9]:
results = data.Results.cat.codes
results

0         5
1         2
2         3
3         4
4         2
         ..
221463    5
221464    1
221465    5
221466    5
221467    5
Length: 221468, dtype: int8

## Facility type

In [10]:
len(data['Facility Type'].unique())

503

In [11]:
data["Facility Type"]=  pd.Categorical(data["Facility Type"])
data.dtypes

Results            category
Facility Type      category
Risk                 object
City                 object
Zip                 float64
Inspection Type      object
Violations           object
Latitude            float64
Longitude           float64
dtype: object

In [12]:
facility_type = data['Facility Type'].cat.codes
facility_type

0         405
1         385
2         385
3         385
4         385
         ... 
221463    276
221464    142
221465    385
221466    385
221467    216
Length: 221468, dtype: int16

## Inspection Type

In [13]:
len(data['Inspection Type'].unique())

111

In [14]:
data["Inspection Type"]=  pd.Categorical(data['Inspection Type'])
data.dtypes

Results            category
Facility Type      category
Risk                 object
City                 object
Zip                 float64
Inspection Type    category
Violations           object
Latitude            float64
Longitude           float64
dtype: object

In [15]:
inspect_types = data["Inspection Type"].cat.codes
inspect_types

0         15
1         52
2         44
3         15
4         52
          ..
221463    15
221464    44
221465    75
221466    80
221467    45
Length: 221468, dtype: int8

## Violations

In [16]:
len(data['Violations'].unique())

161238

In [17]:
data['Violations'] = pd.Categorical(data.Violations)
data.dtypes

Results            category
Facility Type      category
Risk                 object
City                 object
Zip                 float64
Inspection Type    category
Violations         category
Latitude            float64
Longitude           float64
dtype: object

In [18]:
violations = data.Violations.cat.codes
violations

0             -1
1             -1
2             -1
3             -1
4             -1
           ...  
221463    128018
221464     17634
221465    141150
221466     79227
221467        -1
Length: 221468, dtype: int32

## Cities

In [19]:
len(data['City'].unique())

74

In [20]:
data["City"]=  pd.Categorical(data.City)
data.dtypes

Results            category
Facility Type      category
Risk                 object
City               category
Zip                 float64
Inspection Type    category
Violations         category
Latitude            float64
Longitude           float64
dtype: object

In [21]:
cities = data['City'].cat.codes
cities

0         16
1         16
2         16
3         16
4         16
          ..
221463    16
221464    16
221465    16
221466    16
221467    16
Length: 221468, dtype: int8

## Risks

In [22]:
len(data.Risk.unique())

5

In [23]:
data["Risk"] = pd.Categorical(data.Risk)
data.dtypes

Results            category
Facility Type      category
Risk               category
City               category
Zip                 float64
Inspection Type    category
Violations         category
Latitude            float64
Longitude           float64
dtype: object

In [24]:
risks = data.Risk.cat.codes
risks

0         1
1         1
2         1
3         1
4         1
         ..
221463    1
221464    1
221465    2
221466    2
221467    3
Length: 221468, dtype: int8

In [26]:
cleaned_data : dict = {'results': results,
                'facility_type' : facility_type,
                'inspect_type' : inspect_types,
                'violations': violations,
                'cities' : cities,
                'risk' : risks,
                'zip' : data.Zip,
                'Lat': data.Latitude,
                'Long': data.Longitude}
cleaned_data : pd.DataFrame = pd.DataFrame(cleaned_data)
cleaned_data.shape


(221468, 9)

In [27]:
cleaned_data.dtypes

results             int8
facility_type      int16
inspect_type        int8
violations         int32
cities              int8
risk                int8
zip              float64
Lat              float64
Long             float64
dtype: object

In [28]:
cleaned_data.isnull().sum()

results            0
facility_type      0
inspect_type       0
violations         0
cities             0
risk               0
zip               52
Lat              756
Long             756
dtype: int64

In [29]:
cleaned_data.zip.fillna(-1, inplace = True)
cleaned_data.Lat.fillna(-1, inplace = True)
cleaned_data.Long.fillna(-1, inplace = True)
cleaned_data.isnull().sum()

results          0
facility_type    0
inspect_type     0
violations       0
cities           0
risk             0
zip              0
Lat              0
Long             0
dtype: int64

# Step 3 - Let's prepare the data for learning

In [30]:
cleaned_data.shape

(221468, 9)

In [31]:
X = cleaned_data.iloc[:, 1:]
X 

Unnamed: 0,facility_type,inspect_type,violations,cities,risk,zip,Lat,Long
0,405,15,-1,16,1,60615.0,41.798029,-87.602463
1,385,52,-1,16,1,60611.0,41.891652,-87.622604
2,385,44,-1,16,1,60607.0,41.867330,-87.642117
3,385,15,-1,16,1,60659.0,41.985362,-87.689652
4,385,52,-1,16,1,60625.0,41.975927,-87.699046
...,...,...,...,...,...,...,...,...
221463,276,15,128018,16,1,60611.0,41.897438,-87.626020
221464,142,44,17634,16,1,60827.0,41.655907,-87.599022
221465,385,75,141150,16,2,60602.0,41.883115,-87.625173
221466,385,80,79227,16,2,60656.0,41.962768,-87.836840


In [32]:
y = cleaned_data.iloc[:, 0]
y

0         5
1         2
2         3
3         4
4         2
         ..
221463    5
221464    1
221465    5
221466    5
221467    5
Name: results, Length: 221468, dtype: int8

In [33]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)

# Split the data into training and test sets
for train_index, test_index in split.split(X, y):
    X_train = X.loc[train_index]
    X_test = X.loc[test_index]
    y_train = y.loc[train_index]
    y_test = y.loc[test_index]

print("X_train", X_train.shape)
print("X_test", X_test.shape)

X_train (132880, 8)
X_test (88588, 8)


In [36]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)

# Split the data into training and test sets
for test_index, valid_index in split.split(X_test, y_test):
    X_test = X.loc[test_index]
    X_valid = X.loc[valid_index]
    y_test = y.loc[test_index]
    y_valid = y.loc[valid_index]

print("X_test", X_test.shape)
print("X_valid", X_valid.shape)

X_test (11073, 8)
X_valid (11074, 8)


# Step 4 - Learning 
This phase is attempting to fit a model between the known outcome _y_ and some values against those outcomes. We wil

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
tree.plot_tree(clf)

[Text(0.6613796973032965, 0.9895833333333334, 'x[2] <= 62860.5\ngini = 0.659\nsamples = 132880\nvalue = [43, 25610, 4673, 1451, 11477, 69276, 20350]'),
 Text(0.4093132244477975, 0.96875, 'x[2] <= 1.0\ngini = 0.774\nsamples = 73352\nvalue = [43, 22133, 4523, 1437, 11466, 16962, 16788]'),
 Text(0.19869313520732038, 0.9479166666666666, 'x[1] <= 15.5\ngini = 0.675\nsamples = 35495\nvalue = [43, 1952, 4340, 1411, 11460, 15923, 366]'),
 Text(0.1445757562736948, 0.9270833333333334, 'x[4] <= 1.5\ngini = 0.498\nsamples = 16739\nvalue = [23, 272, 3040, 37, 11271, 2065, 31]'),
 Text(0.10936984455392888, 0.90625, 'x[0] <= 389.5\ngini = 0.584\nsamples = 10007\nvalue = [15, 194, 2586, 28, 5731, 1430, 23]'),
 Text(0.06409873212270423, 0.8854166666666666, 'x[6] <= 41.902\ngini = 0.567\nsamples = 9438\nvalue = [12, 178, 2562, 27, 5545, 1091, 23]'),
 Text(0.030612077591897623, 0.8645833333333334, 'x[6] <= 41.872\ngini = 0.538\nsamples = 4381\nvalue = [8, 73, 820, 17, 2781, 674, 8]'),
 Text(0.00982817351

In [None]:
y_pred_train = clf.predict(X_train)
confusion_matrix(y_train,y_pred_train)

In [None]:
y_pred_test = clf.predict(X_test)
confusion_matrix(y_test,y_pred_test)

In [None]:
y_pred_valid = clf.predict(X_valid)
confusion_matrix(y_valid,y_pred_valid)

## What are the distribution of the data?


Many of the statistical variables appears to have some categories appears more than other.  It is likely that some statistical observations may be dependent. One example are the geographical columns - cities, zip, latitude and longitude. It may support learning pattern.

In [None]:
plt.hist(cleaned_data["results"])

In [None]:
data.Results.unique()

In [None]:
plt.hist(cleaned_data["facility_type"])

In [None]:
plt.hist(cleaned_data["inspect_type"])

In [None]:
plt.hist(cleaned_data["violations"])

In [None]:
plt.hist(cleaned_data["cities"])

In [None]:
plt.hist(cleaned_data["zip"])

In [None]:
plt.hist(cleaned_data["risk"])

In [None]:
cleaned_data.groupby(['inspect_type','cities','results']).count()

Some correlations may exists between the results and the type of inspection.  Both variables may be also independent. 

In [None]:
sns.heatmap(cleaned_data.corr())