#### 오늘 여러분은 Chicago시의 공중보건부가 진행한 위생 검사의 실패 여부를 예측하는 모델을 만들어야 합니다.

여러분의 모델이 예측할 target은 `Inspection Fail` 칼럼입니다.   
칼럼 값은 아래와 같습니다:
- 식당이 위생 검사에 불합격한 경우: **1**
- 식당이 검사를 통과한 경우: **0**

In [1]:
# Google Colab을 사용하는 경우 해당 셀을 실행하세요
%%capture
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pandas-profiling==2.*
    !pip install pdpbox
    !pip install shap

In [1]:
# 데이터셋을 불러오기 위해 판다스 라이브러리를 불러옵니다
import pandas as pd

train_url = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_train.csv'
test_url  = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_test.csv'

# train, test 데이터셋을 불러옵니다
train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

# 데이터셋 확인
assert train.shape == (60000, 17)
assert test.shape  == (20000, 17)

# Part 1: 데이터 전처리 (Data Preprocessing)

In [None]:
## 1.1 데이터셋을 파악하기 위한 EDA를 진행하세요
#> EDA를 하는 방식 및 라이브러리에 대한 제한은 없습니다. 단, 시간 분배에 주의하세요.

In [2]:
# 데이터 타입 확인 및 데이터 수 파악+
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Inspection ID    60000 non-null  int64  
 1   DBA Name         60000 non-null  object 
 2   AKA Name         59283 non-null  object 
 3   License #        59996 non-null  float64
 4   Facility Type    58573 non-null  object 
 5   Risk             59976 non-null  object 
 6   Address          60000 non-null  object 
 7   City             59955 non-null  object 
 8   State            59985 non-null  object 
 9   Zip              59987 non-null  float64
 10  Inspection Date  60000 non-null  object 
 11  Inspection Type  60000 non-null  object 
 12  Violations       44130 non-null  object 
 13  Latitude         59822 non-null  float64
 14  Longitude        59822 non-null  float64
 15  Location         59822 non-null  object 
 16  Inspection Fail  60000 non-null  int64  
dtypes: float64(4

In [3]:
# 결측값 확인
train.isnull().sum()

Inspection ID          0
DBA Name               0
AKA Name             717
License #              4
Facility Type       1427
Risk                  24
Address                0
City                  45
State                 15
Zip                   13
Inspection Date        0
Inspection Type        0
Violations         15870
Latitude             178
Longitude            178
Location             178
Inspection Fail        0
dtype: int64

컬럼이 여러개 이므로 필요없는 컬럼을 전처리 단계에서 제거후 자세히 살펴볼 예정

## 1.2 EDA의 결과를 토대로 Feature Engineering 및 Preprocessing을 진행하세요
> 새로운 feature를 만드는 작업뿐만이 아니라, 필요한 feature가 적절한 데이터 타입을 가지고 있지 않다면 변환합니다

In [4]:
# 데이터확인
train

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Inspection Fail
0,2050629,MY SWEET STATION INC,MY SWEET STATION,2327223.0,Restaurant,Risk 1 (High),2511 N LINCOLN AVE,CHICAGO,IL,60614.0,2017-05-18,Canvass,,41.927577,-87.651528,"(-87.65152817242594, 41.92757677830966)",0
1,2078428,OUTTAKES,RED MANGO,2125004.0,Restaurant,Risk 2 (Medium),10 S DEARBORN ST FL,CHICAGO,IL,60603.0,2017-08-14,Canvass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",41.881807,-87.629543,"(-87.62954311539407, 41.88180696006542)",0
2,1591748,JAFFA BAGELS,JAFFA BAGELS,2278918.0,Restaurant,Risk 1 (High),225 N MICHIGAN AVE,CHICAGO,IL,60601.0,2015-12-15,Complaint,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL...",41.886377,-87.624382,"(-87.62438167043969, 41.88637740620821)",0
3,1230035,FRANKS 'N' DAWGS,FRANKS 'N' DAWGS,2094329.0,Restaurant,Risk 1 (High),1863 N CLYBOURN AVE,CHICAGO,IL,60614.0,2012-07-10,Canvass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.914990,-87.654994,"(-87.65499361162448, 41.91498953039437)",0
4,1228186,SOUTH COAST,SOUTH COAST SUSHI,1817424.0,Restaurant,Risk 1 (High),1700 S MICHIGAN AVE,CHICAGO,IL,60616.0,2013-09-20,Canvass,,41.858996,-87.624106,"(-87.62410566978502, 41.85899630014676)",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,2316057,LITTLE GENIUS COMMUNITY DAYCARE 11,LITTLE GENIUS COMMUNITY DAYCARE,2359451.0,Daycare (2 - 6 Years),Risk 1 (High),1000 W 103RD ST,CHICAGO,IL,60643.0,2019-10-18,License,"55. PHYSICAL FACILITIES INSTALLED, MAINTAINED ...",41.706982,-87.647758,"(-87.64775773310967, 41.706982259265786)",0
59996,1170444,A J FOOD & LIQUOR INC.,FERAS FOOD & LIQUOR,2157174.0,Grocery Store,Risk 3 (Low),4265 W CERMAK RD,CHICAGO,IL,60623.0,2012-08-30,License Re-Inspection,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",41.851324,-87.732192,"(-87.73219217780218, 41.85132422344611)",0
59997,1098317,PITCHFORK FOOD & SALOON,PITCHFORK FOOD & SALOON,1271831.0,Restaurant,Risk 1 (High),2922-2924 W IRVING PARK RD,CHICAGO,IL,60618.0,2012-04-03,Short Form Complaint,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.954067,-87.701753,"(-87.70175255165984, 41.954066875501354)",0
59998,1632103,DATA RESTAURANT,DAATA DARBAR RESTAURANT,2446971.0,Restaurant,Risk 1 (High),2306 W DEVON AVE,CHICAGO,IL,60659.0,2016-02-23,License,,41.997876,-87.687764,"(-87.6877635508729, 41.99787636318572)",0


In [5]:
# 필요없는 고유번호 및 위치정보 컬럼 제거
train = train.drop(columns =['Inspection ID', 'DBA Name', 'Latitude', 'Longitude', 'Location','Violations', 'Inspection Date', 'Zip', 'License #','AKA Name','Address','State'],axis = 1 )
test = test.drop(columns = ['Inspection ID', 'DBA Name', 'Latitude', 'Longitude', 'Location','Violations', 'Inspection Date', 'Zip', 'License #','AKA Name','Address','State'], axis = 1)
train

Unnamed: 0,Facility Type,Risk,City,Inspection Type,Inspection Fail
0,Restaurant,Risk 1 (High),CHICAGO,Canvass,0
1,Restaurant,Risk 2 (Medium),CHICAGO,Canvass,0
2,Restaurant,Risk 1 (High),CHICAGO,Complaint,0
3,Restaurant,Risk 1 (High),CHICAGO,Canvass,0
4,Restaurant,Risk 1 (High),CHICAGO,Canvass,0
...,...,...,...,...,...
59995,Daycare (2 - 6 Years),Risk 1 (High),CHICAGO,License,0
59996,Grocery Store,Risk 3 (Low),CHICAGO,License Re-Inspection,0
59997,Restaurant,Risk 1 (High),CHICAGO,Short Form Complaint,0
59998,Restaurant,Risk 1 (High),CHICAGO,License,0


# 나중에 전처리 시도할것


In [6]:
train['Facility Type']

0                   Restaurant
1                   Restaurant
2                   Restaurant
3                   Restaurant
4                   Restaurant
                 ...          
59995    Daycare (2 - 6 Years)
59996            Grocery Store
59997               Restaurant
59998               Restaurant
59999               Restaurant
Name: Facility Type, Length: 60000, dtype: object

In [7]:
A = train[train['Facility Type']=='Restaurant'].index.to_list()
B = train[train['Facility Type']=='Grocery Store'].index.to_list()
C = train[train['Facility Type']=='School'].index.to_list()
D = train[train['Facility Type']=="Children's Services Facility"].index.to_list()
E = train[train['Facility Type']=='Bakery'].index.to_list()
F = train[train['Facility Type']=='Daycare (2 - 6 Years)'].index.to_list()
G = train[train['Facility Type']=='Daycare Above and Under 2 Years'].index.to_list()

In [11]:
len(A)

39922

In [28]:
len(B)

7863

In [8]:
total = A+B+C+D+E+F+G
total

[0,
 1,
 2,
 3,
 4,
 7,
 11,
 12,
 14,
 17,
 18,
 19,
 22,
 26,
 28,
 29,
 31,
 33,
 38,
 40,
 42,
 45,
 46,
 47,
 49,
 50,
 51,
 55,
 56,
 59,
 61,
 63,
 64,
 66,
 67,
 68,
 69,
 72,
 73,
 74,
 76,
 77,
 78,
 79,
 80,
 84,
 86,
 87,
 88,
 89,
 92,
 93,
 94,
 95,
 97,
 98,
 99,
 101,
 102,
 103,
 104,
 105,
 106,
 108,
 109,
 111,
 115,
 116,
 118,
 120,
 121,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 135,
 136,
 139,
 141,
 142,
 145,
 146,
 147,
 149,
 150,
 151,
 153,
 154,
 155,
 156,
 157,
 158,
 160,
 161,
 162,
 163,
 166,
 167,
 168,
 169,
 171,
 173,
 174,
 175,
 180,
 181,
 182,
 183,
 184,
 186,
 188,
 190,
 191,
 193,
 194,
 195,
 197,
 198,
 200,
 201,
 202,
 203,
 208,
 209,
 210,
 211,
 216,
 217,
 218,
 221,
 222,
 224,
 225,
 226,
 227,
 228,
 229,
 231,
 232,
 234,
 235,
 236,
 238,
 239,
 241,
 242,
 243,
 244,
 245,
 246,
 247,
 248,
 249,
 251,
 252,
 253,
 254,
 255,
 257,
 258,
 259,
 260,
 261,
 262,
 263,
 265,
 266,
 268,
 269,
 270,
 271,
 272,
 

In [9]:
a = train['Facility Type'][total]
b= train['Facility Type']

In [10]:
c = pd.concat([a, b])

In [42]:
a 

Unnamed: 0,Facility Type
0,Restaurant
1,Restaurant
2,Restaurant
3,Restaurant
4,Restaurant
...,...
59512,Daycare Above and Under 2 Years
59652,Daycare Above and Under 2 Years
59847,Daycare Above and Under 2 Years
59894,Daycare Above and Under 2 Years


In [46]:
c

Unnamed: 0,Facility Type
0,Restaurant
1,Restaurant
2,Restaurant
3,Restaurant
4,Restaurant
...,...
59995,Daycare (2 - 6 Years)
59996,Grocery Store
59997,Restaurant
59998,Restaurant


In [52]:
new = c.drop_duplicates(['Facility Type'], keep = False) # 겹치는거 전체 제거
new

Unnamed: 0,Facility Type
179,PALETERIA /ICECREAM SHOP
496,COLLEGE
1335,TOBACCO STORE
1695,CHARTER SCHOOL/CAFETERIA
1794,HERBALCAL
...,...
56590,Assisted Living
57827,DAYCARE 2 YRS TO 12 YRS
58927,EVENT SPACE
59371,FURNITURE STORE


In [55]:
drop_index = new.index.to_list()
drop_index

[179,
 496,
 1335,
 1695,
 1794,
 2017,
 2163,
 2272,
 2685,
 2821,
 3181,
 3400,
 4220,
 4569,
 4819,
 5502,
 5997,
 6261,
 7367,
 7481,
 7828,
 7839,
 8552,
 8962,
 9013,
 9351,
 10081,
 11127,
 11572,
 11815,
 11911,
 12105,
 12520,
 12629,
 12852,
 12905,
 13025,
 13157,
 13491,
 13764,
 13849,
 14109,
 14466,
 15626,
 15637,
 15704,
 15769,
 15939,
 16161,
 16349,
 17122,
 17290,
 17564,
 17930,
 18060,
 18233,
 18365,
 18376,
 18631,
 18653,
 18692,
 18924,
 19029,
 19395,
 19708,
 19752,
 20080,
 21501,
 21709,
 22024,
 22125,
 22707,
 24023,
 24283,
 24713,
 25082,
 26752,
 26929,
 27478,
 27715,
 28122,
 28154,
 28264,
 28628,
 29145,
 29277,
 31021,
 31886,
 32319,
 33040,
 33239,
 33257,
 33646,
 33775,
 33777,
 34565,
 34836,
 35163,
 35312,
 35526,
 37466,
 38102,
 38868,
 39209,
 39366,
 41119,
 41698,
 42208,
 42767,
 42857,
 43192,
 43454,
 44470,
 44838,
 45151,
 45633,
 48284,
 48763,
 48939,
 49189,
 49837,
 50457,
 50873,
 51151,
 51298,
 51455,
 51621,
 51874,
 520

In [58]:
train['Facility Type'][drop_index] = 'Others'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [69]:
train['Facility Type'].tail(50)

59950    Children's Services Facility
59951                      Restaurant
59952                  Long Term Care
59953                      Restaurant
59954                      Restaurant
59955                      Restaurant
59956                      Restaurant
59957                      Restaurant
59958                          Liquor
59959                      Restaurant
59960                          School
59961                      Restaurant
59962                      Restaurant
59963                      Restaurant
59964                      Restaurant
59965                      Restaurant
59966                      Restaurant
59967                      Restaurant
59968                      Restaurant
59969                      Restaurant
59970                      Restaurant
59971                      Restaurant
59972                      Restaurant
59973                      Restaurant
59974                      Restaurant
59975                      Restaurant
59976       

In [62]:
train['Facility Type'][drop_index]

179      Others
496      Others
1335     Others
1695     Others
1794     Others
          ...  
56590    Others
57827    Others
58927    Others
59371    Others
59597    Others
Name: Facility Type, Length: 146, dtype: object

In [31]:
train[train['Facility Type'].iloc[total]]

0                             Restaurant
1                             Restaurant
2                             Restaurant
3                             Restaurant
4                             Restaurant
                      ...               
59512    Daycare Above and Under 2 Years
59652    Daycare Above and Under 2 Years
59847    Daycare Above and Under 2 Years
59894    Daycare Above and Under 2 Years
59935    Daycare Above and Under 2 Years
Name: Facility Type, Length: 54782, dtype: object

In [12]:
train['Facility Type'].value_counts().head(50)

Restaurant                           39922
Grocery Store                         7863
School                                3605
Children's Services Facility           994
Bakery                                 841
Daycare (2 - 6 Years)                  805
Daycare Above and Under 2 Years        752
Long Term Care                         402
Catering                               394
Liquor                                 261
Mobile Food Dispenser                  255
Daycare Combo 1586                     220
Mobile Food Preparer                   191
Golden Diner                           171
Hospital                               164
Wholesale                              142
TAVERN                                  80
Daycare (Under 2 Years)                 77
Special Event                           64
Shared Kitchen User (Long Term)         51
BANQUET HALL                            41
Shelter                                 38
Shared Kitchen                          37
GAS STATION

In [None]:
train['Facility Type'].value_counts().head(16).to_dict().values()

dict_values([39922, 7863, 3605, 994, 841, 805, 752, 402, 394, 261, 255, 220, 191, 171, 164, 142])

In [None]:
key_list = []
for i in train['Facility Type'].value_counts().head(16).to_dict().keys():
  key_list.append(i)
key_list


['Restaurant',
 'Grocery Store',
 'School',
 "Children's Services Facility",
 'Bakery',
 'Daycare (2 - 6 Years)',
 'Daycare Above and Under 2 Years',
 'Long Term Care',
 'Catering',
 'Liquor',
 'Mobile Food Dispenser',
 'Daycare Combo 1586',
 'Mobile Food Preparer',
 'Golden Diner',
 'Hospital',
 'Wholesale']

In [None]:
len(train[train['Facility Type']=='Restaurant'].values)

39922

In [None]:
train['Facility Type'].unique()

In [None]:
list = []
for i in train['Facility Type'].unique():
  if len(train[train['Facility Type']==i].values) < 200:
    list= list.append('Others') 
  else:
    list.append(i)
list

AttributeError: ignored

In [None]:
list = []
for i in train['Facility Type'].unique():
  if len(train[train['Facility Type']==i].values) < 200:
    list.append(train['Facility Type'].replace(i, 'Others'))
  else:
    list.append(i)
list

['Restaurant',
 'Mobile Food Dispenser',
 'School',
 'Grocery Store',
 "Children's Services Facility",
 0                   Restaurant
 1                   Restaurant
 2                   Restaurant
 3                   Restaurant
 4                   Restaurant
                  ...          
 59995    Daycare (2 - 6 Years)
 59996            Grocery Store
 59997               Restaurant
 59998               Restaurant
 59999               Restaurant
 Name: Facility Type, Length: 60000, dtype: object,
 0                   Restaurant
 1                   Restaurant
 2                   Restaurant
 3                   Restaurant
 4                   Restaurant
                  ...          
 59995    Daycare (2 - 6 Years)
 59996            Grocery Store
 59997               Restaurant
 59998               Restaurant
 59999               Restaurant
 Name: Facility Type, Length: 60000, dtype: object,
 'Long Term Care',
 0                   Restaurant
 1                   Restaurant
 2    

In [None]:
|train['Facility Type'].value_counts()

Restaurant    60000
Name: Facility Type, dtype: int64

In [None]:
for i in train['Facility Type']:
  if i != key_list:
    i == 'others'
  else: 
    i == i
  train['Facility Type'] == i
  train['Facility Type']


  

KeyboardInterrupt: ignored

In [None]:
# 200개이하의 것들을 others로 바꾸려면??
train['Facility Type'].value_counts().values

array([41281,  7853,  3568,   994,   836,   805,   752,   402,   394,
         261,   255,   217,   191,   171,   164,   141,   126,    80,
          77,    64,    51,    41,    38,    37,    33,    31,    31,
          26,    25,    25,    19,    19,    19,    19,    18,    16,
          15,    15,    14,    13,    12,    12,    12,    12,    11,
          11,    10,    10,     9,     9,     9,     9,     8,     8,
           8,     8,     8,     7,     7,     7,     7,     7,     7,
           7,     7,     7,     7,     6,     6,     6,     6,     6,
           6,     6,     5,     5,     5,     5,     5,     5,     5,
           5,     5,     5,     5,     5,     5,     5,     5,     5,
           5,     5,     5,     4,     4,     4,     4,     4,     4,
           4,     4,     4,     4,     4,     4,     4,     4,     4,
           4,     4,     4,     4,     4,     4,     4,     4,     4,
           4,     4,     4,     4,     4,     3,     3,     3,     3,
           3,     3,

In [None]:
train[train['City']=='Chicago']['City'].values =='CHICAGO' 

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

# 전처리

In [None]:
# 결측값 처리
train['Facility Type'] = train['Facility Type'].fillna('Restaurant')
train['Risk'] =train['Risk'].fillna('Risk 1 (High)')
train['City'] =train['City'].fillna('CHICAGO')

test['Facility Type'] = test['Facility Type'].fillna('Restaurant')
test['Risk'] =test['Risk'].fillna('Risk 1 (High)')
test['City'] =test['City'].fillna('CHICAGO')

In [None]:
# city 오타 정리
train[train['City']=='Chicago'] = 'CHICAGO'
train[train['City']=='chicago'] = 'CHICAGO'
train[train['City']=='CCHICAGO'] = 'CHICAGO'

In [None]:
test[test['City']=='Chicago'] = 'CHICAGO'
test[test['City']=='chicago'] = 'CHICAGO'
test[test['City']=='CCHICAGO'] = 'CHICAGO'

In [None]:
# 높은 카디널리티 컬럼 제거
train = train.drop(columns = 'Facility Type')
test = test.drop(columns = 'Facility Type')
train

Unnamed: 0,Risk,City,Inspection Type,Inspection Fail
0,Risk 1 (High),CHICAGO,Canvass,0
1,Risk 2 (Medium),CHICAGO,Canvass,0
2,Risk 1 (High),CHICAGO,Complaint,0
3,Risk 1 (High),CHICAGO,Canvass,0
4,Risk 1 (High),CHICAGO,Canvass,0
...,...,...,...,...
59995,Risk 1 (High),CHICAGO,License,0
59996,Risk 3 (Low),CHICAGO,License Re-Inspection,0
59997,Risk 1 (High),CHICAGO,Short Form Complaint,0
59998,Risk 1 (High),CHICAGO,License,0


In [None]:
# city컬럼 전처리
for i in train['City']:
  if i != 'CHICAGO':
    train[train['City']==i] = 'Others'   

In [None]:
for i in test['City']:
  if i != 'CHICAGO':
    test[test['City']==i] = 'Others'   

In [None]:
train['City'].value_counts()

CHICAGO    59942
Others        58
Name: City, dtype: int64

In [None]:
test['City'].value_counts()

CHICAGO    19980
Others        20
Name: City, dtype: int64

In [None]:
train[train['Inspection Type']=='Canvass']['Inspection Type'].count()

31829

In [None]:
train[train['Inspection Type']=='Canvass']['Inspection Type']

0        Canvass
1        Canvass
3        Canvass
4        Canvass
6        Canvass
          ...   
59984    Canvass
59986    Canvass
59987    Canvass
59990    Canvass
59992    Canvass
Name: Inspection Type, Length: 31829, dtype: object

In [None]:
train['Inspection Type'].value_counts()

Canvass                                      31829
License                                       7817
Canvass Re-Inspection                         6424
Complaint                                     5534
License Re-Inspection                         2633
Complaint Re-Inspection                       2230
Short Form Complaint                          2039
Suspected Food Poisoning                       253
Consultation                                   192
Tag Removal                                    175
License-Task Force                             168
CHICAGO                                        126
Recent Inspection                              121
Task Force Liquor 1475                          79
Out of Business                                 79
Others                                          58
Suspected Food Poisoning Re-inspection          50
Complaint-Fire                                  39
Short Form Fire-Complaint                       35
No Entry                       

In [None]:
 for i in train['Inspection Type']:
  if train[train['Inspection Type']==i]['Inspection Type'].count() < 2000:
    train[train['Inspection Type']] == 'Others'


In [None]:
train['Inspection Type'].value_counts()

NameError: ignored