## Bias & Fairness

In [1]:
import pandas as pd
import pickle
from sklearn.tree import DecisionTreeClassifier

In [2]:
df_raw = pickle.load(open("df_raw.pkl","rb"))
df_clean = pickle.load(open("df_clean.pkl","rb"))
df_fe = pickle.load(open("df_fe.pkl","rb"))
m = pickle.load(open("best_model.pkl","rb"))
df_trn_tst = pickle.load(open("df_trn_tst.pkl","rb"))

## Facility type classification

Podemos agrupar los tipos de establecimiento considerando los primeros 5 grupos más dominantes (Restaurant, Grocery Store, School, Children's services facility y Bakery) y agregar un sexto que se conforme de los 4 tipos de establecimientos tipo 'care' (Daycare 2-6 years, Daycare above and under 2 years, Long term care y Daycare combo). El resto quedaría en categoría 'others'

In [21]:
tipo = pd.DataFrame(df_fe.facility_type.value_counts())
tipo['name'] = tipo.index
tipo.index = range(len(tipo.name))
tipo.head(15)

Unnamed: 0,facility_type,name
0,103554,restaurant
1,20887,grocery store
2,11886,school
3,2973,children's services facility
4,2368,bakery
5,2300,daycare (2 - 6 years)
6,2222,daycare above and under 2 years
7,1109,long term care
8,907,catering
9,770,mobile food dispenser


La categoría 'others' quedría con alrededor del 6% de representación

In [10]:
tipo.iloc[0:8,].sum()/tipo.facility_type.sum()

facility_type    0.94149
dtype: float64

In [25]:
grupo1 = tipo.iloc[0:4,1].tolist()
grupo2 = tipo.iloc[[5,6,7,11],1].tolist()

['daycare (2 - 6 years)',
 'daycare above and under 2 years',
 'long term care',
 'daycare combo 1586']

In [28]:
df_fe['class'] = df_fe['facility_type'].apply(lambda x: x if x in grupo1 else ('daycare' if x in grupo2 else 'other'))

### Categorías resultantes

In [29]:
df_fe['class'].value_counts()

restaurant                      103554
grocery store                    20887
school                           11886
other                            10957
daycare                           6196
children's services facility      2973
Name: class, dtype: int64

## Zip Code classification

La otra forma de asignar categoría y determinar grupos protegidos es por medio del zip code. Categorizando los códigos por nivel de desarrollo o income de las difernetes zonas de Chicago:
- High
- Low-Medium
- Downtown

Las que no se ubiquen se pondrían en la categoría 'others'

***Referencias:***
- https://www.chicago.gov/city/en.html
- https://www.chicago.gov/content/dam/city/sites/covid/reports/2020-04-24/ChicagoCommunityAreaandZipcodeMap.pdf
- https://voorheescenter.wordpress.com/2015/10/13/the-affordability-challenge-chicago-updates-the-affordable-requirements-ordinance/
- https://www.chicagohealthatlas.org/community-areas

Se agrupan de acuerdo con información del gobierno de Chicago

In [12]:
lev = pd.read_csv('levels.csv')
lev['zip'] = lev['zip'].astype(str)

In [13]:
lev.index = lev.zip
dic = lev.level.to_dict()

In [14]:
def zips(x):
    if x in lev.zip.to_list():
        return dic[x]
    else:
        return 'other'
    
df_fe['level'] = df_fe['zip'].apply(lambda x: zips(x))

### Categorías resultantes

In [15]:
df_fe.level.value_counts()

high        88290
low-mid     36769
downtown    24847
other        6547
Name: level, dtype: int64

In [5]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 156453 entries, 0 to 156452
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   license           156453 non-null  object        
 1   facility_type     156453 non-null  object        
 2   label_risk        156453 non-null  int32         
 3   zip               156453 non-null  object        
 4   inspection_date   156453 non-null  datetime64[ns]
 5   inspection_type   156452 non-null  object        
 6   aka_name          156453 non-null  object        
 7   violations_count  156453 non-null  int32         
 8   sin_mnth          156453 non-null  float64       
 9   cos_mnth          156453 non-null  float64       
 10  sin_wkd           156453 non-null  float64       
 11  cos_wkd           156453 non-null  float64       
 12  label_results     156453 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int32(2), int64(1), objec