## Fairness in Machine Learning project

Chosen dataset: Communities and Crime

First, download and preprocess the data. The dataset has a lot of variables, and some of them can probably be dropped to reduce model complexity. Firstly, there are five non-predictive features that can be dropped: state, county, community, communityname, and fold. We're going to keep fold for possible cross-validation purposes for now, and drop the others. Additionally, missing values are marked by "?", so they will be converted to NaN values.

In [21]:
import pandas as pd 
import numpy as np 
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
communities_and_crime = fetch_ucirepo(id=183) 
  
# data (as pandas dataframes) 
X = communities_and_crime.data.features 
y = communities_and_crime.data.targets 

X = X.drop(["state", "county", "community", "communityname"], axis="columns")
X = X.replace('?', np.nan)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Columns: 123 entries, fold to PolicBudgPerPop
dtypes: float64(99), int64(1), object(23)
memory usage: 1.9+ MB


Next, we have to deal with the null values. Let's see how many nulls we have on each column to decide if the entire columns should be dropped, or just the rows with missing values.

In [22]:
nulls = X.isnull().values
null_n = np.sum(nulls, axis=0) # count the null values on each column
columns_with_nans = [] # collect column names
for i in range(len(null_n)): # print column names with missing values
    if null_n[i]>0:
        print(X.columns[i], null_n[i])
        columns_with_nans.append(X.columns[i])

OtherPerCap 1
LemasSwornFT 1675
LemasSwFTPerPop 1675
LemasSwFTFieldOps 1675
LemasSwFTFieldPerPop 1675
LemasTotalReq 1675
LemasTotReqPerPop 1675
PolicReqPerOffic 1675
PolicPerPop 1675
RacialMatchCommPol 1675
PctPolicWhite 1675
PctPolicBlack 1675
PctPolicHisp 1675
PctPolicAsian 1675
PctPolicMinor 1675
OfficAssgnDrugUnits 1675
NumKindsDrugsSeiz 1675
PolicAveOTWorked 1675
PolicCars 1675
PolicOperBudg 1675
LemasPctPolicOnPatr 1675
LemasGangUnitDeploy 1675
PolicBudgPerPop 1675


OtherPerCap only has a single missing value, and the others are missing on a majority of the data. Hence, we will drop all other listed features entirely, and then drop rows with missing values to get rid of the one singular row with a missing OtherPerCap.

In [None]:
X = X.drop(columns_with_nans[1:], axis="columns")
X = X.dropna(axis="index")
X["OtherPerCap"] = X["OtherPerCap"].values.astype(float) # convert from string to float
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1993 entries, 0 to 1993
Columns: 101 entries, fold to LemasPctOfficDrugUn
dtypes: float64(100), int64(1)
memory usage: 1.6 MB


Next up is feature selection.