# One-feature classifier exploration

This notebook presents results on creating one-feature classifiers, as a way of exploring which of them could have more predictive power.

## Reading the data

In [20]:
import sys
sys.path.insert(0, '../scripts')
import classifiers
import preparation
import pandas as pd

In [2]:
df = pd.read_csv("../data/raw/block-groups.csv")

Now we keep only the block groups in the Cook County for years 2016

In [27]:
df = df[df['parent-location']=='Cook County, Illinois']
#df = df[(df['year']>=2012) & (df['year']<=2016)]
df = df[df['year']==2016]

Now we generate our desired label: upper 10% of # evictions / # eviction filings -- I arbitrarily call it 'evictions-effectiveness'

In [43]:
df['evictions-effectiveness'] = df['evictions'] / df['eviction-filings']
preparation.fill_nas_other(df, 'evictions-effectiveness', 0)
perc90 = df['evictions-effectiveness'].quantile(0.9)
df.loc[df['evictions-effectiveness'] >= perc90, 'upper10'] = 1
df.loc[df['upper10'].isna(), 'upper10'] = 0
df.groupby('upper10').sum()


Unnamed: 0_level_0,GEOID,year,population,poverty-rate,renter-occupied-households,pct-renter-occupied,median-gross-rent,median-household-income,median-property-value,rent-burden,...,pct-multiple,pct-other,eviction-filings,evictions,eviction-rate,eviction-filing-rate,low-flag,imputed,subbed,evictions-effectiveness
upper10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,606494704496006,7178976,4694760.0,52765.8,814671.0,150261.7,3691755.0,214857686.0,854393402.0,120388.2,...,5332.83,639.3,29124.0,9721.0,5428.25,18573.4,0,0,0,811.189072
1.0,73576556381738,870912,541633.0,4255.5,79349.0,15232.33,460204.0,28339489.5,112424400.0,13896.5,...,730.75,41.75,1486.0,1182.0,1255.43,1464.89,0,0,0,379.584359


## Now we analyze

This is the approach:

1. Select the features we'll use
2. Impute missing values
3. Loop over each feature and build very classifier with every feature
4. Present all the results in a table to compare them

In [5]:
features = ['population', 'poverty-rate', 'renter-occupied-households', 'pct-renter-occupied', \
            'median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden', 'pct-white', \
            'pct-af-am', 'pct-hispanic', 'pct-am-ind', 'pct-asian', 'pct-nh-pi', 'pct-multiple', 'pct-other']

In [23]:
impute_median = ['median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden']
for feature in impute_median:
    preparation.fill_nas_median(df, feature)

In [25]:
test_sets = {}
for feature in features:
    test_sets[feature] = df[feature].values.reshape(-1, 1)

In [26]:
models = {}
for feature in features:
    models['boosting-{}'.format(feature)] = classifiers.boosting(test_sets[feature], df['evictions-effectiveness'])
    models['bagging-{}'.format(feature)] = classifiers.bagging(df[feature], df['evictions-effectiveness'])
    models['random_forest-{}'.format(feature)] = classifiers.random_forest(df[feature], df['evictions-effectiveness'])
    models['svm-{}'.format(feature)] = classifiers.svm(df[feature], df['evictions-effectiveness'])
    models['logistic_regression-{}'.format(feature)] = classifiers.logistic_regression(df[feature], df['evictions-effectiveness'])
    models['decision_trees-{}'.format(feature)] = classifiers.decision_trees(df[feature], df['evictions-effectiveness'])
    models['nearest_neighbors-{}'.format(feature)] = classifiers.nearest_neighbors(df[feature], df['evictions-effectiveness'])

ValueError: Unknown label type: 'continuous'