# One-feature classifier exploration

This notebook presents results on creating one-feature classifiers, as a way of exploring which of them could have more predictive power.

## Reading the data

In [1]:
import sys
sys.path.insert(0, '../scripts')
import classifiers
import preparation
import evaluation
import pandas as pd

In [2]:
df = pd.read_csv("../data/raw/block-groups.csv")

Now we keep only the block groups in the Cook County for years 2016

In [3]:
df = df[df['parent-location']=='Cook County, Illinois']
#df = df[(df['year']>=2012) & (df['year']<=2016)]
df = df[df['year']==2016]

Now we generate our desired label: upper 10% of # evictions / # eviction filings -- I arbitrarily call it 'evictions-effectiveness'

In [4]:
df['evictions-effectiveness'] = df['evictions'] / df['eviction-filings']
preparation.fill_nas_other(df, 'evictions-effectiveness', 0)
perc90 = df['evictions-effectiveness'].quantile(0.9)
df.loc[df['evictions-effectiveness'] >= perc90, 'upper10'] = 1
df.loc[df['upper10'].isna(), 'upper10'] = 0

## Now we analyze

This is the approach:

1. Select the features we'll use
2. Impute missing values
3. Loop over each feature and build every classifier with every feature
4. Present all the results in a table to compare them

In [5]:
features = ['population', 'poverty-rate', 'renter-occupied-households', 'pct-renter-occupied', \
            'median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden', 'pct-white', \
            'pct-af-am', 'pct-hispanic', 'pct-am-ind', 'pct-asian', 'pct-nh-pi', 'pct-multiple', 'pct-other']

In [6]:
impute_median = ['median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden']
for feature in impute_median:
    preparation.fill_nas_median(df, feature)

In [7]:
test_sets = {}
for feature in features:
    test_sets[feature] = df[feature].values.reshape(-1, 1)

In [10]:
models = []
for feature in features:
    models.append(classifiers.boosting(test_sets[feature], df['upper10']))
    models.append(classifiers.bagging(test_sets[feature], df['upper10']))
    models.append(classifiers.random_forest(test_sets[feature], df['upper10']))
    models.append(classifiers.svm(test_sets[feature], df['upper10']))
    models.append(classifiers.logistic_regression(test_sets[feature], df['upper10']))
    models.append(classifiers.decision_tree(test_sets[feature], df['upper10']))
    models.append(classifiers.nearest_neighbors(test_sets[feature], df['upper10']))



In [12]:
ar = df['population']

8380      435.0
8397     1496.0
8414     2175.0
8431     1785.0
8448     4339.0
8465     1105.0
8482     1052.0
8499     1252.0
8516     1010.0
8533     2046.0
8550     1292.0
8567     1729.0
8584     1755.0
8601      780.0
8618     2476.0
8635     1050.0
8652     1640.0
8669     1147.0
8686     1180.0
8703     2233.0
8720     1442.0
8737      638.0
8754     2662.0
8771      903.0
8788     1753.0
8805     1240.0
8822     1875.0
8839     2124.0
8856     1564.0
8873     1515.0
          ...  
75751    1701.0
75768    1924.0
75785     869.0
75802     426.0
75819     214.0
75836    1396.0
75853     376.0
75870    1285.0
75887    1563.0
75904    1374.0
75921     557.0
75938    1265.0
75955    1273.0
75972    1848.0
75989    1281.0
76006    8572.0
76023    1093.0
76040    1546.0
76057    1019.0
76074    1421.0
76091     961.0
76108     892.0
76125     906.0
76142     428.0
76159     962.0
76176     635.0
76193     699.0
76210       0.0
76227       0.0
76244       0.0
Name: population, Length