# One-feature classifier exploration

This notebook presents results on creating one-feature classifiers, as a way of exploring which of them could have more predictive power.

## Reading the data

In [1]:
import sys
sys.path.insert(0, '../scripts')
import classifiers
import preparation
import evaluation
import pandas as pd

In [2]:
df = pd.read_csv("../data/raw/block-groups.csv")

Now we keep only the block groups in the Cook County for years 2016

In [3]:
df = df[df['parent-location']=='Cook County, Illinois']
#df = df[(df['year']>=2012) & (df['year']<=2016)]
df = df[df['year']==2016]

Now we generate our desired label: upper 10% of # evictions / # eviction filings -- I arbitrarily call it 'evictions-effectiveness'

In [4]:
df['evictions-effectiveness'] = df['evictions'] / df['eviction-filings']
preparation.fill_nas_other(df, 'evictions-effectiveness', 0)
perc90 = df['evictions-effectiveness'].quantile(0.9)
df.loc[df['evictions-effectiveness'] >= perc90, 'upper10'] = 1
df.loc[df['upper10'].isna(), 'upper10'] = 0
y = df['upper10']

## Now we analyze

This is the approach:

1. Select the features we'll use
2. Impute missing values
3. Loop over each feature and build every classifier with every feature
4. Present all the results in a table to compare them

In [5]:
features = ['population', 'poverty-rate', 'renter-occupied-households', 'pct-renter-occupied', \
            'median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden', 'pct-white', \
            'pct-af-am', 'pct-hispanic', 'pct-am-ind', 'pct-asian', 'pct-nh-pi', 'pct-multiple', 'pct-other']

In [6]:
impute_median = ['median-gross-rent', 'median-household-income', 'median-property-value', 'rent-burden']
for feature in impute_median:
    preparation.fill_nas_median(df, feature)

In [7]:
test_sets = {}
test_sets_list = []
for feature in features:
    test_sets[feature] = df[feature].values.reshape(-1, 1)
    test_sets_list.append(df[feature])

In [8]:
models = []
test_sets_list = []
for feature in features:
    models.append(classifiers.boosting(test_sets[feature], y))
    models.append(classifiers.bagging(test_sets[feature], y))
    models.append(classifiers.random_forest(test_sets[feature], y))
    models.append(classifiers.svm(test_sets[feature], y))
    models.append(classifiers.logistic_regression(test_sets[feature], y))
    models.append(classifiers.decision_tree(test_sets[feature], y))
    models.append(classifiers.nearest_neighbors(test_sets[feature], y))
    test_sets_list += [df[feature]] * 7



In [9]:
table = evaluation.evaluation_table2(models, test_sets_list, y)

Predicting every data point's value to be 0, the accuracy is 89.2 %


  prec = true_positives / (false_positive + true_positives)
  prec = true_positives / (false_positive + true_positives)
  prec = true_positives / (false_positive + true_positives)


## Exporting results

Exporting into a csv file

In [16]:
table.to_csv('../outputs/one-feature_classifiers_May20.csv')