## US National Census (Income)

*About this Dataset*

**US Adult Census** (1994) relates income to social factors: 

- *age*: continuous.
- *workclass*: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- *fnlwgt*: continuous.
- *education*: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- *education-num*: continuous.
- *marital-status*: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- *occupation*: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- *relationship*: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- *race*: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- *sex*: Female, Male.
- *capital-gain*: continuous.
- *capital-loss*: continuous.
- *hours-per-week*: continuous.
- *native-country*: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Each row is labelled as either having a salary greater than ">50K" or "<=50K".

Note: This Dataset was obtained from the UCI repository, it can be found on

https://archive.ics.uci.edu/ml/datasets/census+income, http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/

### Preprocessing

In [198]:
from pathlib import Path
import os
import pandas as pd
import numpy as np

path = Path(os.getcwd()).parent

columns = ['Age','Workclass','fnlgwt','Education','Education Num','Marital Status',
           'Occupation','Relationship','Race','Sex','Capital Gain','Capital Loss',
           'Hours/Week','Country','Above/Below 50K']

train = pd.read_csv(os.path.join(path, 'data/census_income/adult.data'), names=columns)
test = pd.read_csv(os.path.join(path, 'data/census_income/adult.test'), names=columns)
test = test.iloc[1:] # drop first row from test set

df = pd.concat([train, test])

In [200]:
df.replace(' ?', np.nan, inplace=True)
df.dropna()
df.reset_index()

ctg = ['Workclass', 'Sex', 'Education', 'Marital Status', 
       'Occupation', 'Relationship', 'Race', 'Country'] # Categorical to Numerical

for c in ctg:
    df = pd.concat([df, pd.get_dummies(df[c], 
                                       prefix=c,
                                       dummy_na=False)], axis=1).drop([c],axis=1)

df_high = df[df['Above/Below 50K'] == " >50K"].copy(deep=True)
df_low = df[df['Above/Below 50K'] == " <=50K"].copy(deep=True)

In [201]:
df_high.head(3) # Income >=50k

Unnamed: 0,Age,fnlgwt,Education Num,Capital Gain,Capital Loss,Hours/Week,Above/Below 50K,Workclass_ Federal-gov,Workclass_ Local-gov,Workclass_ Never-worked,...,Country_ Portugal,Country_ Puerto-Rico,Country_ Scotland,Country_ South,Country_ Taiwan,Country_ Thailand,Country_ Trinadad&Tobago,Country_ United-States,Country_ Vietnam,Country_ Yugoslavia
7,52,209642.0,9.0,0.0,0.0,45.0,>50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,31,45781.0,14.0,14084.0,0.0,50.0,>50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9,42,159449.0,13.0,5178.0,0.0,40.0,>50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [202]:
df_low.head(3) # Income <50k

Unnamed: 0,Age,fnlgwt,Education Num,Capital Gain,Capital Loss,Hours/Week,Above/Below 50K,Workclass_ Federal-gov,Workclass_ Local-gov,Workclass_ Never-worked,...,Country_ Portugal,Country_ Puerto-Rico,Country_ Scotland,Country_ South,Country_ Taiwan,Country_ Thailand,Country_ Trinadad&Tobago,Country_ United-States,Country_ Vietnam,Country_ Yugoslavia
0,39,77516.0,13.0,2174.0,0.0,40.0,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311.0,13.0,0.0,0.0,13.0,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646.0,9.0,0.0,0.0,40.0,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [203]:
print("df_high / df_low:\n", len(df_high), "/", len(df_low), "=", round(len(df_high)/len(df_low),4))

df_high / df_low:
 7841 / 24720 = 0.3172


### Experiment

Construct **over-representative** dataset by **undersampling** low income instances (df_low). 

Initial attempt: Half the size of df_low ==> Expect half the size of df_high removed from algorithm.

Run algorithm.

Check results.

In [204]:
# Randomize data
df_low = df_low.reindex(np.random.permutation(df_low.index))
df_high = df_high.reindex(np.random.permutation(df_high.index))

low = df_low.head(21000).copy(deep=True)
high = df_high.copy(deep=True)

print("high / low:\n", len(high), "/", len(low), "=", round(len(high)/len(low),4))

high / low:
 7841 / 21000 = 0.3734


In [252]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict


iterations = 5

# use while loop with KS-Test exit condition instead
for i in range(iterations):
    
    rf = RandomForestClassifier(n_estimators=100, 
                                bootstrap=True,
                                max_features = 'sqrt')
    
    data = pd.concat([low, high], sort=True)
    
    preds = cross_val_predict(rf, 
                              data.drop(['Above/Below 50K'], axis=1),
                              data['Above/Below 50K'], 
                              cv=10,
                              method='predict_proba')
    
    # use temperature sampling instead of removing highest prob instance from "high"
    
    data['preds'] = [p[0] for p in preds] # Get >50K predictions
    
    drop_id = data[data['Above/Below 50K'] == ' >50K'].idxmax(axis=['preds'], skipna=True)
    high = high.drop(drop_id).copy()

TypeError: unhashable type: 'list'