## US National Census (Income)

*About this Dataset*

**US Adult Census** (1994) relates income to social factors: 

- *age*: continuous.
- *workclass*: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- *fnlwgt*: continuous.
- *education*: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- *education-num*: continuous.
- *marital-status*: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- *occupation*: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- *relationship*: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- *race*: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- *sex*: Female, Male.
- *capital-gain*: continuous.
- *capital-loss*: continuous.
- *hours-per-week*: continuous.
- *native-country*: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Each row is labelled as either having a salary greater than ">50K" or "<=50K".

Note: This Dataset was obtained from the UCI repository, it can be found on

https://archive.ics.uci.edu/ml/datasets/census+income, http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/

### Preprocessing

In [16]:
from pathlib import Path
import os

import pandas as pd
import numpy as np
np.seterr(divide = 'ignore')

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

from scipy import stats

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [84]:
def sample(softmax, temperature):

    EPSILON = 10e-16 # to avoid taking the log of zero
    softmax = (np.array(softmax) + EPSILON).astype('float64')
    
    preds = np.log(softmax) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)

    return probas[0]

def bootstrap(n):
    
    bootstrap = n*[0]
    for _ in range(n):
        bootstrap[np.random.randint(n)] = 1
    
    return bootstrap

def remove(preds, max_drop, temperature):
    
    to_drop = []
    for _ in range(max_drop):
        to_drop.append(sample(preds, temperature))
        
    return to_drop.sum(axis=0)

In [85]:
path = Path(os.getcwd()).parent

columns = ['Age','Workclass','fnlgwt','Education','Education Num','Marital Status',
           'Occupation','Relationship','Race','Sex','Capital Gain','Capital Loss',
           'Hours/Week','Country','Above/Below 50K']

train = pd.read_csv(os.path.join(path, 'data/census_income/adult.data'), names=columns)
test = pd.read_csv(os.path.join(path, 'data/census_income/adult.test'), names=columns)
test = test.iloc[1:] # drop first row from test set

df = pd.concat([train, test]).copy(deep=True)

del train, test

In [86]:
df.replace(' >50K.', ' >50K', inplace=True)
df.replace(' <=50K.', ' <=50K', inplace=True)

df.dropna(inplace=True)
df.reset_index(inplace=True)

ctg = ['Workclass', 'Sex', 'Education', 'Marital Status', 
       'Occupation', 'Relationship', 'Race', 'Country'] # Categorical to Numerical

for c in ctg:
    df = pd.concat([df, pd.get_dummies(df[c], 
                                       prefix=c,
                                       dummy_na=False)], axis=1).drop([c],axis=1)

df_high = df[df['Above/Below 50K'] == " >50K"].copy(deep=True)
df_low = df[df['Above/Below 50K'] == " <=50K"].copy(deep=True)

df_low = df_low.reindex(np.random.permutation(df_low.index))
df_high = df_high.reindex(np.random.permutation(df_high.index))

rep = df.copy(deep=True)
nonrep = pd.concat([df_low.head(12000).copy(deep=True),
                     df_high.head(9000).copy(deep=True)], sort=True)

print('Rep: \n', rep['Above/Below 50K'].value_counts(), '\n')
print('Nonrep: \n', nonrep['Above/Below 50K'].value_counts())

nonrep['label'] = 1
rep['label'] = 0

del df, df_low, df_high

data = pd.concat([nonrep, rep], sort=True)

Rep: 
  <=50K    37155
 >50K     11687
Name: Above/Below 50K, dtype: int64 

Nonrep: 
  <=50K    12000
 >50K      9000
Name: Above/Below 50K, dtype: int64


### Experiment

- dropping only one instance at a time?
- how to find a good temperature value?
- https://www.sciencedirect.com/science/article/abs/pii/S0167865513000020


# MRS Algorithm

In [5]:
temperature = 1 
max_drop = 1000
limit = 5000

meta = ['label', 'Above/Below 50K', 'probs', 'index', 'bootstrap']

#ks = []
#auc = []
#ratio = []
#subset = []

while (len(data[data.label == 1]) > max_drop and 
       len(data.label) > limit):
    
    nonrep = data[data.label = 1].copy(deep=True)
    rep = data[data.label = 0].copy(deep=True)
    
    nonrep['bootstrap'] = bootstrap(nonrep.size)
    rep['bootstrap'] = bootstrap(rep.size)
    
    train = data[data.bootstrap == 1]
    test = data[data.bootstrap == 0]
    
    dt = DecisionTreeClassifier(max_depth=7)
    dt.fit(train.drop(meta, axis=1), train.label)
    preds = dt.predict_proba(test.drop(meta, axis=1))
    
    test['removed'] = remove(preds, max_drop, temperature)
    test = test[test.removed == 0]
        
    data = pd.concat([train, test], sort=True)
    
    print('Größe Datensatz:', data.size)
    '''
    # EVALUATION
    subset.append(len(data[data.label == 'nonrep']))
    ratio.append(data[data.label == 'nonrep']['Above/Below 50K'].value_counts()[0]/
                 data[data.label == 'nonrep']['Above/Below 50K'].value_counts()[1])
    ks_two = stats.ks_2samp(data[data.label == 'nonrep']['probs'], 
                            data[data.label == 'rep']['probs'])
    ks.append(ks_two)
    auc.append(metrics.roc_auc_score([1 if k == 'nonrep' else 0 for k in data.label], data.probs))
    
    print('auc =', metrics.roc_auc_score([1 if k == 'nonrep' else 0 for k in data.label], data.probs))
    print('length of current dataframe:', len(data.label))
    '''

KeyboardInterrupt: 

In [None]:
subset = [21000+11687-i for i in subset]

ax = sns.lineplot(x=subset, y=auc)
plt.title('auc')
plt.show()

In [None]:
ax = sns.lineplot(x=subset, y=ratio)
ax = sns.lineplot(x=subset, y=len(ratio)*[3.18])

plt.title('ratio')
plt.show()

In [None]:
ax = sns.lineplot(x=subset, y=[a[0] for a in ks])
plt.title('ks-stats')
plt.show()

In [None]:
ax = sns.lineplot(x=subset, y=[a[1] for a in ks])
plt.title('p-value')
plt.show()