# UCI Census Logistic Regression

Here we train a logistic regression model to predict whether someone's income is >= or < $50K. The results are saved to a file that can be preprocessed and loaded into FairVis.

The data is from UCI - https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29

The model is trained using SciKit Learn

In [7]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

Load feature labels

In [2]:
fname = "data/census/columns.txt"
with open(fname) as f:
    header_names = f.readlines()

header_names = [x.strip() for x in header_names] 

In [3]:
df = pd.read_csv("data/census/census-income-data.csv", sep=",", header=None, names=header_names)

In [4]:
df.head(5)

Unnamed: 0,age,class_of_worker,detailed_industry_recode,detailed_occupation_recode,education,wage_per_hour,enroll_in_edu_inst_last_wk,marital_stat,major_industry_code,major_occupation_code,...,country_of_birth_father,country_of_birth_mother,country_of_birth_self,citizenship,own_business_or_self_employed,fill_inc_questionnaire_for_veteran's_admin,veterans_benefits,weeks_worked_in_year,year,class
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.


Convert label to binary

In [5]:
df.drop('instance_weight', axis=1, inplace=True)
classes = df['class']
classes = classes.map({' - 50000.': 0, ' 50000+.': 1})
df.drop('class', axis=1, inplace=True)

One-hot encode data

In [6]:
cols = list(df.select_dtypes(include=['object']).columns)
train_oh = df.copy()

for c in cols:
    one_hot = pd.get_dummies(train_oh[c])

    for new_col_name in one_hot.columns:
        one_hot.rename(columns={new_col_name : c + "_" + new_col_name}, inplace=True)

    train_oh = train_oh.drop(c, axis = 1)
    train_oh = train_oh.join(one_hot)

Train and run classifier

In [11]:
X = train_oh.values
y = classes.values

clf = LogisticRegression(random_state=0, solver='lbfgs', max_iter=200)
clf.fit(X, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
out = clf.predict(X)

In [13]:
pos = out > .5
out[pos] = 1
out[~pos] = 0

In [14]:
print("Model accuracy: ", sum(out == y) / y.shape[0])

Model accuracy:  0.9464973962901521


In [16]:
df['class'] = classes
df['out'] = out
df.to_csv("processed/census_out.csv", index=False)

In [17]:
df.shape

(199523, 42)