https://archive.ics.uci.edu/ml/datasets/Adult

Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

## Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

## Attribute Information:

Listing of attributes:

    >50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [29]:
# Loading libraries
import pandas as pd
import numpy as np

In [30]:
# Global variables and constants
DATA_FILENAME = 'adult.data'
TEST_FILENAME = 'adult.test'
COLUMNS = ['age', 'workclass','fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

In [31]:
# Loading data
data = pd.read_csv(DATA_FILENAME, index_col = False, names=COLUMNS)

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 4.0+ MB


In [33]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561,38.581647,13.640433,17,28,37,48,90
fnlwgt,32561,189778.366512,105549.977697,12285,117827,178356,237051,1484705
education-num,32561,10.080679,2.57272,1,9,10,12,16
capital-gain,32561,1077.648844,7385.292085,0,0,0,0,99999
capital-loss,32561,87.30383,402.960219,0,0,0,0,4356
hours-per-week,32561,40.437456,12.347429,1,40,40,45,99


In [34]:
data.iloc[0:2, :]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [35]:
test = pd.read_csv(TEST_FILENAME, index_col = False, names=COLUMNS, skiprows=1)

In [36]:
test.iloc[0:2, :]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.


In [37]:
def label_encoding(data):
    from sklearn.preprocessing import LabelEncoder
    lbl = LabelEncoder()
    for col in data.columns:
        if data[col].dtype == 'object':
            lbl.fit(np.unique(data[col]))
            data[col] = lbl.transform(data[col])
    return data

In [38]:
data = label_encoding(data)

In [39]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561,38.581647,13.640433,17,28,37,48,90
workclass,32561,3.868892,1.45596,0,4,4,4,8
fnlwgt,32561,189778.366512,105549.977697,12285,117827,178356,237051,1484705
education,32561,10.29821,3.870264,0,9,11,12,15
education-num,32561,10.080679,2.57272,1,9,10,12,16
marital-status,32561,2.611836,1.506222,0,2,2,4,6
occupation,32561,6.57274,4.228857,0,3,7,10,14
relationship,32561,1.446362,1.606771,0,0,1,3,5
race,32561,3.665858,0.848806,0,4,4,4,4
sex,32561,0.669205,0.470506,0,0,1,1,1


In [40]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)

In [41]:
clf.fit(data.iloc[:,:-1], data.iloc[:, -1])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [42]:
test = label_encoding(test)

In [43]:
test.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,16281,38.767459,13.849187,17,28,37,48,90
workclass,16281,3.873534,1.480682,0,4,4,4,8
fnlwgt,16281,189435.677784,105714.907671,13492,116736,177831,238384,1490400
education,16281,10.268841,3.88298,0,9,11,12,15
education-num,16281,10.072907,2.567545,1,9,10,12,16
marital-status,16281,2.632578,1.510611,0,2,2,4,6
occupation,16281,6.587617,4.233925,0,3,7,10,14
relationship,16281,1.437135,1.592903,0,0,1,3,5
race,16281,3.67244,0.840327,0,4,4,4,4
sex,16281,0.667035,0.471289,0,0,1,1,1


In [44]:
clf.score(test.iloc[:,:-1], test.iloc[:, -1])

0.85406301824212272

In [45]:
test['predictions'] = clf.predict(test.iloc[:,:-1])

In [46]:
test[['income', 'predictions']]

Unnamed: 0,income,predictions
0,0,0
1,0,0
2,1,0
3,1,1
4,0,0
5,0,0
6,0,0
7,1,1
8,0,0
9,0,0


In [47]:
clf.feature_importances_

array([ 0.14723056,  0.0396438 ,  0.17273492,  0.0306471 ,  0.09567723,
        0.07947219,  0.0663147 ,  0.09334537,  0.01397436,  0.01091782,
        0.11292096,  0.03580971,  0.08436168,  0.01694959])

In [64]:
feature_importances = dict(zip(list(data.columns), clf.feature_importances_))

In [65]:
feature_importances

{'age': 0.13659148779990879,
 'capital-gain': 0.12151453349908771,
 'capital-loss': 0.035658316534364937,
 'education': 0.028701338572021094,
 'education-num': 0.099279942557564255,
 'fnlwgt': 0.17586197184043012,
 'hours-per-week': 0.076972067419910306,
 'marital-status': 0.067178544889087102,
 'native-country': 0.016456379255672914,
 'occupation': 0.064444143353620337,
 'race': 0.013709634958327454,
 'relationship': 0.11688727885385063,
 'sex': 0.00801778875030844,
 'workclass': 0.038726571715845863}

In [76]:
feature_importances = pd.DataFrame(data=feature_importances.items(), columns =['feature', 'importance'])

In [79]:
feature_importances.sort_values(by=['importance'], ascending=False)

Unnamed: 0,feature,importance
11,fnlwgt,0.175862
2,age,0.136591
3,capital-gain,0.121515
1,relationship,0.116887
8,education-num,0.09928
5,hours-per-week,0.076972
12,marital-status,0.067179
13,occupation,0.064444
0,workclass,0.038727
9,capital-loss,0.035658
