<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Problem Statement</a></span><ul class="toc-item"><li><span><a href="#Abstract" data-toc-modified-id="Abstract-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Abstract</a></span></li><li><span><a href="#Attribute-Description" data-toc-modified-id="Attribute-Description-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Attribute Description</a></span></li></ul></li><li><span><a href="#Data-Source" data-toc-modified-id="Data-Source-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Source</a></span></li><li><span><a href="#Importing-libraries" data-toc-modified-id="Importing-libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing libraries</a></span></li><li><span><a href="#Dataset-Preparation" data-toc-modified-id="Dataset-Preparation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dataset Preparation</a></span></li><li><span><a href="#Importing-the-Cleaned-Dataset(data.csv)" data-toc-modified-id="Importing-the-Cleaned-Dataset(data.csv)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Importing the Cleaned Dataset(data.csv)</a></span><ul class="toc-item"><li><span><a href="#Python-code-to-clean-the-dataset-is-here" data-toc-modified-id="Python-code-to-clean-the-dataset-is-here-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Python code to clean the dataset is <a href="./Cleaning_Data.ipynb" target="_blank">here</a></a></span></li></ul></li><li><span><a href="#Model-Building" data-toc-modified-id="Model-Building-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model Building</a></span><ul class="toc-item"><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Train Test Split</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span><ul class="toc-item"><li><span><a href="#Random-Forest-Classifier" data-toc-modified-id="Random-Forest-Classifier-6.2.1"><span class="toc-item-num">6.2.1&nbsp;&nbsp;</span>Random Forest Classifier</a></span></li><li><span><a href="#XGBoost-Regressor" data-toc-modified-id="XGBoost-Regressor-6.2.2"><span class="toc-item-num">6.2.2&nbsp;&nbsp;</span>XGBoost Regressor</a></span></li></ul></li></ul></li></ul></div>

# Problem Statement

**Can the data help to predict whether the income is '>50k'?** 

The data contains anonymous information such as age, occupation, education, working class, etc. The goal is to train a binary classifier to predict the income which has two possible values ‘>50K’ and ‘<50K’. There are 48842 instances and 14 attributes in the dataset. The data contains a good blend of categorical, numerical and missing values.

Our goal is to eventually provide these insights to correctly predict the Income based on these attributes.

## Abstract

Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

## Attribute Description

* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

# Data Source

* UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/datasets/Adult
* Kaggle : https://www.kaggle.com/wenruliu/adult-income-dataset

# Importing libraries

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Adult.csv')

# Dataset Preparation

In [3]:
df.shape

(48842, 15)

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
df.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [6]:
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,48842,48842.0,48842,48842.0,48842,48842,48842,48842,48842,48842.0,48842.0,48842.0,48842,48842
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,37155
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [8]:
df.isnull().sum(axis=0)

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [9]:
df['workclass'].value_counts()

Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

In [10]:
df['education'].value_counts()

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [11]:
df['marital-status'].value_counts()

Married-civ-spouse       22379
Never-married            16117
Divorced                  6633
Separated                 1530
Widowed                   1518
Married-spouse-absent      628
Married-AF-spouse           37
Name: marital-status, dtype: int64

In [12]:
df['occupation'].value_counts()

Prof-specialty       6172
Craft-repair         6112
Exec-managerial      6086
Adm-clerical         5611
Sales                5504
Other-service        4923
Machine-op-inspct    3022
?                    2809
Transport-moving     2355
Handlers-cleaners    2072
Farming-fishing      1490
Tech-support         1446
Protective-serv       983
Priv-house-serv       242
Armed-Forces           15
Name: occupation, dtype: int64

In [13]:
df['relationship'].value_counts()

Husband           19716
Not-in-family     12583
Own-child          7581
Unmarried          5125
Wife               2331
Other-relative     1506
Name: relationship, dtype: int64

In [14]:
df['race'].value_counts()

White                 41762
Black                  4685
Asian-Pac-Islander     1519
Amer-Indian-Eskimo      470
Other                   406
Name: race, dtype: int64

In [15]:
df['gender'].value_counts()

Male      32650
Female    16192
Name: gender, dtype: int64

In [16]:
df['native-country'].value_counts()

United-States                 43832
Mexico                          951
?                               857
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                        

In [17]:
df['income'].value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

In [18]:
(df['age']=='?').value_counts()

  return op(a, b)


False    48842
Name: age, dtype: int64

In [19]:
(df['fnlwgt']=='?').value_counts()

  return op(a, b)


False    48842
Name: fnlwgt, dtype: int64

In [20]:
(df['capital-gain']=='?').value_counts()

  return op(a, b)


False    48842
Name: capital-gain, dtype: int64

In [21]:
(df['capital-loss']=='?').value_counts()

  return op(a, b)


False    48842
Name: capital-loss, dtype: int64

In [22]:
((df['workclass']=='?') | (df['occupation']=='?') | (df['native-country']=='?')).value_counts()

False    45222
True      3620
dtype: int64

In [23]:
df.drop(df[(df['workclass']=='?') & (df['occupation']=='?') & (df['native-country']=='?')].index, inplace = True) 

In [24]:
((df['workclass']=='?') & (df['occupation']=='?') ).value_counts()

False    46043
True      2753
dtype: int64

# Importing the Cleaned Dataset(data.csv)

## Python code to clean the dataset is [here](./Cleaning_Data.ipynb)

In [25]:
dataset = pd.read_csv('data.csv')

In [26]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              48842 non-null  float64
 1   workclass        48842 non-null  int64  
 2   fnlwgt           48842 non-null  float64
 3   education        48842 non-null  int64  
 4   educational-num  48842 non-null  float64
 5   marital-status   48842 non-null  int64  
 6   occupation       48842 non-null  int64  
 7   relationship     48842 non-null  int64  
 8   race             48842 non-null  int64  
 9   gender           48842 non-null  int64  
 10  capital-gain     48842 non-null  float64
 11  capital-loss     48842 non-null  float64
 12  hours-per-week   48842 non-null  float64
 13  native-country   48842 non-null  int64  
 14  income           48842 non-null  int64  
dtypes: float64(6), int64(9)
memory usage: 5.6 MB


# Model Building

## Train Test Split

In [27]:
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values


In [28]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)

In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

In [30]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('XGB', xgb.XGBClassifier()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))


LR: 0.797763 (0.005448)
LDA: 0.813631 (0.005501)
KNN: 0.778287 (0.007981)
CART: 0.813861 (0.004679)
NB: 0.793668 (0.005058)
RFC: 0.857190 (0.005008)




XGB: 0.872418 (0.002912)


## Hyperparameter Tuning

As per the accuracy score, RFC and XGB are the best fitted models on the dataset.<br>
Therefore, we will perform Hyperparameter Tuning on both RFR and XGB to find out the best one.

### Random Forest Classifier


In [31]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 800, num = 8)]
max_depth = [int(x) for x in np.linspace(5, 25, num = 5)]
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               }

In [32]:
from sklearn.model_selection import RandomizedSearchCV

In [33]:
rf_random = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid,scoring='accuracy', n_iter = 100, cv = 5, verbose=2, random_state=0, n_jobs = -1)

In [34]:
rf_random.fit(x_train,y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   51.6s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed: 11.7min finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [5, 10, 15, 20, 25],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500, 600, 700, 800]},
                   random_state=0, scoring='accuracy', verbose=2)

In [35]:
rf_random.best_params_

{'n_estimators': 500, 'max_depth': 20}

In [36]:
rf_random.best_score_

0.8651754213159084

In [37]:
predictions=rf_random.predict(x_test)

In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, predictions)
print(cm)
accuracy_score(y_test, predictions)

[[6963  424]
 [ 898 1484]]


0.8646739686764254

### XGBoost Regressor


In [39]:
learning_rate = ['0.05','0.1', '0.2','0.3','0.5','0.6']
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'learning_rate': learning_rate
               }

In [40]:
xg_random = RandomizedSearchCV(estimator = xgb.XGBClassifier(), param_distributions = random_grid,scoring='accuracy', n_iter = 100, cv = 5, verbose=2, random_state=0, n_jobs = -1)

In [41]:
xg_random.fit(x_train,y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 30.6min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 91.0min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 141.1min finished




RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=100,...
                                           reg_lambda=None,
                                           scale_pos_weight=None,
                                           subsample=No

In [42]:
xg_random.best_params_

{'n_estimators': 500, 'max_depth': 5, 'learning_rate': '0.05'}

In [43]:
xg_random.best_score_

0.8747216513955871

In [44]:
predictions=xg_random.predict(x_test)

In [45]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, predictions)
print(cm)
accuracy_score(y_test, predictions)

[[6957  430]
 [ 795 1587]]


0.8746033370867028