## Problem Statement
#### In this assignment students need to predict whether a person makes over 50K per year or not from classic adult dataset using XGBoost.

##### The description of the dataset is as follows:
- Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions:

((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Attribute Information: Listing of attributes:

50K, <=50K.

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof- specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing,
- Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

In [1]:
## importing librarry
import pandas as pd
import numpy as np

In [2]:
## importing testing and training dataet
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', skiprows = 1, header = None)

In [3]:
print ("Train Size : ",train_set.shape)
print ("Test Size : ", test_set.shape)

Train Size :  (32561, 15)
Test Size :  (16281, 15)


In [4]:
## Adding feature names to the dataset
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
'occupation','relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
'native_country', 'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels

In [5]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


###### Checking for null values

In [6]:
train_set.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

In [7]:
test_set.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

- There are no null values in the dataset

##### Find categorical variables in the dataset

In [8]:
list(train_set.dtypes[train_set.dtypes == "object"].index)

['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country',
 'wage_class']

##### Find Unique Variable count

In [9]:
train_set.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

In [10]:
train_set.education.value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

In [11]:
train_set.marital_status.value_counts()

 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital_status, dtype: int64

In [12]:
train_set.occupation.value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

In [13]:
train_set.relationship.value_counts()

 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64

In [14]:
train_set.race.value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: race, dtype: int64

In [15]:
train_set.sex.value_counts()

 Male      21790
 Female    10771
Name: sex, dtype: int64

In [16]:
train_set.native_country.unique().shape

(42,)

In [17]:
test_set.native_country.unique().shape

(41,)

##### The unique number of native_country is different for test and train data. This will cause issue for Label and onehot encoding.

In [18]:
### Finding extra native country in train dataset
train_set.native_country[~ train_set.native_country.isin(test_set.native_country)].value_counts()

 Holand-Netherlands    1
Name: native_country, dtype: int64

##### There is an extra country name in train set which is causing issue in encoding. Adding the same row entry to the test data

In [19]:
test_set=test_set.append(train_set[train_set.native_country==" Holand-Netherlands"])

In [20]:
test_set.native_country.unique().shape

(42,)

In [21]:
### Finding extra native country in train dataset. There is no missing country
train_set.native_country[~ train_set.native_country.isin(test_set.native_country)].value_counts()

Series([], Name: native_country, dtype: int64)

In [22]:
train_set.wage_class.value_counts()

 <=50K    24720
 >50K      7841
Name: wage_class, dtype: int64

In [23]:
test_set.wage_class.value_counts()

 <=50K.    12435
 >50K.      3846
 <=50K         1
Name: wage_class, dtype: int64

##### wage_class in test data is not uniform

In [24]:
test_set.wage_class=test_set.wage_class.map({' <=50K.': ' <=50K', ' >50K.': ' >50K'})

In [25]:
## corrected the wage_class
test_set.wage_class.value_counts()

 <=50K    12435
 >50K      3846
Name: wage_class, dtype: int64

##### Replacing categorical variables with numbers
- Out of 9 categorical variables 2 (sex & wage class) has only two values which can be represented by 0 and 1

In [26]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [27]:
##test_set.wage_class=test_set.wage_class.map({' <=50K': 0, ' >50K': 1})
##train_set.wage_class=train_set.wage_class.map({' <=50K': 0, ' >50K': 1})


In [28]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [29]:
## Reading the categorical varibales train_set
categorical_variables = list(train_set.dtypes[train_set.dtypes == "object"].index)

### Replacing categorical variables with numbers
for variables in categorical_variables:
    ##Replacing with dummy variables
    dummies = pd.get_dummies(train_set[variables],prefix=variables)
    train_set=pd.concat([train_set,dummies.iloc[:,1:]],axis=1)
    train_set.drop([variables],inplace=True,axis=1)
    ##onehot_encoder=OneHotEncoder(train_set[variables])
    ##train_set=onehot_encoder.fit_transform(train_set)


In [30]:
train_set.shape

(32561, 101)

In [31]:
train_set.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,wage_class_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [32]:
## Reading the categorical varibales train_set
categorical_variables = list(test_set.dtypes[test_set.dtypes == "object"].index)

### Replacing categorical variables with numbers
for variables in categorical_variables:
    ##Replacing with dummy variables
    dummies = pd.get_dummies(test_set[variables],prefix=variables)
    test_set=pd.concat([test_set,dummies.iloc[:,1:]],axis=1)
    test_set.drop([variables],inplace=True,axis=1)
    ##onehot_encoder=OneHotEncoder(train_set[variables])
    ##train_set=onehot_encoder.fit_transform(train_set)


In [33]:
test_set.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,wage_class_ >50K
0,25,226802,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
1,38,89814,9,0,0,50,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,28,336951,12,0,0,40,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
3,44,160323,10,7688,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
4,18,103497,10,0,0,30,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [34]:
test_set.shape

(16282, 101)

In [35]:
X_train=train_set.iloc[:,:-1].values
y_train=train_set.iloc[:,-1].values
X_test=test_set.iloc[:,:-1].values
y_test=test_set.iloc[:,-1].values

In [36]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



In [37]:
from xgboost import XGBClassifier
classifier=XGBClassifier(max_depth=10)
classifier.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=10, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [38]:
X_test.size

1628200

In [39]:
y_pred=classifier.predict(X_test)

  if diff:


In [40]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(y_test,y_pred)

array([[11749,   687],
       [ 1378,  2468]], dtype=int64)

In [42]:
accuracy_score(y_test,y_pred)

0.8731728288907996

#### Finding the Feature importance

In [52]:
import matplotlib.pyplot as plt
feature_importance=pd.Series(classifier.feature_importances_,index=train_set.columns[train_set.columns!='wage_class_ >50K'])
feature_importance=feature_importance.sort_values(ascending=False)
feature_importance

fnlwgt                                        0.231334
age                                           0.177302
hours_per_week                                0.100267
education_num                                 0.060374
capital_gain                                  0.057318
capital_loss                                  0.054643
occupation_ Exec-managerial                   0.020099
workclass_ Private                            0.019488
sex_ Male                                     0.016737
occupation_ Prof-specialty                    0.015055
workclass_ Self-emp-not-inc                   0.014368
relationship_ Wife                            0.011616
occupation_ Sales                             0.010776
education_ Some-college                       0.010699
workclass_ Local-gov                          0.009935
marital_status_ Married-civ-spouse            0.009782
occupation_ Craft-repair                      0.009171
relationship_ Not-in-family                   0.009094
race_ Whit