### Predicting Income with Random Forests

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.

https://archive.ics.uci.edu/ml/datasets/census+income

By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

Let’s get started!

#### Import Modules

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

#### Import Data

There’s a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row’s native-country is " United-States", but we want it to be "United-States". This is happening because in income.csv there are spaces after the commas. To fix this, we can add the parameter delimiter = ", " to our read_csv() function.

In [3]:
income_data = pd.read_csv('income.csv', header=0, delimiter =", ")
income_data.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [16]:
income_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

#### Attribute Information:

Listing of attributes:

- >50K, <=50K.

- age: continuous.

- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [14]:
income_data.iloc[0]

age                            39
 workclass              State-gov
 fnlwgt                     77516
 education              Bachelors
 education-num                 13
 marital-status     Never-married
 occupation          Adm-clerical
 relationship       Not-in-family
 race                       White
 sex                         Male
 capital-gain                2174
 capital-loss                   0
 hours-per-week                40
 native-country     United-States
 income                     <=50K
Name: 0, dtype: object

#### Format The Data For Scikit-learn

In [26]:
labels = income_data['income']
labels.head()

0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: object

In [32]:
data = income_data[[
    'age', 
    'capital-gain', 
    'capital-loss', 
    'hours-per-week',
    'sex'
]]

data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,sex
0,39,2174,0,40,Male
1,50,0,0,13,Male
2,38,0,0,40,Male
3,53,0,0,40,Male
4,28,0,0,40,Female


In [33]:
train_data, test_data, train_labels, test_labels =\
train_test_split(data.drop('sex', axis = 'columns'), labels, random_state=1)

#### Create The Random Forest

In [34]:
forest = RandomForestClassifier(random_state = 1)

In [35]:
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [42]:
print("Forest Test Score: {:.3f}".format(forest.score(test_data, test_labels)))

Forest Test Score: 0.822


In [40]:
income_data['sex-int'] = income_data['sex'].apply(lambda row: 0 if row == "Male" else 1)
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sex-int
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1


In [41]:
data = income_data[[
    'age', 
    'capital-gain', 
    'capital-loss', 
    'hours-per-week',
    'sex-int'
]]

data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,sex-int
0,39,2174,0,40,0
1,50,0,0,13,0
2,38,0,0,40,0
3,53,0,0,40,0
4,28,0,0,40,1


In [43]:
train_data, test_data, train_labels, test_labels =\
train_test_split(data, labels, random_state=1)

In [44]:
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [45]:
print("Forest Test Score: {:.3f}".format(forest.score(test_data, test_labels)))

Forest Test Score: 0.827


In [48]:
print(income_data['native-country'].value_counts())

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [51]:
income_data['country-int'] = income_data['native-country'].apply(lambda row: 0 if row == "United-States" else 1)
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sex-int,country-int
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1,1


In [52]:
data = income_data[[
    'age', 
    'capital-gain', 
    'capital-loss', 
    'hours-per-week',
    'sex-int',
    'country-int'
]]

data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,sex-int,country-int
0,39,2174,0,40,0,0
1,50,0,0,13,0,0
2,38,0,0,40,0,0
3,53,0,0,40,0,0
4,28,0,0,40,1,1


In [53]:
train_data, test_data, train_labels, test_labels =\
train_test_split(data, labels, random_state=1)

In [54]:
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [55]:
print("Forest Test Score: {:.3f}".format(forest.score(test_data, test_labels)))

Forest Test Score: 0.823
