# Predicting Income

In this project, we will be using an ensemble machine learning technique - **Random Forest** - on a dataset containing [census information from UCI’s Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult/). By using census data with a random forest, we will try to predict whether a person makes more than $50,000. 

Random Forest is a technique that allows to prevent overfitting on training data which single Decision Tree algorithms are often prone to. The idea is simple: Random Forest contains many Decision Trees that all work together to classify new points - it’s like every tree gets a vote, and classification that got the maximum number of votes, wins.

Attribute Information:

1. age: continuous;
2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt: continuous.
4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. education-num: continuous.
6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. sex: Female, Male.
11. capital-gain: continuous.
12. capital-loss: continuous.
13. hours-per-week: continuous.
14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
15. income: >50K, <=50K.

## Investigate The Data

Let’s begin with imports and investigating the data available to us. 

In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

# Basic imports related to analysis
import pandas as pd

# Imports for ml
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

In [2]:
income_data = pd.read_csv('income.csv', delimiter = ', ')
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
income_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [4]:
income_data[income_data['income'] == '>50K'].head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K


In [5]:
income_data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

## Feature Modelling

Now we can begin putting it in a format that our Random Forest can work with. To do this, we need to separate the labels from the rest of the data.

In [6]:
income_data.income.unique()

array(['<=50K', '>50K'], dtype=object)

In [7]:
income_data['income-num'] = income_data.income.map({'>50K': 1, '<=50K': 0})
income_data.head(2)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,income-num
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0


In [8]:
labels = income_data[['income-num']]
labels.head()

Unnamed: 0,income-num
0,0
1,0
2,0
3,0
4,0


In [9]:
income_data.corr()[6:7]

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income-num
income-num,0.234037,-0.009463,0.335154,0.223329,0.150526,0.229689,1.0


There are several columns in `object` format that might be useful to use in prediction, let's map them according to their values.

In [10]:
income_data['native-country'].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

For isnstance, since the majority of the data comes from `United-States`, it might make sense to make a column where every row that contains `United-States` becomes a `0` and any other country becomes a `1`. 

In [11]:
income_data['native-country-num'] = income_data['native-country'].apply(lambda row: 1 if row == 'United-States' else 0)
income_data.head(2)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,income-num,native-country-num
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,1


Similar is the situation with `sex` column. We can replace `Female` with `1` and `Male` with `0`

In [12]:
income_data.sex.value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

In [13]:
income_data['sex-num'] = income_data.sex.map({'Female': 1, 'Male': 0})
income_data.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,income-num,native-country-num,sex-num
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,1,0


In [14]:
income_data.race.unique()

array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)

In [15]:
income_data = pd.get_dummies(data=income_data, columns=['race'])
income_data.head(2)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,...,native-country,income,income-num,native-country-num,sex-num,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,...,United-States,<=50K,0,1,0,0,0,0,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,...,United-States,<=50K,0,1,0,0,0,0,0,1


In [16]:
income_data.corr()[6:7]

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income-num,native-country-num,sex-num,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
income-num,0.234037,-0.009463,0.335154,0.223329,0.150526,0.229689,1.0,0.03447,-0.21598,-0.028721,0.010543,-0.089089,-0.03183,0.085224


Correlations of new variations of `race` features and `native-country-num` with `income` are very low, so there is not much sense to include them in the training set. However, we can test that. 

## Create The Random Forest

In [17]:
# Select features 
data_0 = income_data[[
    'age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-num', 'native-country-num',
    'race_Amer-Indian-Eskimo', 'race_Asian-Pac-Islander', 'race_Black', 'race_Other', 'race_White'
    ]]
# Split data and labels for training and testing sets
train_data_0, test_data_0, train_labels_0, test_labels_0 = train_test_split(data_0, labels, random_state=1)
# Instantiate classifier
forest_0 = RandomForestClassifier(random_state=1)
# Train classifier
forest_0.fit(train_data_0, train_labels_0)
# Get accuracy of the classifier
print(round(forest_0.score(test_data_0, test_labels_0), 4)*100, '%')

82.07 %


In [18]:
# Select features 
data = income_data[[
    'age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-num' ]]
# Split data and labels for training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
# Instantiate classifier
forest = RandomForestClassifier(random_state=1)
# Train classifier
forest.fit(train_data, train_labels)
# Get accuracy of the classifier
print(round(forest.score(test_data, test_labels), 4)*100, '%')

82.73 %
