# Predicting Income with Random Forests

In this project, I will be using a dataset containing census information from UCI’s Machine Learning Repository.

By using this census data with a random forest, I will try to predict whether or not a person makes more than $50,000.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

### Investigating The Data

In [4]:
income_data = pd.read_csv("income.csv", header=0, delimiter=", ")

In [5]:
print(income_data.iloc[0])

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object


### Formatting The Data For Scikit-learn

In [7]:
labels = income_data[["income"]]

In [23]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

In [24]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int"]]

In [25]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

### Creating the Random Forest

In [26]:
forest = RandomForestClassifier(random_state=1)

In [27]:
forest.fit(train_data, train_labels)

RandomForestClassifier(random_state=1)

In [28]:
print(forest.score(test_data, test_labels))

0.8272939442328953


Model 'as-is' has a accuracy of just over 82%. 

### Changing Column Types

In [29]:
print(income_data["native-country"].value_counts())

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [30]:
income_data["country-int"] = income_data["native-country"].apply(lambda row: 0 if row == "United-States" else 1)

In [31]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int", "country-int"]]

In [32]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

In [33]:
forest.fit(train_data, train_labels)

RandomForestClassifier(random_state=1)

In [34]:
print(forest.score(test_data, test_labels))

0.8225033779633951


By adding the country column to the model there is a slight decrease in accuracy. 