# Predicting Income with Random Forests

By using census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.(https://archive.ics.uci.edu/ml/datasets/census+income)

In [1]:
import pandas as pd

<h3>Data</h3>

In [2]:
income_data = pd.read_csv('income.csv', header=0)

In [3]:
income_data.iloc[0]

age                            39
 workclass              State-gov
 fnlwgt                     77516
 education              Bachelors
 education-num                 13
 marital-status     Never-married
 occupation          Adm-clerical
 relationship       Not-in-family
 race                       White
 sex                         Male
 capital-gain                2174
 capital-loss                   0
 hours-per-week                40
 native-country     United-States
 income                     <=50K
Name: 0, dtype: object

From the above first row, we can infer that the person made less than $50,000. The column 'income' contains this information.

There is a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row’s native-country is " United-States", but we want it to be "United-States". This is happening because in income.csv there are spaces after the commas.

In [4]:
income_data = pd.read_csv('income.csv', header=0, delimiter=', ')

  """Entry point for launching an IPython kernel.


In [5]:
income_data.iloc[0]

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object

<h3>Formatting The Data For Scikit-learn</h3>

Now that we have our data imported into a DataFrame, we can begin putting it in a format that our Random Forest can work with. To do this, we need to separate the labels from the rest of the data.

In [6]:
labels = income_data[['income']]

We will also want to pick which columns to use when trying to predict income.

In [7]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex"]]

Finally, we want to split our data and labels into a training set and a test set. We will use the training set to build the random forest, and the test set to see how accurate it is. 

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

<h3>Creating The Random Forest</h3>

In [10]:
from sklearn.ensemble import RandomForestClassifier

In [11]:
forest = RandomForestClassifier(random_state=1)

In [12]:
forest.fit(train_data, train_labels)

ValueError: could not convert string to float: 'Female'

There seems to be a problem with using the column "sex" when training the random forest.

In that column, there are values like "Male" and "Female". Random forests can’t use columns that contain Strings — they have to be continuous values like integers or floats. For, now removing column "sex" from data.

In [13]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)

  after removing the cwd from sys.path.


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Now that our training set doesn’t have a column containing strings, we have successfully fit our random forest.

In [14]:
forest.score(test_data, test_labels)

0.8222577078982926

<h3>Changing Column Types</h3>

Now that we know the random forest works, let’s go back and try to add the "sex" column. If we transformed those strings into integers, we could use this data! If we take every row and make every "Male" a 0 and every "Female" a 1, we could then use the column in our random forest.

In [15]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

Adding "sex-int" column to data.

In [16]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
forest.score(test_data, test_labels)

  after removing the cwd from sys.path.


0.8272939442328953

There are a couple of other columns that use strings that might be useful to use. Let’s try transforming the values in the "native-country" column. Taking a look at the different values that exist in the column.

In [17]:
income_data["native-country"].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

Since the majority of the data comes from "United-States", it might make sense to make a column where every row that contains "United-States" becomes a 0 and any other country becomes a 1.

In [18]:
income_data["country-int"] = income_data["native-country"].apply(lambda row: 0 if row == "United-States" else 1)

When mapping Strings to numbers like this, it is important to make sure that continuous numbers make sense. For example, it wouldn’t make much sense to map "United-States" to 0, "Germany" to 1, and "Mexico" to 2. If we did this, we are saying that Mexico is more similar to Germany than it is to the United States.

However, if we had values in a column like "low", "medium", and "high" mapping those values to 0, 1, and 2 would make sense because their representation as Strings is also continuous.

In [19]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int", "country-int"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
forest.score(test_data, test_labels)

  after removing the cwd from sys.path.


0.8225033779633951

It is observed that the best accuracy shown by our model is 82.72%.