Let's start by importing everything we will need in this project

In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import numpy as np

I'm going to load our data and change the names of the columns as they weren't defined  

In [2]:
income_data = pd.read_csv('income.data', delimiter = ", ")
income_data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'edication-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

In [3]:

income_data.head()


Unnamed: 0,age,workclass,fnlwgt,education,edication-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [4]:
income_data.iloc[0]

age                               50
workclass           Self-emp-not-inc
fnlwgt                         83311
education                  Bachelors
edication-num                     13
marital-status    Married-civ-spouse
occupation           Exec-managerial
relationship                 Husband
race                           White
sex                             Male
capital-gain                       0
capital-loss                       0
hours-per-week                    13
native-country         United-States
income                         <=50K
Name: 0, dtype: object

Since our goal is to predict whether or not a person makes more than $50,000, the labels for our model will be the column containing the information about income.

In [5]:
labels = income_data['income']

To make the training data we need the data in it to be numeric. So I'm going to add new columns with the data I'm interested in.

In [6]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)
income_data["usa"] = income_data["native-country"].apply(lambda row: 0 if row == "United-States" else 1)
income_data["workclass"] = income_data["workclass"].apply(lambda row: 0 if row == "Private" else 1)
#print(income_data["workclass"].value_counts())

Let's put all the features we thing are the most important in predicting the income into a variable called data

In [7]:
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-int', 'usa']]

It's time to split it into train and test data, create a Random Forest Classifier model and train it

In [8]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

In [9]:
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Let's find out which features from our list have the biggest impact on the income and measure the model's accuracy

In [10]:
forest.feature_importances_

array([0.31329156, 0.28935528, 0.11537207, 0.20597669, 0.06629625,
       0.00970814])

We can see that from the features we chose  'age', 'capital-gain' and 'hours-per-week' have the biggest impact on the results. 

In [11]:
print(forest.score(test_data, test_labels))

0.8153562653562654


The score of our model equals to 81% which is quite good and means that our model works well. 

Now I want to predict a random person's income. For that I created a list and transformed it into a numpy array.

In [12]:
simple_data1 = [28, 3200, 0, 35, 1, 1]
simple_data = np.array(simple_data1)
simple_data = simple_data.reshape(1,-1)
print(forest.predict(simple_data))

['<=50K']


Now I want to create a DecisionTreeClassifier and compare the results and prediction to the Random Forest's ones. 

In [13]:
classifier = tree.DecisionTreeClassifier()
classifier.fit(train_data, train_labels)

print(classifier.score(test_data, test_labels))
print(classifier.predict(simple_data ))

0.8133906633906633
['<=50K']


We can see that their score is almost the same, although the Random Forest model is slightly more accurate. 
Both models predicted the random individuals income (in simple data) would be less than $50,000. 
