# Predicting Income Supervised Classification Project

### Introduction

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.

By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

The data extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Data source: UCI's Mahcine Learning Repository (https://archive.ics.uci.edu/dataset/20/census+income)

### Importing modules

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

### Loading the data

In [31]:
header = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

In [32]:
df = pd.read_csv('full_data.csv', engine='python', index_col=None, names=header)

### Investigating the data

In [33]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [60]:
df.occupation.value_counts()

Craft-repair         6020
Prof-specialty       6008
Exec-managerial      5984
Adm-clerical         5540
Sales                5408
Other-service        4808
Machine-op-inspct    2970
Transport-moving     2316
Handlers-cleaners    2046
Farming-fishing      1480
Tech-support         1420
Protective-serv       976
Priv-house-serv       232
Armed-Forces           14
Name: occupation, dtype: int64

In [62]:
df.race.value_counts()

White                 38903
Black                  4228
Asian-Pac-Islander     1303
Amer-Indian-Eskimo      435
Other                   353
Name: race, dtype: int64

In [63]:
df.sex.value_counts()

Male      30527
Female    14695
Name: sex, dtype: int64

### Cleaning the data

##### Stripping spaces from strings

In [36]:
string_columns = df.select_dtypes('object').columns
df[string_columns] = df[string_columns].apply(lambda x: x.str.strip())
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


##### Dropping records with missing values ('?')

In [37]:
df = df.replace("?", np.nan)

In [38]:
df = df.dropna().reset_index()

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           45222 non-null  int64 
 1   age             45222 non-null  int64 
 2   workclass       45222 non-null  object
 3   fnlwgt          45222 non-null  int64 
 4   education       45222 non-null  object
 5   education-num   45222 non-null  int64 
 6   marital-status  45222 non-null  object
 7   occupation      45222 non-null  object
 8   relationship    45222 non-null  object
 9   race            45222 non-null  object
 10  sex             45222 non-null  object
 11  capital-gain    45222 non-null  int64 
 12  capital-loss    45222 non-null  int64 
 13  hours-per-week  45222 non-null  int64 
 14  native-country  45222 non-null  object
 15  income          45222 non-null  object
dtypes: int64(7), object(9)
memory usage: 5.5+ MB


### Formatting the data

##### Separating the dependent and independent variables 

In [132]:
labels = df.income
data = df.drop(columns=['income', 'education', 'race', 'sex'])

Here, I also remove the variables 'education', 'race' and 'sex' after noticing that their exclusion improves model performance.

##### Transforming categorical variables to binary

In [133]:
data = pd.get_dummies(data)

##### Splitting the data into training and test sets

In [134]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.3, random_state=1)

### Creating the Random Forest

##### Creating the classifier and fitting it to the training data

In [135]:
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)

##### Testing the accuracy of the random forest classifier

In [136]:
forest.score(test_data, test_labels)

0.8586275521485959

In [143]:
# forest.feature_importances_

### Comparing to a single decision tree

##### Creating the decision tree classifier and fitting it to the training data

In [139]:
dt = tree.DecisionTreeClassifier(criterion='gini')
dt.fit(train_data, train_labels)

##### Testing the accuracy of the decision tree classifier

In [140]:
dt.score(test_data, test_labels)

0.8116016805483894

In [144]:
# dt.feature_importances_

### Conclusion

The random forest classifier performs well when tasked with predicting whether a person makes more than $50,000. The simple decision tree is also fairly accurate, but underperforms compared to the random forest. This can be attributed to the fact that random forests prevent overfitting, to which the simple decision tree is vulnerable, and, thus, has more accurate prediction when given new data. 