#### FOUNDATIONS OF MACHINE LEARNING: SUPERVISED LEARNING

<br>

# Predicting Income with Random Forests

By using this census data with a random forest, we will try to predict whether or not a person makes more than \\$50,000.

<br>

#### Attribute Information:
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- `income`: more than \\$50,000 (>50K), less than or equal to \\$50,000(<=50K)

<br>

Information is from [USI Machine Learning's Repository](https://archive.ics.uci.edu/ml/datasets/census%20income).

<hr>

In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

### Investigate the Data

In [12]:
income_data = pd.read_csv('income.csv', header = 0, delimiter = ", ", engine='python')
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [13]:
#this is the first row of the dataframe
print(income_data.iloc[0])

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object


### Format the Data for `scikit-learn`

In [15]:
labels = income_data[['income']]
print(labels)

      income
0      <=50K
1      <=50K
2      <=50K
3      <=50K
4      <=50K
...      ...
32556  <=50K
32557   >50K
32558  <=50K
32559  <=50K
32560   >50K

[32561 rows x 1 columns]


In [20]:
#the 'sex' column causes an issue because the values are not 0 & 1, therefore we need to get rid of it for now

data = income_data[['capital-gain', 'capital-loss', 'hours-per-week']]
print(data)

       capital-gain  capital-loss  hours-per-week
0              2174             0              40
1                 0             0              13
2                 0             0              40
3                 0             0              40
4                 0             0              40
...             ...           ...             ...
32556             0             0              38
32557             0             0              40
32558             0             0              40
32559             0             0              20
32560         15024             0              40

[32561 rows x 3 columns]


In [21]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

### Create the Random Forest

In [22]:
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

  forest.fit(train_data, train_labels)


RandomForestClassifier(random_state=1)

In [23]:
print(forest.score(test_data, test_labels))

0.8373664169021005


### Changing Column Types

In [55]:
print(income_data['native-country'].value_counts())

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

In [49]:
income_data['sex-int'] = income_data['sex'].apply(lambda x: 0 if x == 'Male' else 1)
income_data['country-int'] = income_data['native-country'].apply(lambda x: 0 if x == 'United-States' else 1)
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sex-int,country-int
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1,1


In [50]:
new_data = income_data[['capital-gain', 'capital-loss', 'hours-per-week', 'sex-int', 'country-int']]
print(new_data)

       capital-gain  capital-loss  hours-per-week  sex-int  country-int
0              2174             0              40        0            0
1                 0             0              13        0            0
2                 0             0              40        0            0
3                 0             0              40        0            0
4                 0             0              40        1            1
...             ...           ...             ...      ...          ...
32556             0             0              38        1            0
32557             0             0              40        0            0
32558             0             0              40        1            0
32559             0             0              20        0            0
32560         15024             0              40        1            0

[32561 rows x 5 columns]


In [51]:
new_train_data, new_test_data, new_train_labels, new_test_labels = train_test_split(new_data, labels, random_state = 1)

In [52]:
forest.fit(new_train_data, new_train_labels)

  forest.fit(new_train_data, new_train_labels)


RandomForestClassifier(random_state=1)

In [60]:
#each number corresponds to the relevance of a column from the training data
print(forest.feature_importances_)

[0.481075   0.20724286 0.21995084 0.08552218 0.00620913]


In [53]:
print(forest.score(new_test_data, new_test_labels))

0.8361380665765876


### Compare with Decision Tree

In [58]:
classifier = DecisionTreeClassifier(random_state = 1)
classifier.fit(new_train_data, new_train_labels)

DecisionTreeClassifier(random_state=1)

In [59]:
print(classifier.score(new_test_data, new_test_labels))

0.8389632723252671
