<a href="https://colab.research.google.com/github/maztig/Machine-Learning/blob/main/IncomePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import and Analysis**

In [200]:
import pandas as pd

In [201]:
df = pd.read_csv("Income.csv")

In [202]:
df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [203]:
for x in df.columns:
  print(x)

age
workclass
fnlwgt
education
educational-num
marital-status
occupation
relationship
race
gender
capital-gain
capital-loss
hours-per-week
native-country
income


# **Data manipulation**

## **Turning names into binary**

The different string outputs become collums, with binary outputs

This is done so the model can work properly

In [204]:
df = pd.concat([df.drop("occupation", axis=1), pd.get_dummies(df.occupation).add_prefix("occupation_")], axis=1)
df = pd.concat([df.drop("workclass", axis=1), pd.get_dummies(df.workclass).add_prefix("workclass_")], axis=1)
df = df.drop("education",axis=1)
df = pd.concat([df.drop("marital-status", axis=1), pd.get_dummies(df["marital-status"]).add_prefix("marital-status_")], axis=1)
df = pd.concat([df.drop("relationship", axis=1), pd.get_dummies(df.relationship).add_prefix("relationship_")], axis=1)
df = pd.concat([df.drop("race", axis=1), pd.get_dummies(df.race).add_prefix("race_")], axis=1)
df = pd.concat([df.drop("native-country", axis=1), pd.get_dummies(df["native-country"]).add_prefix("native-country_")], axis=1)

In [205]:
df["gender"] = df["gender"].apply(lambda x: 1 if x == "Male" else  0)
df["income"] = df["income"].apply(lambda x: 1 if x == ">50K" else  0)

In [206]:
df

Unnamed: 0,age,fnlwgt,educational-num,gender,capital-gain,capital-loss,hours-per-week,income,occupation_?,occupation_Adm-clerical,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,7,1,0,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,1,0,0,50,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,1,0,0,40,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,1,7688,0,40,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,18,103497,10,0,0,0,30,0,1,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,257302,12,0,0,0,38,0,0,0,...,0,0,0,0,0,0,0,1,0,0
48838,40,154374,9,1,0,0,40,1,0,0,...,0,0,0,0,0,0,0,1,0,0
48839,58,151910,9,0,0,0,40,0,0,1,...,0,0,0,0,0,0,0,1,0,0
48840,22,201490,9,1,0,0,20,0,0,1,...,0,0,0,0,0,0,0,1,0,0


# **Training**

The dataset is split into a 0.8 training set, and a 0.2 testing set.

In [207]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)

The X and Y value is defined, whereas the Y what the model shall predict

In [208]:
train_X = train_df.drop("income", axis=1)
train_Y = train_df["income"]

test_X = test_df.drop("income", axis=1)
test_Y = test_df["income"]

RandomForestClassifier is called for in this project, while the Y value only comes out as 0 or 1

In [209]:
forest = RandomForestClassifier()

In [210]:
forest.fit(train_X, train_Y)

# **Testing**

In [211]:
forest.score(test_X, test_Y)

0.8579179035725253

After training and evaluating, you can find the different importances of parameters from the dataset. This is interesting and useful for finding parameters that has an too high or too low percentage importance in the model

In [212]:
importances = dict(zip(forest.feature_names_in_, forest.feature_importances_))
importances = {k: v for k, v in sorted(importances.items(), key = lambda x: x[1], reverse=True)}

It would be smart to drop "fnlwgt"

In [213]:
importances

{'fnlwgt': 0.17155276206820386,
 'age': 0.1554383570946664,
 'educational-num': 0.11112649724334617,
 'capital-gain': 0.0957346762132802,
 'hours-per-week': 0.08500234662443626,
 'marital-status_Married-civ-spouse': 0.06023166764083589,
 'relationship_Husband': 0.042934170133298734,
 'capital-loss': 0.031607804715937,
 'marital-status_Never-married': 0.023365398577276566,
 'occupation_Exec-managerial': 0.0190739726757666,
 'occupation_Prof-specialty': 0.017593688131087654,
 'gender': 0.014844236581667222,
 'relationship_Not-in-family': 0.012342517501958626,
 'relationship_Wife': 0.009056226646045832,
 'workclass_Private': 0.008699632885383737,
 'relationship_Own-child': 0.007705307877062976,
 'workclass_Self-emp-not-inc': 0.0076950679542710865,
 'occupation_Other-service': 0.007473698994164809,
 'marital-status_Divorced': 0.007198148500415021,
 'occupation_Craft-repair': 0.005987662016564044,
 'occupation_Sales': 0.005930831554809137,
 'race_White': 0.005715412009125494,
 'relationship