# Classification for Adult Dataset

`Nome`: Henrique Abe Fukushima

`NUSP`: 13682465

In this notebook, we'll explore the Adult dataset with four different classifiers that were studied in class. In the EP1, we performed an Exploratory Data Analysis (EDA) which resulted in both data cleaning and feature selection. Thus, here we'll skip EDA and jump straight into the data preparation. Then, we'll have one section for training and tuning each of the classifiers. The last section will be a final analysis, comparing the models to find out which has the best performance for this task. 

This notebook is structured as follows:
1. Dataprep
2. Logistic Regression
3. Random Forest
4. Support Vector Machine (SVM)
5. Xgboost
6. Comparation and Conclusion

We begin by importing libraries and data.

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

# Classifiers' Libraries
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn.svm import SVC                         # Support Vector Classifier
from xgboost import XGBClassifier                   # XGBoost

# Training and Validation Libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [13]:
df = pd.read_csv('data/train_data.csv')
df.set_index('Id', inplace=True)
df.rename(columns={'age': 'Age',
                   'workclass': 'Workclass',
                   'fnlwgt': 'Final Weight',
                   'education': 'Education',
                   'education.num': 'Education Number',
                   'marital.status': 'Marital Status',
                   'occupation': 'Occupation',
                   'relationship': 'Relationship',
                   'race': 'Race',
                   'sex': 'Sex',
                   'capital.gain': 'Capital Gain',
                   'capital.loss': 'Capital Loss',
                   'hours.per.week': 'Hours per Week',
                   'native.country': 'Native Country',
                   'income': 'Target'
                   }, inplace=True)
df.head()

Unnamed: 0_level_0,Age,Workclass,Final Weight,Education,Education Number,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per Week,Native Country,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
16280,34,Private,204991,Some-college,10,Divorced,Exec-managerial,Own-child,White,Male,0,0,44,United-States,<=50K
16281,58,Local-gov,310085,10th,6,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,<=50K
16282,25,Private,146117,Some-college,10,Never-married,Machine-op-inspct,Not-in-family,White,Male,0,0,42,United-States,<=50K
16283,24,Private,138938,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K
16284,57,Self-emp-inc,258883,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,5178,0,60,Hungary,>50K


# 1. Dataprep

In [14]:
# Null Values
df.replace("?", np.nan, inplace=True)
df.isna().sum()

Age                    0
Workclass           1836
Final Weight           0
Education              0
Education Number       0
Marital Status         0
Occupation          1843
Relationship           0
Race                   0
Sex                    0
Capital Gain           0
Capital Loss           0
Hours per Week         0
Native Country       583
Target                 0
dtype: int64

In [None]:
# Drop Null Values
df = df.dropna()

In [19]:
# Duplicated Rows
print('Duplicated rows:', df.duplicated().sum())

Duplicated rows: 23


In [21]:
# Drop Duplicated Rows
df.drop_duplicates(inplace=True)

In [None]:
def race_feat(race):
    if race == 'White':
        return 1
    else:
        return 0
    
def country_feat(country):
    if country == 'United-States':
        return 1
    else:
        return 0

def sex_feat(sex):
    if sex == 'Male':
        return 1
    else:
        return 0
    
def income_feat(income):
    if income == '>50K':
        return 1
    else:
        return 0

In [None]:
# Apply binary categorization
df['Race'] = df['Race'].apply(race_feat)
df['Native Country'] = df['Native Country'].apply(country_feat)
df['Sex'] = df['Sex'].apply(sex_feat)
df['Target'] = df['Target'].apply(income_feat)

In [None]:
def education_feat(education):
    if education == 'Preschool':
        return 1
    elif education == '1st-4th':
        return 2
    elif education == '5th-6th':
        return 3
    elif education == '7th-8th':
        return 4
    elif education == '9th':
        return 5
    elif education == '10th':
        return 6
    elif education == '11th':
        return 7
    elif education == '12th':
        return 8
    elif education == 'HS-grad':
        return 9
    elif education == 'Some-college':
        return 10
    elif education == 'Assoc-voc':
        return 11
    elif education == 'Assoc-acdm':
        return 12
    elif education == 'Bachelors':
        return 13
    elif education == 'Masters':
        return 14
    elif education == 'Prof-school':
        return 15
    elif education == 'Doctorate':
        return 16

In [None]:
df['Education'] = df['Education'].apply(education_feat)