# I. Problem statement

Dataset: https://archive.ics.uci.edu/ml/datasets/Adult

Prediction task is to determine whether a person makes over 50K a year.

* Class probabilities for adult.all file
* Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
* Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

Description of fnlwgt (final weight)

The weights on the CPS files are controlled to independent estimates of the
civilian noninstitutional population of the US.  These are prepared monthly
for us by Population Division here at the Census Bureau.  We use 3 sets of
controls.
  These are:
          1.  A single cell estimate of the population 16+ for each state.
          2.  Controls for Hispanic Origin by age and sex.
          3.  Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through
them 6 times so that by the end we come back to all the controls we used.

The term estimate refers to population totals derived from CPS by creating
"weighted tallies" of any specified socio-economic characteristics of the
population.

People with similar demographic characteristics should have
similar weights.  There is one important caveat to remember
about this statement.  That is that since the CPS sample is
actually a collection of 51 state samples, each with its own
probability of selection, the statement only applies within
state.



# II. Variables

* age: continuous.

* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.

* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

* education-num: continuous.

* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

* sex: Female, Male.

* capital-gain: continuous.

* capital-loss: continuous.

* hours-per-week: continuous.

* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

# III. Import Data

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

colNames = ['Age', 'WorkClass', 'FnlWgt', 'Education', 'Education-Num', 'Marital-Status', 'Occupation', 
            'Relationship', 'Race', 'Sex', 'Capital-Gain', 'Capital-Loss', 'Hours-Per-Week', 'Native-Country', 'Salary']
dataset_train = pd.read_csv('data/adult-train.csv', header=None, names = colNames)
dataset_train.head(10)

dataset_test = pd.read_csv('data/adult-test.csv', header=None, names = colNames)
dataset_test.head(10)

Unnamed: 0,Age,WorkClass,FnlWgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Sex,Capital-Gain,Capital-Loss,Hours-Per-Week,Native-Country,Salary
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K.
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K.
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K.
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K.
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K.


# IV. Data preprocessing

## a) Data Wrangling

* We tranform the 'Salary' attribute to the categorial values 'Yes' if people salary is greater than 50 K, 'No' if otherwise.

* Remove unknown data from both train and test data

* Normalize data

In [7]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Check code https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish#Some-Final-Encoding to encode the 
# category values to numerical values

def encode_features(df_train, df_test):
    features = ['WorkClass', 'Education', 'Marital-Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Native-Country']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = LabelEncoder()  
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test

dataset_train, dataset_test = encode_features(dataset_train, dataset_test)

dataset_train['SalaryB'] = (dataset_train['Salary'] == ' >50K')
dropColumns = ['Salary', 'SalaryB']
X_train = dataset_train.drop(dropColumns, axis=1)
Y_train = dataset_train['SalaryB']

dataset_test['SalaryB'] = (dataset_train['Salary'] == ' >50K')
X_test = dataset_test.drop(dropColumns, axis=1)
Y_test = dataset_test['SalaryB']

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# V. Algorithms

## Common method

In [16]:
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
import numpy as np

def bestParamsClassifier(classifier, params):
    # Type of scoring used to compare parameter combinations
    acc_scorer = make_scorer(accuracy_score)
    # Run the grid search
    grid_obj = GridSearchCV(classifier, parameters, scoring=acc_scorer)
    grid_obj.fit(X_train, Y_train)

    # Set the clf to the best combination of parameters
    classifier = grid_obj.best_estimator_
    return classifier

from sklearn.model_selection import learning_curve

def calLearningCurve(estimator, X, y, train_sizes, cv):
    training_sizes, training_scores, test_cores = learning_curve(estimator, X, y, cv=10, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 20))
    # Create means and standard deviations of training set scores
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)

    # Create means and standard deviations of test set scores
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

## a)  KNN


In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import validation_curve

from sklearn.metrics import classification_report

?learning_curve
# validation_curve(KNeighborsClassifier(), X_train, Y_train, cv = 3)
# Reference: https://www.dataquest.io/blog/learning-curves-machine-learning/
#train_sizes, train_scores, validation_scores = learning_curve(KNeighborsClassifier(), X_train, Y_train, , cv = 3)


for i in [3, 5, 10, 20]:
    knn_classifier = KNeighborsClassifier(n_neighbors=i)
    knn_classifier.fit(X_train, Y_train)
    Y_pred = knn_classifier.predict(X_test)
    print("KNN with neighbors ", i)
    print(classification_report(Y_test, Y_pred))
    


KNN with neighbors  3
              precision    recall  f1-score   support

       False       0.76      0.80      0.78     12384
        True       0.23      0.19      0.20      3897

   micro avg       0.65      0.65      0.65     16281
   macro avg       0.49      0.49      0.49     16281
weighted avg       0.63      0.65      0.64     16281

KNN with neighbors  5
              precision    recall  f1-score   support

       False       0.76      0.81      0.78     12384
        True       0.23      0.17      0.20      3897

   micro avg       0.66      0.66      0.66     16281
   macro avg       0.49      0.49      0.49     16281
weighted avg       0.63      0.66      0.64     16281

KNN with neighbors  10
              precision    recall  f1-score   support

       False       0.76      0.85      0.80     12384
        True       0.22      0.13      0.17      3897

   micro avg       0.68      0.68      0.68     16281
   macro avg       0.49      0.49      0.48     16281
weighte

## b) Decision tree

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report

for i in [3, 4, 5, 6]:
    dt_classifier = DecisionTreeClassifier(max_depth=i)
    dt_classifier.fit(X_train, Y_train)
    Y_pred = dt_classifier.predict(X_test)
    print("Decision tree with max_dept ", i)
    print(classification_report(Y_test, Y_pred))


Decision tree with max_dept  3
              precision    recall  f1-score   support

       False       0.76      0.78      0.77     12384
        True       0.23      0.21      0.22      3897

   micro avg       0.64      0.64      0.64     16281
   macro avg       0.49      0.49      0.49     16281
weighted avg       0.63      0.64      0.64     16281

Decision tree with max_dept  4
              precision    recall  f1-score   support

       False       0.76      0.83      0.79     12384
        True       0.23      0.16      0.19      3897

   micro avg       0.67      0.67      0.67     16281
   macro avg       0.49      0.49      0.49     16281
weighted avg       0.63      0.67      0.65     16281

Decision tree with max_dept  5
              precision    recall  f1-score   support

       False       0.76      0.82      0.79     12384
        True       0.23      0.17      0.19      3897

   micro avg       0.66      0.66      0.66     16281
   macro avg       0.49      0.49  

## c) Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }

for i in [30,60,90]:
    rd_classifier = RandomForestClassifier(n_estimators=i)
    rd_classifier.fit(X_train, Y_train)
    Y_pred = rd_classifier.predict(X_test)
    print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

       False       0.76      0.82      0.79     12384
        True       0.24      0.17      0.20      3897

   micro avg       0.67      0.67      0.67     16281
   macro avg       0.50      0.50      0.50     16281
weighted avg       0.64      0.67      0.65     16281

              precision    recall  f1-score   support

       False       0.76      0.82      0.79     12384
        True       0.24      0.17      0.20      3897

   micro avg       0.67      0.67      0.67     16281
   macro avg       0.50      0.50      0.49     16281
weighted avg       0.63      0.67      0.65     16281

              precision    recall  f1-score   support

       False       0.76      0.82      0.79     12384
        True       0.24      0.17      0.20      3897

   micro avg       0.67      0.67      0.67     16281
   macro avg       0.50      0.50      0.49     16281
weighted avg       0.63      0.67      0.65     16281



## d) AdaBoost

In [8]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report

for i in [3,5,7,9]:
    ada_classifier = AdaBoostClassifier(n_estimators=i, learning_rate=0.5)
    ada_classifier.fit(X_train, Y_train)
    Y_pred = ada_classifier.predict(X_test)
    print(metrics.accuracy_score(Y_test, Y_pred))
    print(classification_report(Y_test, Y_pred))

0.7353970886309195
              precision    recall  f1-score   support

       False       0.76      0.95      0.85     12384
        True       0.22      0.04      0.07      3897

   micro avg       0.74      0.74      0.74     16281
   macro avg       0.49      0.50      0.46     16281
weighted avg       0.63      0.74      0.66     16281

0.7355813524967754
              precision    recall  f1-score   support

       False       0.76      0.95      0.85     12384
        True       0.22      0.04      0.07      3897

   micro avg       0.74      0.74      0.74     16281
   macro avg       0.49      0.50      0.46     16281
weighted avg       0.63      0.74      0.66     16281

0.7010625882931024
              precision    recall  f1-score   support

       False       0.76      0.89      0.82     12384
        True       0.22      0.10      0.14      3897

   micro avg       0.70      0.70      0.70     16281
   macro avg       0.49      0.50      0.48     16281
weighted avg     

## e) Neural Network 

In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import classification_report

for i in [100, 200, 300, 400, 500]:
    mlp_classifier = MLPClassifier(hidden_layer_sizes=(13,13,13),max_iter=i)
    mlp_classifier.fit(X_train, Y_train)
    Y_pred = mlp_classifier.predict(X_test)
    print(metrics.accuracy_score(Y_test, Y_pred))
    print(classification_report(Y_test, Y_pred))

0.741539217492783
              precision    recall  f1-score   support

       False       0.76      0.97      0.85     12384
        True       0.20      0.03      0.05      3897

   micro avg       0.74      0.74      0.74     16281
   macro avg       0.48      0.50      0.45     16281
weighted avg       0.63      0.74      0.66     16281

0.7288864320373442
              precision    recall  f1-score   support

       False       0.76      0.94      0.84     12384
        True       0.21      0.05      0.08      3897

   micro avg       0.73      0.73      0.73     16281
   macro avg       0.48      0.50      0.46     16281
weighted avg       0.63      0.73      0.66     16281

0.7574473312450095
              precision    recall  f1-score   support

       False       0.76      0.99      0.86     12384
        True       0.19      0.00      0.01      3897

   micro avg       0.76      0.76      0.76     16281
   macro avg       0.48      0.50      0.43     16281
weighted avg      

# VI. References