In [104]:
file = open('adult.data')
lines = file.readlines()
degree = {
    'Doctorate': 15.,
    'Masters': 14. ,
    'Bachelors': 13.,
    'Some-college': 12.,
    'Prof-school': 11.,
    'Assoc-acdm': 10.,
    'Assoc-voc': 9.,
    'HS-grad': 8.,
    '12th': 7.,
    '11th': 6.,
    '10th': 5.,
    '9th': 4.,
    '7th-8th': 3.,
    '5th-6th': 2.,
    '1st-4th': 1.,
    'Preschool': 0.,
}
occupation = {
    'Tech-support':0., 
    'Craft-repair':1., 
    'Other-service':2., 
    'Sales':3., 
    'Exec-managerial':4., 
    'Prof-specialty':5., 
    'Handlers-cleaners':6., 
    'Machine-op-inspct':7., 
    'Adm-clerical':8., 
    'Farming-fishing':9., 
    'Transport-moving':10, 
    'Priv-house-serv':11., 
    'Protective-serv':12., 
    'Armed-Forces':13.,
}
race = {
    'White': 0.0,
    'Asian-Pac-Islander': 1.0,
    'Amer-Indian-Eskimo': 2.0,
    'Other':3.0,
    'Black':4.0
}
X, Y = [], []
for i in range(len(lines)-1):
    if '?' in lines[i]:
        continue
    l = lines[i].strip().split(',')
    X.append([float(l[0]), float(l[12]), float(l[4]), degree[l[3].strip()], occupation[l[6].strip()], race[l[8].strip()]])
    if l[-1] == ' <=50K':
        Y.append(-1.0)
    else:
        Y.append(1.0)
x_test, x_train = X[0:int(0.2*len(X))], X[int(0.2*len(X))+1:]
y_test, y_train = Y[0:int(0.2*len(Y))], Y[int(0.2*len(Y))+1:]

In [105]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Default model with set random state
clf = DecisionTreeClassifier(random_state = 44 )
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))

Model Accuracy: 0.7491710875331565


In [106]:
clf = DecisionTreeClassifier(random_state = 44, class_weight='balanced')
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))

Model Accuracy: 0.7141909814323607


In [107]:
clf = DecisionTreeClassifier(random_state = 44, max_leaf_nodes=100)
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))

Model Accuracy: 0.7907824933687002


In [108]:
clf = DecisionTreeClassifier(random_state = 44, max_depth=10)
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))

Model Accuracy: 0.7828249336870027


In [109]:
clf = DecisionTreeClassifier(random_state = 44, min_impurity_decrease=0.001)
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))

Model Accuracy: 0.7834880636604774


By modifying most of the default parameters, I was able to increase the accuracy of the model. The class_weight describes how each class is weighted in the output, with a default with both being equal weights. Because there are only two classes in the data, it doesn't make sense to differentiate between them. However with the other parameters I saw in increase in the accuracy from the default model from 4.5 - 5%.

In [110]:
# Bagging Decision Trees implementation

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
import numpy as np 

clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_leaf_nodes=100), n_estimators=10, random_state=44)
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))
val = cross_val_score(clf, x_test, y_test, cv=10)
print("10-Cross Validation Score:", np.average(val))

Model Accuracy: 0.7934350132625995
10-Cross Validation Score: 0.7824937124531866


In [111]:
# AdaBoost implementation

from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=10, random_state=44, algorithm='SAMME')
clf.fit(x_train, y_train)
print("Model Accuracy:", clf.score(x_test, y_test))
val = cross_val_score(clf, x_test, y_test, cv=10)
print("10-Cross Validation Score:", np.average(val))

Model Accuracy: 0.7854774535809018
10-Cross Validation Score: 0.782660648193909


### Model Effectiveness

In terms of accuracy the Bagging Decision Tree model performed the best when it used the regular DecisionTreeClassifier with max_leaf_nodes=100. It makes sense that this is the case, because the bagging method is supposed to be an improvement over the base models.

### Metrics

The metric I am using to determine the model effectiveness is the sklearn method score(), which returns the mean accuracy for a given model on the input data. The formula for accuracy is $A = \frac{Correct Predictions}{Total Predictions}$

I'm using the accuracy metric since it is generally accepted to be a reasonable measure of correctness for models. There are other metrics that are similar such as precision, which only looks at the correctness of a certain class, or recall, which looks into the correctness of a certain class and the misses. These metrics can be more valuble than accuracy in certain situations in which the risks of false positives or negatives needs to be considered. Since this problem doesn't neccesarily have a immediate use or risk when failure occurs, accuracy should be a good metric. 