# Modelling of adult data #

I am going to test a few models to predict high income based on the features we have (race, sex, gender, etc).

The models I am going to try are:

1. Logistic regression
1. Random Forest
1. Naive Bayes

I will transform the features slightly for the models.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
adult_data = pd.read_csv('data/adult.data', header=None)
adult_test = pd.read_csv('data/adult.test', header=None, skiprows = 1)

adult_data.columns = ['age', 'type_employer', 'fnlwgt', 'education', 
                "education_num","marital", "occupation", "relationship", "race","sex",
                "capital_gain", "capital_loss", "hr_per_week","country","income"]
adult_test.columns = ['age', 'type_employer', 'fnlwgt', 'education', 
                "education_num","marital", "occupation", "relationship", "race","sex",
                "capital_gain", "capital_loss", "hr_per_week","country","income"]
# Fix slightly different formating
adult_test.replace([' <=50K.',' >50K.'],[' <=50K',' >50K'], inplace=True)

## Processing the dataframe ##



In [3]:
# First divide between target and features:
train_target = pd.get_dummies(adult_data, columns=['income'], drop_first=True)['income_ >50K']
train_features = adult_data.drop(columns = ['income'])
test_target = pd.get_dummies(adult_test, columns=['income'], drop_first=True)['income_ >50K']
test_features = adult_test.drop(columns = ['income'])

In [4]:
from sklearn.exceptions import NotFittedError

class transform_adult_data():
    """ Put the dataframe in the format we want for our model, following the convention of sklearn pipelines
        ie implementing fit and transform.
    """
    
    def __init__(self):
        self.columns = None
        pass
    
    def fit(self, df):
        df = self.__transform(df)
        self.columns = df.columns
        return df
    
    def transform(self, df):
        if self.columns is None:
            raise NotFittedError('Call fit before using transform.')
        result = pd.DataFrame(columns = self.columns)
        result = result.append(self.__transform(df).fillna(0))
        return result
    
    def __transform(self, df):
        df = pd.get_dummies(df, columns=['sex'], drop_first=True)
        # Education, race and high income are highly correlated, but the relationship isn't linear,
        # race is categorical.
        df = pd.get_dummies(df, columns=['education','race','type_employer','marital','occupation'])
        # In EDA we saw that the trend for age reverses around the age of 50, therefore we will have turn age
        # into two features, one for age over 50, the other for age under 50.
        df['age >= 50'] = df['age'] * (df['age'] >= 50)
        df['age < 50'] = df['age'] * (df['age'] < 50)
        # Drop the columns that we won't be using as features (or the target):
        df.drop(columns=['fnlwgt','education_num', 'relationship',
                         'capital_gain','capital_loss','hr_per_week','country','age'], inplace=True)
        return df

In [5]:
processing = transform_adult_data()
train_data = processing.fit(train_features)
train_data.shape

(32561, 55)

In [6]:
test_data = processing.transform(test_features)
test_data.shape

(16281, 55)

## Logistic regression ##

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

In [8]:
scaler = StandardScaler()
scaler.fit(train_data)

model = LogisticRegression()

model.fit(scaler.transform(train_data), train_target)

prediction = model.predict(scaler.transform(test_data))

acc = metrics.accuracy_score(test_target, prediction)
confusion = metrics.confusion_matrix(test_target, prediction)

print('The accuracy is {:.2f}%'.format(acc*100))
print('The confusion matrix is:')
print(confusion)

The accuracy is 83.29%
The confusion matrix is:
[[11490   945]
 [ 1775  2071]]


## Random Forest ##

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(train_data, train_target)

prediction = model.predict(test_data)

acc = metrics.accuracy_score(test_target, prediction)
confusion = metrics.confusion_matrix(test_target, prediction)
print('The accuracy is {:.2f}%'.format(acc*100))
print('The confusion matrix is:')
print(confusion)

The accuracy is 81.20%
The confusion matrix is:
[[11078  1357]
 [ 1704  2142]]


## Naive Bayes ##

### Preprocessing ###

I have decided to use Categorical Naive Bayes, therefore I am going to do slightly different pre-processing.

Mainly not using dummies and binning age into 20 age groups (in the obvious manner). I will also keep some of the features I discarded for regression.

In [10]:
def bin_value(value, bins):
    # assumes any value above last bin goes in last bin and below first bin goes in first bin.
    for i in range(len(bins)-1):
        if value <= bins[i+1]:
            return i
    return len(bins)-1

class Bayes_transform():
    def __init__(self, num_bins=20):
        self.num_bins = num_bins
        self.age_bins = None
        
    def fit(self, df, y=None):
        self.age_bins = np.linspace(min(df['age']), max(df['age']), self.num_bins+1)
        return self
    
    def fit_transform(self, df, y=None):
        self.fit(df)
        return self.transform(df)
    
    def transform(self, df):
        if self.age_bins is None:
            raise NotFittedError('Call fit before using transform.')
        df = df.copy()
        # Bin age into groups
        df['age'] = df['age'].map(lambda x: bin_value(x, self.age_bins))
        # Drop the columns that we won't be using as features (either continuous or not interesting):
        df.drop(columns=['fnlwgt','education_num', 'capital_gain','capital_loss',
                         'hr_per_week','country'], inplace=True)
        return df

### Model/Pipeline ###

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import CategoricalNB

In [12]:
naive_bayes_pipeline = Pipeline([('feature_preparation', Bayes_transform(20)),
                                 ('encoder', OrdinalEncoder()), 
                                 ('classifier', CategoricalNB())])

naive_bayes_pipeline.fit(train_features, train_target)
prediction = naive_bayes_pipeline.predict(test_features)

acc = metrics.accuracy_score(test_target, prediction)
confusion = metrics.confusion_matrix(test_target, prediction)
print('The accuracy is {:.2f}%'.format(acc*100))
print('The confusion matrix is:')
print(confusion)

The accuracy is 79.86%
The confusion matrix is:
[[10065  2370]
 [  909  2937]]


## Comparison ##

The short answer is that between logistic regression and random forest, logistic regression does a better job. On the other hand, when it is not so straightforward to compare logistic regression and naive bayes, because although the accuracy of the former is better, the number of true positives in the later is much larger. I plan on having a look at a couple other metrics, but ultimately which model is better would be depend on the application, and which kind of error is less desirable.