Today we're going to utilize a very simple (but rich) data set housed in the UCI Machine Learning repository. The Adult Income Dataset is taken from US Census information and is formatted particularly well to study the features/regressors/predictors that go into determining whether an adult US resident is 'likely' to have a household income greater than $50,000. 

The data includes age, workclass, a weight variable (to account for the unbalanced sampling), education level, time spent in education (in years), marital status, occupation, relationship, race, sex, individuals residency, and a target column that indicates whether the person attained a household income greater than $50,000. All in all, an interested data set for socio-economic research. So let's get our hands dirty and load up some data!

In [1]:
from sklearn import naive_bayes
import pandas as pd
import numpy as np
import matplotlib as plt

# Load the data 

Load the adult data set, which is just .txt file. There are no column labels. Read the docs for the data set here: https://archive.ics.uci.edu/ml/datasets/Adult, and use the in-built Pandas dataframe options to attach the column labels into the data frame. 

In [2]:
income = pd.read_csv('/Users/HudsonCavanagh/Documents/adult.csv')
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,small
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,small
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,small


In [3]:
income.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.134597,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.025423,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [4]:
income.isnull().sum().sum()

22746

In [5]:
Sex = pd.get_dummies(income['sex'])
Workclass = pd.get_dummies(income['workclass']) 
Marital = pd.get_dummies(income['marital-status'])
Occupation = pd.get_dummies(income['occupation'])
Relationship = pd.get_dummies(income['relationship'])
Race = pd.get_dummies(income['race'])
Country = pd.get_dummies(income['native-country'])
Target = pd.get_dummies(income['income'])

# Clean up the data set by deleting un-used columns

one_hot_dat = pd.concat([income, Sex, Workclass, Marital, Occupation, Relationship, Race, Country, Target], axis = 1)
del one_hot_dat['sex']; del one_hot_dat['age']; del one_hot_dat['workclass']; del one_hot_dat['fnlwgt']; 
del one_hot_dat['education']; del one_hot_dat['education-num']; del one_hot_dat['marital-status']
del one_hot_dat['occupation']; del one_hot_dat['relationship']; del one_hot_dat['race']; del one_hot_dat['capital-gain']
del one_hot_dat['capital-loss']; del one_hot_dat['hours-per-week']; del one_hot_dat['native-country']; del one_hot_dat['income']

one_hot_dat.head()

Unnamed: 0,Female,Male,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,...,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia,large,small
0,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
1,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,1
2,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


# Convert the categorical variables into unordered integral values

For us to use the scikit-learn (although not every implementation of) Naive Bayes, we must pass in numerical data. Since we have decided to analyze all unordered categorical values, we can do a one-hot encoding to convert our categorical data into a numerical data frame.

**Note**: Do not use scikit-learn's implementation of One-hot encoding, we want to get you familiar with a bunch of methods, but as you should know by now, there are many ways to do the same thing. If you want, to a challenge, you can write the procedure both from scikit-learn and Pandas method. 

# Challenge Problem: Alternative Encoding Scheme to One-Hot Encoding

Likewise, beside doing a One-hot encoding, we could also map each string label in our categorical features to a integral value. As we previously leveraged a Pandas data frame method to do the encoding, we are now going to test out a scikit-learn method to impose the integral value encoding. Please check the docs and read up on: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html. Proceed with the encoding and build a Naive Bayes and Logistic classifier for both. Do we get similar results? What should we expect? And why?

In [15]:
one_hot_dat['large'].value_counts()

0    41001
1     7841
Name: large, dtype: int64

In [16]:

one_hot_dat['target'] = one_hot_dat['large'].apply(lambda x: 1 if x == 'large' else 0)
one_hot_dat['target'].mean()

0.0

# Summarize the data and engage in elementary data exploration

For some data exploration, use Pandas histogram methods to display the features. 

In [6]:
partition_val = np.random.rand(len(one_hot_dat)) < 0.70
train = one_hot_dat[partition_val]
test = one_hot_dat[~partition_val]

In [7]:
target_train = train['<=50K']
feature_train = train.drop('<=50K', axis=1)


KeyError: '<=50K'

# Partition the data

Without using any direct method/libraries that would automatically accomplish this, please partition the data set 70/30. You can use anything from the math, pandas, or numpy library, do not use other libraries. 

# Define your feature set and define your target 

# Run Naive Bayes Classifier

Instantiate the Naive Bayes predictor from scikit-learn with the training data. 

In [None]:
Cat_Naive_Bayes = naive_bayes.MultinomialNB();
Cat_Naive_Bayes.fit(feature_train, target_train)


# Check Accuracy / Score for Naive Bayes

Define the target and feature set for the test data

In [None]:
target_test = test['<=50K']
feature_test =  test.drop('<=50K', axis = 1)

Score the Naive Bayes classifier on the test data

In [None]:
Cat_Naive_Bayes.score(feature_test, target_test)

# Check Accuracy / Score for a Logistic Classifier 

Define a logistic regression and train it with the feature and target set

Produce the accuracy score of the logistic regression from the test set

Was that what you expected? All we did was remove non categorical variables, and imposed a One-hot encoding, should we have expected the Naive Bayes to underperform the Logistic? Here are some other things you can think about:

1. What other metrics outside of simple accuracy can we utilize to measure performance?
2. Could some pair-wise correlation between pair-wise features in our feature set have caused an issue with the Naive Bayes? What are the assumptions for Naive Bayes which may cause this to happen? 
3. How could we improve the performance of Naive Bayes? 
4. What about the numerica features we left out, should we bring them back in? How?

If you want to expand on your analysis, why not build a correlation matrix, or perhaps print a summary of the logistic regression, would an ANOVA table help in our assessment for this case? 