In [2]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import pandas as pd

## Read in the data 


In [27]:
def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

In [35]:
train_data = pd.read_csv('census.train', converters={'education' : strip,
                                    'marital-status' : strip,
                                    'occupation' : strip,
                                    'race' : strip,
                                    'sex' : strip,
                                    'hours-per-week' : strip,
                                    'prediction' : strip})

test_data = pd.read_csv('census.test', converters={'education' : strip,
                                    'marital-status' : strip,
                                    'occupation' : strip,
                                    'race' : strip,
                                    'sex' : strip,
                                    'hours-per-week' : strip,
                                    'prediction' : strip})
train_data = train_data.drop(['Unnamed: 0'], axis=1)
test_data = test_data.drop(['Unnamed: 0'], axis=1)
train_data

Unnamed: 0,age,people-represented,education,marital-status,occupation,race,sex,hours-per-week,prediction
0,39,77516,Bachelors,Never-married,Adm-clerical,White,Male,40,<=50K
1,50,83311,Bachelors,Married,Exec-managerial,White,Male,13,<=50K
2,38,215646,HS-grad,Divorced,Handlers-cleaners,White,Male,40,<=50K
3,53,234721,Some HS,Married,Handlers-cleaners,Black,Male,40,<=50K
4,28,338409,Bachelors,Married,Prof-specialty,Black,Female,40,<=50K
5,37,284582,Graduate,Married,Exec-managerial,White,Female,40,<=50K
6,49,160187,Some HS,Married,Other-service,Black,Female,16,<=50K
7,52,209642,HS-grad,Married,Exec-managerial,White,Male,45,>50K
8,31,45781,Graduate,Never-married,Prof-specialty,White,Female,50,>50K
9,42,159449,Bachelors,Married,Exec-managerial,White,Male,40,>50K


In [37]:
income_boundary = train_data['prediction']
y = np.where(income_boundary == '<=50K',1,0)
train_data = train_data.drop(['prediction'], axis=1)

# We don't need these columns
# to_drop = ['State','Area Code','Phone','Churn?']
# churn_feat_space = churn_df.drop(to_drop,axis=1)

## Understanding the data
Before working with a dataset, I spend a little bit of time trying to figure out the different aspects of the dataset. I'll glimpse through the table and get an idea of what types of variables are being represented, some of the possible values, etc. One of the first things I investigate is the number of unique values in each column

In [38]:
# iteratie through the columns and count the unique values in each column
for column in train_data:
    print(column, len(np.unique(train_data[column])))

age 73
people-represented 21648
education 7
marital-status 4
occupation 15
race 5
sex 2
hours-per-week 94


This gives me a few insights about what techniques may be best for processing the data. 

## Dealing with categorical Data
Before we can even get close to framing this problem in a proper context, we need to deal with the categorical data. 





There are a number of techniques to approach this and each varies based on the type of data. 

The simplest of technique is to enumerate all possible categories and replace the categories with their numerical position. Fortunately, sklearn has the module just for us: [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


Let's start with the simple case - sex

In [48]:
from sklearn.preprocessing import LabelEncoder as LE

num_encoder = LE()

#to convert into numbers
print('before encoding', np.unique(train_data['sex']))
train_data['sex'] = num_encoder.fit_transform(train_data['sex'])
print('after encoding', np.unique(train_data['sex']))

before encoding [0 1]
after encoding [0 1]


Of course we will want to do the same for the other categorical variables. **Write a function that does this for a given list of column names**

In [46]:
def le_categories(df, labels):
    assert labels in df.columns, "Labels not in column names"
    ### STUDENT SOL START ###
    
    ### STUDENT SOL END ###
le_categories(train_data, ['sex', 'education', 'marital-status', 'occupation', 'race'])

SyntaxError: unexpected EOF while parsing (<ipython-input-46-52f13c7b7c46>, line 3)

### There's another way (and it's sometimes better!)
Another technique to encode categorical data is called *One Hot Encoding*. Essentially, you're taking every possible category for a catgorical feature and making it it's own binary feature. 

One hot encoding can improve performance of a classifier by removing the notion of order the enumerated labels produced by labelencoding can include. Despite adding more features and thereby reducing performance, other techniques exist to offset this loss. Now let's check this out 

In [None]:
#OHE EXAMPLE

**Adopt the above example to work for a dataframe if given a list of column names**

In [None]:

def ohe_catgories(df, labels):
ohe_categories(train_data, ['sex', 'education', 'marital-status', 'occupation', 'race'])

## Moving On
Now that we've explored categorical data, we can do some tricks to improve performance using continuous data as well. 

A popular and useful technique is called binning. The procedure involves breaking up a range of continuous numbers into equally sized chunks, then assigning each datapoint to their respective chunk. This significantly reduces the number of possible values and also can significanlty increase performance in many cases.