### Introduction
We continue our work, with census data, from [Project 1](https://gist.github.com/kjprice/820c75bd8e5c3f2558f4576f38893dae), to take a deeper look into our data. We move beyond exploratory data analysis and will now look into classifying the data based on the given attributes. 

In [1]:
import pandas as pd
import numpy as np
import sys
import os

# load in raw dataset
person_raw = pd.read_csv('../data/person-subset-2.5percent.csv')

# clean data (as performed in Project 1)
# will provide us with a new dataset "df"
# ...and a list of "important_features"
execfile('../python/clean_data_person.py')

Let's take a look at some of the `important_features` discovered from the previous project:

In [2]:
df[important_features].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60357 entries, 0 to 78317
Data columns (total 12 columns):
PINCP     60357 non-null float64
POVPIP    57892 non-null float64
JWMNP     32486 non-null float64
AGEP      60357 non-null int64
PWGTP     60357 non-null int64
PAP       60357 non-null float64
CIT       60357 non-null object
OC        60357 non-null bool
ENG       60357 non-null object
COW       60357 non-null object
PUMA      60357 non-null category
SEX       60357 non-null object
dtypes: bool(1), category(1), float64(4), int64(2), object(4)
memory usage: 5.3+ MB


### New Categorical Variable

Along with these attributes above, as defined in our [previous project](https://gist.github.com/kjprice/820c75bd8e5c3f2558f4576f38893dae), we will want to add another variable which we will use to perform a classification analysis on. This variable should be categorical and would, ideally, continue on with our theme of "predicting income". Income (`PINCP`), as we have it currently, is a continuous variable. We will take income and will create a new categorical variable called `affluency`, which will take on the values "general" and "rich" based on whether the individual makes less (or more) than $100,000:

In [3]:
df['affluency'] = pd.cut(df.PINCP, [-1, 99999.99, 1e12], labels=('general', 'rich'))

important_features = important_features + ['affluency']

lr = df[important_features].copy(deep=True)

### Cleanup
Now that we have our new categorical variable, and a new dataset (`lr`) based on our `important_features`, let's try to clean up our data.

First, we will remove unwanted fields:

In [4]:
del lr['POVPIP']
del lr['PUMA']

Then we group "Travel Time" (`JWMNP`)

In [5]:
lr.JWMNP = lr.JWMNP.fillna(-1)
lr['travel_time'] = pd.cut(lr.JWMNP, (-2, 0, 15, 40, 60, lr.JWMNP.max()), labels=['na', 'short', 'half hour', 'hour', 'long'])
del lr['JWMNP']

Then, from our variables `affluency` and `SEX`, we will create the boolean variables `wealthy` and `is_male` respectively:

In [6]:
lr['wealthy'] = lr.affluency == 'rich'
del lr['affluency']
lr['is_male'] = lr.SEX == 'Male'
lr.is_male = lr.is_male.astype(np.int)
del lr['SEX']

Finally, we can perform one-hot-encoding on our other categorical variables `travel_time`, `CIT`, `ENG`, `COW`:

In [7]:
one_hot_travel_time = pd.get_dummies(lr.travel_time, prefix='Travel_Time_')
del lr['travel_time']
one_hot_citizenship = pd.get_dummies(lr.CIT, prefix='Citizen_')
lr = pd.concat((lr, one_hot_citizenship), axis=1)
del lr['CIT']
one_hot_english = pd.get_dummies(lr.ENG, prefix='English_')
lr = pd.concat((lr, one_hot_english), axis=1)
del lr['ENG']
one_hot_worker_class = pd.get_dummies(lr.COW, prefix='Worker_Class_')
del lr['COW']

Let's see how our dataset looks now:

In [8]:
lr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60357 entries, 0 to 78317
Data columns (total 17 columns):
PINCP                           60357 non-null float64
AGEP                            60357 non-null int64
PWGTP                           60357 non-null int64
PAP                             60357 non-null float64
OC                              60357 non-null bool
wealthy                         60357 non-null bool
is_male                         60357 non-null int64
Citizen__Born Abroad)           60357 non-null uint8
Citizen__Naturalized            60357 non-null uint8
Citizen__Non-Citizen            60357 non-null uint8
Citizen__US Born                60357 non-null uint8
Citizen__US Territory Born      60357 non-null uint8
English__Not at all             60357 non-null uint8
English__Not well               60357 non-null uint8
English__Speaks only English    60357 non-null uint8
English__Very well              60357 non-null uint8
English__Well                   60357 non

Great, so we now we have numeric fields to work with. now we can begin our analysis...