## Introduction
We continue our work, with census data, from [Project 1](https://gist.github.com/kjprice/820c75bd8e5c3f2558f4576f38893dae) and the [Mini Project](https://gist.github.com/kjprice/bb3ecd63e7a2e5050c018d63c875a79b), to take a deeper look into our data. We move beyond exploratory data analysis and will now look into classifying the data based on the given attributes. 

In [40]:
import numpy as np 
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

# load in raw dataset
person_raw = pd.read_csv('../data/person-subset-2.5percent.csv')

# clean data (as performed in Project 1)
# will provide us with a new dataset "df"
# ...and a list of "important_features"
execfile('../python/clean_data_person.py')

Let's take a look at some of the `important_features` discovered from the first project:

In [12]:
df[important_features].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60357 entries, 0 to 60356
Data columns (total 15 columns):
PINCP        60357 non-null float64
POVPIP       57892 non-null float64
JWMNP        32486 non-null float64
AGEP         60357 non-null int64
PWGTP        60357 non-null int64
PAP          60357 non-null float64
CIT          60357 non-null object
ENG          60357 non-null object
COW          60357 non-null object
PUMA         60357 non-null category
SEX          60357 non-null object
MIL          60357 non-null object
SCHL         60357 non-null float64
MAR          60357 non-null object
affluency    60312 non-null category
dtypes: category(2), float64(5), int64(2), object(6)
memory usage: 6.7+ MB


### Data Descriptions

Attributes used with their descriptions:
  * AGEP:  Age of person (continuous 0-95)
  * CIT:  Citizenship status (categorical - numerical key)
  * CIT_CAT:  Citizenship status (categorical - string)
  * COW:  Class of worker (categorical - string)
  * ENG:  Ability to speak English (categorical scale of 1-4 and native speakers)
  * JWMNP:  Travel time to work (continuous - minutes of commute to work)
  * JWTR:  Means of transportation to work (categorical - 12 modes of transportation)
  * MAR:  Marital status (categorical - 5 categories: married, divorced, separated, single, widow)
  * PAP:  Public assistance income past 12 months (continuous variable of dollars of assistance received)
  * PINCP:  Total person's income (continuous of total income)
  * POVPIP:  Income-to-poverty ratio recode (continuous with a cap at 501)
  * SCHL:  Educational attainment (continuous - years of completed education)
  * SEX:  Sex (gender of female or male)

### Classification variable

As performed in the previous assignment, we have created a new variable called `affluency` which can be one of two levels "general" or "rich" (greater than $100,000 income):

In [13]:
def create_affluency():
    global lr
    global important_features

    lr = df[important_features].copy(deep=True)
    lr['affluency'] = pd.cut(df.PINCP, [-1, 99999.99, 1e12], labels=('general', 'rich'))
create_affluency()

### Cleanup

As performed in the previous project, we cleanup our dataset further. We delete variables which we do not need: 
 - `POVPIP`
 - `PINCP`
 - `PUMA`
 - `MIL`
 - `MAR`
 - `SCHL`

We group "Travel Time" (`JWMNP`) into appropriate intervals (previously a continuous variable):
 - `na`
 - `short`
 - `half hour`
 - `hour`
 - `long`
 
Then, from our variables `affluency` and `SEX`, we will create the boolean variables `wealthy` and `is_male` respectively.

Finally, we can perform one-hot-encoding on our other categorical variables `travel_time`, `CIT`, `ENG`, `COW`.

All of this gives us the following:

In [17]:
execfile('../python/clean_data_classification.py')
lr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60357 entries, 0 to 60356
Data columns (total 48 columns):
AGEP                                               60357 non-null int64
PWGTP                                              60357 non-null int64
PAP                                                60357 non-null float64
wealthy                                            60357 non-null bool
is_male                                            60357 non-null int64
Travel_Time__half hour                             60357 non-null uint8
Travel_Time__hour                                  60357 non-null uint8
Travel_Time__long                                  60357 non-null uint8
Travel_Time__na                                    60357 non-null uint8
Travel_Time__short                                 60357 non-null uint8
Citizen__Born Abroad)                              60357 non-null uint8
Citizen__Naturalized                               60357 non-null uint8
Citizen__Non-Citizen      

### PCA

A principal components analysis is also performed with five components on a scaled-version of our dataset. We may decide to use this later on in our analysis:

In [22]:
df[pca_features].head()

Unnamed: 0,PCA_0,PCA_1,PCA_2,PCA_3,PCA_4
0,1.884722,1.800257,0.924182,-0.699644,-0.178205
1,0.337951,-1.349852,0.459928,-0.865059,0.822025
2,-1.203093,-0.315044,-0.527054,-0.989047,1.147465
3,-0.173147,-1.279213,-0.209784,-0.61414,-0.449465
4,1.361579,1.421788,-0.016835,-1.109769,0.966567


## Evaluation metric

Ultimately, we are hoping to classify individuals as either _wealthy_ (income > \$100,000) or _general_ (income < \$100,000). Being so, we want our model to be as accurate as possible and will consider accuracy to be the most important metric.

Our dataset is highly unbalanced with only 7% of the population being classified as "wealthy". Because of this unbalance, we want to keep an eye on the F-measure. A model may choose to take the easy way out, obtaining nearly 92% accuracy by simply classifying all individuals as "general". The F-measure will be a secondary metric to accuracy.

## Dividing Data

To help ensure that overfitting is mitigated, we will use a 10-fold cross validation technique to split our dataset. Because our dataset is so unbalanced, we will use a stratified k-fold cross validator.

Additionally, we will scale our dataset so that various classification techniques can be used.

In [25]:
def create_stratified_transformed_dataset():
   global lr2
   ### Create reponse and explanatory variables
   lr2 = lr.copy(deep=True)
   y = lr2.wealthy.values
   del lr2['wealthy']
   X = lr2.values
   
   ### Standardize X values
   scl_obj = MinMaxScaler()
   scl_obj.fit(X)
   X = scl_obj.transform(X)
   
   skf = StratifiedKFold(n_splits=10)

   skf.get_n_splits(X, y)
   
   _data = []
   for train_index, test_index in skf.split(X, y):
      X_train, X_test = X[train_index], X[test_index]
      y_train, y_test = y[train_index], y[test_index]
      _data.append([X_train, X_test, y_train, y_test])
      
   return _data


test_train_data = create_stratified_transformed_dataset()

## Creating models

We will try to classify the individuals of our dataset as being wealthy (or not) using several techniques including Naive Bayes, Decision Trees, and KNN. Then we will try to regress our dataset to predict the actual income of the individual using linear regression.

### Helper functions

We will create two functions to help us with cognitive overload by cleaning our code down to the essential pieces:

In [35]:
def print_accuracy(title, accuracy, avg=False):
    accuracy = round(accuracy*100, 2)
    avg_text = 'avg' if avg else ''
    bullet = ''
    if not avg:
       bullet = '**'
    print('%s%s%% %s accuracy - %s' % (bullet, accuracy, avg_text, title))

def fit_and_test(title, test_train, show_individual_accuracies=False, print_confusion=False):
   accuracies = pd.Series()
   for X_train, X_test, y_train, y_test in test_train_data:
      test_train.fit(X_train, y_train)
      y_hat = test_train.predict(X_test)
      
      acc = mt.accuracy_score(y_test, y_hat)
      accuracies = accuracies.append(pd.Series(acc))
      
      if print_confusion:
         print(mt.confusion_matrix(y_test, y_hat))
      
      if show_individual_accuracies:
         print_accuracy(title, acc)
   print_accuracy(title, accuracies.mean(), avg=True)



### Decision Trees

In [41]:
dt_clf = DecisionTreeClassifier(max_features=None, class_weight='balanced')
fit_and_test('decision tree (no max feature)', dt_clf)
dt_clf = DecisionTreeClassifier(max_features=4, class_weight='balanced')
fit_and_test('decision tree (max_features=4)', dt_clf)
dt_clf = RandomForestClassifier(class_weight='balanced')
fit_and_test('Random Forest', dt_clf)

89.7% avg accuracy - decision tree (no max feature)
89.73% avg accuracy - decision tree (max_features=4)
92.28% avg accuracy - Random Forest


We can get nearly **90%** accuracy using a decision tree. We are not seeing much difference when changing the `max_features` parameter except a slight speed boost when limiting the featuers. Using a random forest, however, seems to give us a bit of a boost in accuracy of about **92%** total.

### Multinomial Bayes

In [43]:
mb_clf = MultinomialNB(alpha=100)
fit_and_test('bayes multinomial (alpha=100)', mb_clf, show_individual_accuracies=False)
mb_clf = MultinomialNB(alpha=1)
fit_and_test('bayes multinomial (alpha=1)', mb_clf, show_individual_accuracies=False)
mb_clf = MultinomialNB(alpha=.001)
fit_and_test('bayes multinomial (alpha=.001)', mb_clf, show_individual_accuracies=False)

92.74% avg accuracy - bayes multinomial (alpha=100)
92.82% avg accuracy - bayes multinomial (alpha=1)
92.82% avg accuracy - bayes multinomial (alpha=.001)


With multinomial Bayes, we are getting an accuracy around **93%** consistently. A chance in the `alpha` parameter does not seem to produce a significant change in accuracy.