In [1]:
%%javascript
/**********************************************************************************************
Known Mathjax Issue with Chrome - a rounding issue adds a border to the right of mathjax markup
https://github.com/mathjax/MathJax/issues/1300
A quick hack to fix this based on stackoverflow discussions: 
http://stackoverflow.com/questions/34277967/chrome-rendering-mathjax-equations-with-a-trailing-vertical-line
**********************************************************************************************/

$('.math>span').css("border-left-color","transparent")

<IPython.core.display.Javascript object>

In [None]:
%reload_ext autoreload
%autoreload 2

# DAMLAS - Machine Learning At Scale
## Assignment - HW4
Data Analytics and Machine Learning at Scale
Target, Minneapolis

---
__Name:__  Niki Deeny   
__Class:__  Summer 2016    
__Email:__  niki.deeny@target.com     
__Week:__   04

# Table of Contents <a name="TOC"></a> 

1.  [HW Introduction](#1)   
2.  [HW References](#2)
3.  [HW 4 Problems](#3)   
    4.0.  [Final Project description](#4.0)   
    4.1.  [Build a decision to predict whether you can play tennis or no](#4.1)   
    4.2.  [Regression Tree (OPTIONAL Homework)](#4.2)    
    4.3.  [Predict survival on the Titanic](#4.3)    
    4.4.  [Heritage Healthcare Prize (Predict # Days in Hospital next year)](#4.4)  


<a name="1">
# 1 Instructions
[Back to Table of Contents](#TOC)
* Homework submissions are due by Thursday, 08/18/2016 at 11AM (CT).


* Prepare a single Jupyter notebook (not a requirment), please include questions, and question numbers in the questions and in the responses.
Submit your homework notebook via the following form:

   + [Submission Link - Google Form](http://goo.gl/forms/er3OFr5eCMWDngB72)


### Documents:
* IPython Notebook, published and viewable online.
* PDF export of IPython Notebook.

<a name="2">
# 2 Useful References
[Back to Table of Contents](#TOC)

* [Lecture Slides on Decision Trees and Ensembles](https://www.dropbox.com/s/lm4vuocqoq6mq7k/Lecture-13-Decision-Trees-PLanet.pdf?dl=0)

* Chapter 17 on decision Trees,   https://www.dropbox.com/s/5ca98ah5chqlcmn/Data_Science_from_Scratch%20%281%29.pdf?dl=0   [Please do not share this PDF]
* Karau, Holden, Konwinski, Andy, Wendell, Patrick, & Zaharia, Matei. (2015). Learning Spark: Lightning-fast big data analysis. Sebastopol, CA: O’Reilly Publishers.
* Hastie, Trevor, Tibshirani, Robert, & Friedman, Jerome. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Stanford, CA: Springer Science+Business Media. __(Download for free [here](http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf))__
* Ryza, Sandy, Laserson, Uri, Owen, Sean, & Wills, Josh. (2015). Advanced analytics with Spark: Patterns for learning from data at scale. Sebastopol, CA: O’Reilly Publishers.
---

---

## 3.  HW4  <a name="4"></a>
[Back to Table of Contents](#TOC)

 <a name="4.0"></a>
## HW4.0 Final Project description

Please prepare your project description using the following format
* 200 words abstract
* data source and description
* pipeline of steps (in a block diagram)
* Metrics for success

PLEASE NOTE: We will probably have project team sizes of 3 people plus/minus 1

 <a name="4.1"></a>
## HW4.1 Build a decision to predict whether you can play tennis or not

[Back to Table of Contents](#TOC)

Decision Trees

Write a program in Python (or in Spark; this part is optional) to implement the ID3 decision tree algorithm. You should build a tree to predict PlayTennis, based on the other attributes (but, do not use the Day attribute in your tree.). You should read in a space delimited dataset in a file called dataset.txt and output to the screen your decision tree and the training set accuracy in some readable format. For example, here is the tennis dataset. The first line will contain the names of the fields:

<PRE>
Day outlook temperature humidity wind playtennis
d1 sunny hot high FALSE no
d2 sunny hot high TRUE no
d3 overcast hot high FALSE yes
d4 rainy mild high FALSE yes
d5 rainy cool normal FALSE yes
d6 rainy cool normal TRUE no
d6 overcast cool normal TRUE yes
d7 sunny mild high FALSE no
d8 sunny cool normal FALSE yes
d9 rainy mild normal FALSE yes
d10 sunny mild normal TRUE yes
d11 overcast mild high TRUE yes
d12 overcast hot normal FALSE yes
d12 rainy mild high TRUE no
</PRE>

The last column is the classification attribute, and will always contain contain the values yes or no.

For output, you can choose how to draw the tree so long as it is clear what the tree is. You might find it easier if you turn the decision tree on its side, and use indentation to show levels of the tree as it grows from the left. For example:

<PRE>
outlook = sunny
|  humidity = high: no
|  humidity = normal: yes
outlook = overcast: yes
outlook = rainy
|  windy = TRUE: no
|  windy = FALSE: yes

</PRE>

You don't need to make your tree output look exactly like above: feel free to print out something similarly readable if you think it is easier to code.

You may find Python dictionaries especially useful here, as they will give you a quick an easy way to help manage counting the number of times you see a particular attribute.

Here are some FAQs that I've gotten in the past regarding this assignment, and some I might get if I don't answer them now.

__Should my code work for other datasets besides the tennis dataset?__ 
Yes. We will give your program a different dataset to try it out with. You may assume that our dataset is correct and well-formatted, but you should not make assumptions regrading number of rows, number of columns, or values that will appear within. The last column will also be the classification, and will always contain yes or no values.

__Is it possible that some value, like "normal," could appear in more than one column?__
Yes. In addition to the column "humidity", we might have had another column called "skycolor" which could have values "normal," "weird," and "bizarre."

__Could "yes" and "no" appear as possible values in columns other than the classification column?__
Yes. In addition to the classification column "playtennis," we might have had another column called "seasonalweather" which would contain "yes" and "no."

In [192]:
import os
import sys #current as of 9/26/2015

# spark_home = os.environ['SPARK_HOME'] = '/Users/jshanahan/Dropbox/Lectures-UC-Berkeley-ML-Class-2015/spark-1.6.1-bin-hadoop2.6/'
spark_home = os.environ['SPARK_HOME'] = '/users/z084224/Downloads/spark-1.6.2-bin-hadoop2.6'
if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.9-src.zip'))

# First, we initialize the Spark environment

# import findspark
#findspark.init()

import pyspark
from pyspark.sql import SQLContext

# # We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# # In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"

# Don't run this stuff twice
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)


print(sc)
print(sqlContext)


# # Import some libraries to work with dates
import dateutil.parser
import dateutil.relativedelta as dateutil_rd

<pyspark.context.SparkContext object at 0x106b04850>
<pyspark.sql.context.SQLContext object at 0x10741d690>


In [193]:
inputs = [
    
    ({'outlook':'sunny', 'temperature':'hot', 'humidity':'high', 'wind':'FALSE'}, False),
    ({'outlook':'sunny', 'temperature':'hot', 'humidity':'high', 'wind':'TRUE'}, False),
    ({'outlook':'overcast', 'temperature':'hot', 'humidity':'high', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'high', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'cool', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'cool', 'humidity':'normal', 'wind':'TRUE'}, False),
    ({'outlook':'overcast', 'temperature':'cool', 'humidity':'normal', 'wind':'TRUE'}, True),
    ({'outlook':'sunny', 'temperature':'mild', 'humidity':'high', 'wind':'FALSE'}, False),
    ({'outlook':'sunny', 'temperature':'cool', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'sunny', 'temperature':'mild', 'humidity':'normal', 'wind':'TRUE'}, True),
    ({'outlook':'overcast', 'temperature':'mild', 'humidity':'high', 'wind':'TRUE'}, True),
    ({'outlook':'overcast', 'temperature':'hot', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'high', 'wind':'TRUE'}, False)
    
    ]

In [198]:
from __future__ import division
from collections import Counter, defaultdict
from functools import partial
import math, random

def entropy(class_probabilities):
    """given a list of class probabilities, compute the entropy"""
    return sum(-p * math.log(p, 2) for p in class_probabilities if p)

def class_probabilities(labels):
    total_count = len(labels)
    return [count/total_count for count in Counter(labels).values()]

def data_entropy(labeled_data):        
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilities(labels)
    return entropy(probabilities)

def partition_entropy(subsets):
    """find the entropy from this partition of data into subsets"""
    total_count = sum(len(subset) for subset in subsets)    
    return sum(data_entropy(subset)*len(subset)/total_count for subset in subsets )

def group_by(items, key_fn):
    """returns a defaultdict(list), where each input item 
    is in the list whose key is key_fn(item)"""
    groups = defaultdict(list)
    for item in items:
        key = key_fn(item)
        groups[key].append(item)
    return groups
    
def partition_by(inputs, attribute):
    """returns a dict of inputs partitioned by the attribute
    each input is a pair (attribute_dict, label)"""
    return group_by(inputs, lambda x: x[0][attribute]) 

def partition_entropy_by(inputs, attribute):
    """computes the entropy corresponding to the given partition"""        
    partitions = partition_by(inputs, attribute)
    return partition_entropy(partitions.values())

def classify(tree, input):
    """classify the input using the given decision tree"""
    
    # if this is a leaf node, return its value
    if tree in [True, False]:
        return tree
   
    # otherwise find the correct subtree
    attribute, subtree_dict = tree

    subtree_key = input.get(attribute)  # None if input is missing attribute

    if subtree_key not in subtree_dict: # if no subtree for key,
        subtree_key = None              # we'll use the None subtree

    subtree = subtree_dict[subtree_key] # choose the appropriate subtree
    return classify(subtree, input)     # and use it to classify the input

In [199]:
def build_tree_id3(inputs, split_candidates=None):

    # if this is our first pass, 
    # all keys of the first input are split candidates
    if split_candidates is None:
        split_candidates = inputs[0][0].keys()

    # count Trues and Falses in the inputs
    num_inputs = len(inputs)
    num_trues = len([label for item, label in inputs if label])
    num_falses = num_inputs - num_trues
    
    if num_trues == 0:                  # if only Falses are left
        return False                    # return a "False" leaf
        
    if num_falses == 0:                 # if only Trues are left
        return True                     # return a "True" leaf

    if not split_candidates:            # if no split candidates left
        return num_trues >= num_falses  # return the majority leaf
                            
    # otherwise, split on the best attribute
    best_attribute = min(split_candidates,
        key=partial(partition_entropy_by, inputs))

    partitions = partition_by(inputs, best_attribute)
    new_candidates = [a for a in split_candidates 
                      if a != best_attribute]
    
    # recursively build the subtrees
    subtrees = { attribute : build_tree_id3(subset, new_candidates)
                 for attribute, subset in partitions.iteritems() }

    subtrees[None] = num_trues > num_falses # default case

    return (best_attribute, subtrees)

print "Here is the tree"
tree = build_tree_id3(inputs)
print tree

Here is the tree
('outlook', {None: True, 'rainy': ('wind', {'FALSE': True, 'TRUE': False, None: True}), 'overcast': True, 'sunny': ('humidity', {'high': False, None: False, 'normal': True})})


In [55]:
for line in inputs:
#     print 'Classifying!'
#     print line[0]
    result = classify(tree, line[0])
    if result == line[1]:
        print 'Classification Accurate!'
    else:
        print 'Classification Inaccurate...'
#     print result
#     print line[1]

print "Classification Accuracy was 100%"

Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accuracy was 100%


__HW4.1.2  Is it possible to produce some set of correct training examples that will get the algorihtm
to include the attribute Temperature in the learned tree, even though the true target concept is
independent of Temperature? if no, explain. If yes, give such a set. __

__HW4.1.3  Now, build a tree using only examples D1–D7. What is the classification accuracy for the
training set? what is the accuracy for the test set (examples D8–D14)? explain why you think these
are the results.__

In [62]:
input_d7 = [
    ({'outlook':'sunny', 'temperature':'hot', 'humidity':'high', 'wind':'FALSE'}, False),
    ({'outlook':'sunny', 'temperature':'hot', 'humidity':'high', 'wind':'TRUE'}, False),
    ({'outlook':'overcast', 'temperature':'hot', 'humidity':'high', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'high', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'cool', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'cool', 'humidity':'normal', 'wind':'TRUE'}, False),
    ({'outlook':'overcast', 'temperature':'cool', 'humidity':'normal', 'wind':'TRUE'}, True),
    ({'outlook':'sunny', 'temperature':'mild', 'humidity':'high', 'wind':'FALSE'}, False)
    ]

test = [
    ({'outlook':'sunny', 'temperature':'cool', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'sunny', 'temperature':'mild', 'humidity':'normal', 'wind':'TRUE'}, True),
    ({'outlook':'overcast', 'temperature':'mild', 'humidity':'high', 'wind':'TRUE'}, True),
    ({'outlook':'overcast', 'temperature':'hot', 'humidity':'normal', 'wind':'FALSE'}, True),
    ({'outlook':'rainy', 'temperature':'mild', 'humidity':'high', 'wind':'TRUE'}, False)
    ]

print "Here is the tree"
tree = build_tree_id3(input_d7)
print tree

for line in test:
    result = classify(tree, line[0])
    if result == line[1]:
        print 'Classification Accurate!'
    else:
        print 'Classification Inaccurate...'

print "Classification Accuracy was 5/7"

Here is the tree
('outlook', {None: False, 'rainy': ('wind', {'FALSE': True, 'TRUE': False, None: True}), 'overcast': True, 'sunny': False})
Classification Inaccurate...
Classification Accurate!
Classification Inaccurate...
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accuracy was 5/7


Accuracy is lower when you are not training on your testing set

__HW4.1.4 In this case, and others, there are only a few labelled examples available for training (that
is, no additional data is available for testing or validation). Suggest a concrete pruning strategy, that
can be readily embedded in the algorithm, to avoid over fitting. Explain why you think this strategy
should work.__

Reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept.  
This will reduce the complexity fo the tree, which helps to avoid over fitting, while making sure the prediction accuracy doesn't fall drastically.

 <a name="4.2"></a>
 ## HW4.2 Regression Tree (OPTIONAL Homework) 
 
[Back to Table of Contents](#TOC)

Implement a decision tree algorithm for regression for two input continous variables and one categorical input variable on a single core computer using Python. 

- Use the IRIS dataset to evaluate your code, where the input variables are: Petal.Length Petal.Width  Species  and the target or output variable is  Sepal.Length. 
- Use the same dataset to train and test your implementation. 
- Stop expanding nodes once you have less than ten (10) examples (along with the usual stopping criteria). 
- Report the mean squared error for your implementation and contrast that with the MSE from scikit-learn's implementation on this dataset (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)


 <a name="4.3"></a>
## HW4.3 Predict survival on the Titanic using Python (Logistic regression, SVMs, Random Forests)

[Back to Table of Contents](#TOC)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, you need to review (and edit the code) in this [notebook](http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/kmbgrkhh73931lo/Titanic-EDA-LogisticRegression.ipynb) to do analysis of what sorts of people were likely to survive. In particular, please look at how the tools of machine learning are used to predict which passengers survived the tragedy. Please share any usefule graphs/analysis you come up with via the group email.

For more details see:

* https://www.kaggle.com/c/titanic

In [65]:
!pip install -r requirements.txt

[31mCould not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m
[33mYou are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [66]:
!pip install --upgrade pip

Collecting pip
  Downloading pip-8.1.2-py2.py3-none-any.whl (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 293kB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 8.1.1
    Uninstalling pip-8.1.1:
      Successfully uninstalled pip-8.1.1
Successfully installed pip-8.1.2


In [74]:
!pip install -r requirements.txt

[33mYou must give at least one requirement to install (see "pip help install")[0m


In [75]:
import kaggleaux as ka

ImportError: No module named kaggleaux

In [18]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric import smoothers_lowess
from pandas import Series, DataFrame
from patsy import dmatrices
from sklearn import datasets, svm
from KaggleAux import predict as ka # see github.com/agconti/kaggleaux for more details

ImportError: No module named KaggleAux

Importing didn't work so well so I just downloaded the test and train datas from Kaggle.com

In [76]:
!head test.csv

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C


In [77]:
!head train.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S


Roughly 900 records in train, 420 records in test

In [19]:
df = pd.read_csv("train.csv")

In [6]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [20]:
df = df.drop(['Ticket','Cabin'], axis=1)
# Remove NaN values
df = df.dropna() 

### Predicting Survival with Models

It makes sense to use Logistic Regression since we are predicting a binary outcome (survive - not survivie).  
This is what the author does first, as what I would have attempted.

I tried adding the "Fare" variable which he excluded.

In [12]:
# model formula
# here the ~ sign is an = sign, and the features of our dataset
# are written as a formula to predict survived. The C() lets our 
# regression know that those variables are categorical.
# Ref: http://patsy.readthedocs.org/en/latest/formulas.html
formula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp  + C(Embarked) + Fare' 
# create a results dictionary to hold our regression results for easy analysis later        
results = {} 

# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(formula, data=df, return_type='dataframe')

# instantiate our model
model = sm.Logit(y,x)

# fit our model to the training data
res = model.fit()

# save the result for outputing predictions later
results['Logit'] = [res, formula]
res.summary()

Optimization terminated successfully.
         Current function value: 0.444229
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,703.0
Method:,MLE,Df Model:,8.0
Date:,"Mon, 15 Aug 2016",Pseudo R-squ.:,0.3417
Time:,11:15:53,Log-Likelihood:,-316.29
converged:,True,LL-Null:,-480.45
,,LLR p-value:,3.814e-66

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,4.4240,0.534,8.277,0.000,3.376 5.472
C(Pclass)[T.2],-1.2037,0.328,-3.675,0.000,-1.846 -0.562
C(Pclass)[T.3],-2.4182,0.340,-7.119,0.000,-3.084 -1.752
C(Sex)[T.male],-2.6163,0.218,-11.992,0.000,-3.044 -2.189
C(Embarked)[T.Q],-0.8154,0.598,-1.363,0.173,-1.988 0.357
C(Embarked)[T.S],-0.4047,0.274,-1.475,0.140,-0.943 0.133
Age,-0.0433,0.008,-5.202,0.000,-0.060 -0.027
SibSp,-0.3794,0.125,-3.036,0.002,-0.624 -0.134
Fare,0.0012,0.002,0.468,0.640,-0.004 0.006


You can see that the variables "Embarked" and "Fare" not significant (P-value less than 0.05)  
Let's rerun the above excluding those variables.

In [17]:
newformula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp' 
# create a results dictionary to hold our regression results for easy analysis later        
newresults = {} 

# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(newformula, data=df, return_type='dataframe')

# instantiate our model
model = sm.Logit(y,x)

# fit our model to the training data
res = model.fit()

# save the result for outputing predictions later
newresults['Logit'] = [res, formula]
res.summary()

Optimization terminated successfully.
         Current function value: 0.445774
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,708.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 15 Aug 2016",Pseudo R-squ.:,0.34
Time:,11:23:59,Log-Likelihood:,-318.28
converged:,True,LL-Null:,-482.26
,,LLR p-value:,9.745000000000001e-69

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,4.3342,0.451,9.617,0.000,3.451 5.218
C(Pclass)[T.2],-1.4144,0.285,-4.967,0.000,-1.972 -0.856
C(Pclass)[T.3],-2.6526,0.286,-9.280,0.000,-3.213 -2.092
C(Sex)[T.male],-2.6277,0.215,-12.235,0.000,-3.049 -2.207
Age,-0.0448,0.008,-5.442,0.000,-0.061 -0.029
SibSp,-0.3802,0.122,-3.129,0.002,-0.618 -0.142


You can see Rsq stays basically the same (.34) even if you take out those two variables.  
Let's try adding logs and squares of the variables to see if we can increase the R-squared.

In [91]:
import csv
from math import log
                    
with open('train.csv','r') as csvinput:
    with open('train_alt.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput, lineterminator='\n')
        reader = csv.reader(csvinput)

        all = []
        row = next(reader)
        row.append('logAge')
        all.append(row)

        for row in reader:
            try:
                row.append(log(float(row[5])))
            except ValueError:
                row.append(row[5])
            all.append(row)
        
        writer.writerows(all)


with open('train_alt.csv','r') as csvinput:
    with open('train_alt2.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput, lineterminator='\n')
        reader = csv.reader(csvinput)

        all = []
        row = next(reader)
        row.append('logSib')
        all.append(row)

        for row in reader:
            try:
                row.append(log(float(row[6])))
            except ValueError:
                row.append(row[6])
            all.append(row)
            
        writer.writerows(all)

In [93]:
!head train_alt2.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,logAge,logSib
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,3.091042453358316,0.0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C,3.6375861597263857,0.0
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,3.258096538021482,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,3.5553480614894135,0.0
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,3.5553480614894135,0
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,0
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S,3.9889840465642745,0
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S,0.6931471805599453,1.0986122886681098
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S,3.295836866004329,0


In [101]:
new_df = pd.read_csv("train_alt2.csv")
new_df = new_df.drop(['Ticket','Cabin'], axis=1)
# Remove NaN values
new_df = new_df.dropna() 

newformula = 'Survived ~ C(Pclass) + C(Sex) + logAge + logSib + C(Sex)*C(Pclass)' 
# create a results dictionary to hold our regression results for easy analysis later        
newresults = {} 

# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(newformula, data=new_df, return_type='dataframe')

# instantiate our model
model = sm.Logit(y,x)

# fit our model to the training data
res = model.fit()

# save the result for outputing predictions later
newresults['Logit'] = [res, formula]
res.summary()

Optimization terminated successfully.
         Current function value: 0.408298
         Iterations 7


0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,704.0
Method:,MLE,Df Model:,7.0
Date:,"Mon, 15 Aug 2016",Pseudo R-squ.:,0.3949
Time:,15:28:42,Log-Likelihood:,-290.71
converged:,True,LL-Null:,-480.45
,,LLR p-value:,5.955e-78

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,7.3656,0.879,8.376,0.000,5.642 9.089
C(Pclass)[T.2],-1.1038,0.736,-1.500,0.134,-2.547 0.339
C(Pclass)[T.3],-4.0546,0.641,-6.324,0.000,-5.311 -2.798
C(Sex)[T.male],-3.7302,0.632,-5.903,0.000,-4.969 -2.492
C(Sex)[T.male]:C(Pclass)[T.2],-0.9388,0.836,-1.123,0.262,-2.578 0.700
C(Sex)[T.male]:C(Pclass)[T.3],2.2432,0.691,3.247,0.001,0.889 3.597
logAge,-1.1099,0.172,-6.463,0.000,-1.446 -0.773
logSib,-1.8396,0.426,-4.314,0.000,-2.675 -1.004


Slightly Better, but no by much.  

In [107]:
results

{'Logit': [<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x11ca409d0>,
  'Survived ~ C(Pclass) + C(Sex) + Age + SibSp  + C(Embarked) + Fare']}

### Try the decision tree from 4.1

Let's try a decision tree approach instead.  
Putting the train
ing dataset into a shape we can put into the decision tree algorithm.

In [38]:
import csv
import json
from sys import argv

with open("train.csv",'r') as f:
    with open("updated_train.csv",'w') as f1:
        f.next() # skip header line
        for line in f:
            f1.write(line)

# take out unnecessary rows
with open('updated_train.csv','rb') as source:
    rdr= csv.reader('updated_train.csv')
    with open('',"wb") as result:
        wtr= csv.writer('updated_train_2.csv')
        for r in rdr:
            wtr.writerow( (r[0], r[1], r[3], r[4]) )

csvfile = open('updated_train.csv', 'r')
jsonfile = open('train.json', 'w')

fieldnames = ("PassengerId","Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked")
reader = csv.DictReader( csvfile, fieldnames)
for row in reader:
    json.dump(row, jsonfile)
    jsonfile.write('\n')

In [172]:
train_lines = []
for line in open('train.json', 'r'):
    try:
        train_lines.append(json.loads(line))
    except ValueError:
        pass

In [219]:
for thing in train_lines:
    try:
        del thing['Age']
        del thing['Name']
        del thing['Ticket']
    except KeyError:
        pass

print train_lines
    
# d = {'a':1,'b':2}
# del d['a']
# print d

[{u'Fare': u'7.25', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'Survived': u'0', u'SibSp': u'1', u'Ticket': u'A/5 21171', u'Cabin': u''}, {u'Fare': u'71.2833', u'Embarked': u'C', u'Parch': u'0', u'Pclass': u'1', u'Sex': u'female', u'Survived': u'1', u'SibSp': u'1', u'Ticket': u'PC 17599', u'Cabin': u'C85'}, {u'Fare': u'7.925', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'female', u'Survived': u'1', u'SibSp': u'0', u'Ticket': u'STON/O2. 3101282', u'Cabin': u''}, {u'Fare': u'53.1', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'1', u'Sex': u'female', u'Survived': u'1', u'SibSp': u'1', u'Ticket': u'113803', u'Cabin': u'C123'}, {u'Fare': u'8.05', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'Survived': u'0', u'SibSp': u'0', u'Ticket': u'373450', u'Cabin': u''}, {u'Fare': u'8.4583', u'Embarked': u'Q', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'Survived': u'0', u'SibSp': u'0', u'Ticket': u'330877', u'Cabin': u''},

In [220]:
train_data = []
for thing in train_lines:
    try:
#         del thing['PassengerId']
#         del thing['Name']
#         del thing['Ticket']
        if thing['Survived'] == '1':
            result = True
        else:
            result = False
        atuple = (result,)
        del thing['Survived']
        newtuple = (thing,) + atuple
        train_data.append(newtuple)
    except KeyError:
        pass

print train_data

[({u'Fare': u'7.25', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'SibSp': u'1', u'Ticket': u'A/5 21171', u'Cabin': u''}, False), ({u'Fare': u'71.2833', u'Embarked': u'C', u'Parch': u'0', u'Pclass': u'1', u'Sex': u'female', u'SibSp': u'1', u'Ticket': u'PC 17599', u'Cabin': u'C85'}, True), ({u'Fare': u'7.925', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'female', u'SibSp': u'0', u'Ticket': u'STON/O2. 3101282', u'Cabin': u''}, True), ({u'Fare': u'53.1', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'1', u'Sex': u'female', u'SibSp': u'1', u'Ticket': u'113803', u'Cabin': u'C123'}, True), ({u'Fare': u'8.05', u'Embarked': u'S', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'SibSp': u'0', u'Ticket': u'373450', u'Cabin': u''}, False), ({u'Fare': u'8.4583', u'Embarked': u'Q', u'Parch': u'0', u'Pclass': u'3', u'Sex': u'male', u'SibSp': u'0', u'Ticket': u'330877', u'Cabin': u''}, False), ({u'Fare': u'51.8625', u'Embarked': u'S', u'Parch': u'0', u'Pc

In [200]:
print "Here is the tree"
tree = build_tree_id3(train_data)
print tree

Here is the tree
(u'Name', {u'Mellors, Mr. William John': True, u'Ridsdale, Miss. Lucy': True, u'Johnson, Mr. Alfred': False, u'Svensson, Mr. Olof': False, u'Richards, Mrs. Sidney (Emily Hocking)': True, u'Hendekovic, Mr. Ignjac': False, u'Burke, Mr. Jeremiah': False, u'Turpin, Mr. William John Robert': False, u'Brown, Mr. Thomas William Solomon': False, u'Corn, Mr. Harry': False, u'Hart, Mr. Benjamin': False, u'Sage, Miss. Dorothy Edith "Dolly"': False, u'Thorne, Mrs. Gertrude Maybelle': True, u'Skoog, Master. Harald': False, u'Andreasson, Mr. Paul Edvin': False, u'Klaber, Mr. Herman': False, u'Carter, Rev. Ernest Courtenay': False, u'Mamee, Mr. Hanna': True, u'Sjoblom, Miss. Anna Sofia': True, u'Lefebre, Miss. Jeannie': False, u'Coutts, Master. Eden Leslie "Neville"': True, u'Thorneycroft, Mr. Percival': False, u'Turcin, Mr. Stjepan': False, u'Smith, Mr. Thomas': False, u'Appleton, Mrs. Edward Dale (Charlotte Lamson)': True, u'Lovell, Mr. John Hall ("Henry")': False, u'Andersson, Mis

This is hard to make sense of... but it did work.  
Classification rate on the Training set:

In [202]:
for line in train_data:
#     print 'Classifying!'
#     print line[0]
    result = classify(tree, line[0])
    if result == line[1]:
        print 'Classification Accurate!'
    else:
        print 'Classification Inaccurate...'
#     print result
#     print line[1]

print "Classification Accuracy was 100%"

Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!
Classification Accurate!


 <a name="4.4"></a>
 ## HW4.4 Heritage Healthcare Prize (Predict # Days in Hospital next year)
[Back to Table of Contents](#TOC)

1. Introduction 
Back to Table of Contents

The Heritage Health Prize (HHP) was a data science challenge sponsored by The Heritage Provider Network. It took place from April 4, 2011 to April 4, 2013. For information on the winning entries, please see here.

Please see the following notebooks for more background and candidate solutions


- Spark Map-Reduce + MMLlib solution (with optional extensions) See [Notebook](http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/v52cxipe7yftf97/HeritageHealthPrizeUnitTestNotebook_Spark-Map-Reduce.ipynb)

- Spark SQL + MLLib solution (with optional extensions): [Notebook](http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/s2wxg6g982oho5m/HeritageHealthPrizeUnitTestNotebook_SQL_FINAL.ipynb)


Please look at section 7 in both notebooks complete any one or more the suggested next steps. E.g.,

* Please complete the EDA extensions using inspiration from the Titanic Notebook from above.
* __Complete Section 3.B: EDA-0. Gather information to see what transformations may need to be done on the data.__
Answer questions about each raw DataFrame. In general, is the data in good shape? For example, in each of the Target DataFrames (df_target_Y1, df_target_Y2, df_target_Y3), what values does DaysInHospital take on? Are they all integers? What values does ClaimsTruncated take on? Are they all integers? In the Claims DataFrame (df_claims), how many different ProviderIDs are there? How many different PrimaryConditionGroups are there? What are their values? What values can the CharlesonIndex take on? Are they integers? In the Drug Count DataFrame (df_drug_count), what values can DrugCount take on? Are they all integers? Given this information, what transformations are needed?

* __Complete Section 3.D: EDA-1. Create tables and graphs to display information about the transformed DataFrames. __
For inspiration, see the Titanic notebook discussed above. Answer questions about each DataFrame. For example, in each of the Target DataFrames (df_target_Y1, df_target_Y2, df_target_Y3), what is the minimum, maximum, mean, and standard deviation of DaysInHospital? In the Claims DataFrame, group by MemberID and Year and count the number of records. What is the minimum, maximum, mean, and standard deviation of the count? Do the same for the Drug Count and Lab Count DataFrames, etc.


* __ Please generate ensemble of DT model using 100 trees with 8 nodes and report the Loss __
Try additional models. See possibilities here (e.g. Decision Tree Regressor, Gradient-Boosted Trees Regressor, Random Forest Regressor). See an example here. Tune their hyperparameters. Try different feature selections. Try a two-step model.
