# Bonus: Working with Continuous Data

## Introduction

In this lesson, we'll move into the details of how training a decision tree with continuous data occurs.  We'll continue to use our movies dataset as an example.  This is an optional reading, as the main concepts of decision trees remains the same.

### Exploring our data

As always, we'll get started by loading up our data.  Now that data that we'll use is a [set of movie data](https://github.com/fivethirtyeight/data/tree/master/bechdel) put together by the website 538.  Because we haven't learned anything about data cleaning yet, we'll work with a cleaned up version of this data.  Time to load it up:

In [48]:
import pandas as pd
df = pd.read_csv('./imdb_movies.csv', index_col = 0)
df[:10]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000
2,Spectre,Action,245000000,148.0,2015,10,880674609
3,The Dark Knight Rises,Action,250000000,165.0,2012,7,1084939099
4,John Carter,Action,260000000,132.0,2012,3,284139100
5,Spider-Man 3,Fantasy,258000000,139.0,2007,5,890871626
6,Tangled,Animation,260000000,100.0,2010,11,591794936
7,Avengers: Age of Ultron,Action,280000000,141.0,2015,4,1405403694
8,Harry Potter and the Half-Blood Prince,Adventure,250000000,153.0,2009,7,933959197
9,Batman v Superman: Dawn of Justice,Action,250000000,151.0,2016,3,873260194


### Working with continuous features

Now this data is a little bit different than our previous data.  

1. The `budget` feature is continuous

Notice that there is almost a different value for every movie for something like `budget`.  So how can we split on a feature like this?  

Well this time, our model will look at each different budget value, and use it as a split point.  So it will split all of the observations by those with a budget lower than `237000000` (the first budget value), and greater or equal to `237000000`, and then move onto the next budget value, seeing how well the second budget serves as a split point.

If such a separation does a good job of splitting the data into movies with similar revenues, then it's chosen as a split for the data.

So this is how a decision tree can work with continuous features.  It divides the data into being observations below or above each observed value.

### Working with a continuous target

Let's say that we try splits based on movie budget, and split all of the movies with a budget under 10 million into one group, and all of the movies with a budget over 10 million.  Did this do a good job of separating our data?  How do we rank our splits?

Let's take a look at two potential splits, movies with a budget under and over 10 million, and another split at 15 million dollar budgets.  Which split does a better job at grouping the (imaginary) data.

In [63]:
split_ten_million_under = [200, 250, 270]
split_ten_million_over = [500, 600, 700, 900]

split_fifteen_million_under = [200, 250, 270, 500, 600]
split_fifteen_million_over = [700, 900]

Well it looks the first split does a better job, as it produces that first group with revenues all close together.  But how do we quantify this?  Well one solution is to use the mean squared error.  

Mean squared error just measures how close the datapoints are to the mean.  The smaller the mean squared error, the closer the data is to the mean, and thus the better the split -- as the split is grouping the target data together.

Ok, now let's calculate the mean squared errors for the split of a 10 million dollar budget.

First, we'll calculate the mean.

In [17]:
def calc_mean(targets): 
    return sum(targets)/len(targets)

In [64]:
# split_ten_million_a = [200, 250, 270]

mean_ten_mil = calc_mean(split_ten_million_under)
mean_ten_mil
# 240.0

240.0

Just by looking, we can already see that the numbers in `split_ten_million_a`, hover pretty close to the mean.  But we can calculate this officially.

In [45]:
def squared_diffs(targets):
    return [(target_val - calc_mean(targets))**2 for target_val in targets]

def mse(targets):
    diffs = squared_diffs(targets)
    return sum(diffs)/len(diffs)

In [65]:
mse(split_ten_million_under)

866.6666666666666

And then looking at the `mse` for the observations with a budget over 10 million we have. 

In [66]:
mse(split_ten_million_over)

21875.0

So splitting by movies below and above ten million gives us a different `mse` for each group.  How do we know how good this split was in total?  We can weight the `mse` scores by the amount of data in each group.

In [67]:
(3/7)*866 + (4/7)*21875.0
# 12871

12871.142857142857

Ok, we so we have a total `cost` of 12871.  The lower the cost the better the split.

Let's compare this with our second split of the data: 

In [68]:
split_fifteen_million_under = [200, 250, 270, 500, 600]
split_fifteen_million_over = [700, 900]

In [69]:
mse(split_fifteen_million_under)

24584.0

In [70]:
mse(split_fifteen_million_over)

10000.0

Then we take the weighted sum.

In [71]:
(5/7)*24584 + (2/7)*10000 
# 20417

20417.142857142855

So the first split produced data that was more close together, so that split is better, according to our criteria.

Understanding the details of mean squared error is only moderately important.  The important part is that we are able to quantify how close together a collection of data is: the smaller the combined mean squared error, the closer together the data.  So when we split our data, we choose the split that results in the lowest weighted sum of the mean squared error.  

### Predicting with Continuous data

Now, let's say that splitting based on budget was the last split for our data. That is, that the split resulted in two leaf nodes.  So again, we have these two groups.

In [72]:
split_ten_million_under = [200, 250, 270]
split_ten_million_over = [500, 600, 700, 900]

And now we have a new observation that ends up in the leaf node for those with budgets under 10 million.  How do we predict this observation's revenue?

We just calculate the mean of the training data.

In [73]:
calc_mean(split_ten_million_under)

240.0

So we predict this observation will have a revenue of 240, as that is the average of the leaf node.

### Summary

In this lesson, we learned about working with continuous datasets.  We saw that for our features, we can still separate our data, as we'll choose split points at each observed value and determine how well it separates our data.

Then, for our continuous target data, we saw that we can use the mean squared error to calculate how closely grouped together our data is after a split.  The lower the mean squared error, the closer together the data, and the better the split.  

Finally, we saw that for continuous target data, we can still make a prediction for an observation that falls into a group - the prediction is simply the mean of the target values who fell into that group. 

* Change this to first talk about how the hypothesis function changes 
* Then how training changes with continuous data
    * With our cost function
    * Also add in the latex