<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/16May20_2_training_decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Decision Trees

## Introduction

In the last lesson, we learned about the hypothesis function for decision trees. And we saw that the hypothesis function of a decision tree asks a series of questions to predict if the target falls in one category or the other. 

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/decision-tree-real-estate.png" width="50%">

> Remember that a hypothesis function takes in a set of features and comes close to predicting the correct target.

But how does a machine learning algorithm arrive at a hypothesis function?  In this lesson we'll find out.

## Our Data

Let's take a look at some training data of past customer leads.  Below, we have eight observations with feature data and the corresponding target value.

| Attended College | Under Thirty | Borough   | Income | Customer |
| ---------------- | ------------ | --------- | ------ | :------: |
| ?                | Yes          | Manhattan | < 55   |    0     |
| Yes              | Yes          | Brooklyn  | < 55   |    0     |
| ?                | No           | Brooklyn  | < 55   |    1     |
| No               | No           | Queens    | > 55   |    1     |
| ?                | No           | Queens    | < 55   |    1     |
| Yes              | No           | Queens    | >55    |    0     |
| Yes              | No           | Queens    | >55    |    0     |
| Yes              | Yes          | Manhattan | >55    |    0     |

As we know, we start with this training data, and from this training data, we want to find a hypothesis function that will predict the target of our of our past, and ideally our future data. 

So now we want to find a pattern in the data that arrives at a hypothesis function.  That's training.  Let's get to it.

## The first steps

Ok, so looking at our training data, how would you use the data to predict if someone will become a customer or not?

| Attended College | Under Thirty | Borough   | Income | Customer |
| ---------------- | ------------ | --------- | ------ | :------: |
| ?                | Yes          | Manhattan | < 55   |    0     |
| Yes              | Yes          | Brooklyn  | < 55   |    0     |
| ?                | No           | Brooklyn  | < 55   |    1     |
| No               | No           | Queens    | > 55   |    1     |
| ?                | No           | Queens    | < 55   |    1     |
| Yes              | No           | Queens    | >55    |    0     |
| Yes              | No           | Queens    | >55    |    0     |
| Yes              | Yes          | Manhattan | >55    |    0     |

We begin by finding the feature that most separates the data by the outcomes. For example, let's split the training data by the attended college feature. 

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-4.png" width="50%">

> Focus in on the 1s and 0s in the bottom layer of the tree.  

What the above graph indicates is that:
* All of the leads that *did* go to college ended up becoming customers
* The one lead that didn't, did not end up becoming a customer, and the rest are mixed.  

> You can check the training data table to confirm this.

Now let's try another feature.  Let's see what happens if we were to split the data based on someone's income.

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-3.png" width="50%">

Here if we split our data according to income, we are not as cleanly separating customers between non-customers.  Notice that each subgroup has both customers and non-customers.

So separating our data by the attended college feature does a better of distinguishing between customers and non-customers than separating data by the income feature.

## Our Training Procedure

So that is what we'll do:

* We'll go through every feature, one by one, and find the feature that most separates our training data between customers and non-customers.  

Here, the feature that most separates our training data between customers and non-customers is `college`.

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-4.png" width="50%">

> For this lesson, let's skip over *how* we find the feature that most separates the training data.  For now, we'll simply eyeball it.

So now that we found this feature, we can make college the first test in our hypothesis function.  

> Hypothesis function, version 1.

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-6.png" width="50%">

So right now, if a lead attended college, the hypothesis function predicts she becomes a customer, and if not it predicts she will not become a customer.  

Notice that our hypothesis function **does not not make a prediction** for the observations with a value of `?`.  This is because the `?` did not perfectly separate our data.  

Remember?

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-4.png" width="40%">

So we separate this remaining mixed data in the `?` subgroup by making another split.  

This brings us to the next step.

2. Find the feature that most separates our *remaining mixed data data*.

So as we saw above, splitting our observations by the attended college feature perfectly separated all of our data, except for those with a `?` value for college.  Let's zoom in on those three observations.

| Attended College | Under Thirty | Borough   | Income | Customer |
| ---------------- | ------------ | --------- | ------ | :------: |
| ?                | Yes          | Manhattan | < 55   |    0     |
| ?                | No           | Brooklyn  | < 55   |    1     |
| ?                | No           | Queens    | < 55   |    1     |


Our next step is to find the feature that most separates this *remaining data*.  Look at the chart above.  Do you see it?

Well it's whether or not someone is under 30.  Let's take a look.

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/remaining-leads.png" width="40%">

So splitting the remaining data by under 30 *perfectly separates* this mixed group of data.  And now that we've identified this feature to split by, we add this to our hypothesis function. 

This takes our hypothesis function from here:

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-6.png" width="50%">

To here:

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/decision-tree-real-estate.png" width="50%">

Let's summarize our training procedure for finding our hypothesis function as follows:

**Start with a mixed group of training data, and then:**

1. Find the feature that most separates the data by the target values, and

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/all-obs.png" width="30%" style="float: left;">
<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-4.png" width="30%" style="float: left;">

2. Add that feature to the hypothesis function to separate the mixed group

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/customer-leads-6.png" width="30%" style="float: left;">

3. Then for each remaining group of mixed data, repeat steps one through three

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/remaining-obs.png" width="30%" style="float: left;">
<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/remaining-leads.png" width="30%" style="float: left;">

**Stop** when there are no remaining mixed groups, and you'll have the final hypothesis function.

<img src="https://storage.googleapis.com/curriculum-assets/intro-to-ml/decision-tree-real-estate.png" width="30%" style="float: left;">

### Summary

In this lesson we went through the training procedure for decision trees.  We saw that the process involved the following steps.

1. Find the feature that most separates the data by the target values, and
2. Add that feature to the hypothesis function to separate the mixed group
3. Then for each remaining group of mixed data, repeat steps one through three
**Stop** when there are no remaining mixed groups.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="https://storage.cloud.google.com/curriculum-assets/curriculum-assets.nosync/mom-files/jigsaw-labs.png" width="15%" style="text-align: center"></a>
</center>