# Decision Trees

## Introduction

Imagine that we have the following customer data.

In [84]:
import pandas as pd
df = pd.read_csv('./customer_data.csv').iloc[:, 1:]

In [85]:
import numpy as np
updated_df = df.drop(columns = ['customer'])
updated_df.loc[:, 'customer'] = df.customer
updated_df['customer_predictions'] = '?'

In [86]:
updated_df = updated_df.rename(columns={'graduate education': 'attended_college'})
updated_df

Unnamed: 0,attended_college,under_thirty,borough,income,customer,customer_predictions
0,?,Yes,Manhattan,< 55,0,?
1,Yes,Yes,Brooklyn,< 55,0,?
2,?,No,Brooklyn,< 55,1,?
3,No,No,Queens,> 100,1,?
4,?,No,Queens,55 - 100,1,?
5,Yes,No,Manhattan,> 100,0,?
6,Yes,No,Queens,> 100,0,?
7,?,Yes,Brooklyn,55 - 100,0,?


Our vector of target variables is the `customer` column, with a 1 or a 0 representing whether or not a customer lead became a customer (1 represents that they did).  The rest of our columns are our feature vectors.

Now, to use the above data to predict whether someone will be come a customer, we decide to use a series of tests.  These are the tests that we can use:

* Does the lead have a graduate level of education?
* Is the lead under thirty?
* Is the lead from Manhattan?  Or from Brooklyn?
* Is the lead's income under 55k, over 100k, or between 55k and 100k?

The reason why we use these tests is to discover if the responses can help us predict whether or not someone will become a customer.  

### The quality of a test

Now, we can imagine that we have at least one test to ask of each feature in our dataset, so the next question is, how do we prioritize these tests.

A good test is one that divides our dataset between those who become customers and those who do not.  Think about it, if we ask if a customer earns less than 55k, and then we find the same proportion of customers and non-customers as we started with, this test didn't help us target customers.

In [87]:
df[df['income'] == '< 55']['customer']

0    0
1    0
2    1
Name: customer, dtype: int64

In [29]:
df[df['income'] == '> 100']['customer']

3    1
5    0
6    0
Name: customer, dtype: int64

In [30]:
df[df['income'] == '55 - 100']['customer']

4    1
7    0
Name: customer, dtype: int64

We'll score our test, by counting the number of members split into a subgroup where member of the subset is of the same type (that is all are either a customer, or all or either not.).  So our income test scored a zero.

### Going through all of the tests

In [45]:
def split_data_by(df, column_name, target):
    selected_dfs = []
    for value in df[column_name].unique():
        selected = df[df[column_name] == value]
        selected = selected[[column_name, target]]
        selected_dfs.append(selected)
    return selected_dfs

* graduate education

In [46]:
split_data_by(df, 'graduate education', 'customer')

[  graduate education  customer
 0                  ?         0
 2                  ?         1
 4                  ?         1
 7                  ?         0,   graduate education  customer
 1                Yes         0
 5                Yes         0
 6                Yes         0,   graduate education  customer
 3                 No         1]

Score of 4.

* under thirty

In [47]:
split_data_by(df, 'under_thirty', 'customer')

[  under_thirty  customer
 0          Yes         0
 1          Yes         0
 7          Yes         0,   under_thirty  customer
 2           No         1
 3           No         1
 4           No         1
 5           No         0
 6           No         0]

This scores a three.

In [48]:
split_data_by(df, 'borough', 'customer')

[     borough  customer
 0  Manhattan         0
 5  Manhattan         0,     borough  customer
 1  Brooklyn         0
 2  Brooklyn         1
 7  Brooklyn         0,   borough  customer
 3  Queens         1
 4  Queens         1
 6  Queens         0]

This scores a 2.

So our final results are:

| education   |      under 30      |  borough |income |
|----------|:-------------:|------:|------:|
| 4 |  3 | 2 |0 |


The highest scoring test we have is with education, so that is the test that we being with.

In [74]:
split_data_by(updated_df, 'attended_college', 'customer')

[  attended_college  customer
 0                ?         0
 2                ?         1
 4                ?         1
 7                ?         0,   attended_college  customer
 1              Yes         0
 5              Yes         0
 6              Yes         0,   attended_college  customer
 3               No         1]

In [93]:
updated_df[updated_df['attended_college'] == 'Yes']['customer_predictions']

1    ?
5    ?
6    ?
Name: customer_predictions, dtype: object

### Our criteria

In [None]:
updated_df[:, 'customer_predictions'] = 