## Decision Trees and Sparse

Notice that with decision trees, there are different ways to handle categorical data.  In this lab we'll walk through them.

### Working with categories

First let's load up our customer data.

In [1]:
import pandas as pd
df = pd.read_csv('./customer_sparse.csv', index_col=0)

In [3]:
# df

Now our classification tree can only understand numbers, so let's move through this.

a) Translate all categories, except for `customer`, into type category.

In [20]:
df_cats = None

In [11]:
# df_cats.dtypes

# under_thirty       category
# borough            category
# education_level    category
# dtype: object

b) convert all category values to numbers.

Assign the variables, `X`, and `y`, with the target variable `y` being equal to customer, and all other features set as `X`.

In [19]:
X = None

In [21]:
X.head().to_numpy()

# array([[1, 1, 2],
#        [1, 0, 0],
#        [0, 0, 2],
#        [0, 2, 5],
#        [0, 2, 2]], dtype=int8)

In [18]:
y = df.customer

y.shape
# (8,)

(8,)

2. Fit the data with decision tree

    * Assign our decision tree to the variable `dtc_1`.

In [22]:
dtc_1 = None

> Now press shift + enter to view the tree.

In [None]:
from sklearn import tree
from IPython.display import SVG, display
from graphviz import Source 


graph_1 = Source(tree.export_graphviz(dtc_1, out_file=None,
                                feature_names=X.columns))

graph_1

Now this may seem pretty good.  After all, we did successfully separate all of our data.  But if we look at our original dataframe, we can see that amount of education should be a strong indicator of whether someone becomes a customer.

In [15]:
df

Unnamed: 0,under_thirty,borough,education_level,customer
0,Yes,Manhattan,high school,0
1,Yes,Brooklyn,college grad,0
2,No,Brooklyn,high school,1
3,No,Queens,some college,1
4,No,Queens,high school,1
5,No,Manhattan,law school,0
6,No,Queens,mba,0
7,Yes,Brooklyn,graduate school,0


Why was this feature not selected in our decision tree? The problem is that the codes assigned did not correspond to the "amount" of education.  We can see the order of our feature variables by looking at `cat.categories` for the `education_level` column.

In [17]:
df_cats.education_level.cat.categories

Index(['college grad', 'graduate school', 'high school', 'law school', 'mba',
       'some college'],
      dtype='object')

### The ideal coercion

Use the `set_categories` method to place the values in the correct, order:

`['high school', 'some college', 'college grad', 'law school', 'mba', 'graduate school']`

Assign this correctly ordered column to the variable `ordered_edu`.

In [27]:
ordered_edu = None
ordered_edu

# 0        high school
# 1       college grad
# 2        high school
# 3       some college
# 4        high school
# 5         law school
# 6                mba
# 7    graduate school
# Name: education_level, dtype: category
# Categories (6, object): [high school, some college, college grad, law school, mba, graduate school]

Let's create a new feature matrix called `X_ordered`.  It is just like our X matrix, but the X matrix's `education_level` feature should be updated with the `ordered_edu` category codes.

In [28]:
X_ordered = None

Now that education level values are correctly ordered, let's fit the `X_ordered` data to a classifier called `dtc_2`.

In [29]:
dtc_2 = None

Press shift + enter on the code below to see the updated decision tree.

In [None]:
from sklearn import tree
from IPython.display import SVG, display
from graphviz import Source 


graph_2 = Source(tree.export_graphviz(dtc_2, out_file=None,
                                feature_names=X_ordered.columns))

graph_2

We can see that this time our decision tree separated based on the `education_level` feature in the first level.

### One hot encoding

The other way to work with the data is to split the data apart into multiple columns with one hot encoding. Let's see how this performs.  Begin with the dataframe.

In [26]:
df

Unnamed: 0,under_thirty,borough,education_level,customer
0,Yes,Manhattan,high school,0
1,Yes,Brooklyn,college grad,0
2,No,Brooklyn,high school,1
3,No,Queens,some college,1
4,No,Queens,high school,1
5,No,Manhattan,law school,0
6,No,Queens,mba,0
7,Yes,Brooklyn,graduate school,0


From there, convert the appropriate features to be categories.  Assign it to the value `df_sparse_cats`.

In [29]:
df_sparse_cats = df.select_dtypes(include='object').astype('category')

Then use `get_dummies` to convert to convert the categorical features to use dummy variables.  Assign this dummied data to the variable `dummied_X`.

In [30]:
dummied_X = None
dummied_X

Now train a new tree with this data and assign it the variable `dtc_3`.

In [31]:
from sklearn.tree import DecisionTreeClassifier
dtc_3 = None

In [None]:
from sklearn import tree
from IPython.display import SVG, display
from graphviz import Source 


graph_3 = Source(tree.export_graphviz(dtc_3, out_file=None,
                                feature_names=dummied_X.columns))


In [None]:
graph_3

### Discussion Questions

Compare the three techniques.

Which technique performed the best?  Why do you think that technique performed the best?

### Resources

[Decision Trees and Sparse](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)