In this notebook, I try to respond to this question: do we need to use one-hot encoding in the random forest model? The answer is **YES**.

There is [another post](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769) on this topic, which claims that one-hot encoding is making the tree-based ensembles (including random forest) worse for two main reasons. First, one-hot encoding a categorical variable will induce sparsity into the dataset which is undesirable. Second, the dummy variable resulting from one-hot encoding will be given less importance than should be, which will obsecure the order of feature importance.

However, the first reason is not important as we don't care much about the 'sparsity' in the random forest model, as 'sparsity' is not related to the model performance. The second reason is even wrong as it is based on the biased and problematic feature importance calculation from the sklearn package. Rather, the **rfpimp** pckage provides a better feature importance method that is based on permutation of feature values, and it is able to estimate the importance of a group of variables.

In the following parts, I will use Kobe's shot selection data to build a model to predict whether a shot given the circumstances would land in or not. This aims to demonstrate the differenceo of label encoding and one-hot encoding. According to some analysis, it is clear that the two variables describing the type of shot were the most important in predicting the outcome of a given shot and both variables (**'action_type'** and **'combined_shot_type'**) were categorical.

I will run random forest on the dataset with label encoding (using different orders) and with one-hot encoding. This will answer the following questions:

- Is there difference between label encoding and one-hot encoding in this classification task?
- In label encoding, does the category order of the categorical variable matter for the classification task?

## Load packages

Note that fastai v0.7 is needed rather than fastai v1.0. [This link](https://forums.fast.ai/t/fastai-v0-7-install-issues-thread/24652) tells you how to install fastai v0.7. 

In [157]:
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics
import os
import pandas as pd
import numpy as np

# Part 1: label encoding

## Read Data

In [158]:
# os.chdir("/home/rk9cx/Kaggle/Kobe Shot Selection/")
df = pd.read_csv("data.csv", index_col = False, low_memory=False, parse_dates=["game_date"])
test = df[df['shot_made_flag'].isna()]
train = df[~df['shot_made_flag'].isna()]

In [159]:
print("Number of rows:", train.shape[0])
print("Number of features:", train.shape[1])

Number of rows: 25697
Number of features: 25


In [160]:
train.dtypes

action_type                   object
combined_shot_type            object
game_event_id                  int64
game_id                        int64
lat                          float64
loc_x                          int64
loc_y                          int64
lon                          float64
minutes_remaining              int64
period                         int64
playoffs                       int64
season                        object
seconds_remaining              int64
shot_distance                  int64
shot_made_flag               float64
shot_type                     object
shot_zone_area                object
shot_zone_basic               object
shot_zone_range               object
team_id                        int64
team_name                     object
game_date             datetime64[ns]
matchup                       object
opponent                      object
shot_id                        int64
dtype: object

## Preprocessing

### Step 1: converting date into different features

In [161]:
#converting date into different features
# add_datepart: add_datepart converts a column of df from a datetime64 to 13 columns containing 
# the information from the date. This applies changes inplace.
# new columns include: Year Month Week Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed
add_datepart(train, 'game_date')
add_datepart(test, 'game_date')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### Step 2: converting cateogrical variables using label encoding


Pandas has a concept of a category data type, but by default it would not turn anything into a category. The **train_cats** function from fastai will convert any columns of strings in a dataframe to a column of 'Categorical'. This applies the changes inplace. By default this function uses the lexical order of the categories.

In [162]:
#converting categorical variables using label coding
train_cats(train)
apply_cats(test, train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)


In [164]:
# check the oder in the action_type column
print(train.action_type.cat.categories)

Index(['Alley Oop Dunk Shot', 'Alley Oop Layup shot', 'Cutting Layup Shot',
       'Driving Bank shot', 'Driving Dunk Shot',
       'Driving Finger Roll Layup Shot', 'Driving Finger Roll Shot',
       'Driving Floating Bank Jump Shot', 'Driving Floating Jump Shot',
       'Driving Hook Shot', 'Driving Jump shot', 'Driving Layup Shot',
       'Driving Reverse Layup Shot', 'Driving Slam Dunk Shot', 'Dunk Shot',
       'Fadeaway Bank shot', 'Fadeaway Jump Shot', 'Finger Roll Layup Shot',
       'Finger Roll Shot', 'Floating Jump shot', 'Follow Up Dunk Shot',
       'Hook Bank Shot', 'Hook Shot', 'Jump Bank Shot', 'Jump Hook Shot',
       'Jump Shot', 'Layup Shot', 'Pullup Bank shot', 'Pullup Jump shot',
       'Putback Dunk Shot', 'Putback Layup Shot', 'Putback Slam Dunk Shot',
       'Reverse Dunk Shot', 'Reverse Layup Shot', 'Reverse Slam Dunk Shot',
       'Running Bank shot', 'Running Dunk Shot',
       'Running Finger Roll Layup Shot', 'Running Finger Roll Shot',
       'Running Hook

### Step 3: split the dataframe into predictor and response variable and impute missing values

In [166]:
# proc_df: takes a data frame df and splits off the response variable, and
# changes the df into an entirely numeric dataframe (using label encoding). It also impute missing values with meadian

# returned: 
# X: the transformed dataframe without the response variable.
# y: the dataframe of the response variable
# nas: returns a dictionary of which nas it created, and the associated median.
df, y, nas = proc_df(train, y_fld='shot_made_flag')
df_test, y_test, nas = proc_df(test, y_fld='shot_made_flag')

In [167]:
# train-valid split using sklearn
from sklearn.model_selection import train_test_split
# test_size: If int, represents the absolute number of train samples
X_train, X_valid, y_train, y_valid = train_test_split(df, y, test_size = 5000, random_state = 42)

## Modelling using Random Forest

In [168]:
# how many columns in X_train
print(X_train.columns.size)
# inspect the X_train and y_train
X_train.dtypes

36


action_type                 int8
combined_shot_type          int8
game_event_id              int64
game_id                    int64
lat                      float64
loc_x                      int64
loc_y                      int64
lon                      float64
minutes_remaining          int64
period                     int64
playoffs                   int64
season                      int8
seconds_remaining          int64
shot_distance              int64
shot_type                   int8
shot_zone_area              int8
shot_zone_basic             int8
shot_zone_range             int8
team_id                    int64
team_name                   int8
matchup                     int8
opponent                    int8
shot_id                    int64
game_Year                  int64
game_Month                 int64
game_Week                  int64
game_Day                   int64
game_Dayofweek             int64
game_Dayofyear             int64
game_Is_month_end           bool
game_Is_mo

In [105]:
# Is the RandomForestClassifier stable on this dataset?
# Run the RandomForestClassifier for ten times, and print the mean and standard deviation of log_loss
# n_estimator = 100
num_estimator = 100
vec_log_loss = []
for i in range(10):
    m = RandomForestClassifier(n_jobs=-1, n_estimators=num_estimator)
    m.fit(X_train, y_train)
    # validation error
    vec_log_loss.append(metrics.log_loss(y_valid,m.predict_proba(X_valid)))

print("10-run RandomForest Classifier using cross-entropy")
print("Mean of log_loss", np.mean(vec_log_loss))
print("Std_dev of log_loss", np.std(vec_log_loss))

10-run RandomForest Classifier using cross-entropy
Mean of log_loss 0.622414845504307
Std_dev of log_loss 0.0013381796245117178


## Feature Importance: using permutation importance rathen than **gini importantce** from sklean

In [113]:
# using feature importance function from sklearn
list_feature_imp = m.feature_importances_

feature_list = list(X_train.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, list_feature_imp)]

# Sort the feature importance by most important first
feature_importances = sorted(feature_importances, key=lambda x:x[1], reverse=True)

# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: action_type          Importance: 0.09
Variable: game_event_id        Importance: 0.06
Variable: seconds_remaining    Importance: 0.06
Variable: shot_id              Importance: 0.06
Variable: lat                  Importance: 0.05
Variable: loc_x                Importance: 0.05
Variable: loc_y                Importance: 0.05
Variable: lon                  Importance: 0.05
Variable: matchup              Importance: 0.05
Variable: game_Dayofyear       Importance: 0.05
Variable: game_Elapsed         Importance: 0.05
Variable: game_id              Importance: 0.04
Variable: minutes_remaining    Importance: 0.04
Variable: shot_distance        Importance: 0.04
Variable: opponent             Importance: 0.04
Variable: game_Day             Importance: 0.04
Variable: combined_shot_type   Importance: 0.03
Variable: game_Week            Importance: 0.03
Variable: game_Dayofweek       Importance: 0.03
Variable: period               Importance: 0.02
Variable: season               Importanc

It is shown that the feature importance of **action_type** is 0.09, and that of **combined_shot_type** is 0.03.

From the list above, the second most important feature is **game_event_id**, which describes **DDD**. This doesn't make sense. In fact, this is because the **game_event_id** has a larger cardinality (aka the number of unique values), thus having a larger probabiblity of being selected as the cutoff variable than combined_shot_type.

**We should not use this feature importance method, as it leads to erroneous conclusions.**

In [123]:
# using feature importance from rfpimp
from rfpimp import *
import rfpimp
# print(rfpimp.__version__)
print(importances(m, X_valid, y_valid))

                       Importance
Feature                          
action_type                0.0970
combined_shot_type         0.0066
seconds_remaining          0.0044
shot_distance              0.0010
game_Is_quarter_end        0.0008
minutes_remaining          0.0004
game_Is_year_start         0.0002
shot_zone_basic            0.0002
game_Is_year_end           0.0000
team_name                  0.0000
team_id                    0.0000
game_Is_quarter_start     -0.0004
game_Is_month_start       -0.0004
game_Is_month_end         -0.0004
shot_type                 -0.0006
game_Dayofweek            -0.0012
playoffs                  -0.0014
shot_zone_range           -0.0016
lon                       -0.0018
game_Year                 -0.0020
game_event_id             -0.0024
matchup                   -0.0024
game_Day                  -0.0028
shot_zone_area            -0.0028
game_Month                -0.0030
lat                       -0.0050
season                    -0.0050
game_id       

From the result using **rfpimp**, it is shown that the feature importance of **action_type** is 0.0970, and that of **combined_shot_type** is 0.0066.

# Part 2: Changing the order in action_type

This part investigates the influence of changing the order in action_type. Specifically, the order will be randomly shuffled ten times, and the resulting log loss is recorded and compared.

In [175]:
import random
from sklearn import utils
list_mean_log_loss = []
list_cat_action_type = train.action_type.cat.categories
for i in range(10):
    # change the order in action_type using random shuffle
    train.action_type.cat.set_categories(utils.shuffle(list_cat_action_type,random_state=i), ordered=True, inplace=True)
    print(train.action_type.cat.categories)
#     # apply the changes to the test
#     apply_cats(test, train)
    # split the X and y
    df, y, nas = proc_df(train, y_fld='shot_made_flag')
    # split the train and valid data
    X_train, X_valid, y_train, y_valid = train_test_split(df, y, test_size = 5000, random_state = 42)
    # rerun the random forest mode
    num_estimator = 100
    vec_log_loss = []
    for i in range(10):
        m = RandomForestClassifier(n_jobs=-1, n_estimators=num_estimator)
        m.fit(X_train, y_train)
        # validation error
        vec_log_loss.append(metrics.log_loss(y_valid,m.predict_proba(X_valid)))
    
    # record the log loss
    list_mean_log_loss.append(np.mean(vec_log_loss))
print(list_mean_log_loss)

Index(['Running Slam Dunk Shot', 'Putback Layup Shot', 'Pullup Jump shot',
       'Driving Finger Roll Layup Shot', 'Tip Layup Shot', 'Running Jump Shot',
       'Running Finger Roll Layup Shot', 'Reverse Slam Dunk Shot',
       'Turnaround Hook Shot', 'Driving Floating Bank Jump Shot',
       'Turnaround Finger Roll Shot', 'Tip Shot', 'Hook Shot',
       'Running Finger Roll Shot', 'Driving Finger Roll Shot',
       'Driving Reverse Layup Shot', 'Driving Bank shot', 'Running Layup Shot',
       'Follow Up Dunk Shot', 'Driving Hook Shot', 'Driving Jump shot',
       'Alley Oop Dunk Shot', 'Layup Shot', 'Finger Roll Shot',
       'Pullup Bank shot', 'Running Bank shot', 'Putback Slam Dunk Shot',
       'Slam Dunk Shot', 'Jump Hook Shot', 'Step Back Jump shot',
       'Running Pull-Up Jump Shot', 'Putback Dunk Shot',
       'Driving Floating Jump Shot', 'Jump Shot', 'Turnaround Jump Shot',
       'Jump Bank Shot', 'Hook Bank Shot', 'Fadeaway Jump Shot',
       'Turnaround Fadeaway shot',

Index(['Fadeaway Jump Shot', 'Running Tip Shot', 'Driving Slam Dunk Shot',
       'Driving Layup Shot', 'Running Layup Shot',
       'Running Reverse Layup Shot', 'Floating Jump shot', 'Driving Bank shot',
       'Follow Up Dunk Shot', 'Finger Roll Shot', 'Step Back Jump shot',
       'Running Hook Shot', 'Running Dunk Shot', 'Fadeaway Bank shot',
       'Pullup Bank shot', 'Turnaround Fadeaway shot',
       'Running Finger Roll Layup Shot', 'Turnaround Hook Shot',
       'Alley Oop Layup shot', 'Dunk Shot', 'Jump Bank Shot', 'Hook Shot',
       'Driving Floating Bank Jump Shot', 'Reverse Slam Dunk Shot',
       'Layup Shot', 'Running Finger Roll Shot', 'Driving Reverse Layup Shot',
       'Putback Dunk Shot', 'Hook Bank Shot', 'Slam Dunk Shot',
       'Putback Slam Dunk Shot', 'Finger Roll Layup Shot',
       'Driving Finger Roll Shot', 'Driving Hook Shot', 'Reverse Layup Shot',
       'Cutting Layup Shot', 'Running Jump Shot', 'Driving Floating Jump Shot',
       'Pullup Jump shot', 

The log loss associated with the shuffled categories of action_type is very similar to that of the lexical order. This means that the order of a categorical variable is not relevant in this classification task.

The order being irrelevant to the classification accuracy means that the **action_type** variable might be represented as an unordered categorical variable (aka a nominal variable). The most appropriate way to model a nominal variable is one-hot encoding, which is introduced in Part 3.

# Part 3: One-hot Encoding (using get_dummies)

In [125]:
# os.chdir("/home/rk9cx/Kaggle/Kobe Shot Selection/")
df = pd.read_csv("data.csv", index_col = False, low_memory=False, parse_dates=["game_date"])
df = pd.get_dummies(df, columns=['action_type','combined_shot_type'])
test = df[df['shot_made_flag'].isna()]
train = df[~df['shot_made_flag'].isna()]
print(train.shape)

(25697, 86)


In [126]:
#converting date into different fatures
add_datepart(train, 'game_date')
add_datepart(test, 'game_date')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [129]:
train.shape

(25697, 98)

In [130]:
#converting categorical variables into label coding
train_cats(train)
apply_cats(test, train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)


In [131]:
train.shape

(25697, 98)

In [132]:
df, y, nas = proc_df(train, 'shot_made_flag')

In [133]:
# train-valid split using sklearn
from sklearn.model_selection import train_test_split
# test_size: If int, represents the absolute number of train samples
X_train, X_valid, y_train, y_valid = train_test_split(df, y, test_size = 5000, random_state = 42)

In [177]:
# Is the RandomForestClassifier stable on this dataset?
# Run the RandomForestClassifier for ten times, and print the mean and standard deviation of log_loss
# n_estimator = 100
num_estimator = 100
vec_log_loss = []

for i in range(10):
    m = RandomForestClassifier(n_jobs=-1, n_estimators=num_estimator)
    m.fit(X_train, y_train)
    # validation error
    vec_log_loss.append(metrics.log_loss(y_valid,m.predict_proba(X_valid)))

print("10-run RandomForest Classifier using cross-entropy")
print("Mean of log_loss", np.mean(vec_log_loss))
print("Std_dev of log_loss", np.std(vec_log_loss))

UsageError: Line magic function `%%time` not found.


## Feature importance

In [136]:
# using feature importance from rfpimp
# install the rfpimp in Anaconda: conda install rfpimp
# compute the permutation importance of each individual feature. 
from rfpimp import *
import rfpimp
# print(rfpimp.__version__)
importance_single_feature = importances(m, X_valid, y_valid)
print(importance_single_feature)

                               Importance
Feature                                  
action_type_Jump Shot              0.0718
action_type_Layup Shot             0.0190
seconds_remaining                  0.0040
combined_shot_type_Jump Shot       0.0040
action_type_Running Jump Shot      0.0030
...                                   ...
opponent                          -0.0058
game_id                           -0.0062
game_Elapsed                      -0.0064
matchup                           -0.0066
game_Day                          -0.0094

[97 rows x 1 columns]


In [155]:
# importance of the feature group of action_type* and combined_shot_type*
# in the importances function, the **features** parameter is a list containing the feature groups whose importance is to be investigated.
# By default, it will compute the permutation importance of a group consisting of the features not in any elemenet of **features** list.

list_feature = list(X_valid.columns)
print("Importance of features deriving from action_type variable")
print(importances(m, X_valid, y_valid, features = [[k for k in list_feature if 'action_type' in k],[k for k in list_feature if 'combined_shot_type' in k]]))

Importance of features deriving from action_type variable
                                                    Importance
Feature                                                       
action_type_Alley Oop Dunk Shot\naction_type_Al...      0.1168
combined_shot_type_Bank Shot\ncombined_shot_typ...      0.0116
game_Day\ngame_id\nloc_y\ngame_Month\ngame_Is_m...      0.0066


It is reported that the importance of the **action_type** group is 0.1168, and that of **combined_shot_type** is 0.0116. Moreover, by comparing them to the importance in the long list, it is seen than each of these two groups is more important than any individual variable not in these groups. 

This is consistent with our knowledge of this dataset that the two variables are more important than the others in this classification task.

# TLDR

1. In a random forest, **one-hot encoding** is well-suited for representing a categorical variable when you don't have knowledge of the order in the variable. One-hot encoding doesn't make the model ineffiecient, nor does it obsecure the feature importance order, on a condition that the correct feature importance method is used.
2. Theoretically, for a nominal variable associated with unordered categories, one-hot encoding is better than label encoding, as label encoding requires a specific order between the categories. 
3. The **RandomForestClassifier.feature_importances_** function from **sklearn** is problematic. Always use the **importances** function from **rfpimp** instead. This method is grounded on the widely accepted permutation method, which was proposed for random forest.