## Extracting RandomForest Feature Importance
In this exercise, we will extract the feature importance of a Random Forest classifier model trained to predict the customer drop-out ratio.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import altair as alt

In [2]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter09/Dataset/phpYYZ4Qc.csv'

In [3]:
df = pd.read_csv(file_url)
y = df.pop('rej')

#split data
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=1)

In [4]:
# RandomForestRegressor
rf_model = RandomForestRegressor(random_state=1, 
                                n_estimators=50, 
                                max_depth=6, 
                                min_samples_leaf=60)
rf_model.fit(X_train, y_train)

RandomForestRegressor(max_depth=6, min_samples_leaf=60, n_estimators=50,
                      random_state=1)

In [5]:
# predictions on train and test sets
preds_train = rf_model.predict(X_train)
preds_test = rf_model.predict(X_test)

# MSE of train and test sets
train_mse = mean_squared_error(y_train, preds_train)
test_mse = mean_squared_error(y_test, preds_test)

print('Train MSE: {}'.format(train_mse))
print('Test MSE: {}'.format(test_mse))

Train MSE: 0.007315982781336234
Test MSE: 0.007489642004973965


We also have a low MSE score on the testing set that is very similar to the training one. So, our model is not overfitting.

In [6]:
# print variable importance using .feature_importances_
rf_model.feature_importances_

array([0.00000000e+00, 7.56405224e-04, 8.89442010e-05, 9.46275333e-04,
       4.08153931e-05, 1.97210546e-01, 5.03587073e-04, 2.31999967e-04,
       6.15222081e-03, 3.52461267e-03, 0.00000000e+00, 5.69504288e-01,
       1.13616416e-04, 4.90638284e-04, 1.87909452e-04, 3.20591202e-04,
       2.12958787e-04, 1.90764978e-01, 5.75581836e-03, 4.67864791e-04,
       0.00000000e+00, 0.00000000e+00, 1.75187909e-02, 3.51906210e-04,
       4.85916389e-04, 2.89740583e-05, 1.27170564e-03, 1.12059338e-03,
       1.97954549e-04, 3.01220348e-04, 0.00000000e+00, 1.44886927e-03])

In [7]:
# create dataframe for feature importances
df_feature_imp = pd.DataFrame()
df_feature_imp['features'] = df.columns
df_feature_imp['importance'] = rf_model.feature_importances_
df_feature_imp.head()

Unnamed: 0,features,importance
0,a1cx,0.0
1,a1cy,0.000756
2,a1sx,8.9e-05
3,a1sy,0.000946
4,a1rho,4.1e-05


From this output, we can see the variables a1cy and a1sy have the highest value, so they are more important for predicting the target variable than the three other variables shown here.

In [8]:
# plot bar chart using altair of feature importance
alt.Chart(df_feature_imp).mark_bar().encode(x='importance', y='features')

As a1pop, a2pop, and a3pop increase positively they increase the value of the target variable which increases likelyhood of customer drop-out.