## Random Forest Feature Importance
In this notebook we use random forest modeling (an ensemble method building upon decision trees) to evaluate the feature importance of our dataset. We are doing this instead of calculating feature importance on our knn model because we have been unable to calculate feature importance for the divorce dataset on the knn model. We are using the random forest model from SciKit Learn.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Import divorce data
divorce = pd.read_csv('data/divorce_data.csv',sep=';')

# Processing features:
X = divorce.drop('Divorce',axis=1) # .dropna() - dataset has no missing values
t = divorce['Divorce']

# Display summary statistics for the features
print(f"These are the summary statistics for the features:\n {X.describe()}")
print("--------------------------------------------------------------------")
print(f"This is the number of rows in the dataset: {len(X)}")
print("--------------------------------------------------------------------")
print(f"This is the number of unique values in each column:\n {X.nunique()}")

These are the summary statistics for the features:
                Q1          Q2          Q3          Q4          Q5          Q6  \
count  170.000000  170.000000  170.000000  170.000000  170.000000  170.000000   
mean     1.776471    1.652941    1.764706    1.482353    1.541176    0.747059   
std      1.627257    1.468654    1.415444    1.504327    1.632169    0.904046   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
50%      2.000000    2.000000    2.000000    1.000000    1.000000    0.000000   
75%      3.000000    3.000000    3.000000    3.000000    3.000000    1.000000   
max      4.000000    4.000000    4.000000    4.000000    4.000000    4.000000   

               Q7          Q8          Q9         Q10  ...         Q45  \
count  170.000000  170.000000  170.000000  170.000000  ...  170.000000   
mean     0.494118    1.452941    1.458824    1.576471  ...    2.458824

#### Train-Test Split 

In [15]:
# Done with sklearn
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, t, test_size=0.33, random_state=42)

# Print the train X and y shapes
print(f"Train X shape: {train_X.shape}")
print(f"Train y shape: {train_y.shape}")

# Print the summary statistics of the training features and testing features
print(f"Train X summary statistics:\n {train_X.describe()}")
print("--------------------------------------------------------------------")
print(f"Test X summary statistics:\n {val_X.describe()}")

Train X shape: (113, 54)
Train y shape: (113,)
Train X summary statistics:
                Q1          Q2          Q3          Q4          Q5          Q6  \
count  113.000000  113.000000  113.000000  113.000000  113.000000  113.000000   
mean     1.672566    1.566372    1.690265    1.433628    1.477876    0.769912   
std      1.589264    1.387933    1.382800    1.469184    1.581688    0.906414   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
50%      2.000000    1.000000    2.000000    1.000000    1.000000    1.000000   
75%      3.000000    3.000000    3.000000    3.000000    3.000000    1.000000   
max      4.000000    4.000000    4.000000    4.000000    4.000000    4.000000   

               Q7          Q8          Q9         Q10  ...         Q45  \
count  113.000000  113.000000  113.000000  113.000000  ...  113.000000   
mean     0.477876    1.371681    1.398230    1

#### Random Forest Model
Here we'll instantiate and train a Random Forest model using SciKit Learn's RandomForestClassifier. We'll use default parameters.

In [29]:
from sklearn.ensemble import RandomForestClassifier

# NOTE: uncomment the following line if you want to have random datasets each time you run the code
# train_X, val_X, train_y, val_y = train_test_split(X, t, test_size=0.33)

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

# Fit the model
rf.fit(train_X, train_y)

# Print the training and testing accuracies
print(f"Train accuracy: {rf.score(train_X, train_y)}")
print(f"Test accuracy: {rf.score(val_X, val_y)}")

# Get feature importances from our random forest model
importances = rf.feature_importances_

# Display the importances sorted in descending order in a dataframe
feature_importances = pd.DataFrame({
    'feature':train_X.columns,
    'importance':importances
}).sort_values('importance',ascending=False)

display(feature_importances)

Train accuracy: 1.0
Test accuracy: 0.9824561403508771


Unnamed: 0,feature,importance
17,Q18,0.144867
18,Q19,0.103354
39,Q40,0.09745
35,Q36,0.086462
10,Q11,0.070627
29,Q30,0.06751
16,Q17,0.05702
19,Q20,0.041687
15,Q16,0.038775
13,Q14,0.03775


#### Feature Importance
In this section we compute the average feature importance for each feature in the divorce dataset by running the RandomForestClassifier from sci-kit learn 100 times and averaging the feature importance for each feature across all 100 runs. We then plot the average feature importance for each feature in the dataset.

In [36]:
iterations = 10000

# Randomly split the data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, t, test_size=0.33)

# Create empty array to hold feature importances
feature_importances_values = np.zeros((iterations, len(train_X.columns)))

# Create the random forest model and save the feature importances
for i in range(iterations):
    rf = RandomForestClassifier(n_estimators=100, max_depth=5)
    rf.fit(train_X, train_y)
    feature_importances_values[i,:] = rf.feature_importances_

# Create a dataframe of the mean feature importances
feature_importances_df = pd.DataFrame({
    'feature':train_X.columns,
    'importance':feature_importances_values.mean(axis=0)
}).sort_values('importance',ascending=False)

display(feature_importances_df.head(10))

Unnamed: 0,feature,importance
17,Q18,0.095916
39,Q40,0.071994
18,Q19,0.066364
25,Q26,0.064343
19,Q20,0.064011
8,Q9,0.061889
10,Q11,0.061301
15,Q16,0.058346
28,Q29,0.043989
27,Q28,0.039828


In [37]:
display(feature_importances_df)

Unnamed: 0,feature,importance
17,Q18,0.095916
39,Q40,0.071994
18,Q19,0.066364
25,Q26,0.064343
19,Q20,0.064011
8,Q9,0.061889
10,Q11,0.061301
15,Q16,0.058346
28,Q29,0.043989
27,Q28,0.039828
