<a href="https://colab.research.google.com/github/oriol-pomarol/codegeo_workshops/blob/main/2_feature_importance/2_feature_importance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2 Feature importance - deciphering ML’s predictions
During this workshop we will explore three different ways to estimate the importance of our input features on the model outputs: impurity feature importance, permutation feature importance and shap feature importance.

## 2.1 Setting up our (random forest) model
To begin estimating feature importance we need a model. We will use the same model as in [the previous "understanding random forest" workshop](../1_understanding_random_forest/1_understanding_random_forest.ipynb). Surprise, surprise, it's a random forest model. The code blocks below (1) loads the data, (2) splits our data into training and testing datasets, (3) trains our random forest model and (4) provides a simple evaluation of the model performance.

Note that this code is near identical to the "understanding random forest" workshop, If you have any problems understanding what is happening, please take a look there.

In [None]:
import pandas as pd

# Load the data
data_url =  "https://raw.githubusercontent.com/Jignesh1594/CodeGeoworkshop_02_understanding_RF/master/data.csv"
data = pd.read_csv(data_url, delimiter=",", on_bad_lines='skip')
data.head()

In [None]:
import sklearn.model_selection as model_selection

# Split the data into training and test sets
input_data = data[['WLHv', 'RH', 'EV24', 'QMeuse', 'QRhine']]
output_data = data['value']
X_train, X_test, y_train, y_test = model_selection.train_test_split(input_data,
                                                                    output_data, 
                                                                    test_size=0.1,
                                                                    shuffle=False)

print(f"Train sample size is {X_train.index.size} and test sample size is {X_test.index.size}")

In [None]:
import sklearn.ensemble as ensemble

# Train the model
model = ensemble.RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
model

In [None]:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from matplotlib.style import use
import matplotlib.dates as mdates

# Evaluate the model
y_pred = model.predict(X_test)
mse = metrics.mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

prediction = {"date": y_test.index,
              "actual": y_test,
              "predicted": y_pred}
prediction = pd.DataFrame(prediction)

# Plot the prediction
use('ggplot')
fig, ax = plt.subplots()
prediction.plot.line(x = "date",
                     ax=ax)
ax.set_title("Actual vs Predicted")
ax.set_ylabel("Water level (m)")
ax.set_xlabel("Date")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
fig.tight_layout()

### 2.1.1 Input features
It seems the model has a decent performance, but how is it doing those predictions? To that end we will investigate the importance of the input features for these prediction. Naturally, what is particularly important for this workshop is the input data for the random forest model. Here we use three inputs, called 'WLHv', 'RH', 'EV24', 'QMeuse', 'QRhine', that represent the water level, rainfall, evaporation, discharge in the Meuse and discharge in the Rhine, respectively. The code below will make a quick plot of all the input data.

In [None]:
import matplotlib.pyplot as plt
from matplotlib.style import use
import matplotlib.dates as mdates

for input_feature in X_test.columns:
    permutation = {"date": X_test.index,
                   "input": X_test[input_feature]}
    permutation = pd.DataFrame(permutation)
    
    # Plot the permutation
    use('ggplot')
    fig, ax = plt.subplots()
    permutation.plot.line(x="date",
                         ax=ax)
    ax.set_title("Random forest input")
    ax.set_ylabel(f"{input_feature}")
    ax.set_xlabel("Date")
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    fig.tight_layout()

## 2.2 Impurity feature importance
Impurity feature importance is a special type of feature importance that is only relevant for random forest models and is sometimes called "gini importance" or "mean decrease impurity". This is the same as used in the [previous workshop on "understanding random forests"](../1_understanding_random_forest/1_understanding_random_forest.ipynb). Impurity feature importance is defined as the total decrease in node impurity, weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node), averaged over all trees of the ensemble. Simply said, how much does the data in our node remain varied after a split decision (based on a specific input feature) is made.

### 2.2.1 Engage some braincells
Impurity feature importance is directly calculated by the sklearn package and stored in a property of the *RandomForestRegressor* class. Go to the [RandomForestRegressor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), identify the property that stores the feature importance, and add the property in the below code where it shows: "feature_importance_property" (first line). If today is not the day to engage your braincells, you can see the answer in the text below.

In [None]:
importances = model."feature_importance_property"

# Register the feature importance
feature_importance = {"feature": X_test.columns.to_list(),
                      "importance": importances.tolist()}
feature_importance = pd.DataFrame(feature_importance)
feature_importance.head()

<details>
    <summary>Click to see the solution</summary>
    The feature importance property of the *RandomForestRegressor* class is *feature_importances_*
</details>

In [None]:
import matplotlib.pyplot as plt
from matplotlib.style import use

# Plot the feature importance
use('ggplot')
fig, ax = plt.subplots()
feature_importance.plot.bar(x = "feature",
                            y = "importance",
                            ax=ax)
ax.set_title("Impurity feature importance")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

## 2.3 Permutation feature importance
Permutation feature importance is a more generalizable method that can be applied to any type of machine learning model (e.g. random forest, neural network, LSTM) to determine the feature importance. Permutation feature importance is defined as the decrease in a model score when a single feature value is adjusted (permuted). This is achieved by permuting input features one-at-a-time, predict our model outputs with the permuted input feature, comparing the original model outputs with the permuted model outputs.

Here we will try four different types of permutation. Three types of permutation aim to eliminate the signal of a specific input feature by taking the minimum, mean and maximum of the input feature, whereas the final permutation type aims to introduce a lot of noise to the signal of a specific input feature by shuffling the dates around. Here we assess the difference between the original model outputs and the permuted model outputs using the mean squared error.

In [None]:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from matplotlib.style import use
import matplotlib.dates as mdates

mean_importance = {"feature": [],
                   "importance": [],}

for input_feature in X_test.columns:
    # Mean premutation importance:
    # Takes the input feature mean to determine performance
    X_test_permuted = X_test.copy()
    X_test_permuted[input_feature] = X_test_permuted[input_feature].mean()
    
    permutation = {"date": X_test.index,
                  "actual": X_test[input_feature],
                  "permuted": X_test_permuted[input_feature]}
    permutation = pd.DataFrame(permutation)
    
    # Plot the permutation
    use('ggplot')
    fig, ax = plt.subplots()
    permutation.plot.line(x="date",
                         ax=ax)
    ax.set_title("Actual vs Permuted")
    ax.set_ylabel(f"{input_feature}")
    ax.set_xlabel("Date")
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    fig.tight_layout()
    
    # Register the feature importance
    y_pred_permuted = model.predict(X_test_permuted)
    mse_permuted = metrics.mean_squared_error(y_pred, y_pred_permuted)
    
    mean_importance["feature"].append(input_feature)
    mean_importance["importance"].append(mse_permuted)
    print(f'Permutation (mean) importance of {input_feature}: {mse_permuted}')
    

### 2.3.1 Engaging some brain cells
Now that I have shown how to calculate the mean permutation feature importance, do the same for the minimum, maximum and shuffle feature importance in the three code blocks below. Just copy the above code and adjust where necessary. Make sure you register the importance information to the correct dictionary: *minimum_importance*, *maximum_importance* and *shuffle_importance* respectively. A quick tip is to take a look at the [DataFrame sample function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) for the shuffle importance. If today is not the day to engage your braincells, you can see the answer in the text below.

In [None]:
minimum_importance = {"feature": [],
                   "importance": [],}

In [None]:
maximum_importance = {"feature": [],
                   "importance": [],}

In [None]:
shuffle_importance = {"feature": [],
                   "importance": [],}

<details>
    <summary>Click to see the minimum solution</summary>
    ``` python
    import sklearn.metrics as metrics
    import matplotlib.pyplot as plt
    from matplotlib.style import use
    import matplotlib.dates as mdates

    minimum_importance = {"feature": [],
                    "importance": [],}

    for input_feature in X_test.columns:
        X_test_permuted = X_test.copy()
        X_test_permuted[input_feature] = X_test_permuted[input_feature].min()
        
        permutation = {"date": X_test.index,
                    "actual": X_test[input_feature],
                    "permuted": X_test_permuted[input_feature]}
        permutation = pd.DataFrame(permutation)
        
        # Plot the permutation
        use('ggplot')
        fig, ax = plt.subplots()
        permutation.plot.line(x="date",
                            ax=ax)
        ax.set_title("Actual vs Permuted")
        ax.set_ylabel(f"{input_feature}")
        ax.set_xlabel("Date")
        ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
        fig.tight_layout()
        
        # Register the feature importance
        y_pred_permuted = model.predict(X_test_permuted)
        mse_permuted = metrics.mean_squared_error(y_pred, y_pred_permuted)
        
        minimum_importance["feature"].append(input_feature)
        minimum_importance["importance"].append(mse_permuted)
        print(f'Permutation (minimum) importance of {input_feature}: {mse_permuted}')
    ```
</details>

<details>
    <summary>Click to see the maximum solution</summary>
    ``` python
    import sklearn.metrics as metrics
    import matplotlib.pyplot as plt
    from matplotlib.style import use
    import matplotlib.dates as mdates

    maximum_importance = {"feature": [],
                    "importance": [],}

    for input_feature in X_test.columns:
        X_test_permuted = X_test.copy()
        X_test_permuted[input_feature] = X_test_permuted[input_feature].max()
        
        permutation = {"date": X_test.index,
                    "actual": X_test[input_feature],
                    "permuted": X_test_permuted[input_feature]}
        permutation = pd.DataFrame(permutation)
        
        # Plot the permutation
        use('ggplot')
        fig, ax = plt.subplots()
        permutation.plot.line(x="date",
                            ax=ax)
        ax.set_title("Actual vs Permuted")
        ax.set_ylabel(f"{input_feature}")
        ax.set_xlabel("Date")
        ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
        fig.tight_layout()
        
        # Register the feature importance
        y_pred_permuted = model.predict(X_test_permuted)
        mse_permuted = metrics.mean_squared_error(y_pred, y_pred_permuted)
        
        maximum_importance["feature"].append(input_feature)
        maximum_importance["importance"].append(mse_permuted)
        print(f'Permutation (maximum) importance of {input_feature}: {mse_permuted}')
    ```
</details>

<details>
    <summary>Click to see the shuffle solution</summary>
    ``` python
    import sklearn.metrics as metrics
    import matplotlib.pyplot as plt
    from matplotlib.style import use
    import matplotlib.dates as mdates

    shuffle_importance = {"feature": [],
                    "importance": [],}

    for input_feature in X_test.columns:
        X_test_permuted = X_test.copy()
        X_test_permuted[input_feature] = X_test_permuted[input_feature].sample(frac = 1).values
        
        permutation = {"date": X_test.index,
                    "actual": X_test[input_feature],
                    "permuted": X_test_permuted[input_feature]}
        permutation = pd.DataFrame(permutation)
        
        # Plot the permutation
        use('ggplot')
        fig, ax = plt.subplots()
        permutation.plot.line(x="date",
                            ax=ax)
        ax.set_title("Actual vs Permuted")
        ax.set_ylabel(f"{input_feature}")
        ax.set_xlabel("Date")
        ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
        fig.tight_layout()
        
        # Register the feature importance
        y_pred_permuted = model.predict(X_test_permuted)
        mse_permuted = metrics.mean_squared_error(y_pred, y_pred_permuted)
        
        shuffle_importance["feature"].append(input_feature)
        shuffle_importance["importance"].append(mse_permuted)
        print(f'Permutation (shuffle) importance of {input_feature}: {mse_permuted}')
    ```
</details>

### 2.3.2 Lets finally plot the importances

In [None]:
import matplotlib.pyplot as plt
from matplotlib.style import use

mean_importance = pd.DataFrame(mean_importance)
minimum_importance = pd.DataFrame(minimum_importance)
maximum_importance = pd.DataFrame(maximum_importance)
shuffle_importance = pd.DataFrame(shuffle_importance)

permutation_importance = mean_importance[['feature']]
permutation_importance["mean"] = mean_importance['importance']
permutation_importance["minimum"] = minimum_importance['importance']
permutation_importance["maximum"] = maximum_importance['importance']
permutation_importance["shuffle"] = shuffle_importance['importance']

# Plot the feature importance
use('ggplot')
fig, ax = plt.subplots()
permutation_importance.plot.bar(x="feature",
                                ax=ax)
ax.set_title("Permutation feature importance")
ax.set_ylabel("Mean squared error")
fig.tight_layout()

## 2.4 SHAP importance
Lastly we take a look at SHAP importance values. The SHAP (SHapley Additive exPlanations) package is a popular Python library used for interpreting the output of machine learning models. It provides a unified framework for explaining the predictions made by black-box models. SHAP values, in particular, quantify the contribution of each feature to the prediction made by the model. They provide a measure of feature importance and help in understanding the impact of individual features on the model's output.

Here we first need to install the SHAP package (especially on Google Colab) if it is not yet installed. Then we build a SHAP explainer and use our test dataset to generate the feature importance values. *Note this may take some time!* Afterwards, we plot the feature importance of our model using the SHAP build-in features plotting functions [beeswarm](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/beeswarm.html) and [bar](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/bar.html).

In [None]:
%pip install shap

In [None]:
import shap

explainer = shap.Explainer(model)
shap_values = explainer(X_test)
shap_values

In [None]:
shap.plots.beeswarm(shap_values)
shap.plots.bar(shap_values)

### 2.4.1 Engage some braincells
SHAP also provides functionality to plot the *attribution* (both positive and negative) of the input features *for ever single prediction our model has made* using the [waterfall](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/waterfall.html). Look at the documentation and figure out how to make a waterfall plot for the 105th date in the code block below. If today is not the day to engage your braincells, you can see the answer in the text below.

<details>
    <summary>Click to see the solution</summary>
    ```python
    date_index = 105
    shap.plots.waterfall(shap_values[date_index])
    ```
</details>