# Responsible Machine Learning Exercise 07 

22.05.2023


## 1. Feature Permutation Importance

### Pen and Paper
Calculate the feature permutation importance of the number of years of education given the below data and the linear model for income:
$$\hat{y} = 2 \cdot Education + 1 \cdot Height.$$

|Height|Years of Education|Income (k)
|-|-|-|
|1.8 | 16 | 40
|1.6 | 20 | 50
|1.75 | 12 | 100
|1.8 | 8 | 25

Use the permutation $(1 \rightarrow 3, 2\rightarrow 4, 3 \rightarrow 2, 4 \rightarrow 1)$ and mean squared error.



In [1]:
# Calculator cell

### Coding

In this task, you will implement the feature permutation importance algorithm by filling in the parts in the following cells.
Places where you have to write code are marked with `TODO`.

To check whether your implementation is correct, you can use the [eli5 library](https://eli5.readthedocs.io/en/latest/index.html).
To install it, open a shell/Anaconda Prompt and activate your Python environment.
If you followed the instructions from the first exercise, you can use the following commands:

```shell
conda activate xai
conda install -c conda-forge eli5
```

Afterwards, you should be able to import from eli5.

In [None]:
from typing import Callable, List, Union

import matplotlib.pyplot as plt
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import PartialDependenceDisplay
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.model_selection import train_test_split


dataset = fetch_california_housing()
print(
    f"We predict the median house value using a linear regression model based on the following features:"
)
print(dataset.feature_names)

# Train a model and reserve testing data
x_train, x_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, random_state=0
)

model = LinearRegression()
model.fit(x_train, y_train)

print("\nThe model performs as follows:")
print(
    f"R2={model.score(x_test, y_test):.4f}, "
    f"MSE={mean_squared_error(y_test, model.predict(x_test)):.4f}"
)


In [None]:
def feature_permutation_importance(
    model: BaseEstimator,
    x: np.ndarray,
    y: np.ndarray,
    feat_idx: int,
    loss: Callable = mean_squared_error,
    random_state: int = 100,
) -> float:
    """Calculates the decrease in loss when using the original features and a permutated 
    feature column.

    Args:
        model (BaseEstimator): trained model
        x (np.ndarray): original input features
        y (np.ndarray): targets
        feat_idx (int): the index of the column to permute
        loss (Callable, optional): a loss function. Defaults to mean_squared_error.
        random_state (int, optional): seeds the permutation. Defaults to 100.

    Returns:
        float: difference of original loss and loss on permuted data
    """
    # TODO: Fill in the function.
    pass


# TODO: Write a comment to explain the purpose of the rest of this cell.
results = []
for i, feature_name in enumerate(dataset.feature_names):
    imps = np.array(
        [
            feature_permutation_importance(model, x_test, y_test, i, random_state=j)
            for j in range(5)
        ]
    )
    results.append((i, imps.mean(), imps.std()))

for i, mean, std in sorted(results, key=lambda x: x[1], reverse=True):
    print(f"x{i}:{dataset.feature_names[i]}, importance={mean:.4f} ± {std:.4f}")



Compare your result to the implementation in eli5.

* If the weight/importance score of a feature is close to zero, what does it mean w.r.t. to usage of the feature by the model for its predictions?

* What is the most important feature?

* What effect does removing the most important feature have on the model performance? Does it have an intuitive explanation? Describe why or why not the score makes sense.


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

importance = PermutationImportance(
    estimator=model,
    scoring=make_scorer(mean_squared_error, greater_is_better=False),
    random_state=0,
).fit(x_test, y_test)
eli5.show_weights(importance)


## 2. PDP Plot

### Coding

Create the PDP plot of the model $\hat{f}(x) = 100 \cdot Height - 1 \cdot Weight$ for the weight feature:
Use the following values for weight: [50, 60, 70, 80, 90].

You can do it manually or by implementing the method (recommended).

|Height|Weight|Walking speed
|-|-|-|
|1.8 | 80 | 110
|1.6 | 80 | 90
|1.75 | 65 | 100
|1.8 | 70 | 120
|1.6 | 50 | 100

In [None]:
def prediction(height: float, weight: float) -> float:
    """Return $\hat{f}(x) = 100 \cdot Height - 1 \cdot Weight$"""
    # TODO
    pass


def pdp_plot(
    weight_grid: List[Union[float, int]], heights: List[Union[float, int]]
) -> None:
    """Create a PDP plot for the specific problem in this task.

    Args:
        weight_grid (List[Union[float, int]]): The values of weight for which the predictions are calculated
        heights (List[Union[float, int]]): The height values of the dataset
    """
    # TODO
    # You can use matplotlib.pyplot for plotting
    pass


weights = [80, 80, 65, 70, 50]
heights = [1.8, 1.6, 1.75, 1.8, 1.6]

pdp_plot(weight_grid=[50, 60, 70, 80, 90], heights=heights)


### Interpreting PDP Plots

We now train a more complex model (a random forest) on the housing dataset used earlier.
Using the inspection capabilities of sklearn, we can easily plot PDP plots (https://scikit-learn.org/stable/modules/partial_dependence.html).

In [None]:
rf = RandomForestRegressor(random_state=0)
rf.fit(x_train, y_train)

print("The model performs as follows:")
print(
    f"R2={rf.score(x_test, y_test):.4f}, "
    f"MSE={mean_squared_error(y_test, rf.predict(x_test)):.4f}"
)


In [None]:
fig, axes = plt.subplots(2, 4, figsize=(8, 4))
PartialDependenceDisplay.from_estimator(
    rf,
    x_test,
    [i for i in range(len(dataset.feature_names))],
    feature_names=dataset.feature_names,
    ax=axes
)
for i, ax in enumerate(axes[0]):
    ax.set_title(dataset.feature_names[i])


For each of the features, describe their effect on the prediction and argue whether they are important for the prediction.

What is a problem here w.r.t. to the data points we use to create these plots?
