# Explanatory Models

In this chapter we are going to train a feature-rich machine learning model for **explanatory purposes**. That means that our interest shifts from predictive power of the model to explanatory power: Not only do we want our model to make accurate predictions, we want it to **demonstrate the relationship between features and target variables in a robust and explainable way**.

## Preamble

In [None]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style()

In [None]:
import pandas
import matplotlib.pyplot as plt
import seaborn

## Dataset: House Prices

We use the [House Prices]() dataset for the following examples. Our goals is to model the sale price of a house given various attributes of the property. We preprocess the data set as follows:
- exclude columns with mostly missing values
- encode certain attributes measured on a quality scale as integers
- one-hot-encode categorial attributes

In [None]:
data_v0 = data_science_learning_paths.datasets.read_house_prices(
    encode_ordinal=True,
    drop_sparse=True,
    encode_categorial=True,
    drop_first_level=False,
)

In [None]:
data_v0.shape

In [None]:
data_v0.head()

In [None]:
target = "SalePrice"
features = data_v0.columns.difference([target])

A linear model can benefit from scaling the attributes to a common interval:

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
features_scaled = pandas.DataFrame(
    MinMaxScaler().fit_transform(data_v0[features]),
    columns=data_v0[features].columns
)

In [None]:
features_scaled.head()

In [None]:
data_v0 = pandas.concat([features_scaled, data_v0[target]], axis=1)

In [None]:
data_v0.head()

## Fitting the Model

For this example, we select a **linear model** and fit it with **ordinary least squares regression**. For explanatory modelling, the [`statsmodels`](https://www.statsmodels.org/stable/index.html) library is a good choice, since we get a large amount of statistical diagnostics information with the model. 

In [None]:
import statsmodels.api as sm

Let's fit the model and output the summary:

In [None]:
sm.OLS(data_v0[target], data_v0[features]).fit().summary()

In order to inspect the parameters and metadata of the model, we need the result of the `fit` method:

In [None]:
result = sm.OLS(data_v0[target], data_v0[features]).fit()

We are interested in the parameters of the model, i.e. the estimated coefficients, as well as the associated $p$-values.

In [None]:
result.params.plot(kind="bar")

In [None]:
result.pvalues.plot(kind="bar")

Again, sorted by magnitude of the parameters:

In [None]:
result.params.sort_values().plot(kind="bar")

In [None]:
result.pvalues.reindex(result.params.sort_values().index).plot(kind="bar")

## Symptom: Non-Significant Coefficients

We observe that many of the coefficients have very high $p$-values - and this is not limited to the features that were assigned very small coefficients. If we want to use the coefficients of the model , this means trouble: Observe how the model weights the features when trained on two different samples.

In [None]:
data_sample_a = data_v0.sample(frac=0.5)
data_sample_b = data_v0.sample(frac=0.5)

In [None]:
sm.OLS(
    data_sample_a[target], 
    data_sample_a[features]
).fit().params.plot(kind="bar")

In [None]:
sm.OLS(
    data_sample_b[target], 
    data_sample_b[features]
).fit().params.plot(kind="bar")

## Diagnosis: Multicolinearity

Let's have a look at the correlations between our numerous features. For this purpose, we plot a correlation matrix using the [`yellowbricks`](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) ML visualization library.

In [None]:
from yellowbrick.features import Rank2D

In [None]:
f, ax = plt.subplots(1, 1,figsize=(16, 16))

visualizer = Rank2D(features=data_v0.columns, algorithm='pearson', ax=ax)

visualizer.fit(data_v0, None)                # Fit the data to the visualizer
visualizer.transform(data_v0)             # Transform the data

visualizer.poof()                   # Draw/show/poof the data


**What is multicolinearity?**

[Multicolinearity](https://en.m.wikipedia.org/wiki/Multicollinearity) means that the feature set is **redundant** in the sense that features can be predicted (linearly) from other features with high accuracy. **Perfect multicolinarity** exists when a feature is an exact [linear combination](https://en.m.wikipedia.org/wiki/Linear_combination) of other features.

**Why is it problematic?**

If features are correlated in this way, a model can rely on either one for its prediction - _arbitrarily_. More specifically, the model arbitrarily assign weight to either one of the colinear features. The correlated features may end up with non-significant coefficients. This also fits the definition of **overfitting**: Our model depends greatly on the specific training step, and its results are not generalizable. For example, we might get very different feature importances if we train the model with slightly different data.

**Caveat**

Multicolinearity is not the only possible root cause of the symptom (non-significant coefficients) that we saw. (E.g., trying to estimate too many model parameters with too few data points may be another one.)



## Treatment: Reducing Multicolinearity

In the following we will attempt feature selection to treat the multicolinearity problem.

### Eliminating Perfect Multicolinearity

There is one obvious source of multicolinearity here, and we have introduced it ourselves in the preprocessing: By simple one-hot encoding of categorial features, we created _perfect_ multicolinearity. 

Consider the attribute that represents the building type. Let's encode it:

In [None]:
raw_data = pandas.read_csv("../.assets/data/house/prices.csv")

In [None]:
raw_data["BldgType"].unique()

In [None]:
pandas.get_dummies(
    raw_data[["BldgType"]]
).sample(10)

There is an easy fix, provided by `pandas`:

In [None]:
pandas.get_dummies(
    raw_data[["BldgType"]],
    drop_first=True
).sample(10)

Let's look at the situation when applying this in the preprocessing:

In [None]:
data_v1 = data_science_learning_paths.datasets.read_house_prices(
    encode_ordinal=True,
    drop_sparse=True,
    encode_categorial=True,
    drop_first_level=True,  # drop one level when one-hot-encoding!
)
target = "SalePrice"
features = data_v1.columns.difference([target])

### Exercise: What changed after removing perfect multicolinearity?
    
Repeat the above modeling and diagnosis steps. Did dropping the first level of categorial variables improve the robustness of the model? 

In [None]:
# Your code here

### Diagnostic Instrument: Clustered Correlation Plot

Diagnosing and fixing multicolinearity issues can get difficult when faced with a large number of variables. A diagnostic method that has been proposed is the use of a **clustered correlation plot**:


1. Compute a correlation matrix of the features. 

We can use Pearson's correlation coefficient, but since we do not care about the direction of the correlation, we take the absolute value.

In [None]:
feature_data = data_v1[features]
feature_correlations = feature_data.corr().abs()
feature_correlations.head()

2. Compute an affinity-based clustering of the features, treating the strength of their correlations as the measure of affinity.

scikit-learn's [`AffinityPropagation`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) algorithm is well suited for this purpose:


In [None]:
from sklearn.cluster import AffinityPropagation

In [None]:
cluster_labels = AffinityPropagation(
    affinity="precomputed"
).fit_predict(feature_correlations)

clusters = pandas.Series(
    cluster_labels,
    index=feature_data.columns
)
clusters.head()

3. Order the columns and rows of the correlation matrix by their clusters.

In the result, variables that belong to a correlation cluster are placed next to eachother. The cluster is now clearly visible around the diagonal.

### Exercise: Clustered Correlation Plots

Implement a clustered correlation plot function that helps you inspect clusters of colinear variables. Apply it to the house price regression problem and interpret the results. Can you identify variables that should be dropped as features to improve the model?



In [None]:
# Your code here

## Open-Ended Exercise: Improve the Explanatory Model

Use all data science tools available to you to obtain a more robust explanatory model.

## References

- [Lucas Bernardi @ PyData Amsterdam 2017: Diagnosing Machine Learning Models](https://www.youtube.com/watch?v=ZD8LA3n6YvI)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_