# 4.3 Interpretability

This notebook reproduces the example from Section 4.3 of the paper 'Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis' (EASE 2025). It uses a Random Forest model to identify influential options affecting binary size in TuxKConfig version 5.8 (OpenML ID: 46744).

## Steps:
1. **Load Dataset**: Fetch version 5.8 from OpenML.
2. **Train Model**: Fit a Random Forest regressor.
3. **Extract Importance**: Rank the top 5 options by feature importance scores.



In [None]:
import openml
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Step 1: Load TuxKConfig v5.8
dataset = openml.datasets.get_dataset(46744)
X, y = dataset.get_data(target='Binary_Size')

# Step 2: Train Random Forest
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Step 3: Extract and rank top 5 options
importances = model.feature_importances_
feature_names = X.columns
top_indices = importances.argsort()[-5:][::-1]
top_features = [(feature_names[i], importances[i]) for i in top_indices]
for feature, score in top_features:
    print(f'{feature}: {score:.4f}')