# 4.1 Performance Prediction

This notebook reproduces the example from Section 4.1 of the paper 'Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis' (EASE 2025). It uses a simple linear regression model to predict the binary size (in MB) of the vmlinux file for Linux kernel version 4.15, based on the TuxKConfig dataset available on OpenML (ID: 46739).

## Steps:
1. **Load the Dataset**: Fetch version 4.15 from OpenML.
2. **Prepare Data**: Split into training (80%) and test (20%) sets.
3. **Train Model**: Fit a linear regression model.
4. **Evaluate**: Calculate the Mean Absolute Error (MAE).



In [None]:
import openml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Step 1: Fetch TuxKConfig v4.15 from OpenML
dataset = openml.datasets.get_dataset(46739)
X, y = dataset.get_data(target='Binary_Size')

# Step 2: Split data into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 4: Predict and calculate MAE
predictions = model.predict(X_test)
mae = np.abs(predictions - y_test).mean()
print(f'MAE: {mae:.2f} MB')