# Figure 1: Boxplot of Binary Size Distribution

This notebook reproduces Figure 1 from the paper 'Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis' (EASE 2025). It creates a boxplot showing the binary size distribution (in MB) across TuxKConfig versions 4.13 to 5.8.

## Steps:
1. **Load Datasets**: Fetch all versions from OpenML.
2. **Extract Sizes**: Collect binary size data for each version.
3. **Plot Boxplot**: Visualize the distribution with Matplotlib.

The figure shows medians, IQRs, whiskers, and outliers, as described in Section 4.

In [None]:
import openml
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Load datasets
versions = {'4.13': 46759, '4.15': 46739, '4.20': 46740, '5.0': 46741, 
            '5.4': 46742, '5.7': 46743, '5.8': 46744}
sizes = []
for ver, id in versions.items():
    dataset = openml.datasets.get_dataset(id)
    _, y = dataset.get_data(target='Binary_Size')
    sizes.append(y)

# Step 2 & 3: Plot boxplot
plt.boxplot(sizes, labels=versions.keys())
plt.title('Binary Size Distribution (MB) Across Versions 4.13 to 5.8')
plt.ylabel('Binary Size (MB)')
plt.show()