# MatBench MP bandgap prediction

this notebook runs a bandgap prediction experiment on the `matbench_mp_gap` dataset using:
- composition-based features (`ElementFraction` from matminer)
- simple regression models from scikit-learn
- train/test evaluation and a parity plot.


## setup
install the dependencies below.


In [None]:
# if running in Colab or a fresh environment, install dependencies:
!pip install matminer pymatgen scikit-learn matplotlib


## imports


In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from matminer.datasets import load_dataset
from matminer.featurizers.composition import ElementFraction
from pymatgen.core import Composition
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split


## load the `matbench_mp_gap` dataset


In [None]:
dataset_name = 'matbench_mp_gap'
df = load_dataset(dataset_name)

print('Loaded dataset:', dataset_name)
print('Shape:', df.shape)
df.head()


## basic cleaning and subsetting
for this dataset matminer reports columns like `['structure', 'gap pbe']`.


In [None]:
target_col = 'gap pbe'
structure_col = 'structure'

df = df[[structure_col, target_col]].dropna()
print('After dropping NaNs:', df.shape)

# optional: subsample for faster experimentation if the environment is limited
MAX_SAMPLES = 5000  # adjust based on available resources
if len(df) > MAX_SAMPLES:
    df = df.sample(n=MAX_SAMPLES, random_state=0)
    print(f'Subsampled to {len(df)} rows for experimentation')


## convert to compositions and featurize
we convert `Structure` objects to `Composition`, then featurize with `ElementFraction`.


In [None]:
# convert structures to compositions
df['composition'] = df[structure_col].apply(lambda s: s.composition)

# use ElementFraction as a light, robust composition featurizer
featurizer = ElementFraction()
df_feat = featurizer.featurize_dataframe(df, 'composition', ignore_errors=True)

# build feature matrix X and target y
feature_cols = [
    c for c in df_feat.columns
    if c not in [structure_col, target_col, 'composition']
]

X = df_feat[feature_cols]
y = df_feat[target_col]

print('Number of features:', X.shape[1])
X.head()


## train/test split and model training
we start with a simple `LinearRegression` model 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Test MAE: {mae:.3f} eV')
print(f'Test R^2: {r2:.3f}')


## parity plot
Visualize predicted vs. true bandgaps on the test set.


In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(y_test, y_pred, s=5, alpha=0.5)
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
plt.plot(lims, lims, 'r--', label='ideal')
plt.xlabel('True bandgap (eV)')
plt.ylabel('Predicted bandgap (eV)')
plt.title(f'Bandgap prediction ({dataset_name})')
plt.legend()
plt.tight_layout()
plt.show()


## next steps
ideas to extend this notebook:
- try more complex models (e.g. RandomForestRegressor, GradientBoostingRegressor).
- perform k-fold cross-validation instead of a single train/test split.
- do basic error analysis (residual histograms, error vs. composition).
- experiment with different matminer featurizers and compare performance.
