# XGBoost Matchbench Benchmark
###### Created June 13, 2022
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/matbench/benchmarks/matbench_v0.1_lattice_xgboost/notebook.ipynb)

![logo](https://github.com/materialsproject/matbench/blob/main/benchmarks/matbench_v0.1_dummy/matbench_logo.png?raw=1)


# Description

This directory uses an XGBoost model for a Matbench submission on regressing formation energy using only the lattice parameter lengths, angles, and unit cell volume as inputs.

# Benchmark name
Matbench v0.1
- `matbench_mp_e_form` task

# Package versions
###### List all versions of packages required to run your notebook, including the matbench version used.
- matbench==0.1.0
- scikit-learn==1.1.1
- xgboost==1.6.1
- pandas==1.4.2
- numpy==1.22.1
- typing==3.10.5

# Algorithm description
The model uses an XGBoost model with default parameters.
- XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework
- The available libraries provide a parallel tree boosting (also known as GBDT, GBM) 


# Relevant citations
###### List all relevant citations for your algorithm
- [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3)
- [Chen et. al.](http://doi.acm.org/10.1145/2939672.2939785)
- [Ong et. al.](https://doi.org/10.1016/j.commatsci.2012.10.028)


# Any other relevant info
- This is an initial notebook submission using default xgboost parameters, another model with hyperparameter optimization to follow

---



In [None]:
# Import our required libraries and classes
%pip install matbench xgboost

from matbench.bench import MatbenchBenchmark
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd
import numpy as np
from typing import List, Optional, Sequence, Tuple, Union

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matbench
  Downloading matbench-0.5-py3-none-any.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 2.9 MB/s 
Collecting matminer==0.7.4
  Downloading matminer-0.7.4-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 14.5 MB/s 
[?25hCollecting monty==2021.8.17
  Downloading monty-2021.8.17-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 886 kB/s 
[?25hCollecting scikit-learn==1.0
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
[K     |████████████████████████████████| 23.1 MB 6.4 MB/s 
Collecting pint>=0.17
  Downloading Pint-0.18-py2.py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 45.7 MB/s 
[?25hCollecting pymatgen>=2022.0.11
  Downloading pymatgen-2022.0.17.tar.gz (40.6 MB)
[K     |████████████████████████████████| 40.6 MB 1.3 MB/

# Running the actual benchmark

Create a benchmark using the matbench_mp_e_form task, extract into tabular data, and train on an XGBoost model.



In [None]:
# Create a benchmark
mb = MatbenchBenchmark(autoload=False, subset=["matbench_mp_e_form"])

# Run our benchmark on xgboost model
for task in mb.tasks:
  task.load()

  for fold in task.folds:
    # Define lists
    latt_a: List[List[float]] = []
    latt_b: List[List[float]] = []
    latt_c: List[List[float]] = []
    alpha: List[List[float]] = []
    beta: List[List[float]] = []
    gamma: List[List[float]] = []
    volume: List[float] = []
    space_group: List[int] = []

    formation_energy: List[List[float]] = []

    t_latt_a: List[List[float]] = []
    t_latt_b: List[List[float]] = []
    t_latt_c: List[List[float]] = []
    t_alpha: List[List[float]] = []
    t_beta: List[List[float]] = []
    t_gamma: List[List[float]] = []
    t_volume: List[float] = []
    t_space_group: List[int] = []

    # Get the training inputs (an array of pymatgen.Structure or string Compositions, e.g. "Fe2O3")
    train_inputs, train_outputs = task.get_train_and_val_data(fold)

    for i in range(len(train_inputs)):
      latt_a.append(train_inputs.iloc[i]._lattice.a)
      latt_b.append(train_inputs.iloc[i]._lattice.b)
      latt_c.append(train_inputs.iloc[i]._lattice.c)
      alpha.append(train_inputs.iloc[i]._lattice.angles[0])
      beta.append(train_inputs.iloc[i]._lattice.angles[1])
      gamma.append(train_inputs.iloc[i]._lattice.angles[2])
      volume.append(train_inputs.iloc[i].volume)
      space_group.append(train_inputs.iloc[i].get_space_group_info()[1])

    # Get the training outputs (an array of either bools or floats, depending on problem)
    for i in range(len(train_outputs)):
      formation_energy.append(train_outputs.iloc[i])

    # Do all model tuning and selection with the training data only
    # Transfer train_inputs and train_outputs into a pandas DataFrame
    X = pd.DataFrame(
        {
            "a": latt_a,
            "b": latt_b,
            "c":latt_c,
            "alpha": alpha,
            "beta": beta,
            "gamma": gamma,
            "volume": volume,
            "space_group": space_group
        },
    )
    y = pd.Series(name="formation_energy", data=formation_energy)

    train = xgb.DMatrix(X, label=y)

    hyperparam = {
        'max_depth': 4,
        'learning_rate':0.05,
        'n_estimators':1000,
        'verbosity':1,
        'booster':"gbtree",
        'tree_method':"auto",
        'n_jobs':1,
        'gamma':0.0001,
        'min_child_weight':8,
        'max_delta_step':0,
        'subsample':0.6,
        'colsample_bytree':0.7,
        'colsample_bynode':1,
        'reg_alpha':0,
        'reg_lambda':4,
        'scale_pos_weight':1,
        'base_score':0.6,
        'num_parallel_tree':1,
        'importance_type':"gain",
        'eval_metric':"rmse",
        'nthread':4 }

    num_round = 100

    # Train XGBoost model
    my_model = xgb.train(hyperparam, train, num_round)

    # Get test data (an array of pymatgen.Structure or string compositions, e.g., "Fe2O3")
    test_inputs_raw = task.get_test_data(fold, include_target=False)

    for i in range(len(test_inputs_raw)):
      t_latt_a.append(test_inputs_raw.iloc[i]._lattice.a)
      t_latt_b.append(test_inputs_raw.iloc[i]._lattice.b)
      t_latt_c.append(test_inputs_raw.iloc[i]._lattice.c)
      t_alpha.append(test_inputs_raw.iloc[i]._lattice.angles[0])
      t_beta.append(test_inputs_raw.iloc[i]._lattice.angles[1])
      t_gamma.append(test_inputs_raw.iloc[i]._lattice.angles[2])
      t_volume.append(test_inputs_raw.iloc[i].volume)
      t_space_group.append(test_inputs_raw.iloc[i].get_space_group_info()[1])

    test_inputs = pd.DataFrame(
      {
          "a": t_latt_a,
          "b": t_latt_b,
          "c": t_latt_c,
          "alpha": t_alpha,
          "beta": t_beta,
          "gamma": t_gamma,
          "volume":t_volume,
          "space_group": t_space_group
      },
    )

    test = xgb.DMatrix(test_inputs)

    # Make predictions on the test data, returning an array of either bool or float, depending on problem
    predictions = my_model.predict(test)

    # Record predictions into the benchmark object
    task.record(fold, predictions)

2022-06-10 23:47:18 INFO     Initialized benchmark 'matbench_v0.1' with 1 tasks: 
['matbench_mp_e_form']
2022-06-10 23:47:18 INFO     Loading dataset 'matbench_mp_e_form'...
Fetching matbench_mp_e_form.json.gz from https://ml.materialsproject.org/projects/matbench_mp_e_form.json.gz to /usr/local/lib/python3.7/dist-packages/matminer/datasets/matbench_mp_e_form.json.gz


Fetching https://ml.materialsproject.org/projects/matbench_mp_e_form.json.gz in MB: 166.735872MB [00:00, 220.44MB/s]                                


2022-06-10 23:51:10 INFO     Dataset 'matbench_mp_e_form loaded.
2022-06-11 00:03:17 INFO     Recorded fold matbench_mp_e_form-0 successfully.
2022-06-11 00:15:14 INFO     Recorded fold matbench_mp_e_form-1 successfully.
2022-06-11 00:27:12 INFO     Recorded fold matbench_mp_e_form-2 successfully.








# Check out the results of the benchmark

- Validate the benchmark to make sure everything is ok - if you did not get any error messages during the recording process your benchmark results will almost certainly be valid. 

- Check error metrics

- Add some metadata related to this benchmark, if applicable.

In [None]:
# Make sure our benchmark is valid
valid = mb.is_valid
print(f"is valid: {valid}")


# Check out how algorithm is doing using scores
import pprint
pprint.pprint(mb.scores)

# Get some more info about the benchmark
mb.get_info()

is valid: True
{'matbench_mp_e_form': {'mae': {'max': 0.7559645762744662,
                                'mean': 0.7514603730363221,
                                'min': 0.7463943260812504,
                                'std': 0.004167347004583424},
                        'mape': {'max': 8.208108588940437,
                                 'mean': 6.904368768866061,
                                 'min': 4.8884393331071925,
                                 'std': 1.323520300873098},
                        'max_error': {'max': 4.242506746409874,
                                      'mean': 4.057536813383573,
                                      'min': 3.9335069535836924,
                                      'std': 0.10426539042254096},
                        'rmse': {'max': 0.9454158134116134,
                                 'mean': 0.9414775887737938,
                                 'min': 0.936303190895938,
                                 'std': 0.0038121183426142904}}}


# Save benchmark to file

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Save the valid benchmark to file to include with your submission
mb.to_file("/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz")

2022-06-11 00:02:22 INFO     Successfully wrote MatbenchBenchmark to file '/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz'.


Citation:
Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, A. 
Benchmarking Materials Property Prediction Methods: 
The Matbench Test Set and Automatminer Reference Algorithm. 
npj Computational Materials 6, 138 (2020). 
https://doi.org/10.1038/s41524-020-00406-3
