# Baseline Comparison for Ax+CrabNet v1.2.1 Using Default CrabNet Hyperparameters

###### Created January 8, 2022

# Description

This is meant to serve as a baseline comparison for [Ax+CrabNet v1.2.1](https://matbench.materialsproject.org/Full%20Benchmark%20Data/matbench_v0.1_Ax_CrabNet_v1.2.1/). Please see [that
submission's notebook](https://github.com/materialsproject/matbench/blob/main/benchmarks/matbench_v0.1_Ax_CrabNet_v1.2.1/notebook.ipynb) for details.

# Benchmark name
Matbench v0.1

# Package versions
- crabnet==1.2.1
- scikit_learn==1.0.2
- matbench==0.5

## Imports

In [1]:
from os.path import join
from pathlib import Path

import numpy as np
import pandas as pd

import gc
import torch

import crabnet
from crabnet.train_crabnet import get_model
from sklearn.metrics import mean_absolute_error

from matbench.bench import MatbenchBenchmark

## Setup

`dummy` lets you swap between a fast run and a more comprehensive run. The more comprehensive run was used for this matbench submission.

In [2]:
dummy = False

Specify directories where you want to save things and make sure they exist.

In [3]:
# create dir https://stackoverflow.com/a/273227/13697228
experiment_dir = join("experiments", "default")
figure_dir = join("figures", "default")
result_dir = join("results", "default")
Path(experiment_dir).mkdir(parents=True, exist_ok=True)
Path(figure_dir).mkdir(parents=True, exist_ok=True)
Path(result_dir).mkdir(parents=True, exist_ok=True)

## Experimental Bandgap MatBench task

Please ignore that train and val MAE output are identical, val MAE is just train MAE
since there is no validation set specified.

In [4]:
mb = MatbenchBenchmark(autoload=False, subset=["matbench_expt_gap"])

default_maes = []
task = list(mb.tasks)[0]
task.load()
for i, fold in enumerate(task.folds):
    train_inputs, train_outputs = task.get_train_and_val_data(fold)

    train_val_df = pd.DataFrame(
        {"formula": train_inputs.values, "target": train_outputs.values}
    )
    if dummy:
        train_val_df = train_val_df[:100]

    test_inputs, test_outputs = task.get_test_data(fold, include_target=True)

    test_df = pd.DataFrame({"formula": test_inputs, "target": test_outputs})

    default_params = dict(
        fudge=0.02,
        d_model=512,
        out_dims=3,
        N=3,
        heads=4,
        out_hidden=[1024, 512, 256, 128],
        emb_scaler=1.0,
        pos_scaler=1.0,
        pos_scaler_log=1.0,
        bias=False,
        dim_feedforward=2048,
        dropout=0.1,
        elem_prop="mat2vec",
        pe_resolution=5000,
        ple_resolution=5000,
        epochs=40,
        epochs_step=10,
        criterion=None,
        lr=1e-3,
        betas=(0.9, 0.999),
        eps=1e-6,
        weight_decay=0,
        adam=False,
        min_trust=None,
        alpha=0.5,
        k=6,
        base_lr=1e-4,
        max_lr=6e-3,
    )

    default_model = get_model(
        mat_prop="expt_gap",
        train_df=train_val_df,
        learningcurve=False,
        force_cpu=False,
        **default_params,
    )

    default_true, default_pred, default_formulas, default_sigma = default_model.predict(
        test_df
    )

    default_mae = mean_absolute_error(default_true, default_pred)
    default_maes.append(default_mae)

    task.record(fold, default_pred, params=default_params)

    # deallocate CUDA memory https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/28
    del default_model
    gc.collect()
    torch.cuda.empty_cache()

2022-02-05 09:36:02 INFO     Initialized benchmark 'matbench_v0.1' with 1 tasks: 
['matbench_expt_gap']
2022-02-05 09:36:02 INFO     Loading dataset 'matbench_expt_gap'...
2022-02-05 09:36:03 INFO     Dataset 'matbench_expt_gap loaded.

Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cuda:0
Model size: 11987206 parameters



Generating EDM: 100%|██████████| 3683/3683 [00:00<00:00, 123102.35formulae/s]

loading data with up to 4 elements in the formula





training with batchsize 256 (2**8.000)
stepping every 150 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 0.989 val mae: 0.989
Epoch: 19/40 --- train mae: 0.359 val mae: 0.359
Epoch: 39/40 --- train mae: 0.203 val mae: 0.203
Saving network (expt_gap) to models/trained_models/expt_gap.pth


Generating EDM: 100%|██████████| 921/921 [00:00<00:00, 184715.44formulae/s]

loading data with up to 4 elements in the formula
2022-02-05 09:36:44 INFO     Recorded fold matbench_expt_gap-0 successfully.






Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cuda:0
Model size: 11987206 parameters



Generating EDM: 100%|██████████| 3683/3683 [00:00<00:00, 102578.62formulae/s]


loading data with up to 4 elements in the formula
training with batchsize 256 (2**8.000)
stepping every 150 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 0.98 val mae: 0.98
Epoch: 19/40 --- train mae: 0.348 val mae: 0.348
Epoch: 39/40 --- train mae: 0.212 val mae: 0.212
Saving network (expt_gap) to models/trained_models/expt_gap.pth


Generating EDM: 100%|██████████| 921/921 [00:00<00:00, 184671.29formulae/s]

loading data with up to 4 elements in the formula
2022-02-05 09:37:23 INFO     Recorded fold matbench_expt_gap-1 successfully.






Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cuda:0
Model size: 11987206 parameters



Generating EDM: 100%|██████████| 3683/3683 [00:00<00:00, 102577.94formulae/s]


loading data with up to 4 elements in the formula
training with batchsize 256 (2**8.000)
stepping every 150 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 1.01 val mae: 1.01
Epoch: 19/40 --- train mae: 0.353 val mae: 0.353
Epoch: 39/40 --- train mae: 0.23 val mae: 0.23
Saving network (expt_gap) to models/trained_models/expt_gap.pth


Generating EDM: 100%|██████████| 921/921 [00:00<00:00, 184759.61formulae/s]

loading data with up to 4 elements in the formula
2022-02-05 09:38:03 INFO     Recorded fold matbench_expt_gap-2 successfully.






Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cuda:0
Model size: 11987206 parameters



Generating EDM: 100%|██████████| 3683/3683 [00:00<00:00, 136771.19formulae/s]


loading data with up to 4 elements in the formula
training with batchsize 256 (2**8.000)
stepping every 150 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 1.03 val mae: 1.03
Epoch: 19/40 --- train mae: 0.344 val mae: 0.344
Epoch: 39/40 --- train mae: 0.215 val mae: 0.215
Saving network (expt_gap) to models/trained_models/expt_gap.pth


Generating EDM: 100%|██████████| 921/921 [00:00<00:00, 153914.81formulae/s]

loading data with up to 4 elements in the formula
2022-02-05 09:38:44 INFO     Recorded fold matbench_expt_gap-3 successfully.






Model architecture: out_dims, d_model, N, heads
3, 512, 3, 4
Running on compute device: cuda:0
Model size: 11987206 parameters



Generating EDM: 100%|██████████| 3684/3684 [00:00<00:00, 131921.35formulae/s]


loading data with up to 4 elements in the formula
training with batchsize 256 (2**8.000)
stepping every 150 training passes, cycling lr every 10 epochs
checkin at 20 epochs to match lr scheduler
Epoch: 0/40 --- train mae: 1.04 val mae: 1.04
Epoch: 19/40 --- train mae: 0.36 val mae: 0.36
Epoch: 39/40 --- train mae: 0.215 val mae: 0.215
Saving network (expt_gap) to models/trained_models/expt_gap.pth


Generating EDM: 100%|██████████| 920/920 [00:00<00:00, 131819.75formulae/s]

loading data with up to 4 elements in the formula
2022-02-05 09:39:30 INFO     Recorded fold matbench_expt_gap-4 successfully.





## Export matbench file

In [5]:
my_metadata = {"algorithm_version": crabnet.__version__}

mb.add_metadata(my_metadata)

mb.to_file(join(result_dir, "expt_gap_benchmark.json.gz"))

print(default_maes)
print(np.mean(default_maes))


2022-02-05 09:39:31 INFO     User metadata added successfully!
2022-02-05 09:39:31 INFO     Successfully wrote MatbenchBenchmark to file 'results\default\expt_gap_benchmark.json.gz'.
[0.34893551791222294, 0.3673720894124557, 0.4106350690530902, 0.36771265866313774, 0.38388433999827376]
0.375707935007836
