CatBERTa: Text-Based Catalyst Property Prediction

CatBERTa is a Transformer-based energy prediction model designed for efficient catalyst property prediction. It addresses the challenge of predicting adsorption energy, a critical property in catalyst reactivity, without relying on precise atomic coordinates or complex graph representations. Instead, CatBERTa leverages the power of Transformer-based language models to process human-interpretable textual inputs and predict energies accurately. (The image on the right is generated by Leonardo.ai)

Key Features

Employs the power of a pre-trained RoBERTa encoder to predict energy levels using textual inputs.
Processes human-interpretable text to embed target features for energy prediction.
Analyzes attention scores to reveal how CatBERTa focuses on the incorporated features.
Achieves a mean absolute error (MAE) of 0.75 eV, comparable to earlier Graph Neural Networks (GNNs).
Enhances energy difference predictions by effectively canceling out systematic errors for chemically similar systems. (for more details, refer to the paper: Beyond Independent Error Assumptions in Large GNNs).

Getting Started

Follow these steps to start using CatBERTa for predicting catalyst adsorption energy:

Prerequisites

Before you begin, ensure you have the following prerequisites installed:

Python 3.8
PyTorch 1.11
transformers 4.29

Installation

Clone the CatBERTa repository:

# clone the source code of CatBERTa
git clone https://github.com/hoon-ock/CatBERTa.git
cd CatBERTa

Dataset

Preprocessed textual data

The data folder houses the preprocessed textual data derived from the Open Catalyst 2020 dataset. Due to storage limitations, we offer a small subset of our training and validation data as an illustrative example. This subset showcases the format and structure of the data that CatBERTa utilizes for energy prediction.

For access to the full dataset, please visit the following link: Google Drive - Full Dataset.
Structural data

The Open Catalyst Project dataset serves as a crucial source of textual generation for CatBERTa. This comprehensive dataset comprises a diverse collection of structural relaxation trajectories of adsorbate-catalyst systems, each accompanied by their corresponding energies.

To access the Open Catalyst Project dataset and learn more about its attributes, please refer to the official repository: Open Catalyst Project Dataset

Checkpoints

For access to the model checkpoints, please reach out to us.

Model Training

Finetuning for energy prediction

The training configurations for CatBERTa can be found in the config/ft_config.yaml file.

During the training process, CatBERTa automatically creates and manages checkpoints to keep track of model progress. The checkpoints are saved in the checkpoint/finetune folder. This folder is created automatically if it doesn't exist and serves as the storage location for checkpoints.

$ python finetune_regression.py

Model Prediction

Energy and Embedding Prediction

To analyze energy and embedding predictions using CatBERTa, you can utilize the catberta_prediction.py script. This script allows you to generate predictions for either energy or embedding values.

$ python catberta_prediction.py --target energy --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"

or

$ python catberta_prediction.py --target embed --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"

Attention Score Analysis

The AttentionVisualizer repository provides a robust toolkit to visualize and analyze attention scores. To use this tool effectively with CatBERTa, you can load the finetuned Roberta encoder into the AttentionVisualizer package.

Citation

@article{ock2023catberta,
author = {Ock, Janghoon and Guntuboina, Chakradhar and Barati Farimani, Amir},
title = {Catalyst Energy Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models},
journal = {ACS Catalysis},
volume = {13},
number = {24},
pages = {16032-16044},
year = {2023},
doi = {10.1021/acscatal.3c04956},
URL = { 
        https://doi.org/10.1021/acscatal.3c04956
},
eprint = { 
    
        https://doi.org/10.1021/acscatal.3c04956

}
}

Contact

questions or support, feel free to contact us through jock@andrew.cmu.edu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CatBERTa: Text-Based Catalyst Property Prediction

Key Features

Getting Started

Prerequisites

Installation

Dataset

Checkpoints

Model Training

Finetuning for energy prediction

Model Prediction

Energy and Embedding Prediction

Attention Score Analysis

Citation

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
AttentionVisualizer		AttentionVisualizer
analysis		analysis
config		config
data		data
model		model
README.md		README.md
catberta_prediction.py		catberta_prediction.py
finetune_regression.py		finetune_regression.py

hoon-ock/CatBERTa

Folders and files

Latest commit

History

Repository files navigation

CatBERTa: Text-Based Catalyst Property Prediction

Key Features

Getting Started

Prerequisites

Installation

Dataset

Checkpoints

Model Training

Finetuning for energy prediction

Model Prediction

Energy and Embedding Prediction

Attention Score Analysis

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages