Skip to content

hoon-ock/CatBERTa

Repository files navigation

CatBERTa: Text-Based Catalyst Property Prediction

Paper

CatBERTa Image CatBERTa is a Transformer-based energy prediction model designed for efficient catalyst property prediction. It addresses the challenge of predicting adsorption energy, a critical property in catalyst reactivity, without relying on precise atomic coordinates or complex graph representations. Instead, CatBERTa leverages the power of Transformer-based language models to process human-interpretable textual inputs and predict energies accurately. (The image on the right is generated by Leonardo.ai)

Key Features

  • Employs the power of a pre-trained RoBERTa encoder to predict energy levels using textual inputs.
  • Processes human-interpretable text to embed target features for energy prediction.
  • Analyzes attention scores to reveal how CatBERTa focuses on the incorporated features.
  • Achieves a mean absolute error (MAE) of 0.75 eV, comparable to earlier Graph Neural Networks (GNNs).
  • Enhances energy difference predictions by effectively canceling out systematic errors for chemically similar systems. (for more details, refer to the paper: Beyond Independent Error Assumptions in Large GNNs).

framework_github

Getting Started

Follow these steps to start using CatBERTa for predicting catalyst adsorption energy:

Prerequisites

Before you begin, ensure you have the following prerequisites installed:

  • Python 3.8
  • PyTorch 1.11
  • transformers 4.29

Installation

  1. Clone the CatBERTa repository:

    # clone the source code of CatBERTa
    git clone https://github.com/hoon-ock/CatBERTa.git
    cd CatBERTa

Dataset

  1. Preprocessed textual data

    The data folder houses the preprocessed textual data derived from the Open Catalyst 2020 dataset. Due to storage limitations, we offer a small subset of our training and validation data as an illustrative example. This subset showcases the format and structure of the data that CatBERTa utilizes for energy prediction.

    For access to the full dataset, please visit the following link: Google Drive - Full Dataset.

  2. Structural data

    The Open Catalyst Project dataset serves as a crucial source of textual generation for CatBERTa. This comprehensive dataset comprises a diverse collection of structural relaxation trajectories of adsorbate-catalyst systems, each accompanied by their corresponding energies.

    To access the Open Catalyst Project dataset and learn more about its attributes, please refer to the official repository: Open Catalyst Project Dataset

Checkpoints

For access to the model checkpoints, please reach out to us.

Model Training

Finetuning for energy prediction

The training configurations for CatBERTa can be found in the config/ft_config.yaml file.

During the training process, CatBERTa automatically creates and manages checkpoints to keep track of model progress. The checkpoints are saved in the checkpoint/finetune folder. This folder is created automatically if it doesn't exist and serves as the storage location for checkpoints.

$ python finetune_regression.py

Model Prediction

Energy and Embedding Prediction

To analyze energy and embedding predictions using CatBERTa, you can utilize the catberta_prediction.py script. This script allows you to generate predictions for either energy or embedding values.

$ python catberta_prediction.py --target energy --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"

or

$ python catberta_prediction.py --target embed --base --ckpt_dir "Path/to/checkpoint" --data_path "Path/to/data"

Attention Score Analysis

The AttentionVisualizer repository provides a robust toolkit to visualize and analyze attention scores. To use this tool effectively with CatBERTa, you can load the finetuned Roberta encoder into the AttentionVisualizer package.

Citation

@article{ock2023catberta,
author = {Ock, Janghoon and Guntuboina, Chakradhar and Barati Farimani, Amir},
title = {Catalyst Energy Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models},
journal = {ACS Catalysis},
volume = {13},
number = {24},
pages = {16032-16044},
year = {2023},
doi = {10.1021/acscatal.3c04956},
URL = { 
        https://doi.org/10.1021/acscatal.3c04956
},
eprint = { 
    
        https://doi.org/10.1021/acscatal.3c04956

}
}

Contact

questions or support, feel free to contact us through jock@andrew.cmu.edu.

About

Large Language Model for Catalyst Property Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages