CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

Clone the Repository

git clone https://github.com/lczong/CAALM.git
cd CAALM

Set Up a Virtual Environment (Recommended)

conda create -n caalm python=3.10
conda activate caalm

Install PyTorch

Follow the installation below, or choose the build that matches your device (official guide | previous versions)

# CUDA 12.6
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# CPU only
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu

Install FAISS

# CPU (via pip or conda)
pip install faiss-cpu        # option 1
conda install faiss-cpu -c pytorch  # option 2

# GPU (conda recommended — pip may not work correctly)
conda install faiss-gpu -c pytorch

Install the Package
```
pip install .
```

Download Model Assets

Download the full CAALM Hugging Face repository into a directory named models in the project root:

python -c "from huggingface_hub import snapshot_download; snapshot_download('lczong/CAALM', local_dir='models')"

The expected layout after download is:

models/
├── level0/          # Level 0 binary classifier
├── level1/          # Level 1 multi-label classifier
└── level2/
    ├── model.pt     # Level 2 projection checkpoint
    ├── faiss/       # FAISS indices (<CLASS>.faiss)
    └── refdb/       # Reference TSVs (<CLASS>_labels.tsv)

📖 Usage

Prediction Flow

CAALM runs three levels in sequence:

Level 0 predicts whether a sequence is CAZy or non-CAZy.
If Level 0 predicts CAZy, Level 1 predicts one or more major CAZy classes from GT, GH, CBM, CE, PL, and AA.
Level 2 retrieves family labels from the FAISS index and reference database for each predicted Level 1 major class.

If Level 1 predicts multiple classes such as GH|CBM, Level 2 searches both major-class databases and writes one family prediction per major class.

Example Command

A convenience script is provided to run the example with one command:

./scripts/predict_example.sh

Or invoke the CLI directly:

caalm input/example.fasta

The output name defaults to the input filename stem (here example, from input/example.fasta), and output files are written to ./outputs/. To customise:

caalm your_sequences.fasta -o results --output-name my_run

Use caalm --help to see all options grouped by category.

Common Options

# Use a specific GPU
caalm input.fasta -d cuda:0

# Enable mixed precision for faster inference
caalm input.fasta --mixed-precision bf16

# Increase batch size for large-memory GPUs
caalm input.fasta -b 16

# Increase the level 2 projection batch size independently
caalm input.fasta -b2 1024

# Save level 1 embeddings for downstream analysis
caalm input.fasta --save-level1-embeddings

# Save level 0 embeddings
caalm input.fasta --save-level0-embeddings

# Save level 2 projected embeddings
caalm input.fasta --save-level2-embeddings

Models

The recommended setup is to download the full CAALM Hugging Face repository into a local models directory (see Installation step 6). If local files are not found, Level 0 and Level 1 will try to download from Hugging Face automatically.

Level	Description	Default path	CLI override
Level 0	Binary CAZy / non-CAZy classifier	`./models/level0`	`--level0-model`
Level 1	Multi-label major class classifier	`./models/level1`	`--level1-model`
Level 2	Projection checkpoint	`./models/level2/model.pt`	`--level2-model`
Level 2	FAISS indices (`<CLASS>.faiss`)	`./models/level2/faiss`	`--level2-faiss-dir`
Level 2	Reference TSVs (`<CLASS>_labels.tsv`)	`./models/level2/refdb`	`--level2-label-tsv-dir`

If --level2-families is omitted, Level 2 automatically uses each sequence's predicted Level 1 classes.

Outputs

Each run writes three main files under --output-dir with the prefix --output-name. When requested, embedding arrays are also saved as .npy files only.

*_predictions.tsv

sequence_id
pred_is_cazy
pred_cazy_class
pred_cazy_family

Notes:

pred_is_cazy is CAZy for CAZy sequences and Non-CAZy for non-CAZy sequences.
pred_cazy_class is empty for non-CAZy sequences.
pred_cazy_family is empty for non-CAZy sequences.
For multi-label Level 1 predictions, both pred_cazy_class and pred_cazy_family use | as the separator.

*_probabilities.jsonl

One JSON object per sequence.
level0.prob_is_cazy: probability from the binary classifier.
level1.class_probabilities: probabilities for GT, GH, CBM, CE, PL, and AA.
level2.predicted_families: family predictions for each predicted major class, including score, matched reference sequence, and vote count.
Saved probabilities and Level 2 scores are rounded to 5 decimal places.

*_statistics.tsv

Summary counts and percentages for Level 0, Level 1, and Level 2 outputs.

Optional embedding outputs

*_level0_embeddings.npy when --save-level0-embeddings is used.
*_level1_embeddings.npy when --save-level1-embeddings is used.
*_level2_embeddings.npy when --save-level2-embeddings is used.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
input		input
scripts		scripts
src/caalm		src/caalm
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

📖 Usage

Prediction Flow

Example Command

Common Options

Models

Outputs

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CAALM: Carbohydrate Activity Annotation with protein Language Models

⚙️ Installation

📖 Usage

Prediction Flow

Example Command

Common Options

Models

Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages