-
Clone the Repository
git clone https://github.com/lczong/CAALM.git cd CAALM -
Set Up a Virtual Environment (Recommended)
conda create -n caalm python=3.10 conda activate caalm
-
Install PyTorch
Follow the installation below, or choose the build that matches your device (official guide | previous versions)
# CUDA 12.6 pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126 # CPU only pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
-
Install FAISS
# CPU (via pip or conda) pip install faiss-cpu # option 1 conda install faiss-cpu -c pytorch # option 2 # GPU (conda recommended — pip may not work correctly) conda install faiss-gpu -c pytorch
-
Install the Package
pip install . -
Download Model Assets
Download the full CAALM Hugging Face repository into a directory named
modelsin the project root:python -c "from huggingface_hub import snapshot_download; snapshot_download('lczong/CAALM', local_dir='models')"The expected layout after download is:
models/ ├── level0/ # Level 0 binary classifier ├── level1/ # Level 1 multi-label classifier └── level2/ ├── model.pt # Level 2 projection checkpoint ├── faiss/ # FAISS indices (<CLASS>.faiss) └── refdb/ # Reference TSVs (<CLASS>_labels.tsv)
CAALM runs three levels in sequence:
- Level 0 predicts whether a sequence is
CAZyornon-CAZy. - If Level 0 predicts CAZy, Level 1 predicts one or more major CAZy classes from
GT,GH,CBM,CE,PL, andAA. - Level 2 retrieves family labels from the FAISS index and reference database for each predicted Level 1 major class.
If Level 1 predicts multiple classes such as GH|CBM, Level 2 searches both major-class databases and writes one family prediction per major class.
A convenience script is provided to run the example with one command:
./scripts/predict_example.shOr invoke the CLI directly:
caalm input/example.fastaThe output name defaults to the input filename stem (here example, from input/example.fasta), and output files are written to ./outputs/. To customise:
caalm your_sequences.fasta -o results --output-name my_runUse caalm --help to see all options grouped by category.
# Use a specific GPU
caalm input.fasta -d cuda:0
# Enable mixed precision for faster inference
caalm input.fasta --mixed-precision bf16
# Increase batch size for large-memory GPUs
caalm input.fasta -b 16
# Increase the level 2 projection batch size independently
caalm input.fasta -b2 1024
# Save level 1 embeddings for downstream analysis
caalm input.fasta --save-level1-embeddings
# Save level 0 embeddings
caalm input.fasta --save-level0-embeddings
# Save level 2 projected embeddings
caalm input.fasta --save-level2-embeddingsThe recommended setup is to download the full CAALM Hugging Face repository into a local models directory (see Installation step 6). If local files are not found, Level 0 and Level 1 will try to download from Hugging Face automatically.
| Level | Description | Default path | CLI override |
|---|---|---|---|
| Level 0 | Binary CAZy / non-CAZy classifier | ./models/level0 |
--level0-model |
| Level 1 | Multi-label major class classifier | ./models/level1 |
--level1-model |
| Level 2 | Projection checkpoint | ./models/level2/model.pt |
--level2-model |
| Level 2 | FAISS indices (<CLASS>.faiss) |
./models/level2/faiss |
--level2-faiss-dir |
| Level 2 | Reference TSVs (<CLASS>_labels.tsv) |
./models/level2/refdb |
--level2-label-tsv-dir |
If --level2-families is omitted, Level 2 automatically uses each sequence's predicted Level 1 classes.
Each run writes three main files under --output-dir with the prefix --output-name. When requested, embedding arrays are also saved as .npy files only.
*_predictions.tsv
sequence_idpred_is_cazypred_cazy_classpred_cazy_family
Notes:
pred_is_cazyisCAZyfor CAZy sequences andNon-CAZyfor non-CAZy sequences.pred_cazy_classis empty for non-CAZy sequences.pred_cazy_familyis empty for non-CAZy sequences.- For multi-label Level 1 predictions, both
pred_cazy_classandpred_cazy_familyuse|as the separator.
*_probabilities.jsonl
- One JSON object per sequence.
level0.prob_is_cazy: probability from the binary classifier.level1.class_probabilities: probabilities forGT,GH,CBM,CE,PL, andAA.level2.predicted_families: family predictions for each predicted major class, including score, matched reference sequence, and vote count.- Saved probabilities and Level 2 scores are rounded to 5 decimal places.
*_statistics.tsv
- Summary counts and percentages for Level 0, Level 1, and Level 2 outputs.
Optional embedding outputs
*_level0_embeddings.npywhen--save-level0-embeddingsis used.*_level1_embeddings.npywhen--save-level1-embeddingsis used.*_level2_embeddings.npywhen--save-level2-embeddingsis used.