A new DNA foundation model for angiosperms, with LoRA fine-tuned models for accessible chromatin, gene expression, and protein translation.
PlantCaduceus, with its short name of PlantCAD, is a plant DNA LM based on the Caduceus architecture, which extends the efficient Mamba linear-time sequence modeling framework to incorporate bi-directionality and reverse complement equivariance, specifically designed for DNA sequences. PlantCAD is pre-trained on a curated dataset of 16 Angiosperm genomes. PlantCAD showed state-of-the-art cross species performance in predicting TIS, TTS, Splice Donor and Splice Acceptor. The zero-shot of PlantCAD enables identifying genome-wide deleterious mutations and known causal variants in Arabidopsis, Sorghum and Maize.
New to PlantCAD? Try our Google Colab demo - no installation required!
For local usage: See installation instructions here, then use notebooks/examples.ipynb to get started.
Pre-trained models have been uploaded to HuggingFace π€: PlantCAD and PlantCAD2.
| Model | Max Input Length | Model Size | Embedding Size |
|---|---|---|---|
| PlantCAD | |||
| PlantCaduceus_l20 | 512bp | 20M | 384 |
| PlantCaduceus_l24 | 512bp | 40M | 512 |
| PlantCaduceus_l28 | 512bp | 128M | 768 |
| PlantCaduceus_l32 | 512bp | 225M | 1024 |
| PlantCAD2 | |||
| PlantCAD2-Small | 8192bp | 88M | 768 |
| PlantCAD2-Medium | 8192bp | 311M | 1024 |
| PlantCAD2-Large | 8192bp | 694M | 1536 |
β οΈ Important: The "Max Input Length" is a hard limit β your input sequences cannot exceed this length. Use-contextSize 512for PlantCAD models and up to-contextSize 8192for PlantCAD2 models. See Model Recommendations for guidance on which model to use.
| Option | Best for |
|---|---|
| Google Colab | Beginners β no installation required |
| Local installation | Regular use β requires NVIDIA GPU |
| Docker | Reproducible environments |
Get sequence embeddings with PlantCAD:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
device = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained('kuleshov-group/PlantCaduceus_l32')
model = AutoModelForMaskedLM.from_pretrained(
'kuleshov-group/PlantCaduceus_l32', trust_remote_code=True
).to(device)
sequence = "CTTAATTAATATTGCCTTTGTAATAACGCGCGAAACACAAATCTTCTCTGCCTAATGCAGTAGTCATGTGTTGACTCCTTCAAAATTTCCAAGAAGTTAGTGGCTGGTGTGTCATTGTCTTCATCTTTTTTTTTTTTTTTTTAAAAATTGAATGCGACATGTACTCCTCAACGTATAAGCTCAATGCTTGTTACTGAAACATCTCTTGTCTGATTTTTTCAGGCTAAGTCTTACAGAAAGTGATTGGGCACTTCAATGGCTTTCACAAATGAAAAAGATGGATCTAAGGGATTTGTGAAGAGAGTGGCTTCATCTTTCTCCATGAGGAAGAAGAAGAATGCAACAAGTGAACCCAAGTTGCTTCCAAGATCGAAATCAACAGGTTCTGCTAACTTTGAATCCATGAGGCTACCTGCAACGAAGAAGATTTCAGATGTCACAAACAAAACAAGGATCAAACCATTAGGTGGTGTAGCACCAGCACAACCAAGAAGGGAAAAGATCGATGATCG"
input_ids = tokenizer.encode_plus(
sequence, return_tensors="pt", return_attention_mask=False,
return_token_type_ids=False
)["input_ids"].to(device)
with torch.inference_mode():
outputs = model(input_ids=input_ids, output_hidden_states=True)
embeddings = outputs.hidden_states[-1].to(torch.float32).cpu().numpy()
# Average forward and reverse complement embeddings
hidden_size = embeddings.shape[-1] // 2
forward = embeddings[..., 0:hidden_size]
reverse = embeddings[..., hidden_size:][:, ::-1, :]
averaged_embeddings = (forward + reverse) / 2
print(averaged_embeddings.shape)See notebooks/examples.ipynb for more detailed examples.
| Guide | Description |
|---|---|
| Zero-shot SNP & Region Scoring | Score variants (VCF) or genomic regions (BED) using log-likelihood ratios |
| Zero-shot SV Scoring | Score structural variants (deletions & insertions) |
| XGBoost Classifiers | Train or use pre-trained classifiers for TIS, TTS, splice sites |
| In-silico Mutagenesis | Large-scale simulation and analysis of genetic variants |
| Fine-tuned PlantCAD2 Models | LoRA models for chromatin, expression, translation |
| Zero-shot Evaluation | PlantCAD2 zero-shot benchmark results |
| Pre-training | Pre-train or fine-tune PlantCAD models from scratch |
| Model Recommendations | Which model to use, inference speed benchmarks, GPU memory guide |
If you find PlantCAD useful for your research, please consider citing our paper:
- Zhai, J., Gokaslan, A., Schiff, Y., Berthel, A., Liu, Z. Y., Lai, W. L., Miller, Z. R., Scheben, A., Stitzer, M. C., Romay, M. C., Buckler, E. S., & Kuleshov, V. (2025). Cross-species modeling of plant genomes at single nucleotide resolution using a pretrained DNA language model. Proceedings of the National Academy of Sciences, 122(24), e2421738122. https://doi.org/10.1073/pnas.2421738122
- Zhai J., Gokaslan A., Hsu SK., Chen SP., Liu ZY., Marroquin E., Czech E., Cannon B., Berthel A., Romay MC., Pennell M., Kuleshov V.* Buckler ES*. PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms. bioRxiv. 2025. Nov 19. doi: https://doi.org/10.1101/2025.08.27.672609
Maintained by Jingjing Zhai.
- For collaboration inquiries: jz963@cornell.edu or zhaijingjing603@gmail.com
- General questions, bug reports, and feature requests: please open an issue
