Reinforcement-guided generative protein language models enable de novo design of highly diverse AAV capsids
This repository contains the code accompanying the arXiv submission of the paper:
“Reinforcement-guided generative protein language models enable de novo design of highly diverse AAV capsids”
It includes two sets of scripts, each with one or more software environments.
This set contains the main scripts used for data processing, analysis, and visualization:
01_preliminaryAnalysis_and_mutationalLandscape.py– Preliminary data processing and mutation landscape analysis02_distanceFinetunedSequences.py– Calculates novelty scores for all fine-tuned generated sequences03_plottingNovelty.py– Analyzes a set of randomly chosen sequences and generates the plots presented in the paper04_biophysicalAnalysis.py– Implements polarity and charge filter analysis and the grid-based selection strategy
Environment for Set 1: environment1.yml
This set contains scripts used for model training and sequence generation:
05_supervisedFineTuning.py– Supervised fine-tuning of ProGen2-small on viable AAV sequences, with masked loss on the generated middle region06_rlTraining_viabilityDiversity.py– Reinforcement learning with viability and reference-diversity rewards (KL-regularized policy updates)
Environments for Set 2: requirements2.yml and requirements3.yml
All scripts are provided as used by the authors. Users may need to adapt them to their own data or setup.
For questions or issues, please contact Ana Filipa Rodrigues (afdrodrigues@fc.ul.pt).
- Lucas Ferraz
- Ana Filipa Rodrigues
- Pedro Giesteira Cotovio
- Mafalda Ventura
- Gabriela M. Silva
- Ana Sofia Coroadinha
- Miguel Machuqueiro
- Cátia Pesquita
This work was supported by FCT – Fundação para a Ciência e Tecnologia, I.P. under the LASIGE Research Unit, ref. UID/00408/2025, DOI: https://doi.org/10.54499/UID/00408/2025, and partially supported by project 41, HfPT: Health from Portugal, funded by the Portuguese Plano de Recuperação e Resiliência.
It was also partially supported by the CancerScan project, which received funding from the European Union’s Horizon Europe Research and Innovation Action (EIC Pathfinder Open) under grant agreement No. 101186829.
Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Innovation Council and SMEs Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.