FluorCode: Predicting Fluorescent Protein Photophysical Properties with LoRA-Fine-Tuned Protein Language Models

Code and data for the ICML 2026 AI4Science Workshop paper.

Overview

FluorCode predicts six photophysical properties of fluorescent proteins (FPs) from amino acid sequence:

Property	Unit	Description
ex_max	nm	Excitation maximum wavelength
em_max	nm	Emission maximum wavelength
qy	0-1	Quantum yield
ext_coeff	M-1cm-1	Molar extinction coefficient
pka	-	Acid dissociation constant
brightness	%	Relative brightness (ext_coeff x qy)

We compare three approaches:

FPredX (baseline) - XGBoost on MSA one-hot encoding
LoRA-ESM2 + XGBoost - LoRA-fine-tuned ESM2-650M embeddings with XGBoost
LoRA-ESM2 + MLP - LoRA-fine-tuned ESM2-650M embeddings with a 2-layer MLP

Repository Structure

data/
  fetch_fpbase.py           # Download raw data from FPbase API
  identify_chromophore.py   # Identify chromophore tripeptide positions
  fold_simplefold.py        # Fold sequences with SimpleFold
  parse_structures.py       # Parse PDB structures for feature extraction
  graft_chromophore.py      # Graft chromophore into predicted structures
  sequence/                 # Curated sequence data and metadata

model/
  Baseline_FPredX/          # FPredX baseline (one-hot + XGBoost)
  LoRA_ESM2/                # LoRA fine-tuning notebook + training results
  LoRA_ESM2_Structure/      # Structural feature ablation (pocket3d)

benchmark/
  BENCHMARK_REPORT.md       # Full benchmark methodology and results
  compare_models.py         # Head-to-head model comparison script
  clustered/                # MMseqs2-clustered cross-validation results

inference/                  # Standalone prediction from sequence
  model.py                  # Model architecture
  predict.py                # CLI + Python API for predictions

figures/                    # Paper figure generation scripts + outputs
data_visual/                # Exploratory data visualizations

Installation

pip install -r requirements.txt

Requires Python >= 3.9. ESM2 weights (~2.5 GB) are downloaded automatically on first run.

Quick Start

Predict properties for new sequences

cd inference
python predict.py \
    --sequence MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLK \
    --checkpoint ../model/LoRA_ESM2/checkpoints/fold_0/best.pt

Reproduce paper figures

# Figure 2: Dataset statistics
python figures/plot_fig2_data.py

# Figure 3: Random CV comparison
# Figure 4: Clustered CV comparison
python figures/plot_comparison.py

# Supplementary scatter plot
python figures/plot_scatter_em.py

Reproduce benchmarks

The benchmark scripts require LoRA embeddings (lora_embeddings_all_folds.npz, ~84 MB). Download from HuggingFace and place in model/LoRA_ESM2/:

# After downloading embeddings:
python figures/run_mlp_benchmark.py      # MLP benchmark (all targets, all schemes)
python figures/run_xgb_benchmark.py      # XGBoost benchmark (qy, ext_coeff, pka)

Pre-computed results are included in figures/mlp_benchmark_results.csv and figures/xgb_extra_results.csv.

Model Checkpoints

This repository includes a single fold checkpoint (model/LoRA_ESM2/checkpoints/fold_0/best.pt, ~32 MB). All 20 fold checkpoints for ensemble prediction are available on HuggingFace:

Coming soon

Each checkpoint contains LoRA adapter weights, attention pooling parameters, prediction head weights, and target normalization statistics.

Key Results

We evaluate all models under two cross-validation schemes:

Random CV: standard 20-fold splits (seed 42), no sequence-identity constraints.
Clustered CV: MMseqs2 group K-fold at 90% / 70% / 50% identity (183 / 82 / 37 clusters). Members of a cluster always share a fold, preventing family-level leakage.

Pearson R — Random vs. Clustered (50% identity)

Target	FPredX (rand)	LoRA+XGB (rand)	LoRA+MLP (rand)	FPredX (50%)	LoRA+XGB (50%)	LoRA+MLP (50%)
ex_max	0.89	0.95	0.97	0.58	0.93	0.95
em_max	0.92	0.97	0.97	0.63	0.94	0.95
qy	0.75	0.93	0.96	0.16	0.92	0.95
ext_coeff	0.70	0.96	0.97	0.14	0.95	0.96
pka	0.48	0.88	0.91	0.13	0.87	0.91
brightness	0.78	0.91	0.96	0.13	0.90	0.96

MAE degradation across clustering thresholds

Mean absolute error for excitation / emission wavelength (nm). FPredX inflates sharply as family-level leakage is removed; LoRA-ESM2 remains stable.

Model	Random	90%	70%	50%
FPredX	12.7 / 8.3	23.7 / 17.3	25.6 / 19.4	35.6 / 30.5
LoRA-ESM2 (XGB)	10.1 / 6.9	12.3 / 8.7	13.0 / 9.3	13.8 / 11.8
LoRA-ESM2 (MLP)	8.9 / 6.7	9.7 / 8.4	9.7 / 8.1	11.3 / 9.9

Takeaways

Under random CV the gap between one-hot FPredX and LoRA-ESM2 looks modest (2–3 nm MAE on spectral targets), but this protocol leaks family identity across folds.
Under 50%-identity clustered CV, FPredX collapses to near-noise on non-spectral targets (qy / ext_coeff / pKa / brightness, R ≈ 0.13–0.16), while LoRA-ESM2 retains R ≥ 0.87 across all six properties.
The MLP head consistently outperforms the XGBoost head — the LoRA backbone, chromophore-aware attention pooling, and MLP head are trained jointly, so the pooling can adapt to the downstream task.
Adding Pocket-3D chromophore-anchored structural descriptors (~95 dims, from 913 grafted + minimized structures) yields no consistent gain beyond LoRA-ESM2 in this setting.

Full per-target MAE / RMSE / R tables across all four schemes are in benchmark/BENCHMARK_REPORT.md and the paper appendix.

Data Pipeline

The curated dataset is included in data/sequence/. To rebuild from scratch:

python data/fetch_fpbase.py          # Download raw data from FPbase API
python data/identify_chromophore.py  # Identify chromophore positions

Requirements

See requirements.txt. Core dependencies:

torch>=2.0, fair-esm, numpy, pandas, scikit-learn, scipy, xgboost, matplotlib, biopython, optuna

Citation

@inproceedings{fluorcode2026,
  title={FluorCode: Predicting Fluorescent Protein Photophysical Properties with LoRA-Fine-Tuned Protein Language Models},
  author={Sou, Rico Chi Kit and Ziajowska, Alicja},
  booktitle={ICML 2026 Workshop on AI for Science},
  year={2026}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FluorCode: Predicting Fluorescent Protein Photophysical Properties with LoRA-Fine-Tuned Protein Language Models

Overview

Repository Structure

Installation

Quick Start

Predict properties for new sequences

Reproduce paper figures

Reproduce benchmarks

Model Checkpoints

Key Results

Pearson R — Random vs. Clustered (50% identity)

MAE degradation across clustering thresholds

Takeaways

Data Pipeline

Requirements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmark		benchmark
data		data
data_visual		data_visual
figures		figures
inference		inference
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FluorCode: Predicting Fluorescent Protein Photophysical Properties with LoRA-Fine-Tuned Protein Language Models

Overview

Repository Structure

Installation

Quick Start

Predict properties for new sequences

Reproduce paper figures

Reproduce benchmarks

Model Checkpoints

Key Results

Pearson R — Random vs. Clustered (50% identity)

MAE degradation across clustering thresholds

Takeaways

Data Pipeline

Requirements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages