Skip to content

Add caalm#11087

Merged
vagkaratzas merged 17 commits intomasterfrom
add-caalm
Mar 31, 2026
Merged

Add caalm#11087
vagkaratzas merged 17 commits intomasterfrom
add-caalm

Conversation

@vagkaratzas
Copy link
Copy Markdown
Contributor

@vagkaratzas vagkaratzas commented Mar 30, 2026

New protein annotation software for CAZyme prediciton from amino acid sequences.
PR includes the setup module needed to download models from hugging face.

The bioconda recipe review process is slow, so for now, using a pip installation (env and containers through Seqera Containers).

This is the CPU version. Will probably create a separate GPU (dramatic speed increase) one after this gets merged.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@vagkaratzas
Copy link
Copy Markdown
Contributor Author

test_level*_embeddings, seem to change between my local machine and GitHub runners; investigating

@vagkaratzas
Copy link
Copy Markdown
Contributor Author

test_level*_embeddings, seem to change between my local machine and GitHub runners; investigating

I did some more tests and they seem to stay the same in a system, but change depending on CPU model, probably because of non-determinism in the FAISS index search or the ESM embedding computation, and not due to a container/environment difference.

Likely causes (by the bot -I excluded one that didn't make sense for this case):

1. CPU architecture / SIMD instructions — FAISS uses AVX2/AVX-512 on modern CPUs and falls back to SSE4  
  or scalar on older ones. The floating-point operations are reordered differently depending on the SIMD   
  path, producing subtly different embeddings due to floating-point non-associativity. GitHub runners use  
  different CPU generations than your local machine. 

2. FAISS approximate nearest-neighbour — if Level 2 uses an IVF or HNSW index, the search is approximate 
  and the results can differ when the hardware SIMD path changes, even with the same query vectors.        
   
 The fact that it's stable within the same machine across Singularity/Docker/Conda confirms it's not a    
  software version or library issue — the same binary code hits the same SIMD path on the same CPU. But
  cross-platform (your machine → GitHub runner), the CPU capabilities differ, changing the low-level       
  floating-point execution path.

@vagkaratzas vagkaratzas requested a review from jfy133 March 31, 2026 10:44
Comment on lines +10 to +12
path("models/level0"), emit: level0
path("models/level1"), emit: level1
path("models/level2"), emit: level2
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put these all on one tuple, they are all related and can't be used in any other way (e.g. with .bam and .bai) - that way you don't have to do any .combine shenanigans in this case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agreed. Coming up

@vagkaratzas vagkaratzas requested a review from jfy133 March 31, 2026 15:19
@vagkaratzas vagkaratzas added this pull request to the merge queue Mar 31, 2026
Merged via the queue into master with commit 2153095 Mar 31, 2026
29 checks passed
@vagkaratzas vagkaratzas deleted the add-caalm branch March 31, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants