Fix #1113: Hash content-affecting selection params into PLINK export …#1158
Merged
jonbrenas merged 5 commits intomalariagen:masterfrom Mar 20, 2026
Merged
Conversation
…ntrolled filenames
Per maintainer feedback, the onus on avoiding filename collisions should
be on the user. Added an optional 'out' parameter that lets the user
specify a custom output file prefix. When provided, PLINK files are
written as {output_dir}/{out}.bed/.bim/.fam. When omitted, the existing
auto-generated prefix from SNP selection parameters is used.
70a7fe9 to
3eca47d
Compare
jonbrenas
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #1113: Hash content-affecting selection params into PLINK export filenames
Fixes: #1113
Summary
The PLINK export method
biallelic_snps_to_plink()builds output filename prefixes from only a subset of parameters (region,n_snps,min_minor_ac,max_missing_an,thin_offset). Parameters likesample_sets,sample_query,sample_query_options,sample_indices,site_mask, andrandom_seedare not reflected in the filename. This means two calls with different sample selections can silently overwrite each other's.bed/.bim/.famfiles.What This PR Does
New helper:
_plink_content_hash()Added a module-level helper in
to_plink.pythat:sample_sets: sorted if list/tuple (order-insensitive for set membership)sample_indices: numpy arrays converted to Python listsrandom_seed: numpy integers converted to Python int_hash_params()utility fromutil.py(JSON dump withsort_keys=True→ MD5)Updated filename format
Tests
test_plink_converter(updated)test_plink_filename_collisionsample_sets→ different hashtest_plink_filename_random_seedrandom_seed→ different hashtest_plink_filename_canonicalization["a","b"]and["b","a"]→ same hashBackward Compatibility
Existing PLINK files generated with the old naming convention will no longer be auto-discovered when
overwrite=False. This is intentional — the old filenames were not unique identifiers of content, which is precisely the bug being fixed.Design Decisions
sample_sets=["a","b"]and["b","a"]represent the same cohort and should produce the same output files._hash_params()? It's the existing project helper already used for results-cache keys — consistent "house style" and well-tested.