Skip to content

Fix #1113: Hash content-affecting selection params into PLINK export …#1158

Merged
jonbrenas merged 5 commits intomalariagen:masterfrom
khushthecoder:GH1113-plink-filename-hash
Mar 20, 2026
Merged

Fix #1113: Hash content-affecting selection params into PLINK export …#1158
jonbrenas merged 5 commits intomalariagen:masterfrom
khushthecoder:GH1113-plink-filename-hash

Conversation

@khushthecoder
Copy link
Copy Markdown
Contributor

Fix #1113: Hash content-affecting selection params into PLINK export filenames

Fixes: #1113

Summary

The PLINK export method biallelic_snps_to_plink() builds output filename prefixes from only a subset of parameters (region, n_snps, min_minor_ac, max_missing_an, thin_offset). Parameters like sample_sets, sample_query, sample_query_options, sample_indices, site_mask, and random_seed are not reflected in the filename. This means two calls with different sample selections can silently overwrite each other's .bed/.bim/.fam files.

What This PR Does

New helper: _plink_content_hash()

Added a module-level helper in to_plink.py that:

  1. Canonicalizes inputs before hashing:
    • sample_sets: sorted if list/tuple (order-insensitive for set membership)
    • sample_indices: numpy arrays converted to Python lists
    • random_seed: numpy integers converted to Python int
  2. Hashes using the existing _hash_params() utility from util.py (JSON dump with sort_keys=True → MD5)
  3. Truncates to 12 hex chars for a readable but collision-resistant suffix

Updated filename format

# Before (collision-prone):
{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}

# After (collision-safe):
{output_dir}/{region}.{n_snps}.{min_minor_ac}.{max_missing_an}.{thin_offset}.{content_hash}

Tests

Test Purpose
test_plink_converter (updated) Uses API return value instead of hardcoded path format — more robust
test_plink_filename_collision Different sample_sets → different hash
test_plink_filename_random_seed Different random_seed → different hash
test_plink_filename_canonicalization ["a","b"] and ["b","a"] → same hash

Backward Compatibility

Existing PLINK files generated with the old naming convention will no longer be auto-discovered when overwrite=False. This is intentional — the old filenames were not unique identifiers of content, which is precisely the bug being fixed.

Design Decisions

  • Why hashing instead of embedding raw strings? Keeps filenames short, avoids filesystem path length limits, handles arbitrarily long query strings.
  • Why canonicalization? sample_sets=["a","b"] and ["b","a"] represent the same cohort and should produce the same output files.
  • Why _hash_params()? It's the existing project helper already used for results-cache keys — consistent "house style" and well-tested.

…ntrolled filenames

Per maintainer feedback, the onus on avoiding filename collisions should
be on the user. Added an optional 'out' parameter that lets the user
specify a custom output file prefix. When provided, PLINK files are
written as {output_dir}/{out}.bed/.bim/.fam. When omitted, the existing
auto-generated prefix from SNP selection parameters is used.
@khushthecoder khushthecoder force-pushed the GH1113-plink-filename-hash branch from 70a7fe9 to 3eca47d Compare March 20, 2026 07:28
@jonbrenas jonbrenas merged commit 0b58f9e into malariagen:master Mar 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash PLINK export filenames by content-affecting selection parameters to prevent collisions

2 participants