Skip to content


Repository files navigation


The genomic QC pipeline is designed to clean and prepare imputed genotype data for pQTL analysis. All the rules were based on Alessia Mapelli’s work: []


  • Singularity

see also environment.yml and Makefile

Getting started

The output is written to the path defined by the workspace_path variable in the config.yaml file. By default, this path is ./results.

Rules description

  1. list_rs:
    Purpose: Generate lists of all rsIDs and pseudo biallelic variants from the initial pgen file.
    Output: Two files – one containing all rsIDs and the other containing pseudo biallelic variants.

  2. recode_pgen:
    Purpose: Replace the IDs in the imputed pgen file with a new format: chr:pos:ref:alt.
    Output: An updated pgen file with the new ID format.

  3. selected_sample:
    Purpose: Select individuals who are present in both the 2018 data and have corresponding proteomic data.
    Output: A filtered list of individuals.

  4. filter_var:
    Purpose: Perform several quality control steps: remove additional failed samples, identify and remove heterozygosity outliers, perform minor allele frequency (MAF) filtering, remove related samples based on Hardy-Weinberg equilibrium (HWE).
    Output: A cleaned dataset with high-quality variants and samples.

  5. create_bgen:
    Purpose: Convert the filtered data from the previous steps into bgen format, a commonly used format for storing large-scale genotype data.
    Output: A bgen file containing the cleaned genotype data.

  6. qctool:
    Purpose: Compute SNP statistics using qctool, ensuring the quality of the variants.
    Output: SNP statistics file.

  7. get_hq_variants:
    Purpose: Filter variants to retain only those with an info score greater than 0.7.
    Output: A list of high-quality variants.

  8. filter_hq_variants:
    Purpose: Extract SNPs with an info score greater than 0.7 from the pgen file and create a new pgen file for each chromosome.
    Output: pgen files for each chromosome containing only high-quality variants.

  9. merge_filter_hq_variants:
    Purpose: Merge the chromosome-specific pgen files from the previous step into a single pgen file.
    Output: A combined pgen file containing high-quality variants from all chromosomes.

  10. update_pgen_id:
    Purpose: Update the variant IDs in the pgen file to the format chr:pos:A0:A1, with A0 and A1 in alphabetical order.
    Output: An updated pgen file with harmonised IDs.

  11. update_pgen_alleles:
    Purpose: Harmonize the alleles in the pgen file to match the new IDs.
    Output: A pgen file with harmonised alleles.

  12. merge_filter_hq_variants_new_id_alleles_pgen:
    Purpose: Merge all the pgen files from the previous step into a final single pgen file.
    Output: A final combined pgen file with harmonised IDs and alleles, ready for pQTL analysis.

  13. pgen2bed:
    Purpose: Convert pgen file into bed format. Set hard-call-threshold equal to 0.49999999.
    Output: A bed file with harmonised alleles and minimized missing dosage.

  14. merge_filter_hq_variants_new_id_alleles_bed:
    Purpose: Merge all the bed files from the previous step into a final single bed file.
    Output: A final combined bed file.


  1. pgen folder (contains raw pgen files with the new IDs format: chr:pos:ref:alt)
    • qc_recoded subfolder (contains pgen files that have been processed through quality control and recoding steps but not yet harmonised.)
    • qc_recoded_harmonised subfolder (contains pgen files that have been both quality controlled, recoded, and harmonised.)
  2. bed folder
    • qc_recoded_harmonised subfolder (contains bed files that have been harmonised.)


No description, website, or topics provided.






No releases published


No packages published