Skip to content

ht-diva/genomics_QC_pipeline

Repository files navigation

genomics_QC_pipeline

The genomic QC pipeline is designed to clean and prepare imputed genotype data for pQTL analysis. All the rules were based on Alessia Mapelli’s work: [https://github.com/ht-diva/pqtl-believe-interval/blob/main/Script_QC_INTERVAL_genomics.R]

Requirements

  • Singularity

see also environment.yml and Makefile

Getting started

The output is written to the path defined by the workspace_path variable in the config.yaml file. By default, this path is ./results.

Rules description

  1. list_rs:
    Purpose: Generate lists of all rsIDs and pseudo biallelic variants from the initial pgen file.
    Output: Two files – one containing all rsIDs and the other containing pseudo biallelic variants.

  2. recode_pgen:
    Purpose: Replace the IDs in the imputed pgen file with a new format: chr:pos:ref:alt.
    Output: An updated pgen file with the new ID format.

  3. selected_sample:
    Purpose: Select individuals who are present in both the 2018 data and have corresponding proteomic data.
    Output: A filtered list of individuals.

  4. filter_var:
    Purpose: Perform several quality control steps: remove additional failed samples, identify and remove heterozygosity outliers, perform minor allele frequency (MAF) filtering, remove related samples based on Hardy-Weinberg equilibrium (HWE).
    Output: A cleaned dataset with high-quality variants and samples.

  5. create_bgen:
    Purpose: Convert the filtered data from the previous steps into bgen format, a commonly used format for storing large-scale genotype data.
    Output: A bgen file containing the cleaned genotype data.

  6. qctool:
    Purpose: Compute SNP statistics using qctool, ensuring the quality of the variants.
    Output: SNP statistics file.

  7. get_hq_variants:
    Purpose: Filter variants to retain only those with an info score greater than 0.7.
    Output: A list of high-quality variants.

  8. filter_hq_variants:
    Purpose: Extract SNPs with an info score greater than 0.7 from the pgen file and create a new pgen file for each chromosome.
    Output: pgen files for each chromosome containing only high-quality variants.

  9. merge_filter_hq_variants:
    Purpose: Merge the chromosome-specific pgen files from the previous step into a single pgen file.
    Output: A combined pgen file containing high-quality variants from all chromosomes.

  10. update_pgen_id:
    Purpose: Update the variant IDs in the pgen file to the format chr:pos:A0:A1, with A0 and A1 in alphabetical order.
    Output: An updated pgen file with harmonised IDs.

  11. update_pgen_alleles:
    Purpose: Harmonize the alleles in the pgen file to match the new IDs.
    Output: A pgen file with harmonised alleles.

  12. merge_filter_hq_variants_new_id_alleles_pgen:
    Purpose: Merge all the pgen files from the previous step into a final single pgen file.
    Output: A final combined pgen file with harmonised IDs and alleles, ready for pQTL analysis.

  13. pgen2bed:
    Purpose: Convert pgen file into bed format. Set hard-call-threshold equal to 0.49999999.
    Output: A bed file with harmonised alleles and minimized missing dosage.

  14. merge_filter_hq_variants_new_id_alleles_bed:
    Purpose: Merge all the bed files from the previous step into a final single bed file.
    Output: A final combined bed file.

Output

  1. pgen folder (contains raw pgen files with the new IDs format: chr:pos:ref:alt)
    • qc_recoded subfolder (contains pgen files that have been processed through quality control and recoding steps but not yet harmonised.)
    • qc_recoded_harmonised subfolder (contains pgen files that have been both quality controlled, recoded, and harmonised.)
  2. bed folder
    • qc_recoded_harmonised subfolder (contains bed files that have been harmonised.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published