Skip to content

PLINK QC pipeline

magosil86 edited this page Nov 22, 2015 · 15 revisions

-Pipeline for Sample and SNP QC using PLINK

(versions 1.07 and 1.9).

This program implements a QC workflow for human GWAS analysis using PLINK binary files.

Sample QC tasks include checking for:

  1. discordant sex information
  2. Individual missingness
  3. heterozygosity scores
  4. relatedness

SNP QC tasks include checking:

  1. minor allele frequencies
  2. SNP missingness
  3. differential missingness
  4. Hardy Weinberg Equilibrium deviations

Assumptions: Case -control status has been specified in the .fam file (phenotype info can be added using the --make-pheno flag in PLINK 1.9)

Pipeline Options: For datasets missing sex info, the sexinfo_available variable in PlinkUserInput should be set to False e.g. sexinfo_available = False

User interaction: To facilitate user interaction, the pipeline tasks have been grouped into smaller sub-pipelines.

  1. pipeline_qcplink_tasks1-5of20.py
  2. pipeline_qcplink_tasks6-14of20.py
  3. pipeline_qcplink_tasks15-20of20.py

Important note: Individuals identified as being duplicates or being closely related IBD > 0.1875 are written to fail_IBD_qcplink.txt but are NOT removed during QC. It is left up to the user's discretion to decide the point at which they would like to remove those individuals from the dataset. At that point they can be removed using the PLINK command below:

plink --bfile qced_plink --remove fail_IBD_qcplink.txt --make-bed --out clean_inds_qcplink

Reference: Anderson, C. et al. Data quality control in genetic case-control association studies. Nature Protocols. 5, 1564-1573, 2010

Running pipeline_qcplink_tasks1-5of20.py

Pipeline files needed to run tasks 1-5

  1. pipeline_qcplink_tasks1-5of20.py
  2. pipeline_qcplink_tasks1-5of20_config.py
  3. pipeline_qcplink_tasks1-5of20_stages_config.py
  4. PlinkUserInput.py

Note: The above files should be visible from the witsGWAS/ directory

Update the PlinkUserInput.py in preparation for running tasks 1-5 of the PLINK QC pipeline

cd witsGWAS/

Edit the following variables in PlinkUserInput.py

emacs witsGWAS/PlinkUserInput.py

Variables Description Value type
projectname name of project as one word (e.g. RAW_GWA_DATA) String
author project author String
sexinfo_available Specifies whether sex information is available Boolean
plink_binary_files Path to PLINK binaries String

A flowchart of tasks 1-5 of the PLINK QC pipeline

fowchart_qcplink_1-5

The flowchart above can be generated by typing the commands below at the unix prompt. (A flowchart.svg file will be generated and stored in the current project folder: projects/projectname-pipeline-author-timestamp/)

cd witsGWAS/
rubra pipeline_qcplink_tasks1-5of20.py --config pipeline_qcplink_tasks1-5of20_config.py pipeline_qcplink_tasks1-5of20_stages_config.py PlinkUserInput.py --style flowchart

Side note for WITS cluster users: Need to log into a node first, as flowcharts can't be generated from cream

qsub -I -q medium

Viewing the inputs and expected outputs of each task/job via a pipeline printout

cd witsGWAS/
rubra pipeline_qcplink_tasks1-5of20.py --config pipeline_qcplink_tasks1-5of20_config.py pipeline_qcplink_tasks1-5of20_stages_config.py PlinkUserInput.py --style print

A printout of the pipeline tasks will be shown on screen (stdout).

Running tasks 1-5 of the PLINK QC pipeline

cd witsGWAS/
rubra pipeline_qcplink_tasks1-5of20.py --config pipeline_qcplink_tasks1-5of20_config.py pipeline_qcplink_tasks1-5of20_stages_config.py PlinkUserInput.py --style run

A new folder for the results will be created under the witsGWAS/projects/ directory

Tip: Running pipelines from within screen sessions minimizes the chances of the pipeline run being interrupted by broken network connections.

Running pipeline_qcplink_tasks6-14of20.py

Pipeline files needed to run tasks 6-14

  1. pipeline_qcplink_tasks6-14of20.py
  2. pipeline_qcplink_tasks6-14of20_config.py
  3. pipeline_qcplink_tasks6-14of20_stages_config.py
  4. PlinkUserInput.py

Note: The above files should be visible from the witsGWAS/ directory

Update PlinkUserInput.py in preparation for running tasks 6-14 of the PLINK QC pipeline

cd witsGWAS/

Edit the following variables in PlinkUserInput.py

emacs witsGWAS/PlinkUserInput.py

Variables Description Value type
current_dir path to the directory holding results from running pipeline_qcplink_tasks1-5of20.py String
cut_het_high Specifies upper heterozygosity cutoff Float
cut_het_low Specifies lower heterozygosity cutoff Float
cut_miss Specifies individual missingness cutoff Float

A flowchart of tasks 6-14 of the PLINK QC pipeline

flowchart_6-14_smx

Similar to the flowchart for tasks 1-5, the flowchart above can be generated by typing the commands below at the unix prompt.

cd witsGWAS/
rubra pipeline_qcplink_tasks6-14of20.py --config pipeline_qcplink_tasks6-14of20_config.py pipeline_qcplink_tasks6-14of20_stages_config.py PlinkUserInput.py --style flowchart

Key point: Notice the difference in the keys (task-to-run vs up-to-date task) between the flowchart for tasks 1-5 and the flowchart shown for tasks 6-14. The flowchart image for tasks 1-5 demonstrates that the flowchart was generated before running tasks 1-5 whilst that for tasks 6-14 was generated after running tasks 6-14.

Viewing the inputs and expected outputs of tasks 6-14 in the PLINK QC pipeline

cd witsGWAS/
rubra pipeline_qcplink_tasks6-14of20.py --config pipeline_qcplink_tasks6-14of20_config.py pipeline_qcplink_tasks6-14of20_stages_config.py PlinkUserInput.py --style print

A printout of the pipeline tasks will be shown on screen (stdout).

Running tasks 6-14 of the PLINK QC pipeline

cd witsGWAS/
rubra pipeline_qcplink_tasks6-14of20.py --config pipeline_qcplink_tasks6-14of20_config.py pipeline_qcplink_tasks6-14of20_stages_config.py PlinkUserInput.py --style run

The results from running tasks 6-14 will be added to the project folder created during the PLINK QC pipeline run for tasks 1-5

Running pipeline_qcplink_tasks15-20of20.py

Pipeline files needed to run tasks 15-20

  1. pipeline_qcplink_tasks15-20of20.py
  2. pipeline_qcplink_tasks15-20of20_config.py
  3. pipeline_qcplink_tasks15-20of20_stages_config.py
  4. PlinkUserInput.py

Note: The above files should be visible from the witsGWAS/ directory

Update the PlinkUserInput.py in preparation for running tasks 15-20 of the PLINK QC pipeline

cd witsGWAS/

Edit the following variables in PlinkUserInput.py

emacs witsGWAS/PlinkUserInput.py

Note: All the variables take values of type Float

Variables Description
cut_geno Specifies SNP missingness cutoff
cut_diff_miss Specifies differential missingness cutoff
cut_hwe Specifies HWE P-value cutoff
cut_maf Specifies maf cutoff

A flowchart of tasks 15-20 of the PLINK QC pipeline

flowchart_15-20_smx

Similar to the flowchart for tasks 1-5, the flowchart above can be generated by typing the commands below at the unix prompt.

cd witsGWAS/
rubra pipeline_qcplink_tasks15-20of20.py --config pipeline_qcplink_tasks15-20of20_config.py pipeline_qcplink_tasks15-20of20_stages_config.py PlinkUserInput.py --style flowchart

Viewing the inputs and expected outputs of tasks 15-20 in the PLINK QC pipeline

cd witsGWAS/
rubra pipeline_qcplink_tasks15-20of20.py --config pipeline_qcplink_tasks15-20of20_config.py pipeline_qcplink_tasks15-20of20_stages_config.py PlinkUserInput.py --style print

A printout of the pipeline tasks will be shown on screen (stdout).

Running tasks 15-20 of the PLINK QC pipeline

cd witsGWAS/
rubra pipeline_qcplink_tasks15-20of20.py --config pipeline_qcplink_tasks15-20of20_config.py pipeline_qcplink_tasks15-20of20_stages_config.py PlinkUserInput.py --style run

The results from running tasks 15-20 will be added to the project folder created during the PLINK QC pipeline run for tasks 1-5

To identify individuals with divergent ancestry:

Use the pipeline_qcplink_tasks15-20of20 sub-pipeline that has the suffix: extended

cd witsGWAS/
rubra pipeline_qcplink_tasks15-20of20_extended.py --config pipeline_qcplink_tasks15-20of20_stages_config_extended.py pipeline_qcplink_tasks15-20of20_config_extended.py PlinkUserInput.py --style run

Example dataset with results of analysis

See plinkqc and plinkqc_no-sex-info in the example_datasets sub-directory

Clone this wiki locally