# Genotype PLINK file quality control

This workflow implements some prelimary data QC steps for PLINK input files. VCF format of inputs will be converted to PLINK before performing QC.

## Overview

This notebook includes workflow for

- Compute kinship matrix in sample and estimate related individuals
- Genotype and sample QC: by MAF, missing data and HWE
- LD pruning for follow up PCA analysis on genotype, as needed

A potential limitation is that the workflow requires all samples and chromosomes to be merged as one single file, in order to perform both sample and variant level QC. However, in our experience using this pipeline with 200K exomes with 15 million variants, this pipeline works on the single merged PLINK file.

## Methods

Depending on the context of your problem, the workflow can be executed in two ways:

1. Run `qc` command to perform genotype data QC and LD pruning to generate a subset of variants in preparation for analysis such as PCA.
2. Run `king` first on either the original or a subset of common variants to identify unrelated individuals. The `king` pipeline will split samples to related and unrelated individuals. Then you perform `qc` on these individuals only and finally extract the same set of QC-ed variants for related individuals.

## Input format

The whole genome PLINK bim/bed/fam bundle. For input in VCF format and/or per-chromosome VCF or PLINK format, please use `vcf_to_plink` and `merge_plink` in [genotype formatting](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.html) pipeline to convert them to PLINK file bundle.

## Default QC parameters

- Kinship coefficient for related individuals: 0.0625
- MAF default: 0
    - Above default includes both common and are variant
    - Recommand MAF for PCA: 0.01, [we should stick to common variants](https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-020-0833-x)
    - Recommand MAC for single variant analysis: 5
- Variant level missingness threshold: 0.1
- Sample level missingness threshold: 0.1
- LD pruning via PLINK for PCA analysis:
    - window 50 
    - shift 10 
    - r2 0.1

## Minimal working example

Minimal working example data-set as well as the singularity container `bioinfo.sif` can be downloaded from [Google Drive](https://drive.google.com/drive/u/0/folders/1ahIZGnmjcGwSd-BI91C9ayd_Ya8sB2ed).

The `chr1_chr6` data-set was merged from `chr1` and `chr6` data, using `merge_plink` command from [genotype formatting](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/genotype/genotype_formatting.html) pipeline.


### Example 1: perform QC on both rare and common variants

In [None]:
sos run GWAS_QC.ipynb qc_no_prune \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.bed \
    --container container/bioinfo.sif

### Example 2: QC common variants in unrelated individuals and extract those variants from related individuals

Determine and split between related and unrelated individuals,

In [None]:
sos run GWAS_QC.ipynb king \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.bed \
    --name 20220110 \
    --container container/bioinfo.sif

Variant level and sample level QC on unrelated individuals, in preparation for PCA analysis:

In [None]:
sos run GWAS_QC.ipynb qc \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.20220110.unrelated.bed \
    --maf-filter 0.01 \
    --name for_pca \
    --container container/bioinfo.sif

Extract previously selected variants from related individuals in preparation for PCA, only applying missingness filter at sample level,

In [None]:
sos run GWAS_QC.ipynb qc_no_prune \
    --cwd output/genotype \
    --genoFile output/genotype/chr1_chr6.20220110.related.bed \
    --keep-variants output/genotype/chr1_chr6.20220110.unrelated.for_pca.filtered.prune.in \
    --maf-filter 0 --geno-filter 0 --mind-filter 0.1 --hwe-filter 0 \
    --name for_pca \
    --container container/bioinfo.sif

## Command interface

In [1]:
sos run GWAS_QC.ipynb -h

usage: sos run GWAS_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  king
  qc_no_prune
  qc

Global Workflow Options:
  --cwd output (as path)
                        the output directory for generated files
  --name ''
                        A string to identify your analysis run
  --genoFile  paths

                        PLINK binary files
  --remove-samples . (as path)
                        The path to the file that contains the list of samples
                        to remove (format FID, IID)
  --keep-samples . (as path)
                        The path to the file that contains the list of samples
                        to keep (format FID, IID)
  --keep-variants

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path("output")
# A string to identify your analysis run
parameter: name = ""
# PLINK binary files
parameter: genoFile = paths
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# The path to the file that contains the list of variants to keep
parameter: keep_variants = path('.')
# The path to the file that contains the list of variants to exclude
parameter: exclude_variants = path('.')
# Kinship coefficient threshold for related individuals
# (e.g first degree above 0.25, second degree above 0.125, third degree above 0.0625)
parameter: kinship = 0.0625
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container = ""
if not container:
    container = None
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = path(f"{cwd:a}")

## Estimate kinship in the sample

The output is a list of related individuals, as well as the kinship matrix

In [None]:
# Inference of relationships in the sample to identify closely related individuals
[king_1]
# PLINK binary file
parameter: kin_maf = 0.01
input: genoFile
output: f'{cwd}/{_input:bn}{("."+name) if name else ""}.kin0'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    plink2 \
      --bfile ${_input:n} \
      --make-king-table \
      --king-table-filter ${kinship} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      --min-af ${kin_maf} \
      --max-af ${1-kin_maf} \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} 

In [None]:
# Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data 
[king_2]
# If set to true, the unrelated individuals in a family will be kept without being reported. 
# Otherwise (use `--no-maximize-unrelated`) the entire family will be removed
# Note that attempting to maximize unrelated individuals is computationally intensive on large data.
parameter: maximize_unrelated = False
output: f'{_input:n}.related_id'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R:  container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library(dplyr)
    library(igraph)
    # Remove related individuals while keeping maximum number of individuals
    # this function is simplified from: 
    # https://rdrr.io/cran/plinkQC/src/R/utils.R
    #' @param relatedness [data.frame] containing pair-wise relatedness estimates
    #' (in column [relatednessRelatedness]) for individual 1 (in column
    #' [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns
    #' relatednessIID1, relatednessIID2 and relatednessRelatedness have to present,
    #' while additional columns such as family IDs can be present. Default column
    #' names correspond to column names in output of plink --genome
    #' (\url{https://www.cog-genomics.org/plink/1.9/ibd}). All original
    #' columns for pair-wise highIBDTh fails will be returned in fail_IBD.
    #' @param relatednessTh [double] Threshold for filtering related individuals.
    #' Individuals, whose pair-wise relatedness estimates are greater than this
    #' threshold are considered related.
    relatednessFilter <- function(relatedness, 
                                  relatednessTh,
                                  relatednessIID1="IID1", 
                                  relatednessIID2="IID2",
                                  relatednessRelatedness="KINSHIP") {
        # format data
        if (!(relatednessIID1 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessIID2 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessRelatedness %in% names(relatedness))) {
            stop(paste("Column", relatednessRelatedness,
                       "for relatedness not found!"))
        }

        iid1_index <- which(colnames(relatedness) == relatednessIID1)
        iid2_index <- which(colnames(relatedness) == relatednessIID2)

        relatedness[,iid1_index] <- as.character(relatedness[,iid1_index])
        relatedness[,iid2_index] <- as.character(relatedness[,iid2_index])

        relatedness_names <- names(relatedness)
        names(relatedness)[iid1_index] <- "IID1"
        names(relatedness)[iid2_index] <- "IID2"
        names(relatedness)[names(relatedness) == relatednessRelatedness] <- "M"

        # Remove symmetric IID rows
        relatedness_original <- relatedness
        relatedness <- dplyr::select_(relatedness, ~IID1, ~IID2, ~M)

        sortedIDs <- data.frame(t(apply(relatedness, 1, function(pair) {
            c(sort(c(pair[1], pair[2])))
            })), stringsAsFactors=FALSE)
        keepIndex <- which(!duplicated(sortedIDs))

        relatedness_original <- relatedness_original[keepIndex,]
        relatedness <- relatedness[keepIndex,]

        # individuals with at least one pair-wise comparison > relatednessTh
        # return NULL to failIDs if no one fails the relatedness check
        highRelated <- dplyr::filter_(relatedness, ~M > relatednessTh)
        if (nrow(highRelated) == 0) {
            return(list(relatednessFails=NULL, failIDs=NULL))
        }

        # all samples with related individuals
        allRelated <- c(highRelated$IID1, highRelated$IID2)
        uniqueIIDs <- unique(allRelated)

        # Further selection of samples with relatives in cohort
        multipleRelative <- unique(allRelated[duplicated(allRelated)])
        singleRelative <- uniqueIIDs[!uniqueIIDs %in% multipleRelative]

        highRelatedMultiple <- highRelated[highRelated$IID1 %in% multipleRelative |
                                            highRelated$IID2 %in% multipleRelative,]
        highRelatedSingle <- highRelated[highRelated$IID1 %in% singleRelative &
                                           highRelated$IID2 %in% singleRelative,]

        # Only one related samples per individual
        if(length(singleRelative) != 0) {
          # randomly choose one to exclude
          failIDs_single <- highRelatedSingle[,1]
            
        } else {
          failIDs_single <- NULL
        }
  
        # An individual has multiple relatives
        if(length(multipleRelative) != 0) {
            relatedPerID <- lapply(multipleRelative, function(x) {
                tmp <- highRelatedMultiple[rowSums(
                    cbind(highRelatedMultiple$IID1 %in% x,
                          highRelatedMultiple$IID2 %in% x)) != 0,1:2]
                rel <- unique(unlist(tmp))
                return(rel)
            })
            names(relatedPerID) <- multipleRelative

            keepIDs_multiple <- lapply(relatedPerID, function(x) {
                pairwise <- t(combn(x, 2))
                index <- (highRelatedMultiple$IID1 %in% pairwise[,1] &
                              highRelatedMultiple$IID2 %in% pairwise[,2]) |
                    (highRelatedMultiple$IID1 %in% pairwise[,2] &
                         highRelatedMultiple$IID2 %in% pairwise[,1])
                combination <- highRelatedMultiple[index,]
                combination_graph <- igraph::graph_from_data_frame(combination,
                                                                   directed=FALSE)
                all_iv_set <- igraph::ivs(combination_graph)
                length_iv_set <- sapply(all_iv_set, function(x) length(x))

                if (all(length_iv_set == 1)) {
                    # check how often they occurr elsewhere
                    occurrence <- sapply(x, function(id) {
                        sum(sapply(relatedPerID, function(idlist) id %in% idlist))
                    })
                    # if occurrence the same everywhere, pick the first, else keep
                    # the one with minimum occurrence elsewhere
                    if (length(unique(occurrence)) == 1) {
                        nonRelated <- sort(x)[1]
                    } else {
                        nonRelated <- names(occurrence)[which.min(occurrence)]
                    }
                } else {
                    nonRelated <- all_iv_set[which.max(length_iv_set)]
                }
                return(nonRelated)
            })
            keepIDs_multiple <- unique(unlist(keepIDs_multiple))
            failIDs_multiple <- c(multipleRelative[!multipleRelative %in%
                                                       keepIDs_multiple])
        } else {
            failIDs_multiple <- NULL
        }
        allFailIIDs <- c(failIDs_single, failIDs_multiple)
        relatednessFails <- lapply(allFailIIDs, function(id) {
            fail_inorder <- relatedness_original$IID1 == id &
                relatedness_original$M > relatednessTh
            fail_inreverse <- relatedness_original$IID2 == id &
                relatedness_original$M > relatednessTh
            if (any(fail_inreverse)) {
                inreverse <- relatedness_original[fail_inreverse, ]
                id1 <- iid1_index
                id2 <- iid2_index
                inreverse[,c(id1, id2)] <- inreverse[,c(id2, id1)]
                names(inreverse) <- relatedness_names
            } else {
                inreverse <- NULL
            }
            inorder <- relatedness_original[fail_inorder, ]
            names(inorder) <- relatedness_names
            return(rbind(inorder, inreverse))
        })
        relatednessFails <- do.call(rbind, relatednessFails)
        if (nrow(relatednessFails) == 0) {
            relatednessFails <- NULL
            failIDs <- NULL
        } else {
            names(relatednessFails) <- relatedness_names
            rownames(relatednessFails) <- 1:nrow(relatednessFails)
            uniqueFails <- relatednessFails[!duplicated(relatednessFails[,iid1_index]),]
            failIDs <- uniqueFails[,iid1_index]
        }
        return(list(relatednessFails=relatednessFails, failIDs=failIDs))
    }
    
  
    # main code
    kin0 <- read.table(${_input:r}, header=F, stringsAsFactor=F)
    colnames(kin0) <- c("FID1","ID1","FID2","ID2","NSNP","HETHET","IBS0","KINSHIP")
    if (${"TRUE" if maximize_unrelated else "FALSE"}) {
        rel <- relatednessFilter(kin0, ${kinship}, "ID1", "ID2", "KINSHIP")$failIDs
        tmp1 <- kin0[,1:2]
        tmp2 <- kin0[,3:4]
        colnames(tmp1) = colnames(tmp2) = c("FID", "ID")
        # Get the family ID for these rels so there are two columns FID and IID in the output
        lookup <- dplyr::distinct(rbind(tmp1,tmp2))
        dat <- lookup[which(lookup[,2] %in% rel),]
    } else {
        rel <- kin0 %>% filter(KINSHIP >= ${kinship})
        IID <- sort(unique(unlist(rel[, c("ID1", "ID2")])))
        dat <- data.frame(IID)
        dat <- dat %>%
            mutate(FID = 0) %>%
            select(FID, IID)
    }
    cat("There are", nrow(dat),"related individuals using a kinship threshold of ${kinship}\n")
    write.table(dat,${_output:r}, quote=FALSE, row.names=FALSE, col.names=FALSE)

In [None]:
# Split genotype data into related and unrelated samples, if related individuals are detected
[king_3]
input: output_from(2), genoFile
output: unrelated_bed = f'{cwd}/{_input[0]:bn}.unrelated.bed',
        related_bed = f'{cwd}/{_input[0]:bn}.related.bed'
related_id = [x.strip() for x in open(_input[0]).readlines()]
stop_if(len(related_id) == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash:  expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container
    plink2 \
      --bfile ${_input[1]:n} \
      --remove ${_input[0]} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      --make-bed \
      --out ${_output[0]:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 

    plink2 \
      --bfile ${_input[1]:n} \
      --keep ${_input[0]} \
      --make-bed \
      --out ${_output[1]:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 

## Genotype and sample QC

QC the genetic data based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium (HWE).

In this step you may also provide a list of samples to keep, for example in the case when you would like to subset a sample based on their ancestries to perform independent analyses on each of these groups.

The default parameters are set to reflect some suggestions in Table 1 of [this paper](https://dx.doi.org/10.1002%2Fmpr.1608).

In [2]:
# Filter SNPs and select individuals 
[qc_no_prune, qc_1 (basic QC filters)]
# minimum MAF filter to use. 0 means do not apply this filter.
parameter: maf_filter = 0.0
# maximum MAF filter to use. 0 means do not apply this filter.
parameter: maf_max_filter = 0.0
# minimum MAC filter to use. 0 means do not apply this filter.
parameter: mac_filter = 0.0
# maximum MAC filter to use. 0 means do not apply this filter.
parameter: mac_max_filter = 0.0 
# Maximum missingess per-variant
parameter: geno_filter = 0.1
# Maximum missingness per-sample
parameter: mind_filter = 0.1
# HWE filter 
parameter: hwe_filter = 1e-06
# Other PLINK arguments e.g snps_only, write-samples, etc
parameter: other_args = []
# Only output SNP and sample list, rather than the PLINK binary format of subset data
parameter: meta_only = False

fail_if(not (keep_samples.is_file() or keep_samples == path('.')), msg = f'Cannot find ``{keep_samples}``')
fail_if(not (keep_variants.is_file() or keep_variants == path('.')), msg = f'Cannot find ``{keep_variants}``')
fail_if(not (remove_samples.is_file() or remove_samples == path('.')), msg = f'Cannot find ``{remove_samples}``')

input: genoFile, group_by=1
output: f'{cwd}/{_input:bn}{("."+name) if name else ""}.filtered{".extracted" if keep_variants.is_file() else ""}{".bed" if not meta_only else ".snplist"}'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, volumes=[f'{cwd}:{cwd}'], expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink2 \
      --bfile ${_input:n} \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
      ${('--max-maf %s' % maf_max_filter) if maf_max_filter > 0 else ''} \
      ${('--mac %s' % mac_filter) if mac_filter > 0 else ''} \
      ${('--max-mac %s' % mac_max_filter) if mac_max_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      ${('--exclude %s' % exclude_variants) if exclude_variants.is_file() else ""} \
      ${('--extract %s' % keep_variants) if keep_variants.is_file() else ""} \
      ${('--make-bed') if not meta_only else "--write-snplist --write-samples"} \
      ${paths(["--%s" % x for x in other_args]) if other_args else ""} \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 

In [1]:
# LD prunning and remove related individuals (both ind of a pair)
[qc_2 (LD pruning)]
# Window size
parameter: window = 50
# Shift window every 10 snps
parameter: shift = 10
parameter: r2 = 0.1
stop_if(r2==0)
output: bed=f'{cwd}/{_input:bn}.prune.bed', prune=f'{cwd}/{_input:bn}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    plink \
    --bfile ${_input:n} \
    --indep-pairwise ${window} ${shift} ${r2}  \
    --out ${_output["prune"]:nn} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)/1e6}
   
    plink \
    --bfile ${_input:n} \
    --extract ${_output['prune']} \
    --make-bed \
    --out ${_output['bed']:n} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)/1e6}