Skip to content

Anatomy of Q1

Liang Zhisheng edited this page Nov 21, 2021 · 5 revisions

1st Q

The 1st Q is mainly implemented by the “sample_qc” function in the “modules.py” script.

For the personal genome:

  1. recode_and_sex_impute:
  • plink --recode vcf: The user specified genotype file (VCF or PLINK format) is read in and coverted into VCF by PLINK.
  • plink --impute-sex: To infer the genetically determined sex
  1. sample_qc:
  • plink –missing: To calculate missing rate
  1. chr_and_vep:
  • plink --make-just-bim: To generate a bim file counting the number of SNPs by chromosome.
  • python packages pandas & matplotlib: The variants are read in and processed by pandas, and then displayed in bar plot by matplotlib

For the reference population genome:

  1. get_maf:
  • plink --freq: To generate frequency statistics.
  • python packages pandas & matplotlib: The frequency statistics is read in by pandas, and then displayed in histogram by matplotlib

2a. PCA:

  • plink --maf 0.01 --misng 0.01--hwe 1E-50: To retain high quality SNPs
  • plink --indep-pairwise --kb 50kb--r2 0.2: To keep independent SNPs
  • plink --pca –freq: To calculate principal components and SNP frequency
  • plink --read-freq --score no-mean-imputation variance-standardize: To generate PC for the “personal genome”
  • python packages pandas & matplotlib: The principal components result is read in by pandas, and then displayed by matplotlib

2b. UMAP

  • python package UMAP: To reduce the genetics data into two dimensions
  • python packages pandas & matplotlib: The 2-dimensional result is read in by pandas, and then displayed by matplotlib
  1. check_concordance:
  • python package matplotlib_venn: To displayed the venn diagram for variants of two genotypes files.
  1. ref_qc, to show the QC result of the reference population.
  • plink --freq --missing --test-missing --hardy --het --check-sex --genome: To get the information of minor allele frequency, missing genotype rates, Hardy-Weinberg equilibrium, heterozygosity rate, sex discrepancy, and cryptic relatedness in reference population.
  • python packages pandas & matplotlib: The information obtained above is read in by pandas, and then displayed in histogram by matplotlib
Q1