## Posmagen illumnia data  25Aug, 2020

Use the same method to check the quality of Psomagen and Jax data

## Notes
1. -mq  -q means quiet, no verbose at all
2. gsutil doesn't work on python3.8, which is installed in my plot env
3. global varibles and libraries
4. Remeber to update the WRKDIR for previously used notebooks
5. Remember do not add comments before %%bash
6. Remember to replace 'usuhsID' with 'Sample_Name' in sample_ids
7. make sure the json or tool dir are there for the next steps

In [9]:
!mkdir /Users/pengl7/Downloads/WGS/UNHS

In [11]:
WRKDIR = '/Users/pengl7/Downloads/WGS/UNHS'
COHORT = 'UNHS'

In [12]:
import pandas as pd
import numpy as np
import time
import json
import os
import seaborn as sns
import matplotlib.pyplot as plt

#### Check the vcf files inside the destination folder

In [8]:
!gsutil ls -lh gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.chrX.vcf.gz* 

   5.6 MiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.chrX.gtonly.vcf.gz
 85.35 KiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.chrX.gtonly.vcf.gz.tbi
 29.15 MiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.chrX.vcf.gz
  93.7 KiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.chrX.vcf.gz.tbi
174.85 MiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.gtonly.vcf.gz
  1.61 MiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.gtonly.vcf.gz.tbi
  1.18 GiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.vcf.gz
  1.91 MiB  2020-07-27T18:38:33Z  gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.vcf.gz.tbi
TOTAL: 8 objects, 1491329990 bytes (1.39 GiB)


## Run QC on samples

In [16]:
!mkdir {WRKDIR}/genotypes
!gsutil -m cp gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.vcf.gz* {WRKDIR}/genotypes/

mkdir: /Users/pengl7/Downloads/WGS/UNHS/genotypes: File exists
Copying gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.vcf.gz...
==> NOTE: You are downloading one or more large file(s), which would            
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

Copying gs://singlecellindi/WGS/2001UNHS-0021/hg38/genotypes/UNHS.vcf.gz.tbi...
| [2/2 files][  1.2 GiB/  1.2 GiB] 100% Done   6.1 MiB/s ETA 00:00:00           
Operation completed over 2 objects/1.2 GiB.                                      


## Preprocessing for the purpose of QC 
According to Raph's suggestion, using plink instead of VSRQ.
1. filter
2. Extract chrX
Two versions of plink were used in Raph and Anni's notebook for different purpose. For example, multiallelics option is available in plink2, however, flags of --check-sex don't work plink2
1. Use plink2: Use the option of --make-pgen multiallelics=-, otherwise error occurs during plink trim from the bed file
2. Use plink1.9: check-sex and check het rate

In [18]:
#convert from vcf to plink2

input_vcf = f'{WRKDIR}/genotypes/{COHORT}.vcf.gz'
out_file_set = f'{WRKDIR}/genotypes/{COHORT}'

!plink2 --vcf {input_vcf} --double-id \
--silent --allow-extra-chr --make-pgen multiallelics=- --out {out_file_set}

In [19]:
#check creation and logging
!ls -lth {WRKDIR}/genotypes/{COHORT}.*
!tail {WRKDIR}/genotypes/{COHORT}.*log

-rw-r--r--  1 pengl7  NIH\Domain Users   1.3K Aug 26 18:17 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.log
-rw-r--r--  1 pengl7  NIH\Domain Users    45M Aug 26 18:17 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.pgen
-rw-r--r--  1 pengl7  NIH\Domain Users   2.6G Aug 26 18:17 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.pvar
-rw-r--r--  1 pengl7  NIH\Domain Users   260B Aug 26 18:17 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.psam
-rw-r--r--  1 pengl7  NIH\Domain Users   1.2G Aug 26 16:12 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.vcf.gz
-rw-r--r--  1 pengl7  NIH\Domain Users   1.9M Aug 26 16:09 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.vcf.gz.tbi
-rw-r--r--  1 pengl7  NIH\Domain Users    29M Aug 26 16:08 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.chrX.vcf.gz
-rw-r--r--  1 pengl7  NIH\Domain Users    94K Aug 26 16:08 /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.chrX.vcf.gz.tbi
/Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS-temporary.psam.
9347639 variant

## Trim variants to QC set

what do geno, maf, and hwe mean?

- --geno Missing genotype rates: filters out all variants with missing call rates exceeding the provided value (default 0.1) to be removed, while --mind does the same for samples.
- --maf filters out all variants with allele frequency below the provided threshold (default 0.01)
- --hwe filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold.

In [21]:
#trim variant to QC set
input_file_set = f'{WRKDIR}/genotypes/{COHORT}'
out_file_set = f'{WRKDIR}/qc/{COHORT}.geno05maf05hwe000001'

!mkdir -p {WRKDIR}/qc
!plink2 --pfile {input_file_set} --make-bed --geno 0.05 --maf 0.05 --hwe 0.000001 --out {out_file_set}

PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.geno05maf05hwe000001.log.
Options in effect:
  --geno 0.05
  --hwe 0.000001
  --maf 0.05
  --make-bed
  --out /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.geno05maf05hwe000001
  --pfile /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS

Start time: Wed Aug 26 18:18:39 2020
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 12 threads (change this with --threads).
9 samples (0 females, 0 males, 9 ambiguous; 9 founders) loaded from
/Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.psam.
10435116 variants loaded from
/Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.pvar.
Note: No phenotype data present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757

### Extract chrX for checksex at the step of --make-bed using opton of --chr X

In [22]:
#trim variant to QC set
input_file_set = f'{WRKDIR}/genotypes/{COHORT}'
out_file_set = f'{WRKDIR}/qc/{COHORT}.chrX.geno05maf05'

!plink2 --pfile {input_file_set} --make-bed --chr X --geno 0.05 --maf 0.05 --out {out_file_set}

PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.chrX.geno05maf05.log.
Options in effect:
  --chr X
  --geno 0.05
  --maf 0.05
  --make-bed
  --out /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.chrX.geno05maf05
  --pfile /Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS

Start time: Wed Aug 26 18:19:39 2020
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 12 threads (change this with --threads).
9 samples (0 females, 0 males, 9 ambiguous; 9 founders) loaded from
/Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.psam.
298292 out of 10435116 variants loaded from
/Users/pengl7/Downloads/WGS/UNHS/genotypes/UNHS.pvar.
Note: No phenotype data present.
Calculating allele frequencies... 1133557799done.
--geno: 12598 variants removed due to missing genotype data.
19913 variants removed due to allele frequency threshold(

In [24]:
%ls -lth {WRKDIR}/qc

total 639968
-rw-r--r--  1 pengl7  NIH\Domain Users   1.3K Aug 26 18:19 UNHS.chrX.geno05maf05.log
-rw-r--r--  1 pengl7  NIH\Domain Users   779K Aug 26 18:19 UNHS.chrX.geno05maf05.bed
-rw-r--r--  1 pengl7  NIH\Domain Users   6.8M Aug 26 18:19 UNHS.chrX.geno05maf05.bim
-rw-r--r--  1 pengl7  NIH\Domain Users   301B Aug 26 18:19 UNHS.chrX.geno05maf05.fam
-rw-r--r--  1 pengl7  NIH\Domain Users   1.3K Aug 26 18:18 UNHS.geno05maf05hwe000001.log
-rw-r--r--  1 pengl7  NIH\Domain Users    28M Aug 26 18:18 UNHS.geno05maf05hwe000001.bed
-rw-r--r--  1 pengl7  NIH\Domain Users   262M Aug 26 18:18 UNHS.geno05maf05hwe000001.bim
-rw-r--r--  1 pengl7  NIH\Domain Users   301B Aug 26 18:18 UNHS.geno05maf05hwe000001.fam


### Check sex

## Plink2 doesn't have the tag --check-sex, so change to plink1.9
PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
Error: Unrecognized flag ('--check-sex').

### check sex

In [70]:
%%bash -s "$COHORT" "$WRKDIR"
#check gender
COHORT=${1}
WRKDIR=${2}"/qc"
plink=${3}
#hg38 non-PAR
awk '$4 > 2800000 && $4 < 155700000 {print $2}' ${WRKDIR}/${COHORT}.chrX.geno05maf05.bim \
    > ${WRKDIR}/${COHORT}.chrX.list

plink1.9 --bfile ${WRKDIR}/${COHORT}.chrX.geno05maf05 --extract ${WRKDIR}/${COHORT}.chrX.list \
--check-sex 0.25 0.75 --out ${WRKDIR}/${COHORT}.sex

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.sex.log.
Options in effect:
  --bfile /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.chrX.geno05maf05
  --check-sex 0.25 0.75
  --extract /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.chrX.list
  --out /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.sex

16384 MB RAM detected; reserving 8192 MB for main workspace.
265781 variants loaded from .bim file.
9 people (0 males, 0 females, 9 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.sex.nosex
.
--extract: 265717 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 9 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%



In [71]:
#check sex test results
sexcheck_df = pd.read_csv(f'{WRKDIR}/qc/{COHORT}.sex.sexcheck', sep='\s+')
print(f'sexcheck shape is {sexcheck_df.shape}')

fail_sexcheck_df = sexcheck_df.loc[sexcheck_df['STATUS'] == 'PROBLEM']
print(f'number of samples failing sexcheck {fail_sexcheck_df.shape[0]}')
print(fail_sexcheck_df)

fail_sexcheck_df = sexcheck_df.loc[(sexcheck_df['STATUS'] == 'PROBLEM') & \
                                   (sexcheck_df['PEDSEX'] != 0)]
print(f'number of samples failing sexcheck excluding missing info {fail_sexcheck_df.shape}')
print(fail_sexcheck_df)

sexcheck shape is (9, 6)
number of samples failing sexcheck 9
                     FID                    IID  PEDSEX  SNPSEX   STATUS  \
0             GT19-38445             GT19-38445       0       0  PROBLEM   
1             GT19-38446             GT19-38446       0       0  PROBLEM   
2             GT19-38447             GT19-38447       0       0  PROBLEM   
3             GT19-38448             GT19-38448       0       0  PROBLEM   
4             GT19-38449             GT19-38449       0       1  PROBLEM   
5             GT19-38450             GT19-38450       0       0  PROBLEM   
6             GT19-38451             GT19-38451       0       0  PROBLEM   
7             GT19-38452             GT19-38452       0       1  PROBLEM   
8  NIST-reference-sample  NIST-reference-sample       0       1  PROBLEM   

        F  
0  0.7483  
1  0.7475  
2  0.7352  
3  0.7365  
4  0.7646  
5  0.7494  
6  0.7487  
7  0.7610  
8  0.7501  
number of samples failing sexcheck excluding missing info

In [60]:
#check missingness
!plink2 --bfile {WRKDIR}/qc/{COHORT}.geno05maf05hwe000001 --missing --autosome \
--out {WRKDIR}/qc/{COHORT}.missing

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.log.
Options in effect:
  --autosome
  --bfile /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.geno05maf05hwe000001
  --missing
  --out /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing

16384 MB RAM detected; reserving 8192 MB for main workspace.
9542779 out of 9820474 variants loaded from .bim file.
9 people (0 males, 0 females, 9 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 9 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 don

In [62]:
%ls -lth {WRKDIR}/qc/{COHORT}.missing*

-rw-r--r--  1 pengl7  NIH\Domain Users   1.0K Aug 26 18:58 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.log
-rw-r--r--  1 pengl7  NIH\Domain Users   860B Aug 26 18:58 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.imiss
-rw-r--r--  1 pengl7  NIH\Domain Users   419M Aug 26 18:58 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.lmiss
-rw-r--r--  1 pengl7  NIH\Domain Users   220B Aug 26 18:58 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.nosex
-rw-r--r--  1 pengl7  NIH\Domain Users   155M Aug 26 18:57 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.vmiss
-rw-r--r--  1 pengl7  NIH\Domain Users   362B Aug 26 18:57 /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.missing.smiss


In [61]:
#check missingness results
misstest_df = pd.read_csv(f'{WRKDIR}/qc/{COHORT}.missing.smiss', sep='\s+')

#find failed
misstest_failed_df = misstest_df.loc[misstest_df['F_MISS'] > 0.05]

print(f'number samples failing missingness test {misstest_failed_df.shape[0]}')

print(misstest_failed_df)

number samples failing missingness test 0
Empty DataFrame
Columns: [FID, IID, MISS_PHENO, N_MISS, N_GENO, F_MISS]
Index: []


### check het rates

In [54]:
#check het rates, plink2 doesn't work
!plink1.9 --bfile {WRKDIR}/qc/{COHORT}.geno05maf05hwe000001 --het --autosome \
--out {WRKDIR}/qc/{COHORT}.het

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.het.log.
Options in effect:
  --autosome
  --bfile /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.geno05maf05hwe000001
  --het
  --out /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.het

16384 MB RAM detected; reserving 8192 MB for main workspace.
9542779 out of 9820474 variants loaded from .bim file.
9 people (0 males, 0 females, 9 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /Users/pengl7/Downloads/WGS/UNHS/qc/UNHS.het.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 9 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
9542779 varia

In [55]:
#check het rate results

###2 failes

hets_df = pd.read_csv(f'{WRKDIR}/qc/{COHORT}.het.het', sep='\s+')

#find failed
hets_failed_df = hets_df.loc[(hets_df['F'] > 0.15) | (hets_df['F'] < -0.15)]

print(f'number samples failing heterzygosity check {hets_failed_df.shape[0]}')

print(hets_failed_df)

number samples failing heterzygosity check 0
Empty DataFrame
Columns: [FID, IID, O(HOM), E(HOM), N(NM), F]
Index: []
