## Notebook for prepping the ADRD Aging genotypes for NIMH Human Brain Collection Core
original 3 Illumina chip types ped/map files from [here](gs://nihnialng-aging-brain/genotypes); H1M, H5M4, M650K

- H1M=Human1M-Duov3_B (3 samples)
- H650K=HumanHap650Yv3.0 (7 samples)
- H5M4=HumanOmni5-Quad (3 samples)

basically need to:
- merge genotypes from different Illumina platforms
- liftover from hg19 to hg38
- re-order the chromosomes from typical to 10X's lexigraphical

In [1]:
!date

Fri Jul 16 21:36:37 EDT 2021


#### import libraries and set notebook variables

In [2]:
import pandas as pd
import os

In [28]:
# naming
project = 'adrd'
cohort = 'aging'
bank = 'nhbcc'

# directories
wrk_dir = '/labshare/raph/datasets/adrd_neuro'
cohort_dir = f'{wrk_dir}/{cohort}'
cohort_bckt = 'gs://nihnialng-aging-brain'
genos_dir = f'{cohort_dir}/genotypes'
genos_bcket = f'{cohort_bckt}/genotypes'
tools_dir = f'{wrk_dir}/tools'
fasta_index_bucket_path = 'gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta*'
fasta_dict_bucket_path = 'gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict'
demuxlet_vcf_file = f'{genos_dir}/{project}_{cohort}_{bank}.hg38.demuxlet.vcf.gz'

# constants
capture_out = !(nproc)
max_threads = int(capture_out[0])
capture_out = !grep MemTotal /proc/meminfo | awk '{print $2}'
max_mem = int(capture_out[0])

hbcc_ped_prefixes = ['H1M', 'H5M4', 'H650K']

hbcc_id_dict = {'200901070006_R03C01': 'NHBCC-2790', '204321360038_R01C01': 'NHBCC-2628', 
                '3999495136_R02C01': 'NHBCC-1119', '4031091001_A': 'NHBCC-1613', 
                '4031091029_A': 'NHBCC-1615', '4031091062_A': 'NHBCC-1669', 
                '4040296037_A': 'NHBCC-1556', '4040296068_A': 'NHBCC-1604', 
                '4040296088_A': 'NHBCC-1137', '4256126251_A': 'NHBCC-1187', 
                '4463344637_R01C01': 'NHBCC-1275', '4572348358_R01C02': 'NHBCC-831', 
                '4584664028_R01C01': 'NHBCC-1340'}

# when combining final vcf for demuxlet need to have chromosomes 
#sorted to match 10x lexigraphical
autosomes = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9]

#### pull down the NIMH HBCC provided plink ped files

In [4]:
for hbcc_prefix in hbcc_ped_prefixes:
    this_cmd = f'gsutil -mq cp {genos_bcket}/{hbcc_prefix}.* {genos_dir}/'
    print(f'{hbcc_prefix}: {this_cmd}')
    !{this_cmd}

H1M: gsutil -mq cp gs://nihnialng-aging-brain/genotypes/H1M.* /labshare/raph/datasets/adrd_neuro/aging/genotypes/
H5M4: gsutil -mq cp gs://nihnialng-aging-brain/genotypes/H5M4.* /labshare/raph/datasets/adrd_neuro/aging/genotypes/
H650K: gsutil -mq cp gs://nihnialng-aging-brain/genotypes/H650K.* /labshare/raph/datasets/adrd_neuro/aging/genotypes/


In [5]:
!ls -lh {genos_dir}

total 333M
-rw-rw-r--. 1 gibbsr gibbsr  35M Jul 16 21:36 H1M.map
-rw-rw-r--. 1 gibbsr gibbsr  14M Jul 16 21:36 H1M.ped
-rw-rw-r--. 1 gibbsr gibbsr 130M Jul 16 21:36 H5M4.map
-rw-rw-r--. 1 gibbsr gibbsr  51M Jul 16 21:36 H5M4.ped
-rw-rw-r--. 1 gibbsr gibbsr  20M Jul 16 21:36 H650K.map
-rw-rw-r--. 1 gibbsr gibbsr  18M Jul 16 21:36 H650K.ped
-rw-rw-r--. 1 gibbsr gibbsr  234 Jul 16 13:47 merge-list.txt
-rw-rw-r--. 1 gibbsr gibbsr 4.7M Jul 16 13:50 variants_to_keep.txt


#### see what variants are shared between platforms provided, by variant name

In [6]:
shared_variants = None
for hbcc_prefix in hbcc_ped_prefixes:
    map_df = pd.read_csv(f'{genos_dir}/{hbcc_prefix}.map', sep='\s+', header=None)
    # in map file variants name is 2nd column
#     print(map_df.shape)
#     display(map_df.head())
    if shared_variants is None:
        shared_variants = set(map_df[1])
    else:
        shared_variants = shared_variants & set(map_df[1])
    print(f'{hbcc_prefix} shape={map_df.shape} shared variant size = {len(shared_variants)}')

H1M shape=(1192666, 4) shared variant size = 1192666
H5M4 shape=(4437269, 4) shared variant size = 971362
H650K shape=(660918, 4) shared variant size = 561035


#### looks like a definitely a decent number of variants shared, proceed to merging

#### convert to bfiles

In [7]:
for hbcc_prefix in hbcc_ped_prefixes:
    this_cmd = f'plink --ped {genos_dir}/{hbcc_prefix}.ped \
--map {genos_dir}/{hbcc_prefix}.map \
--make-bed --out {genos_dir}/{project}_{cohort}_{bank}_{hbcc_prefix} --silent'
    print(this_cmd)
    !{this_cmd}

plink --ped /labshare/raph/datasets/adrd_neuro/aging/genotypes/H1M.ped --map /labshare/raph/datasets/adrd_neuro/aging/genotypes/H1M.map --make-bed --out /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H1M --silent
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H1M.hh );
many commands treat these as missing.
treat these as missing.
plink --ped /labshare/raph/datasets/adrd_neuro/aging/genotypes/H5M4.ped --map /labshare/raph/datasets/adrd_neuro/aging/genotypes/H5M4.map --make-bed --out /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H5M4 --silent
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H5M4.hh );
many commands treat these as missing.
treat these as missing.
plink --ped /labshare/raph/datasets/adrd_neuro/aging/genotypes/H650K.ped --map /labshare/raph/datasets/adrd_neuro/aging/genotypes/H650K.map --make-bed --out /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H650K --silent
treat th

In [9]:
!ls -lhtr {genos_dir}

total 596M
-rw-rw-r--. 1 gibbsr gibbsr  234 Jul 16 13:47 merge-list.txt
-rw-rw-r--. 1 gibbsr gibbsr 4.7M Jul 16 13:50 variants_to_keep.txt
-rw-rw-r--. 1 gibbsr gibbsr  14M Jul 16 21:36 H1M.ped
-rw-rw-r--. 1 gibbsr gibbsr  35M Jul 16 21:36 H1M.map
-rw-rw-r--. 1 gibbsr gibbsr  51M Jul 16 21:36 H5M4.ped
-rw-rw-r--. 1 gibbsr gibbsr 130M Jul 16 21:36 H5M4.map
-rw-rw-r--. 1 gibbsr gibbsr  18M Jul 16 21:36 H650K.ped
-rw-rw-r--. 1 gibbsr gibbsr  20M Jul 16 21:36 H650K.map
-rw-rw-r--. 1 gibbsr gibbsr 6.7K Jul 16 21:37 adrd_aging_nhbcc_H1M.hh
-rw-rw-r--. 1 gibbsr gibbsr 1.2M Jul 16 21:37 adrd_aging_nhbcc_H1M.bed
-rw-rw-r--. 1 gibbsr gibbsr   92 Jul 16 21:37 adrd_aging_nhbcc_H1M.fam
-rw-rw-r--. 1 gibbsr gibbsr  40M Jul 16 21:37 adrd_aging_nhbcc_H1M.bim
-rw-rw-r--. 1 gibbsr gibbsr 1.8K Jul 16 21:37 adrd_aging_nhbcc_H1M.log
-rw-rw-r--. 1 gibbsr gibbsr 107K Jul 16 21:37 adrd_aging_nhbcc_H5M4.hh
-rw-rw-r--. 1 gibbsr gibbsr 4.3M Jul 16 21:37 adrd_aging_nhbcc_H5M4.bed
-rw-rw-r--. 1 gibbsr gibbsr   95 J

#### merge the per Illumina chip type for the plink bfiles

In [10]:
# merge the files into a single plink binary set
merge_file_set = f'{genos_dir}/merge-list.txt'
bfile_set = f'{genos_dir}/{project}_{cohort}_{bank}'

with open(merge_file_set, 'w') as file_handler:
    for hbcc_prefix in hbcc_ped_prefixes:
        prefix_file_set = f'{genos_dir}/{project}_{cohort}_{bank}_{hbcc_prefix}'
        file_handler.write(f'{prefix_file_set}\n')

# merge the per platform bfiles into a merged bfile
this_cmd = f'plink --merge-list {merge_file_set} --make-bed --allow-no-sex \
--keep-allele-order --silent --out {bfile_set}'
print(this_cmd)
!{this_cmd}

# if there was a missnp problem remove those variant and re-attemp merge
if os.path.exists(f'{bfile_set}-merge.missnp'):
#     for chrom in CHROMOSOMES:
    for hbcc_prefix in hbcc_ped_prefixes:
        !plink --bfile {genos_dir}/{project}_{cohort}_{bank}_{hbcc_prefix} \
--silent --exclude {bfile_set}-merge.missnp \
--keep-allele-order --make-bed \
--out {genos_dir}/{project}_{cohort}_{bank}_{hbcc_prefix}.temp

    with open(merge_file_set, 'w') as file_handler:
        for hbcc_prefix in hbcc_ped_prefixes:
            prefix_file_set = f'{genos_dir}/{project}_{cohort}_{bank}_{hbcc_prefix}.temp'
            file_handler.write(f'{prefix_file_set}\n')
        
    !plink --merge-list {merge_file_set} --make-bed --allow-no-sex \
--keep-allele-order --silent --out {bfile_set} --geno 0.05

plink --merge-list /labshare/raph/datasets/adrd_neuro/aging/genotypes/merge-list.txt --make-bed --allow-no-sex --keep-allele-order --silent --out /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc
Error: 1 variant with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc-merge.missnp.
  alleles probably remain in your data.  If LD between nearby SNPs is high,
  --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc_H1M.temp.hh
); many commands treat these as missing.
treat these as missing.
/labsha

In [11]:
!ls -lhtr {genos_dir}/{project}_{cohort}_{bank}.*

-rw-rw-r--. 1 gibbsr gibbsr  113 Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.nosex
-rw-rw-r--. 1 gibbsr gibbsr 113K Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hh
-rw-rw-r--. 1 gibbsr gibbsr 2.0M Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.bed
-rw-rw-r--. 1 gibbsr gibbsr  363 Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.fam
-rw-rw-r--. 1 gibbsr gibbsr  17M Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.bim
-rw-rw-r--. 1 gibbsr gibbsr  51M Jul 16 21:38 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.log


In [12]:
!tail -n 25 {genos_dir}/{project}_{cohort}_{bank}.log

/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc-merge.bim +
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc-merge.fam .
4665423 variants loaded from .bim file.
13 people (4 males, 2 females, 7 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 13 founders and 0 nonfounders present.
Calculating allele frequencies... done.
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hh ); many
commands treat these as missing.
treat these as missing.
Total genotyping rate is 0.344533.
4155982 variants removed due to missing genotype data (--geno).
509441 variants and 13 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.bed +
/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbc

#### load the bim files see chrom counts

In [13]:
bim_df = pd.read_csv(f'{bfile_set}.bim', sep='\s+', header=None)
print(bim_df.shape)
display(bim_df.sample(10))
display(bim_df[0].value_counts())

(509441, 6)


Unnamed: 0,0,1,2,3,4,5
200057,6,rs17051952,125.028,124857859,0,A
434501,17,rs12951306,20.0268,7020507,G,A
375133,13,rs7983863,57.6872,59768722,A,G
359722,12,rs11109229,111.15,98208150,A,C
235964,7,rs13221692,172.293,153367707,A,C
460968,18,rs8085865,121.724,76047493,G,A
223076,7,rs6460671,84.5521,71020307,G,A
486174,21,rs2834385,41.7415,35506430,A,G
403148,14,rs7146643,124.833,105757392,G,A
180933,6,rs642859,37.1041,16796614,A,G


2     41233
1     38796
3     34364
6     33237
5     31397
4     30507
8     28929
7     27247
10    27091
11    24907
12    24819
9     23941
13    19802
14    16599
18    15727
15    15088
16    14891
20    12937
23    12559
17    12455
19     7605
21     7277
22     6659
24      981
0       302
26       78
25       13
Name: 0, dtype: int64

#### find variants to exclude that aren't SNVs

In [14]:
display(bim_df[4].value_counts())
display(bim_df[5].value_counts())
nucleotides = ['A', 'C', 'G', 'T', 'N', 'a', 'c', 'g', 't', 'n', 0]
vars_to_include = bim_df.loc[(bim_df[4].isin(nucleotides)) & 
                             (bim_df[5].isin(nucleotides))]
print(vars_to_include.shape)
display(vars_to_include.head())

vars_to_include[1].to_csv(f'{genos_dir}/variants_to_keep.txt', index=False, header=False)

A    255595
G    183262
C     43351
0     27225
T         8
Name: 4, dtype: int64

A    238689
G    221471
C     49262
T        19
Name: 5, dtype: int64

(482216, 6)


Unnamed: 0,0,1,2,3,4,5
1,0,rs10015934,0.0,0,A,G
2,0,rs1004236,0.0,0,A,G
3,0,rs1006094,0.0,0,A,G
4,0,rs10084637,0.0,0,C,A
5,0,rs10155688,0.0,0,G,A


#### convert to vcf, exclude InDels

In [15]:
this_cmd = f'plink2 --bfile {bfile_set} --silent \
--export vcf-4.2 bgz id-paste=iid --out {bfile_set} \
--output-chr chrM --not-chr 0 --snps-only \
--extract {genos_dir}/variants_to_keep.txt'

print(this_cmd)
!{this_cmd}

plink2 --bfile /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc --silent --export vcf-4.2 bgz id-paste=iid --out /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc --output-chr chrM --not-chr 0 --snps-only --extract /labshare/raph/datasets/adrd_neuro/aging/genotypes/variants_to_keep.txt
reconstruct them. Consider rerunning with a suitable --export id-delim= value.


#### re-IDs the NIMH HBCC samples using the provided mapping

In [16]:
# since using bcftools reheader to update IDs the sample order has to match
# check expected samples present and if any ID formats need correction
# all files have same sample set so checking one is fine
temp_sample_list_file = f'{bfile_set}.sample.list'
!bcftools query --list-samples {bfile_set}.vcf.gz > \
{temp_sample_list_file}

ids_present_df = pd.read_csv(temp_sample_list_file, header=None)
ids_present_df.columns = ['ID']
print(ids_present_df.shape)

# lookkup the new ID
ids_present_df['newID'] = ids_present_df['ID'].apply(hbcc_id_dict.get)

ids_present_df.to_csv(f'{bfile_set}.rename.sample.list',
                      index=False, header=False, sep='\t')

ids_present_df.head()

(13, 1)


Unnamed: 0,ID,newID
0,4040296088_A,NHBCC-1137
1,4040296037_A,NHBCC-1556
2,204321360038_R01C01,NHBCC-2628
3,4584664028_R01C01,NHBCC-1340
4,200901070006_R03C01,NHBCC-2790


In [17]:
in_vcf = f'{bfile_set}.vcf.gz'
out_vcf = f'{bfile_set}.renamed.vcf.gz'
this_cmd = f'bcftools reheader --sample {bfile_set}.rename.sample.list \
--output {out_vcf} --threads 2 {in_vcf}'
print(this_cmd)
!{this_cmd}

bcftools reheader --sample /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.rename.sample.list --output /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.renamed.vcf.gz --threads 2 /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.vcf.gz


#### need to liftover from hg19 to hg38
use Picard, slower but better than CrossMap

#### get Picard

In [18]:
# grab picard jar
!wget --quiet  https://github.com/broadinstitute/picard/releases/download/2.25.5/picard.jar \
-O {tools_dir}/picard.jar

#### grab necessary ref files

In [19]:
this_cmd = f'gsutil -mq cp -P {fasta_index_bucket_path} {tools_dir}/'
print(this_cmd)
!{this_cmd}

this_cmd = f'gsutil -mq cp -P {fasta_dict_bucket_path} {tools_dir}/'
print(this_cmd)
!{this_cmd}

gsutil -mq cp -P gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta* /labshare/raph/datasets/adrd_neuro/tools/
gsutil -mq cp -P gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict /labshare/raph/datasets/adrd_neuro/tools/


In [20]:
# also need the hg19 files
this_cmd = f'gsutil -mq cp -P gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta* {tools_dir}/'
print(this_cmd)
!{this_cmd}

this_cmd = f'gsutil -mq cp -P gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dict {tools_dir}/'
print(this_cmd)
!{this_cmd}

gsutil -mq cp -P gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta* /labshare/raph/datasets/adrd_neuro/tools/
gsutil -mq cp -P gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dict /labshare/raph/datasets/adrd_neuro/tools/


In [21]:
!wget --quiet  http://hgdownload.cse.ucsc.edu/gbdb/hg19/liftOver/hg19ToHg38.over.chain.gz \
-O {tools_dir}/hg19ToHg38.over.chain.gz

#### run the liftover

In [22]:
vcf = f'{bfile_set}.renamed.vcf.gz'
out_vcf_name = f'{bfile_set}.hg38.vcf.gz'
out_vcf_unmapped_name = f'{bfile_set}.hg38unmapped.vcf.gz'

this_cmd = 	f'java -Xmx{max_mem}k -jar {tools_dir}/picard.jar LiftoverVcf \
INPUT={vcf} \
OUTPUT={out_vcf_name} \
CHAIN={tools_dir}/hg19ToHg38.over.chain.gz \
REJECT={out_vcf_unmapped_name} \
REFERENCE_SEQUENCE={tools_dir}/Homo_sapiens_assembly38.fasta \
MAX_RECORDS_IN_RAM=500000 QUIET=true RECOVER_SWAPPED_REF_ALT=true'

print(this_cmd)
!{this_cmd}

java -Xmx1056503524k -jar /labshare/raph/datasets/adrd_neuro/tools/picard.jar LiftoverVcf INPUT=/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.renamed.vcf.gz OUTPUT=/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.vcf.gz CHAIN=/labshare/raph/datasets/adrd_neuro/tools/hg19ToHg38.over.chain.gz REJECT=/labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38unmapped.vcf.gz REFERENCE_SEQUENCE=/labshare/raph/datasets/adrd_neuro/tools/Homo_sapiens_assembly38.fasta MAX_RECORDS_IN_RAM=500000 QUIET=true RECOVER_SWAPPED_REF_ALT=true
INFO	2021-07-16 21:54:50	LiftoverVcf	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    LiftoverVcf -INPUT /labshare/raph/datasets/

#### split vcf by chromosome so can be recombined in specified chromosome order

In [36]:
%%time
vcf_files = []
for chrom in autosomes:
    in_vcf = f'{bfile_set}.hg38.vcf.gz'
#     out_vcf = f'{bfile_set}.hg38.chr{chrom}.vcf.gz'
    out_vcf = f'{bfile_set}.hg38.chr{chrom}'
    # use plink2 instead of bcftools so header is reduced specifically for contigs
#     this_cmd = f'bcftools view --threads {max_threads} --output-type z \
# --regions chr{chrom} --output {out_vcf} {in_vcf}'
    this_cmd = f'plink2 --vcf {in_vcf} --silent --chr {chrom} --not-chr 0 \
--export vcf-4.2 bgz id-paste=iid --out {out_vcf} --output-chr chrM --allow-extra-chr'
    print(chrom, end='.')
#     print(this_cmd)
    !{this_cmd}    

1.10.11.12.13.14.15.16.17.18.19.2.20.21.22.3.4.5.6.7.8.9.CPU times: user 84 ms, sys: 267 ms, total: 351 ms
Wall time: 5.53 s


#### concat chromosome vcfs (in order) into genome vcf

In [37]:
vcf_files = []
for chrom in autosomes:
    vcf_files.append(f'{bfile_set}.hg38.chr{chrom}.vcf.gz')

vcf_files_arg = ' '.join(vcf_files)
this_cmd = f'bcftools concat --output-type z --output {demuxlet_vcf_file} \
--threads {max_threads} --no-version {vcf_files_arg}'   

# print(this_cmd)
!{this_cmd}

Checking the headers and starting positions of 22 files
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr1.vcf.gz	0.010531 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr10.vcf.gz	0.008038 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr11.vcf.gz	0.006002 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr12.vcf.gz	0.005656 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr13.vcf.gz	0.005266 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr14.vcf.gz	0.005382 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr15.vcf.gz	0.003957 seconds
Concatenating /labshare/raph/datasets/adrd_neuro/aging/genotypes/adrd_aging_nhbcc.hg38.chr16.vcf.gz	0.005047 seconds
Concatena

#### index the final vcf

In [38]:
!tabix --preset vcf {demuxlet_vcf_file}