# Estimating Contamination Rate with hapCON for Male aDNA Samples

Starting from version 0.4a1, hapROH package now has an extension called hapCON to estimate contamination for male aDNA samples.

This small notebook walks you through how to use hapCON to run estimate contamination in your male aDNA sample. We will use one 1240k sample SUA001, from Sardinia, and a WGS sample DA43, from Mongolia, XiongNu to illustrate hapCON's usage on two different reference panels. In this tutorial, both samples have been downsampled to 0.1x to keep runtime minimal.

You can download the two BAM files at https://www.dropbox.com/sh/tgvwq75mvixeyic/AAAGURMdDGWLIxGzwAgtGm1Sa?dl=0

You can download the reference panel, the bed file used by samtools to get readcounts, and the metadata for reference panel at https://www.dropbox.com/s/1vv8mz9athedpiq/data.zip?dl=0


#hapCON with 1240k Reference Panel

You can run hapCON either from samtools's mpileup file or directly from BAM file. Running hapCON with BAM file is slower. We will first see how to run hapCON from samtools's mpileup file.

To generate the mpileup file for SUA001, we need a bed file to specify regions of interest, which is in the dropbox link provided above. Our bed file assumes that the contig name in your BAM file doesn't have chr or Chr prefix. If that is the case for your BAM file, please reset the header of your BAM file by "samtools reheader -c 'perl -pe "s/^(@SQ.*)(\tSN:)chr/\$1\$2/"' in.bam > out.bam". If you are unsure about the contig name of your BAM file, you can check it by "samtools view -H in.bam".

We have assumed that you have put the BAM file at ./Data, and please change the path to bed file for 1240k panel according to your setup. After that, we can run the following to generate the mpileup file,


In [1]:
path2bam="./Data/SUA001.bam"
path2bed1240k="/mnt/archgen/users/yilei/Data/1000G/1000g1240khdf5/all1240/1240kChrX.bed"
!samtools index $path2bam
!samtools mpileup --positions $path2bed1240k -r X -q 30 -Q 30 -o ./Data/SUA001.mpileup $path2bam

[mpileup] 1 samples in 1 input files


With the mpileup in hand, we can now ran hapCON to estimate contamination rate. Below is an example run with default setting. 

Please change the path to reference panel and meta data according to your setup. 

The function hapCON_chrom_BFGS should run for about 1 minute and a half. It produces two output files, which by default reside in the same directory as the input mpileup file. The first output file is a hdf5 file, which is used as an intermediary data file for our method, and can be removed by setting cleanup=True in the function. The second file is the contamination estimate, which is named as $iid.hapcon.OOA_CEU.txt (TODO: change this file name later so that it no longer has the OOA_CEU appendix).

In [9]:
import sys
#sys.path.insert(0, "/mnt/archgen/users/yilei/tools/hapROH/package") # TODO DELETE THIS LATER
from hapsburg.PackagesSupport.hapsburg_run import hapCon_chrom_BFGS
path2ref1240k="/mnt/archgen/users/yilei/Data/1000G/1000g1240khdf5/all1240/chrX.hdf5"
path2meta="/mnt/archgen/users/yilei/Data/1000G/1000g1240khdf5/all1240/meta_df_all.csv"

In [10]:
hapCon_chrom_BFGS(iid="SUA001", mpileup="./Data/SUA001.mpileup",
    h5_path1000g = path2ref1240k, meta_path_ref = path2meta)

exclude 1033 sites outside the specified region
exclude 0 non-SNP sites
number of major reads at flanking sites: 10636
number of minor reads at flanking sites: 16
number of major reads at focal sites: 1291
number of minor reads at focal sites: 27
err rate at flanking sites: 0.001502
err rate at focal sites: 0.020486
saving sample as SUA001 in /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
estimated genotyping error by flanking sites: 0.001502
number of sites covered by at least one read: 3999, fraction covered: 0.085
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
finished reading mpileup file, takes 1.625.
number of sites covered by at least one read: 3999
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
estimated contamination rate: 0.102144(0.076822 - 0.127466)


(0.10214392880751269, 0.0768220864631985, 0.1274657711518269)

Now you have a finished a hapCON run on SUA001! The estimated contamination should be about 10%. This is a highly contaminated sample! Now let's try to run hapCON directly from a BAM file. Running hapCON come BAM file is a bit slower. The following code should take about 1min.

In [11]:
hapCon_chrom_BFGS(iid="SUA001", bam="./Data/SUA001.bam",
    h5_path1000g = path2ref1240k, meta_path_ref = path2meta)

total number of mapped reads: 14755
number of major reads at flanking sites: 10521
number of minor reads at flanking sites: 15
number of major reads at focal sites: 1291
number of minor reads at focal sites: 27
err rate at flanking sites: 0.001424
err rate at focal sites: 0.020486
saving sample as SUA001 in /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
estimated genotyping error by flanking sites: 0.001424
number of sites covered by at least one read: 3999, fraction covered: 0.085
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
finished reading bam file, takes 62.481.
number of sites covered by at least one read: 3999
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/SUA001.hdf5
estimated contamination rate: 0.102215(0.076894 - 0.127536)


(0.1022147826552523, 0.07689382204728816, 0.12753574326321643)

#hapCON with 1000G Reference Panel
With WGS data, we recommend using hapCON with the 1000G reference panel instead. This reference panel contains all biallelic sites with MAF greater than 5% in the 1000Genome dataset, therefore it is much more powerful than the 1240k reference panel. We will use DA43, a Mongolia XiongNu WGS sample. Let's first generate a mpileup file for it.

We have assumed that you have put the BAM file of DA43 at ./Data, and please change the path to bed file for 1000G panel according to your setup. After that, we can run the following to generate the mpileup file,

In [5]:
path2bam="./Data/DA43.bam"
path2bed1kg="/mnt/archgen/users/yilei/Data/1000G/1000g1240khdf5/all1240/maf5FilterChrX.bed"
!samtools index $path2bam
!samtools mpileup --positions $path2bed1kg -r X -q 30 -Q 30 -o ./Data/DA43.mpileup $path2bam

[mpileup] 1 samples in 1 input files


With the mpileup file, we can run hapCON on DA43 similar as we did to SUA001. Please change the path to the 1000G referene panel according to your setup. Running hapCON with 1000G panel is slower than that with 1240k panel, as it contains 4 times more sites. 

In [6]:
path2ref1kg="/mnt/archgen/users/yilei/Data/1000G/1000g1240khdf5/all1240/maf5_filter_chrX.hdf5"
hapCon_chrom_BFGS(iid="DA43", mpileup="./Data/DA43.mpileup",
    h5_path1000g = path2ref1240k, meta_path_ref = path2meta)

exclude 1033 sites outside the specified region
exclude 0 non-SNP sites
number of major reads at flanking sites: 18220
number of minor reads at flanking sites: 184
number of major reads at focal sites: 390
number of minor reads at focal sites: 13
err rate at flanking sites: 0.009998
err rate at focal sites: 0.032258
saving sample as DA43 in /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
estimated genotyping error by flanking sites: 0.009998
number of sites covered by at least one read: 3595, fraction covered: 0.077
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
finished reading mpileup file, takes 6.753.
number of sites covered by at least one read: 3595
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
estimated contamination rate: 0.022853(0.008851 - 0.036855)


(0.022852809591185665, 0.008850652657120381, 0.03685496652525095)

The estimated contamination should be between 2% to 3%. Now you have finished your first trial with 1000G reference panel!

Alternatively, we can also run directly from the BAM file. This should take about 1min.

In [7]:
hapCon_chrom_BFGS(iid="DA43", bam="./Data/DA43.bam",
    h5_path1000g = path2ref1240k, meta_path_ref = path2meta)

total number of mapped reads: 281907
number of major reads at flanking sites: 4776
number of minor reads at flanking sites: 41
number of major reads at focal sites: 569
number of minor reads at focal sites: 16
err rate at flanking sites: 0.008512
err rate at focal sites: 0.027350
saving sample as DA43 in /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
estimated genotyping error by flanking sites: 0.008512
number of sites covered by at least one read: 5074, fraction covered: 0.108
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
finished reading bam file, takes 54.945.
number of sites covered by at least one read: 5074
hdf5 file saved to /mnt/archgen/users/yilei/tools/hapROH/Notebooks/Vignettes/Data/DA43.hdf5
estimated contamination rate: 0.021743(0.008636 - 0.034850)


(0.021743082853858616, 0.008635834460422537, 0.0348503312472947)

# what's more
More detailed documentation (including the full list of user-adjustable parameters) about hapCON can be seen at https://haproh.readthedocs.io/en/latest/hapCON.html