SAMTOOLS DEPTH ON MULTIPLE BAMS
===============================

In [1]:
! samtools --version

samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.


``samtools depth`` requires:
- indexed bam files 

I'm using default filters in the next section.
See ``depth`` [manual](http://www.htslib.org/doc/samtools.html) for more info.

Create a list of input file paths
----------------------------------
Example:

In [20]:
%%bash
INPUT_DIRECTORY="/hackathon/Hackathon_Project_4/FINAL_BAM/"
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH"
OUTPUT_FILE="input_all_rg_final.txt"

ls ${INPUT_DIRECTORY}/*bam > ${OUTPUT_DIRECTORY}/${OUTPUT_FILE}
head -n 5 ${OUTPUT_DIRECTORY}/${OUTPUT_FILE}
wc -l ${OUTPUT_DIRECTORY}/${OUTPUT_FILE}


/hackathon/Hackathon_Project_4/FINAL_BAM//ENCFF000CWZ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam
/hackathon/Hackathon_Project_4/FINAL_BAM//ENCFF000CXF_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam
/hackathon/Hackathon_Project_4/FINAL_BAM//ENCFF000CXQ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam
/hackathon/Hackathon_Project_4/FINAL_BAM//ENCFF000CZN_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam
/hackathon/Hackathon_Project_4/FINAL_BAM//ENCFF000CZY_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam
148 /hackathon/Hackathon_Project_4/DEPTH/input_all_rg_final.txt


Remove problematic samples (skip this):

In [21]:
%%bash
#grep -vf <(awk -F ":" '{print $3}' /hackathon/Hackathon_Project_4/problematic_samples.txt) /hackathon/Hackathon_Project_4/DEPTH/input_all_rg.txt > /hackathon/Hackathon_Project_4/DEPTH/input_all_rg_problematic_excluded.txt

Generating depths with a list of input bams
--------------------------------------------

This step takes a while if the number of input files is large. The ``-@`` option works for ``sort`` and ``view``. It does not seem to work for ``depth`` in v1.3.1 - posted a comment [here](https://github.com/samtools/samtools/issues/348). An alternative is to run this in parallel. However the challenge would be to aggregate positions in individual samples.

Mapping quality and base quality filters used for ``depth`` are configured to match [GATK haplotype caller](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) defaults. We also increased the theshold for maximum depth. Note that ``depth`` does not include flag filters, so there may be discrepancies between GATK coverage and the output of ``samtools depth`` in regions with unmapped reads, duplicate reads.

In [17]:
%%bash
INPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH"
INPUT_FILE="input_all_rg_final.txt"
OUTPUT_FILE="depths_all_rg_final.txt"
REGION="chr22"
wc -l ${INPUT_DIRECTORY}/${INPUT_FILE}
# using -d 1000000 to override default coverage cap
echo "time samtools depth -d 1000000 -q20  -Q10 -f ${INPUT_DIRECTORY}/${INPUT_FILE} -r $REGION > ${INPUT_DIRECTORY}/${OUTPUT_FILE}"
if [ $? -ne 0 ]; 
then
    touch "${INPUT_DIRECTORY}/depth.failure"
else
    touch "${INPUT_DIRECTORY}/depth.success"
fi

echo "Depths obtained ${INPUT_DIRECTORY}/${OUTPUT_FILE}"

148 /hackathon/Hackathon_Project_4/DEPTH/input_all_rg_final.txt
time samtools depth -d 1000000 -q20  -Q10 -f /hackathon/Hackathon_Project_4/DEPTH/input_all_rg_final.txt -r chr22 > /hackathon/Hackathon_Project_4/DEPTH/depths_all_rg_final.txt
Depths obtained /hackathon/Hackathon_Project_4/DEPTH/depths_all_rg_final.txt


The output is a position by sample matrix. The columns are ``chromsome``, ``position``, followed by coverage for each sample in the order of the input file. The following is a small data set with 50 samples generated with ``input50.txt``. 

In [None]:
%%bash

printf "Number of columns: $(awk -F'\t' '{print NF; exit}' /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt)\n"
head /hackathon/Hackathon_Project_4/DEPTH/test/depths50chr22.txt


Running depth on individual bams with parallel
----------------------------------------------



In [18]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH/filtered"
INDEXED_BAMPATHS="/hackathon/Hackathon_Project_4/FINAL_BAM/*bam"
DEPTH_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"depth_commands"
CHROMOSOME="chr22"
for bam in $(ls $INDEXED_BAMPATHS);
do
    # change the regex extension replacement 
    DEPTH_COMMAND="samtools depth -aa -d 1000000 -q20  -Q10 $bam -r $CHROMOSOME > ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam/.DEPTH/g') 2> ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam/.log/g')"
    echo $DEPTH_COMMAND
done > $DEPTH_COMMANDS_FILE

wc -l ${DEPTH_COMMANDS_FILE}
head -n 5 ${DEPTH_COMMANDS_FILE}

148 /hackathon/Hackathon_Project_4/DEPTH/filtered/depth_commands
samtools depth -aa -d 1000000 -q20 -Q10 /hackathon/Hackathon_Project_4/FINAL_BAM/ENCFF000CWZ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam -r chr22 > /hackathon/Hackathon_Project_4/DEPTH/filtered/ENCFF000CWZ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.DEPTH 2> /hackathon/Hackathon_Project_4/DEPTH/filtered/ENCFF000CWZ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.log
samtools depth -aa -d 1000000 -q20 -Q10 /hackathon/Hackathon_Project_4/FINAL_BAM/ENCFF000CXF_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam -r chr22 > /hackathon/Hackathon_Project_4/DEPTH/filtered/ENCFF000CXF_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.DEPTH 2> /hackathon/Hackathon_Project_4/DEPTH/filtered/ENCFF000CXF_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.log
samtools depth -aa -d 1000000 -q20 -Q10 /hackathon/Hackathon_Project_4/FINAL_BAM/ENCFF000CXQ_chr22_RNA-seq_rg_recalibrated_fix_header_no_mmap.bam -r chr22 > /hackathon

Now we have a commands file we can use with parallel. Because this is already generated, I'm commenting out the line that executes this. Uncomment to re-run.

In [19]:
%%bash
nproc 
CORENUM=32 # change number of cores here
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH/filtered"
DEPTH_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"depth_commands"
echo ${DEPTH_COMMANDS_FILE}
time cat ${DEPTH_COMMANDS_FILE} |parallel --gnu -j $CORENUM

64
/hackathon/Hackathon_Project_4/DEPTH/filtered/depth_commands



real	11m16.991s
user	90m41.689s
sys	2m0.921s


Sanity checking
----------------

Here are a couple of files generated with individual bams:

``/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth``
``/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth``

These are samples 1 and 10 in the input file.

In [5]:
%%bash
grep -n ENCFF000ARG /hackathon/Hackathon_Project_4/DEPTH/input50.txt
grep -n ENCFF000JAG /hackathon/Hackathon_Project_4/DEPTH/input50.txt

echo "ENCFF000ARG"

# get coverage for sample in single sample depth file
echo "Single sample"
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth  | head
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth  | head | cut -f 2 > /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.test.position

# get coverage for selected sample in multi-sample depth file
echo "Multi-sample"
grep -w -f /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.test.position /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt | cut -f 1,2,3
echo 

echo "ENCFF000JAG"

# get coverage for sample in single sample depth file
echo "Single sample"
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth  | head
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth  | head | cut -f 2 > /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.test.position

# get coverage for selected sample in multi-sample depth file
echo "Multi-sample"
grep -w -f /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.test.position /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt | cut -f 1,2,12

1:/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARG.bam
10:/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000JAG.bam
ENCFF000ARG
Single sample
chr22	16850789	12
chr22	16850790	12
chr22	16850791	12
chr22	16850792	12
chr22	16850793	13
chr22	16850794	13
chr22	16850795	13
chr22	16850796	14
chr22	16850797	16
chr22	16850798	17
Multi-sample
chr22	16850789	12
chr22	16850790	12
chr22	16850791	12
chr22	16850792	12
chr22	16850793	13
chr22	16850794	13
chr22	16850795	13
chr22	16850796	14
chr22	16850797	16
chr22	16850798	17

ENCFF000JAG
Single sample
chr22	16117756	14
chr22	16117757	14
chr22	16117758	14
chr22	16117759	14
chr22	16117760	14
chr22	16117761	14
chr22	16117762	14
chr22	16117763	14
chr22	16117764	14
chr22	16117765	14
Multi-sample
chr22	16117756	14
chr22	16117757	14
chr22	16117758	14
chr22	16117759	14
chr22	16117760	14
chr22	16117761	14
chr22	16117762	14
chr22	16117763	14
chr22	16117764	14
chr22	16117765	14


awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth FNR=33467) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth FNR=33467) fatal: print to "standard output" failed (Broken pipe)


The coverage with single samples should match the data with multiple samples with the selected columns.

You might see the following message:

``
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
...
``

In [None]:
%%bash

#samtools depth -f /hackathon/Hackathon_Project_4/DEPTH/input_all_rg_problematic_excluded.txt -q20 -Q10 -d 1000000 -r "chr22" |head 

generate a file with CHROM, POS, POS+1, sum of coverage
--------------------------------------------------------

In [None]:
# script to generate a file with CHROM, POS, POS+1, sum of coverage
input_file="/hackathon/Hackathon_Project_4/DEPTH/HUGE.txt"
output_file="/hackathon/Hackathon_Project_4/DEPTH/HUGE_cov.txt"

import csv

with open(input_file, 'r') as ifile, open(output_file, 'w') as ofile:
    csvreader = csv.reader(ifile, delimiter="\t")
    csvwriter = csv.writer(ofile, delimiter="\t")
    counter = 0
    for line in csvreader:
        cov_sum = sum(map(lambda x: int(x) if x else 0,line[3:]))
        csvwriter.writerow([ str(x) for x in [line[0], int(line[1]), int(line[2]), cov_sum]])
        counter += 1
        if counter%100000==0:
            print("Parsed {} lines...".format(counter))
            
print("Done")