SAMTOOLS DEPTH ON MULTIPLE BAMS
===============================

In [1]:
! samtools --version

samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.


``samtools depth`` requires:
- indexed bam files 

I'm using default filters in the next section.
See ``depth`` [manual](http://www.htslib.org/doc/samtools.html) for more info.

Create a list of input file paths
----------------------------------
Example:

In [13]:
%%bash
INPUT_DIRECTORY="/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams"
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH"
OUTPUT_FILE="input_all_rg.txt"

ls ${INPUT_DIRECTORY}/*bam > ${OUTPUT_DIRECTORY}/${OUTPUT_FILE}
head ${OUTPUT_DIRECTORY}/${OUTPUT_FILE}

/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000ARG_chr22_ChIP-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000ARI_chr22_ChIP-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000ARL_chr22_ChIP-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000ARN_chr22_ChIP-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000IYN_chr22_RNA-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000IYW_chr22_RNA-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000IZE_chr22_RNA-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000IZF_chr22_RNA-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000JAD_chr22_RNA-seq_rg.bam
/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000J

Generating depths
-----------------

This step takes a while if the number of input files is large. The ``-@`` option works for ``sort`` and ``view``. It does not seem to work for ``depth`` in v1.3.1 - posted a comment [here](https://github.com/samtools/samtools/issues/348). An alternative is to run this in parallel. However the challenge would be to aggregate positions in individual samples.

In [15]:
%%bash
INPUT_DIRECTORY="/hackathon/Hackathon_Project_4/DEPTH"
INPUT_FILE="input_all_rg.txt"
OUTPUT_FILE="depths_all_rg.txt"
REGION="chr22"
wc -l ${INPUT_DIRECTORY}/${INPUT_FILE}
# using -d 1000000 to override default coverage cap
samtools depth -d 1000000 -f ${INPUT_DIRECTORY}/${INPUT_FILE} -r $REGION > ${INPUT_DIRECTORY}/${OUTPUT_FILE}
echo "Depths obtained ${INPUT_DIRECTORY}/${OUTPUT_FILE}"

224 /hackathon/Hackathon_Project_4/DEPTH/input_all_rg.txt
Depths obtained /hackathon/Hackathon_Project_4/DEPTH/depths_all_rg.txt


[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
samtools depth: can't load index for "/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/ENCFF000VVL_chr22_ChIP-seq_rg.bam"


The output is a position by sample matrix. The columns are ``chromsome``, ``position``, followed by coverage for each sample in the order of the input file. The following is a small data set with 50 samples generated with ``input50.txt``. 

In [4]:
%%bash

printf "Number of columns: $(awk -F'\t' '{print NF; exit}' /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt)\n"
head /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt


Number of columns: 52
chr22	16050002	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
chr22	16050003	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
chr22	16050004	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
chr22	16050005	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
chr22	16050006	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
chr22	16050007	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0
chr22	16050008	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0
chr22	16050009	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0
chr22	16050010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Sanity checking
----------------

Here are a couple of files generated with individual bams:

``/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth``
``/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth``

In [5]:
%%bash
grep -n ENCFF000ARG /hackathon/Hackathon_Project_4/DEPTH/input50.txt
grep -n ENCFF000JAG /hackathon/Hackathon_Project_4/DEPTH/input50.txt

echo "ENCFF000ARG"

# get coverage for sample in single sample depth file
echo "Single sample"
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth  | head
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth  | head | cut -f 2 > /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.test.position

# get coverage for selected sample in multi-sample depth file
echo "Multi-sample"
grep -w -f /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.test.position /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt | cut -f 1,2,3
echo 

echo "ENCFF000JAG"

# get coverage for sample in single sample depth file
echo "Single sample"
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth  | head
awk -F "\t" '$3>10' /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth  | head | cut -f 2 > /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.test.position

# get coverage for selected sample in multi-sample depth file
echo "Multi-sample"
grep -w -f /hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.test.position /hackathon/Hackathon_Project_4/DEPTH/depths50chr22.txt | cut -f 1,2,12

1:/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARG.bam
10:/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000JAG.bam
ENCFF000ARG
Single sample
chr22	16850789	12
chr22	16850790	12
chr22	16850791	12
chr22	16850792	12
chr22	16850793	13
chr22	16850794	13
chr22	16850795	13
chr22	16850796	14
chr22	16850797	16
chr22	16850798	17
Multi-sample
chr22	16850789	12
chr22	16850790	12
chr22	16850791	12
chr22	16850792	12
chr22	16850793	13
chr22	16850794	13
chr22	16850795	13
chr22	16850796	14
chr22	16850797	16
chr22	16850798	17

ENCFF000JAG
Single sample
chr22	16117756	14
chr22	16117757	14
chr22	16117758	14
chr22	16117759	14
chr22	16117760	14
chr22	16117761	14
chr22	16117762	14
chr22	16117763	14
chr22	16117764	14
chr22	16117765	14
Multi-sample
chr22	16117756	14
chr22	16117757	14
chr22	16117758	14
chr22	16117759	14
chr22	16117760	14
chr22	16117761	14
chr22	16117762	14
chr22	16117763	14
chr22	16117764	14
chr22	16117765	14


awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth FNR=33467) fatal: print to "standard output" failed (Broken pipe)
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000JAG.chr22.depth FNR=33467) fatal: print to "standard output" failed (Broken pipe)


The coverage with single samples should match the data with multiple samples with the selected columns.

You might see the following message:

``
awk: cmd. line:1: (FILENAME=/hackathon/Hackathon_Project_4/DEPTH/single/ENCFF000ARG.chr22.depth FNR=153973) fatal: print to "standard output" failed (Broken pipe)
...
``