SAMTOOLS MPILEUP ON MULTIPLE BAMS
=================================

In [1]:
! samtools --version

samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.


See ``mpileup`` [manual](http://samtools.sourceforge.net/mpileup.shtml).
``samtools mpileup`` requires:
- an index reference fasta
- indexed bam files 

The input bam files used in this example are assembled on hg19. 

We want to create a commands file for parallelization. We are using chromosome 22 as an example here. Default ``mpileup`` filtering parameters were used. For example, depth is capped at 8000. We might want to increase that for RNA-seq? We can review and refine that later.



In [2]:
%%bash
REFERENCE="/hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa"
INDEXED_BAMPATHS="/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/*bai" # TODO: change this to bam when everything is indexed
CHROMOSOME="chr22" # change this to any chromosome you like
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
MPILEUP_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"mpileup_commands"


for bam in $(ls $INDEXED_BAMPATHS);
do
    MPILEUP_COMMAND="samtools mpileup -f $REFERENCE $(echo $bam|sed 's/.bai//g') -r $CHROMOSOME > ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam.bai/.mpileup/g')"
    echo $MPILEUP_COMMAND
done > $MPILEUP_COMMANDS_FILE

head ${MPILEUP_COMMANDS_FILE}

samtools mpileup -f /hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa /hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARG.bam -r chr22 > /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG.mpileup
samtools mpileup -f /hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa /hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARG_chr22.bam -r chr22 > /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG_chr22.mpileup
samtools mpileup -f /hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa /hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARG_chr22_rg.bam -r chr22 > /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG_chr22_rg.mpileup
samtools mpileup -f /hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa /hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/ENCFF000ARI.bam -r chr22 > /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI.mpileup
samtools mpileup -f /hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa /h

Now we have a commands file we can use with ``parallel``.

In [3]:
%%bash
nproc
CORENUM=2 # change number of cores here
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
MPILEUP_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"mpileup_commands"
echo ${MPILEUP_COMMANDS_FILE}
cat ${MPILEUP_COMMANDS_FILE} |parallel --gnu -j $CORENUM

64
/hackathon/Hackathon_Project_4/MPILEUP/mpileup_commands


[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000


The mpileup files should be generated in the output directory.

In [4]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
ls ${OUTPUT_DIRECTORY}/*mpileup

/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG.benchmarked.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG_chr22.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG_chr22_rg.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI.benchmarked.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI_chr22.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI_chr22_rg.mpileup
/hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI.mpileup


Spot check for a few high confidence variants...

In [5]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
BENCHMARK_VCF="/hackathon/Hackathon_Project_4/BENCHMARK/Benchmark-Test-V1.vcf"

# using only column 2 because we predefined a specific chromosome; 
# might want to use CHROM and POS columns from the VCFs otherwise
grep -f <(cut -f 2 ${BENCHMARK_VCF}) ${OUTPUT_DIRECTORY}/ENCFF000ARG.mpileup > ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
grep -f <(cut -f 2 ${BENCHMARK_VCF}) ${OUTPUT_DIRECTORY}/ENCFF000ARI.mpileup > ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup
wait
wc -l ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
wc -l ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup
wc -l ${BENCHMARK_VCF}

25 /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARG.benchmarked.mpileup
25 /hackathon/Hackathon_Project_4/MPILEUP/ENCFF000ARI.benchmarked.mpileup
25 /hackathon/Hackathon_Project_4/BENCHMARK/Benchmark-Test-V1.vcf


In [6]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
head ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
echo
head ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup


chr22	18268130	C	13	.T....T..TT..	4<<<79<<<<<<<
chr22	19132325	A	14	...G...CG.GGG.	69<<70;1<2<<.<
chr22	24585835	T	7	,c,cccc	<</<<..
chr22	26727129	C	13	TT.T.,...T.T^0.	<<<<<9<<<<<<;
chr22	30787190	t	11	.CC.......C	:<<55<<<<<<
chr22	33396853	C	11	..TT.TT.,TT	260:<<<7<<<
chr22	33396904	C	2	t,	<8
chr22	33397045	T	9	,$cccccc,c	<<<<<<<<<
chr22	33414359	G	11	AAAAAAA,A.A	9994<<7<<<<
chr22	33445237	A	12	.C.C...CCC.C	9<<<;;<<<7<<

chr22	18268130	C	19	T...t..T.TT.....T.^g.	4:=8B9B<7<=<B4A<<;<
chr22	19132325	A	18	.....G...GGGG.GG.^b.	8=B@A<;<B8CAC5BCB>
chr22	24585835	T	16	,CCCC,,c,,,,,,cC	<17<<BA<><AB;B;<
chr22	26727129	C	21	.T.,TT.T....,,t...TTT	A@=?8364<<<6B1CA@A8<B
chr22	30787190	t	14	CC.CcC.CC...CC	8B8BBB5AB<A<01
chr22	33396853	C	13	T..Tt.t.TTTT.	82<5<<<<<=>5<
chr22	33396904	C	15	tt,,.,,,.,t,,,^f,	<A<<89<9B<<<<14
chr22	33397045	T	7	cc,,c,,	<BB<<B7
chr22	33414359	G	12	,A.A...A.,,A	<;<<<<B<<@<<
chr22	33445237	A	19	CCC.CCC..cC..CC...,	<.@3<=;A1C@;<5CCCC7
