SAMTOOLS MPILEUP ON MULTIPLE BAMS
=================================

In [None]:
! samtools --version

See ``mpileup`` [manual](http://samtools.sourceforge.net/mpileup.shtml).
``samtools mpileup`` requires:
- an index reference fasta
- indexed bam files 

The input bam files used in this example are assembled on hg19. 

We want to create a commands file for parallelization. We are using chromosome 22 as an example here. 
Parameters similar to GATK's default read filters were applied:
- depth 1000000 
- mapping quality 20 ([GATK haplotype caller](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) default)
- base quality 10 ([GATK haplotype caller](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) default)
- exclude read unmapped, not primary alignment, read fails platform/vendor quality checks, read is PCR or optical duplicate ([GATK haplotype caller](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) default)




In [None]:
%%bash
REFERENCE="/hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa"
INDEXED_BAMPATHS="/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/rg_bams/*bam"
CHROMOSOME="chr22" # change this to any chromosome you like
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP/rg_bams"
MPILEUP_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"mpileup_commands"


for bam in $(ls $INDEXED_BAMPATHS);
do
    # change the regex extension replacement 
    MPILEUP_COMMAND="samtools mpileup -d 1000000 --ff 1796 -q20  -Q10 -f $REFERENCE $bam -r $CHROMOSOME > ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam/.mpileup/g') 2> ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam/.log/g')"
    echo $MPILEUP_COMMAND
done > $MPILEUP_COMMANDS_FILE

wc -l ${MPILEUP_COMMANDS_FILE}
head -n 5 ${MPILEUP_COMMANDS_FILE}

Now we have a commands file we can use with ``parallel``. Because this is already generated, I'm commenting out the line that executes this. Uncomment to re-run.

In [None]:
%%bash
nproc 
CORENUM=32 # change number of cores here
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP/rg_bams"
MPILEUP_COMMANDS_FILE=${OUTPUT_DIRECTORY}/"mpileup_commands"
echo ${MPILEUP_COMMANDS_FILE}
time cat ${MPILEUP_COMMANDS_FILE} |parallel --gnu -j $CORENUM

The mpileup files should be generated in the output directory It took about an hour to run almost 500 samples with 32 cores.

In [None]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP/rg_bams"
ls ${OUTPUT_DIRECTORY}/*mpileup|wc -l

Sanity Checks
-------------
Spot check for a few high confidence variants...
The two bams for initial testing, ENCFF000ARG, ENCFF000ARI are not included in the 479 so I couldn't reproduce the following with RG bams...

In [None]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP" #
BENCHMARK_VCF="/hackathon/Hackathon_Project_4/BENCHMARK/Benchmark-Test-V1.vcf"

# using only column 2 because we predefined a specific chromosome; 
# might want to use CHROM and POS columns from the VCFs otherwise
grep -f <(cut -f 2 ${BENCHMARK_VCF}) ${OUTPUT_DIRECTORY}/ENCFF000ARG.mpileup > ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
grep -f <(cut -f 2 ${BENCHMARK_VCF}) ${OUTPUT_DIRECTORY}/ENCFF000ARI.mpileup > ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup
wait
wc -l ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
wc -l ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup
wc -l ${BENCHMARK_VCF}

In [None]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP"
head ${OUTPUT_DIRECTORY}/ENCFF000ARG.benchmarked.mpileup
echo
head ${OUTPUT_DIRECTORY}/ENCFF000ARI.benchmarked.mpileup


SAMTOOLS MPILEUP WITH A SINGLE MERGED BAM
=========================================

This merged bam ``/hackathon/Hackathon_Project_4/VariantCall_HAPLOTYPE/merged_rg.chr22.bam`` is constructed with ENCFF000ARG and ENCFF000ARI. The read groups are tagged so that we know the source.

In [None]:
%%bash

REFERENCE="/hackathon/Hackathon_Project_4/REFERENCE_GENOME/hg19.fa"
bam="/hackathon/Hackathon_Project_4/VariantCall_HAPLOTYPE/merged_rg.chr22.bam" # TODO: change this to bam when everything is indexed
CHROMOSOME="chr22" # change this to any chromosome you like
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP/merged"

samtools mpileup -f $REFERENCE $(echo $bam|sed 's/.bai//g') -r $CHROMOSOME > ${OUTPUT_DIRECTORY}/$(basename $bam|sed 's/.bam/.mpileup/g')
   

Next we pull out the subset of positions from the benchmark data set.

In [None]:
%%bash
OUTPUT_DIRECTORY="/hackathon/Hackathon_Project_4/MPILEUP/merged"
BENCHMARK_VCF="/hackathon/Hackathon_Project_4/BENCHMARK/Benchmark-Test-V1.vcf"

grep -f <(cut -f 2 ${BENCHMARK_VCF}) ${OUTPUT_DIRECTORY}/merged_rg.chr22.mpileup > ${OUTPUT_DIRECTORY}/merged_rg.chr22.benchmarked.mpileup


COMPARING COVERAGE BETWEEN THE TWO APPROACHES
=============================================

In [None]:
import os
import pandas

out_dir = "/hackathon/Hackathon_Project_4/MPILEUP"
ENCFF000ARG = os.path.join(out_dir, "ENCFF000ARG.benchmarked.mpileup")
ENCFF000ARI = os.path.join(out_dir, "ENCFF000ARI.benchmarked.mpileup")
merged = os.path.join(out_dir, "merged", "merged_rg.chr22.benchmarked.mpileup")

ENCFF000ARG_data = pandas.read_table(ENCFF000ARG, header=None)
ENCFF000ARI_data = pandas.read_table(ENCFF000ARI, header=None)
merged_data = pandas.read_table(merged, header=None)
assert all(ENCFF000ARG_data[1]==ENCFF000ARI_data[1])
assert all(ENCFF000ARI_data[1]==merged_data[1])
pandas.DataFrame({"sum_of_cov": ENCFF000ARG_data[3]+ENCFF000ARI_data[3], "merged": merged_data[3]})



SUM OF COVERAGE
===============

In [None]:
import os
identifiers="/hackathon/Hackathon_Project_4/ENCODE_DATA_GM12878/COMPLETED/filesToUse.txt"
output_file="/hackathon/Hackathon_Project_4/MPILEUP/samples.txt"
with open(identifiers, 'r') as ifile, open(output_file, 'w') as ofile:
    for row in ifile:
        orow = os.path.join("/hackathon/Hackathon_Project_4/MPILEUP", row.strip() + ".mpileup" ) + "\n"
        ofile.write(orow)

generate a file with CHROM, POS, POS+1, sum of coverage
--------------------------------------------------------

In [6]:
# script to generate a file with CHROM, POS, POS+1, sum of coverage
input_file="/hackathon/Hackathon_Project_4/DEPTH/HUGE.txt"
output_file="/hackathon/Hackathon_Project_4/DEPTH/HUGE_cov.txt"

import csv

with open(input_file, 'r') as ifile, open(output_file, 'w') as ofile:
    csvreader = csv.reader(ifile, delimiter="\t")
    csvwriter = csv.writer(ofile, delimiter="\t")
    counter = 0
    for line in csvreader:
        cov_sum = sum(map(lambda x: int(x) if x else 0,line[3:]))
        csvwriter.writerow([ str(x) for x in [line[0], int(line[1]), int(line[2]), cov_sum]])
        counter += 1
        if counter%100000==0:
            print("Parsed {} lines...".format(counter))
            
print("Done")

Parsed 100000 lines...
Parsed 200000 lines...
Parsed 300000 lines...
Parsed 400000 lines...
Parsed 500000 lines...
Parsed 600000 lines...
Parsed 700000 lines...
Parsed 800000 lines...
Parsed 900000 lines...
Parsed 1000000 lines...
Parsed 1100000 lines...
Parsed 1200000 lines...
Parsed 1300000 lines...
Parsed 1400000 lines...
Parsed 1500000 lines...
Parsed 1600000 lines...
Parsed 1700000 lines...
Parsed 1800000 lines...
Parsed 1900000 lines...
Parsed 2000000 lines...
Parsed 2100000 lines...
Parsed 2200000 lines...
Parsed 2300000 lines...
Parsed 2400000 lines...
Parsed 2500000 lines...
Parsed 2600000 lines...
Parsed 2700000 lines...
Parsed 2800000 lines...
Parsed 2900000 lines...
Parsed 3000000 lines...
Parsed 3100000 lines...
Parsed 3200000 lines...
Parsed 3300000 lines...
Parsed 3400000 lines...
Parsed 3500000 lines...
Parsed 3600000 lines...
Parsed 3700000 lines...
Parsed 3800000 lines...
Parsed 3900000 lines...
Parsed 4000000 lines...
Parsed 4100000 lines...
Parsed 4200000 lines...
P