## A tutorial on samtools

Personally, I have always found it hard dealing with the very popular and regularly used concepts of bioinformatics. Some of these, for instance, samtools, Bowtie, PLINK have become standard protocols and I have seen people working in these fields relying on them very heavily. Coming from statistics background, I always found it hard to grasp these softwares well. This is my personal effort at learning about these tools.

I am going to start with samtools.

SAM/ BAM (Binary representation of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput sequencing data, at some point you will probably have to deal with SAM/BAM files, so familiarise yourself with them! (http://davetang.org/wiki/tiki-index.php?page=SAMTools#Extracting_only_the_first_read_from_paired_end_BAM_files).

I have also learned that if you start going deep into sequencing, the deepest you can go before you hit the bottom is upto a BAM file. In other words, all sequencing analysis starts with a BAM file. So, it is good to get a proper idea about how to handle BAM files.

To give a demo on samtools, we needed a BAM file for experimental use. T004_all_chr.bam is an example BAM file that we use here. The data is available in the same folder as this file on my Github.

Throughout this demo, I will assume the user has no prior knowledge of samtools or a BAM file, so this tutorial runs the risk of being too basic for someone already well versed with some aspects of the BAM format and samtools.

In [1]:
%%bash
samtools view T004_all_chr.bam | head -n 1  ## first line of the BAM file

DBRHHJN1:287:C11BUACXX:2:1108:18196:37453	0	chr1	11591	1	100M	*	0	0	GTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCT	B@CFFFFFHGHHHJJJJIIJJJIJJJJJJJJJJIIIIJJJJJJJJIJJJ?FFIIGIJJJJJGHGIJJJJEHEIHEEFGHHFFDFEDEEEDDDB?B?>CDD	AS:i:200	XS:i:200	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU


Okay so now we know how the first line of a BAM file looks like. Every line will typically look like this.

In [4]:
%%bash 
samtools view T004_all_chr.bam | head -n 3  ## first 3 lines of the BAM file

DBRHHJN1:287:C11BUACXX:2:1108:18196:37453	0	chr1	11591	1	100M	*	0	0	GTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCT	B@CFFFFFHGHHHJJJJIIJJJIJJJJJJJJJJIIIIJJJJJJJJIJJJ?FFIIGIJJJJJGHGIJJJJEHEIHEEFGHHFFDFEDEEEDDDB?B?>CDD	AS:i:200	XS:i:200	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:100	YT:Z:UU
DBRHHJN1:287:C11BUACXX:2:1102:2899:38685	0	chr1	11729	1	100M	*	0	0	TTTTAAATTTCCACTGATGATTTTACTGAATGGCCGGTGTTGAGAATGACTGTGCAAATTTGCCGGATTTCCTTCGCTGTTCCTGCATGTAGTTTAAACG	<@;BDBDDDFHFFGGHFDAAFE?C,AFHI@CHEGE@C10@7DB<F>B*/?84?<C88B<8CAC;@8=ABCB@@@@=ACB3>A>>>C?(5(5;(5@DCC38	AS:i:173	XS:i:173	XN:i:0	XM:i:4	XO:i:0	XG:i:0	NM:i:4	MD:Z:24G3C23C21T25	YT:Z:UU
DBRHHJN1:287:C11BUACXX:2:1110:3641:63170	0	chr1	11754	1	100M	*	0	0	CTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTC	@@@FFFFFHHHH?EFHJJIIIJDIJHGGGHJIIIJJIIJGIGBGHIJJIGGGGGEHHFEFFDDE@CECEEEDDDDBDBDDDDDDDDDDDDDD5>CCDDAD	AS:i:200	XS:i:200	XN:i:0	XM:i:0	XO:i:0	XG:

In [6]:
%%bash
samtools view T004_all_chr.bam | tail -n 3  ## last 3 lines of the BAM file

DBRHHJN1:287:C11BUACXX:2:1105:21004:26547	4	*	0	0	*	*	0	0	GCCACACGGACCACAAGCAGTATAGCCCCAGATAGCGCCCCCAGTCTGCTCCTGTCGCAGGCAGTGAACGCCCGGGGTAGTGGAGCCAAAAAACACCTGG	+1=DB;DD:F<?A:;2<?C>E4<?CD>??;DF?<9?DF<:;-;@7==4;A>>?###############################################	YT:Z:UU
DBRHHJN1:287:C11BUACXX:2:1107:11638:91641	4	*	0	0	*	*	0	0	CTATCAATTCTCTGTACGTGCTTCATGTTAGATTTCCAGTCATATGTTTGATTTTCTTTTTAGAATGGTCTTCATTTCAGATAATTTCAAATCTAAAGCC	CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJFIJJIJJJHIJJHIJHIJJHIHIJJJIJJIJIEGHJJIJIJIIJHJJIJJJJEHJGHGGHHHFFFFFFE	YT:Z:UU
DBRHHJN1:287:C11BUACXX:2:1107:17196:42409	4	*	0	0	*	*	0	0	CAAAAAGAACTGAGGAGGCTCCCGCCACAGCTGCAGGACCCACCTCTTCGCCTTGGTTCCCTTGAACACGAGCTTGGTAGACTTCACATGAGAGTACTCG	CCCFFFFFHHFHHJJJJJJJJJIIIIIIJIGIGIIIJGIJJIIJGIIJGGHHFHHFDFBEEECEEDDCDDDDDDDCBDDDCCDDDDDDDDCDDD>CDDDB	YT:Z:UU


You will notice, we are printing only a few lines of this file. This is because the file itself is pretty big. How many lines does it contain?

In [7]:
%%bash
samtools view T004_all_chr.bam | wc -l

 2450208


Now we invest some time in finding out the meanings of the different columns in the BAM file. We first see how many columns are there.

In [9]:
%%bash
samtools view T004_all_chr.bam | head -n 1 | awk '{ print NF}'

20
