# Objective:
In this section we will demonstrate how to assess expression of specific variant alleles at particular positions in the RNA-seq BAM.
# BAM Read Counting
Using one of the variant positions identified, count the number of supporting reference and variant reads. 
## 1. First, use samtools mpileup to visualize a region of alignment with a variant.
Each line consists of **chromosome**, **1-based coordinate**, **reference base**, the **number of reads covering the site**, **read bases** and **base qualities**. 
- At the read base column, a **dot** stands for a **match to the reference base on the forward strand**, a comma for a match on the reverse strand, ACGTN for a mismatch on the forward strand and acgtn for a mismatch on the reverse strand. A pattern \+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence.

See samtools pileup/mpileup documentation for more explanation of the output:
- http://samtools.sourceforge.net/pileup.shtml
- http://samtools.sourceforge.net/mpileup.shtml

In [1]:
cd $RNA_HOME
mkdir bam_readcount
cd bam_readcount

# Create indexed reference sequence fasta file (faidx) for use with mpileup
echo $RNA_REF_FASTA
samtools faidx $RNA_REF_FASTA

# Run samtools mpileup on a region of interest
samtools mpileup -f $RNA_REF_FASTA -r 22:18918457-18918467 $RNA_ALIGN_DIR/UHR.bam $RNA_ALIGN_DIR/HBR.bam

/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.fa
[mpileup] 2 samples in 2 input files
22	18918457	A	7	..,.,.,	3EJJDD?	32	>.$.,.,,.,,..,,....,....,,,,.,.,.	ICD@mAJCIJDDJJDDDDJEEHJJIJEGEIDI
22	18918458	G	7	..,.,.,	8EIIEDC	31	>.$,.,,.,,..,,....,....,,,,.,.,.	IB@k:JBJJDDJJDDDDJDEGHJGJFGEIDJ
22	18918459	C	7	..,.,.,	8DJJDH?	30	>,$.,,.,,..,,....,....,,,,.,.,.	I@mAIDIIDDIJBDDDJDDFHJIJFIHJDJ
22	18918460	T	7	..,.,.,	4DIIAEB	29	>.,,.,,..,,....,....,,,,.,.,.	ImFIDGJDDIICDDCJDEFGIHJFIHJDJ
22	18918461	C	7	..,.,.,	4DJIDH@	29	>.,,.,,..,,....,....,,,,.,.,.	ImHJDJJDDIJDDDDJDDFHJFJFIEJDJ
22	18918462	C	8	..,.,.,^].	:DJJFE?@	29	>.,,.,,..,,....,....,,,,.,.,.	ImHJDIJDDIJDDDCIDDFHIGJHIDIDJ
22	18918463	A	8	..,.,.,.	>DIJBH5@	29	>.,,.,,..,,....,....,,,,.,.,.	IkDIaHJDDEICDDCGDDFFHFHHGBGFJ
22	18918464	C	8	..,.,.,.	?DJJFG@@	29	>.,,.,t..,,....,....,,,,.,.,.	IkDJgJJDDJJCDDCJDDEFHHJHGHHFJ
22	18918465	G	9	..,.,.,.^].	?DJJHG?FF	30	>.,a.,,..,,....,....,,,,.,A,.^].	IkBIgJJDDJJBDDDHDDDFJBJJIFIFJC
22	18918466	T	9	..,

## 2. Now, use bam-readcount to count reference and variant bases at a specific position. 
First, create a bed file with some positions of interest (we will create a file called snvs.bed using the echo command).
It will contain a single line specifying a variant position on chr22 e.g.:
22:38483683-38483683

In [2]:
# create a bed file with some positions of interest
echo "22 38483683 38483683"
echo "22 38483683 38483683" > snvs.bed

22 38483683 38483683


In [3]:
# Run bam-readcount on this list for the tumor and normal merged bam files
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/UHR.bam 2>/dev/null
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/HBR.bam 2>/dev/null

22	38483683	G	326	=:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	A:163:60.00:37.15:0.00:94:69:0.53:0.01:38.37:96:0.56:99.98:0.54	C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	G:163:60.00:37.06:0.74:84:79:0.45:0.00:1.90:89:0.53:99.86:0.54	T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
22	38483683	G	206	=:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	A:75:60.00:38.41:0.00:44:31:0.52:0.01:38.99:44:0.55:99.99:0.50	C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	G:131:60.00:37.31:0.00:76:55:0.50:0.00:1.22:77:0.56:99.95:0.54	T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00	N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00


In [4]:
# Now, run it again, but ignore stderr and redirect stdout to a file:
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/UHR.bam 2>/dev/null 1>UHR_bam-readcounts.txt
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/HBR.bam 2>/dev/null 1>HBR_bam-readcounts.txt

In [5]:
# From this output you could parse the read counts for each base
cat UHR_bam-readcounts.txt | perl -ne '@data=split("\t", $_); @Adata=split(":", $data[5]); @Cdata=split(":", $data[6]); @Gdata=split(":", $data[7]); @Tdata=split(":", $data[8]); print "UHR Counts\t$data[0]\t$data[1]\tA: $Adata[1]\tC: $Cdata[1]\tT: $Tdata[1]\tG: $Gdata[1]\n";'
cat HBR_bam-readcounts.txt | perl -ne '@data=split("\t", $_); @Adata=split(":", $data[5]); @Cdata=split(":", $data[6]); @Gdata=split(":", $data[7]); @Tdata=split(":", $data[8]); print "HBR Counts\t$data[0]\t$data[1]\tA: $Adata[1]\tC: $Cdata[1]\tT: $Tdata[1]\tG: $Gdata[1]\n";'

UHR Counts	22	38483683	A: 163	C: 0	T: 0	G: 163
HBR Counts	22	38483683	A: 75	C: 0	T: 0	G: 131


Beside perl code, here’s a [bam-readcount tutorial](https://github.com/genome/bam-readcount/tree/master/tutorial) that uses python to parse output from bam-readcount to identify a Omicron SARS-CoV-2 variant of concern from raw sequence data.