# Counting Reads
The next step after mapping reads is to count the number of reads that fall within each annotated gene in the genome, so lets set up a count directory.

## Shell Variables

In [None]:
# Source the config script
source bioinf_intro_config.sh

mkdir -p $COUNT_OUT
ls $CUROUT

## Counting Reads

In [None]:
htseq-count --help

We will use [htseq-count](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) to do the counting, but first we need to make some decisions, because the `htseq-count` defaults do not work with some annotation files.  Here are the **most important** commandline options that we need to consider:
* --format=<format>: Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.
* --stranded=<yes/no/reverse>: whether the data is from a strand-specific assay (default: yes). For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.
* --type=<feature type>: feature type (3rd column in GFF file) to be evaluated, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon)
* --idattr=<id attribute>: GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene\_id.

And here is how we will set those options:
* --format=bam: Since Tophat generated BAM files for us
* --stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be "reverse".

We need to look at the GFF file to understand what exactly the `--type` and `--idattr` options are, and why we are setting them this way.

In [None]:
head -20 $GTF

In [None]:
head -50 $GTF | cut -c -55

## Running htseq-count
So now we are ready!  We run `htseq-count` using `htseq-count ALIGNMENT_FILE GFF_FILE`.  Here is our command for our test sample:
    
* --format=bam: Since Tophat generated BAM files for us
* --stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be "reverse".


In [None]:
htseq-count --quiet \
    --format=bam \
    --stranded=reverse \
    ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam \
    $GTF > ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

Let's take a quick peek at the results

In [None]:
head ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

There's also some useful information at the end of the file:

In [None]:
tail ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

#### Sanity Check: Total read counts

In [None]:
head ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab

In [None]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'

In [None]:
cat ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'

In [None]:
ls ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab

In [None]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out

In [None]:
grep "Uniquely mapped reads number" ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out

An easy way to check how many reads map unambiguously to genes is to use grep to remove those  lines with information about problem reads that are at the end of the file, then use our awk command again

In [None]:
grep __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv 

In [None]:
grep -v __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'

In [None]:
grep N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab 

In [None]:
grep -v N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'

<!---

`cut -f3 $GFF | sort | uniq -c | head -n50`

--->
