# Counting Transcripts - Pipeline B

> This is part three of the Counting Transcripts series. We will cover another popular pipeline: Trimmomatic/killisto. We will continue to use the data downloaded in [part 1]() of the series. This notebook should be located in the same folder as the previous notebooks.

## Setup

In [None]:
###IMPORTANT: if you ever restart this notebook, you MUST rerun this code cell
import os

#feel free to change this location, because we will be working in this folder for the entire tutorial
os.environ['WORKDIR'] = './data' 

### Pipeline B: Trimmomatic/killisto
**2a.**This pipeline will also use [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to clean the reads.
If you have walked through part 2 of the series, you can skip this step.
If you still need to clean the reads, see instructions for using [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) in [pipeline A](), then return here.

[Kallisto](https://pachterlab.github.io/kallisto/) is a program that appeared in [2015](https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/) developed in Lior Patcher's group. It is based on _pseudo-alignment_ using [k-mers](https://en.wikipedia.org/wiki/K-mer) instead of reads. It is fast, accurate, and robust. And simple to use. All you need it to create is an index of your transcriptome (or any target reference sequence) and you will be ready to apply it on your FASTQ files. This [video](https://www.youtube.com/watch?v=94wphB3GKBM) has a great explanation of how kallisto differs from other mapping tools.

**2b.** The first step is to create a kallisto index, and we can use the same transcriptome FASTA file from pipeline A.

In [None]:
!kallisto index -i $WORKDIR/kallisto/transcripts.idx $WORKDIR/microbesOnline/4932.transcriptomes.fasta

**2c.** Kallisto combines the mapping and quantifying steps into one, so unlike pipeline A which first uses STAR to map the reads to a reference genome sequence and then uses cufflinks to quantify the aligned reads, we can run just one command to get a file representing levels of gene expression.

In [None]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output \
    $WORKDIR/cleanFASTQ/SRR5511057.forward_paired.fastq $WORKDIR/cleanFASTQ/SRR5511057.reverse_paired.fastq

Once the above command has finished, you should see a new folder called `output`. If you look inside, there should be a file called `abundance.tsv` with five columns:
 1. The first column is `target_id`, which is the identifier of the transcript from the FASTA transcriptome we downloaded. 
 2. The second column is `length`, which is the length of the transcript in base pairs. 
 3. The third column is `eff_length`, which stands for effective length. This is the length of the original transcript that matches with reads. 
 4. The fourth column is `est_counts`, which stands for estimated counts. This is almost the same as _counts_, but accounts for the amount of bias in the experiment. 
 5. This fifth column is `tpm`, which is the metric for the level of gene expression.
 
For more information about these expression units, visit [this blog post](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/).


**Question:** What would you expect to be the most expressed transcript in yeast to be?<sub>(3)</sub>