# Constructing a transcriptome index with `kb`

This tutorial provides instructions for how to generate a transcriptome index to use with **kallisto | bustools** using `kb`.

## Download reference files

Download the genomic (DNA) FASTA and GTF annotations for your desired organism from the database of your choice. This tutorial uses mouse reference files downloaded from [Ensembl](https://uswest.ensembl.org/info/data/ftp/index.html).

In [0]:
%%time
!wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
!wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

--2020-01-13 22:23:27--  ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
           => ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-98/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... 805984352
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... done.
Length: 805984352 (769M) (unauthoritative)


2020-01-13 22:23:35 (95.9 MB/s) - ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’ saved [805984352]

--2020-01-13 22:23:37--  ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
           => ‘Mus_musculus.GRCm38.98.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org).

## Install `kb`

In [0]:
!pip install kb-python

Collecting kb-python
[?25l  Downloading https://files.pythonhosted.org/packages/62/c9/2e5b8fa2cd873a23ae1aeb128b33165d6a9387a2f56ea1fafec1d6d32477/kb_python-0.24.4-py3-none-any.whl (35.4MB)
[K     |████████████████████████████████| 35.4MB 119kB/s 
[?25hCollecting loompy>=3.0.6
[?25l  Downloading https://files.pythonhosted.org/packages/36/52/74ed37ae5988522fbf87b856c67c4f80700e6452410b4cd80498c5f416f9/loompy-3.0.6.tar.gz (41kB)
[K     |████████████████████████████████| 51kB 202kB/s 
[?25hCollecting anndata>=0.6.22.post1
[?25l  Downloading https://files.pythonhosted.org/packages/2b/72/87196c15f68d9865c31a43a10cf7c50bcbcedd5607d09f9aada0b3963103/anndata-0.6.22.post1-py3-none-any.whl (47kB)
[K     |████████████████████████████████| 51kB 6.8MB/s 
Collecting numpy-groupies
[?25l  Downloading https://files.pythonhosted.org/packages/57/ae/18217b57ba3e4bb8a44ecbfc161ed065f6d1b90c75d404bd6ba8d6f024e2/numpy_groupies-0.9.10.tar.gz (43kB)
[K     |████████████████████████████████| 51kB 7.5

## Build the index

`kb` automatically splits the genome into a cDNA FASTA file and uses that to build a kallisto index.

In [0]:
%%time
!kb ref -i transcriptome.idx -g transcripts_to_genes.txt -f1 cdna.fa \
Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
Mus_musculus.GRCm38.98.gtf.gz

[2020-01-13 22:26:03,328]    INFO Decompressing Mus_musculus.GRCm38.98.gtf.gz to tmp
[2020-01-13 22:26:06,666]    INFO Creating transcript-to-gene mapping at transcripts_to_genes.txt
[2020-01-13 22:26:36,616]    INFO Decompressing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz to tmp
[2020-01-13 22:27:01,812]    INFO Sorting tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa
[2020-01-13 22:34:09,011]    INFO Sorting tmp/Mus_musculus.GRCm38.98.gtf
[2020-01-13 22:34:57,702]    INFO Splitting genome into cDNA at cdna.fa
[2020-01-13 22:36:02,242]    INFO Indexing to transcriptome.idx
CPU times: user 5.08 s, sys: 886 ms, total: 5.97 s
Wall time: 19min 38s
