# Constructing a velocity index with `kb`

This tutorial provides instructions for how to generate a velocity index to use with **kallisto | bustools** using `kb`.

## Download reference files

Download the genomic (DNA) FASTA and GTF annotations for your desired organism from the database of your choice. This tutorial uses mouse reference files downloaded from [Ensembl](https://uswest.ensembl.org/info/data/ftp/index.html).

In [0]:
%%time
!wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
!wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

--2020-01-16 00:33:28--  ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
           => ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-98/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... 805984352
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... done.
Length: 805984352 (769M) (unauthoritative)


2020-01-16 00:38:18 (2.66 MB/s) - ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’ saved [805984352]

--2020-01-16 00:38:18--  ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
           => ‘Mus_musculus.GRCm38.98.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org).

## Install `kb`

In [0]:
!pip install git+https://github.com/pachterlab/kb_python@count-kite

Collecting git+https://github.com/pachterlab/kb_python@count-kite
  Cloning https://github.com/pachterlab/kb_python (to revision count-kite) to /tmp/pip-req-build-0qd_9um1
  Running command git clone -q https://github.com/pachterlab/kb_python /tmp/pip-req-build-0qd_9um1
  Running command git checkout -b count-kite --track origin/count-kite
  Switched to a new branch 'count-kite'
  Branch 'count-kite' set up to track remote branch 'count-kite' from 'origin'.
Building wheels for collected packages: kb-python
  Building wheel for kb-python (setup.py) ... [?25l[?25hdone
  Created wheel for kb-python: filename=kb_python-0.24.4-cp36-none-any.whl size=80991434 sha256=4dc5ecd507f54a3aa99e9f4862f2ea29dd9a0ce442af305bf0eff1177f35e22f
  Stored in directory: /tmp/pip-ephem-wheel-cache-7b9entel/wheels/8e/56/56/c89223de74af26792675e82f4bb5223e7cf0d653a33038e34c
Successfully built kb-python
Installing collected packages: kb-python
Successfully installed kb-python-0.24.4


## Build the index

`kb` automatically splits the genome into cDNA and intron FASTA files. Because Google Colab has limited memory, we need to split the index into parts (here, we use `-n 4`). This will reduce the maximum memory `kb` uses, but the runtime of `kb count` will increase, which is a fair tradeoff in favor of less memory.

In [0]:
%%time
!kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 8 \
Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
Mus_musculus.GRCm38.98.gtf.gz

[2020-01-16 03:31:03,222]    INFO Preparing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz, Mus_musculus.GRCm38.98.gtf.gz
[2020-01-16 03:31:03,222]    INFO Decompressing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz to tmp
[2020-01-16 03:31:30,853]    INFO Sorting tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa to /content/tmp/tmpl3plby1k
[2020-01-16 03:38:59,002]    INFO Decompressing Mus_musculus.GRCm38.98.gtf.gz to tmp
[2020-01-16 03:39:03,235]    INFO Sorting tmp/Mus_musculus.GRCm38.98.gtf to /content/tmp/tmp7ebqamug
[2020-01-16 03:40:00,940]    INFO Splitting genome tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa into cDNA at /content/tmp/tmp1zp3oo9w
[2020-01-16 03:41:14,163]    INFO Wrote 142446 cDNA transcripts
[2020-01-16 03:41:14,168]    INFO Creating cDNA transcripts-to-capture at /content/tmp/tmpxjcopm_m
[2020-01-16 03:41:15,248]    INFO Splitting genome into introns at /content/tmp/tmpmo19l2ry
[2020-01-16 03:45:44,829]    INFO Wrote 647972 intron sequences
[2020-01-16 03:4