# Pre-processing and analysis of mixed-species single-cell RNA-seq data with kallisto|bustools.

In this notebook, we will perform pre-processing and analysis of [10x Genomics 1k 1:1 mixure of fresh frozen human and mouse cells](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/1k_hgmm_v3) using the **kallisto | bustools** workflow, implemented with a wrapper called `kb`. It was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi.

In [0]:
!date

Thu Jan 16 18:54:23 UTC 2020


## Pre-processing

### Download the data

__Note:__ We use the `-O` option for `wget` to rename the files to easily identify them.

In [0]:
%%time
!wget https://caltech.box.com/shared/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt -O checksums.txt
!wget https://caltech.box.com/shared/static/ags4jxbqrceuqewb0zy7kyuuggazqb0j.gz -O 1k_hgmm_v3_S1_L001_R1_001.fastq.gz
!wget https://caltech.box.com/shared/static/39tknal6wm4lhvozu6bf6vczb475bnuu.gz -O 1k_hgmm_v3_S1_L001_R2_001.fastq.gz
!wget https://caltech.box.com/shared/static/x2hwq2q3weuggtffjfgd1e8a1m1y7wj9.gz -O 1k_hgmm_v3_S1_L002_R1_001.fastq.gz
!wget https://caltech.box.com/shared/static/0g7lnuieg8jxlxswrssdtz809gus75ek.gz -O 1k_hgmm_v3_S1_L002_R2_001.fastq.gz
!wget https://caltech.box.com/shared/static/0avmybuxqcw8haa1hf0n72oyb8zriiuu.gz -O 1k_hgmm_v3_S1_L003_R1_001.fastq.gz
!wget https://caltech.box.com/shared/static/hp10z2yr8u3lbzoj1qflz83r2v9ohs6q.gz -O 1k_hgmm_v3_S1_L003_R2_001.fastq.gz
!wget https://caltech.box.com/shared/static/fx8fduedje53dvf3xixyyaqzugn7yy85.gz -O 1k_hgmm_v3_S1_L004_R1_001.fastq.gz
!wget https://caltech.box.com/shared/static/lpt6uzmueh1l2vx71nvsdj3pwqh8z3ak.gz -O 1k_hgmm_v3_S1_L004_R2_001.fastq.gz

--2020-01-16 18:54:36--  https://caltech.box.com/shared/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt
Resolving caltech.box.com (caltech.box.com)... 103.116.4.197
Connecting to caltech.box.com (caltech.box.com)|103.116.4.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt [following]
--2020-01-16 18:54:36--  https://caltech.box.com/public/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt
Reusing existing connection to caltech.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://caltech.app.box.com/public/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt [following]
--2020-01-16 18:54:36--  https://caltech.app.box.com/public/static/8oeuskecfr9ujlufqj3b7frj74rxfzcc.txt
Resolving caltech.app.box.com (caltech.app.box.com)... 103.116.4.199
Connecting to caltech.app.box.com (caltech.app.box.com)|103.116.4.199|:443... connected.
HTTP request sent, awaiting response... 302 F

Then, we verify the integrity of the files we downloaded to make sure they were not corrupted during the download.

In [0]:
!md5sum -c checksums.txt --ignore-missing

1k_hgmm_v3_S1_L001_R1_001.fastq.gz: OK
1k_hgmm_v3_S1_L001_R2_001.fastq.gz: OK
1k_hgmm_v3_S1_L002_R1_001.fastq.gz: OK
1k_hgmm_v3_S1_L002_R2_001.fastq.gz: OK
1k_hgmm_v3_S1_L003_R1_001.fastq.gz: OK
1k_hgmm_v3_S1_L003_R2_001.fastq.gz: OK
1k_hgmm_v3_S1_L004_R1_001.fastq.gz: OK
1k_hgmm_v3_S1_L004_R2_001.fastq.gz: OK


### Install `kb`

Install `kb` for running the kallisto|bustools workflow.

In [0]:
!pip install git+https://github.com/pachterlab/kb_python@count-kite

Collecting git+https://github.com/pachterlab/kb_python@count-kite
  Cloning https://github.com/pachterlab/kb_python (to revision count-kite) to /tmp/pip-req-build-a0qz7ipg
  Running command git clone -q https://github.com/pachterlab/kb_python /tmp/pip-req-build-a0qz7ipg
  Running command git checkout -b count-kite --track origin/count-kite
  Switched to a new branch 'count-kite'
  Branch 'count-kite' set up to track remote branch 'count-kite' from 'origin'.
Collecting anndata>=0.6.22.post1
[?25l  Downloading https://files.pythonhosted.org/packages/2b/72/87196c15f68d9865c31a43a10cf7c50bcbcedd5607d09f9aada0b3963103/anndata-0.6.22.post1-py3-none-any.whl (47kB)
[K     |████████████████████████████████| 51kB 1.5MB/s 
[?25hCollecting loompy>=3.0.6
[?25l  Downloading https://files.pythonhosted.org/packages/36/52/74ed37ae5988522fbf87b856c67c4f80700e6452410b4cd80498c5f416f9/loompy-3.0.6.tar.gz (41kB)
[K     |████████████████████████████████| 51kB 5.2MB/s 
Collecting tqdm>=4.39.0
[?25l  Do

### Download human and mouse reference files

We will download the following files from Ensembl:
* Mouse genome (FASTA)
* Mouse genome annotations (GTF)
* Human genome (FASTA)
* Human genome annotations (GTF)

In [0]:
%%time
!wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
!wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
!wget ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!wget ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz

--2020-01-16 19:02:01--  ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
           => ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-98/fasta/mus_musculus/dna ... done.
==> SIZE Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... 805984352
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.dna.primary_assembly.fa.gz ... done.
Length: 805984352 (769M) (unauthoritative)


2020-01-16 19:05:30 (3.76 MB/s) - ‘Mus_musculus.GRCm38.dna.primary_assembly.fa.gz’ saved [805984352]

--2020-01-16 19:05:31--  ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
           => ‘Mus_musculus.GRCm38.98.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org).

### Build the mixed species index

`kb` can build a single transcriptome index with multiple references. The FASTAs and GTFs must be passed in as a comma-separated list.

__Note__: Because Google Colab offers limited RAM, we split the index into 4 parts.

In [0]:
%%time
!kb ref -i mixed_index.idx -g mixed_t2g.txt -f1 mixed_cdna.fa -n 4 \
Mus_musculus.GRCm38.dna.primary_assembly.fa.gz,Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
Mus_musculus.GRCm38.98.gtf.gz,Homo_sapiens.GRCh38.98.gtf.gz

[2020-01-16 19:09:00,016]    INFO Preparing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz, Mus_musculus.GRCm38.98.gtf.gz
[2020-01-16 19:09:00,016]    INFO Decompressing Mus_musculus.GRCm38.98.gtf.gz to tmp
[2020-01-16 19:09:03,821]    INFO Creating transcript-to-gene mapping at /content/tmp/tmp_54cxwsm
[2020-01-16 19:09:41,067]    INFO Decompressing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz to tmp
[2020-01-16 19:10:07,073]    INFO Sorting tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa to /content/tmp/tmp8njkjc7p
[2020-01-16 19:17:20,153]    INFO Sorting tmp/Mus_musculus.GRCm38.98.gtf to /content/tmp/tmpdb4z6qdv
[2020-01-16 19:18:16,788]    INFO Splitting genome tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa into cDNA at /content/tmp/tmpbwrr1fgf
[2020-01-16 19:19:12,867]    INFO Wrote 142446 cDNA transcripts
[2020-01-16 19:19:12,870]    INFO Preparing Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, Homo_sapiens.GRCh38.98.gtf.gz
[2020-01-16 19:19:12,870]    INFO Decompressing Hom

### Generate an RNA count matrix in H5AD format

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store [Anndata](https://anndata.readthedocs.io/en/stable/) objects. Notice we are providing the index and transcript-to-gene mapping we downloaded in the previous step to the `-i` and `-g` arguments respectively. Also, these reads were generated with the 10x Genomics Chromium Single Cell v2 Chemistry, hence the `-x 10xv2` argument. To view other supported technologies, run `kb --list`.

__Note:__ If you would like a Loom file instead, replace the `--h5ad` flag with `--loom`. If you want to use the raw matrix output by `kb` instead of their H5AD or Loom converted files, omit these flags.

In [0]:
%%time
!kb count -i mixed_index.idx.0,mixed_index.idx.1,mixed_index.idx.2,mixed_index.idx.3 \
-g mixed_t2g.txt -x 10xv3 -o output --h5ad -t 2 \
1k_hgmm_v3_S1_L001_R1_001.fastq.gz 1k_hgmm_v3_S1_L001_R2_001.fastq.gz \
1k_hgmm_v3_S1_L002_R1_001.fastq.gz 1k_hgmm_v3_S1_L002_R2_001.fastq.gz \
1k_hgmm_v3_S1_L003_R1_001.fastq.gz 1k_hgmm_v3_S1_L003_R2_001.fastq.gz \
1k_hgmm_v3_S1_L004_R1_001.fastq.gz 1k_hgmm_v3_S1_L004_R2_001.fastq.gz

[2020-01-16 19:57:50,686]    INFO Generating BUS file using 4 indices
[2020-01-16 19:57:50,686]    INFO Generating BUS file to output/tmp/bus_part0 from
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L001_R1_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L001_R2_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L002_R1_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L002_R2_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L003_R1_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L003_R2_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L004_R1_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO         1k_hgmm_v3_S1_L004_R2_001.fastq.gz
[2020-01-16 19:57:50,686]    INFO Using index mixed_index.idx.0
[2020-01-16 20:14:53,228]    INFO Generating BUS file to output/tmp/bus_part1 from
[2020-01-16 20:14:53,228]    INFO         1k_hgmm_v3_S1_L001_R1_001.fastq.gz
[2020-0

## Analysis

See [this notebook](https://github.com/pachterlab/MBGBLHGP_2019/blob/master/Supplementary_Figure_6_7/analysis/hgmm10k_v3_single_gene.Rmd) for how to process and load count matrices for a species mixing experiment.