**In this notebook you will see how to go from fastq sequencing files to Unspliced and Spliced count matrices**

We use the loom file output to store the U and S matrices for meK-Means inference.

## **Install Packages**

In [1]:
!pip3 install loompy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for loompy (setup.py) ... [?25l[?25hdone


In [2]:
!pip install kb-python==0.27.2 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.1/122.1 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.9/21.9 MB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import loompy as lp

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

  def twobit_to_dna(twobit: int, size: int) -> str:
  def dna_to_twobit(dna: str) -> int:
  def twobit_1hamming(twobit: int, size: int) -> List[int]:


In [4]:
!wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

In [5]:
!tar -vxzf sratoolkit.tar.gz

## **Download fastqs**

> We will be using the scMixology dataset with three human lung adenocarcinoma cell lines HCC827, H1975 and H2228 mixed together and sequenced with 10xv2 scRNAseq reagents (see the sc_10x data on the [scMixology Github](https://github.com/LuyiTian/sc_mixology)).

In [6]:
#3 cell line mix GSM3022245
#Only download one for now because of Colab disk limits

!./sratoolkit.3.1.0-ubuntu64/bin/prefetch SRR6782109 --max-size 6000000000 -O ./ && ./sratoolkit.3.1.0-ubuntu64/bin/fasterq-dump --include-technical --split-files SRR6782109



2024-04-04T17:41:03 prefetch.3.1.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2024-04-04T17:41:04 prefetch.3.1.0: 1) Downloading 'SRR6782109'...
2024-04-04T17:41:04 prefetch.3.1.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2024-04-04T17:41:04 prefetch.3.1.0:  Downloading via HTTPS...
2024-04-04T17:47:17 prefetch.3.1.0:  HTTPS download succeed
2024-04-04T17:47:43 prefetch.3.1.0:  'SRR6782109' is valid
2024-04-04T17:47:43 prefetch.3.1.0: 1) 'SRR6782109' was downloaded successfully
2024-04-04T17:47:43 prefetch.3.1.0: 'SRR6782109' has 0 unresolved dependencies
spots read      : 109,178,700
reads read      : 218,357,400
reads written   : 218,357,400


## **Run kb-python to get U/S Counts**

> We will get loom file output from the kb-python run which stores these count matrices, and we will use the cells in the filtered barcodes output.

Since this is human data we need the human reference kallisto index and intron/exon annotations (GRCh38-2020).



In [None]:
#Read in saved index and intron/exon files from kb ref

#If wget does not work, use the download link here https://doi.org/10.22002/6wyra-tar37 for the refdata-gex-GRCh38-2020-A.tar.gz
!wget --content-disposition https://data.caltech.edu/records/6wyra-tar37/files/refdata-gex-GRCh38-2020-A.tar.gz?download=1
!tar -vxzf refdata-gex-GRCh38-2020-A.tar.gz

--2024-04-04 18:08:33--  https://data.caltech.edu/records/6wyra-tar37/files/refdata-gex-GRCh38-2020-A.tar.gz?download=1
Resolving data.caltech.edu (data.caltech.edu)... 35.155.11.48
Connecting to data.caltech.edu (data.caltech.edu)|35.155.11.48|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3.us-west-2.amazonaws.com/caltechdata/ad/fe/e87c-c65f-4034-ae56-2b57af502588/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Drefdata-gex-GRCh38-2020-A.tar.gz&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARCVIVNNAP7NNDVEA%2F20240404%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20240404T180834Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=26bf1cf1422719e595adaca4cf05fc13d2d6d3851c1a6dc16df8121e0d8350af [following]
--2024-04-04 18:08:34--  https://s3.us-west-2.amazonaws.com/caltechdata/ad/fe/e87c-c65f-4034-ae56-2b57af502588/data?response-content-type=application%2Foctet-stream&respo

In [None]:
# #Equivalent mouse references here:
# !wget --content-disposition https://data.caltech.edu/records/1dd7a-cc411/files/refdata-gex-mm10-2020-A.tar.gz?download=1
# !tar -vxzf refdata-gex-mm10-2020-A.tar.gz

In [6]:
!mkdir scMix

In [None]:
! kb count --verbose \
-i ./refdata-gex-GRCh38-2020-A/index.idx \
-g ./refdata-gex-GRCh38-2020-A/t2g_grch38.txt \
-x 10xv2 \
-o ./scMix \
-t 2 \
-c1 ./refdata-gex-GRCh38-2020-A/cdna_t2c.txt \
-c2 ./refdata-gex-GRCh38-2020-A/intron_t2c.txt \
--workflow lamanno --filter bustools --overwrite --loom \
./SRR6782109_1.fastq \
./SRR6782109_2.fastq

#There are only two threads, t, on Colab

Commented out code below shows how to create the index and txt files above
(Will need a new index for kb-python >=0.28)

In [None]:
# #Make kallisto index, -->runs out of memory on Colab

# #Download genome information

# # This does
# # 1.   Make a kallisto index of introns and exons.
# # 2.   Generate U/S counts for each cell across the reference transcriptome.


# !wget --content-disposition https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
# #refdata-gex-mm10-2020-A for mouse

# !tar -zxvf refdata-gex-GRCh38-2020-A.tar.gz

# !kb ref -i ./refdata-gex-GRCh38-2020-A/index.idx \
# -g ./refdata-gex-GRCh38-2020-A/t2g_grch38.txt \
# -f1 ./refdata-gex-GRCh38-2020-A/cdna.fa \
# -f2 ./refdata-gex-GRCh38-2020-A/intron.fa \
# -c1 ./refdata-gex-GRCh38-2020-A/cdna_t2c.txt \
# -c2 ./refdata-gex-GRCh38-2020-A/intron_t2c.txt \
# --workflow lamanno \
# ./refdata-gex-GRCh38-2020-A/fasta/genome.fa \
# ./refdata-gex-GRCh38-2020-A/genes/genes.gtf

**Here is the final loom file that we will use for analysis**

In [None]:
ds = lp.connect('./scMix/counts_filtered/adata.loom')

unspliced and spliced counts are stored in the layers

In [None]:
#matrices are genexcell
print(ds.layers['unspliced'][:,:].shape) #or 'spliced'

gene names and cell barcodes are stored under 'gene_name' and 'barcode' respectively

In [None]:
ds.ra['gene_name'][0:10]

In [None]:
ds.ca['barcode'][0:10]

In [None]:
ds.close()