<a href="https://colab.research.google.com/github/lauraluebbert/LSEP_2023/blob/main/Fig1/Validation/1_smartseq_generate_count_matrix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Validation using SARS-CoV2 infected human iPSC derived cardiomyocytes
Data source: https://doi.org/10.1016/j.xcrm.2020.100052  

___
# Install software

In [1]:
!pip install -q ffq gget kb_python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.2/25.2 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [30]:
# Install kb_python, kallisto and bustools from dev branch
# After the official release, everything in this cell will become 'pip install kb-python'
!pip install -q git+https://github.com/pachterlab/kb_python.git@devel

!git clone https://github.com/pachterlab/kallisto.git
!mkdir kallisto/build && cd kallisto/build && cmake .. && make && make install

!git clone https://github.com/BUStools/bustools.git && cd bustools && git checkout devel
!mkdir bustools/build && cd bustools/build && cmake .. && make && make install

  Preparing metadata (setup.py) ... [?25l[?25hdone
Cloning into 'kallisto'...
remote: Enumerating objects: 7711, done.[K
remote: Counting objects: 100% (735/735), done.[K
remote: Compressing objects: 100% (165/165), done.[K
remote: Total 7711 (delta 590), reused 572 (delta 570), pack-reused 6976[K
Receiving objects: 100% (7711/7711), 8.73 MiB | 12.43 MiB/s, done.
Resolving deltas: 100% (5102/5102), done.
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test COMPILER_SUPPORTS_CXX17
-- Performing Test COMPILER_SUPPORTS_CXX

# Download SMART-Seq data

In [39]:
import json
import glob

In [13]:
# Get ftp download links for raw data with ffq and store results in json file
!ffq SRR11777734 SRR11777735 SRR11777736 SRR11777737 SRR11777738 SRR11777739 \
    --ftp \
    -o ffq.json

[2023-06-28 20:44:47,919]    INFO Parsing run SRR11777734
[2023-06-28 20:44:49,634]    INFO Parsing run SRR11777735
[2023-06-28 20:44:51,372]    INFO Parsing run SRR11777736
[2023-06-28 20:44:52,802]    INFO Parsing run SRR11777737
[2023-06-28 20:44:54,275]    INFO Parsing run SRR11777738
[2023-06-28 20:44:55,882]    INFO Parsing run SRR11777739


In [None]:
# Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

In [15]:
# Download raw data using FTP links fetched by ffq
for dataset in data_json:
    url = dataset["url"]
    !curl -O $url

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2962M  100 2962M    0     0  31.3M      0  0:01:34  0:01:34 --:--:-- 31.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2634M  100 2634M    0     0  30.2M      0  0:01:27  0:01:27 --:--:-- 31.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2484M  100 2484M    0     0  31.0M      0  0:01:20  0:01:20 --:--:-- 31.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3018M  100 3018M    0     0  31.3M      0  0:01:36  0:01:36 --:--:-- 31.1M
  % Total    % Received % Xferd  Average Speed   Tim

In [54]:
# Generate batch file pointing to each of the fastqs
with open("batch.txt", "w") as batchfile:
  for fastq in glob.glob("*fastq.gz"):
      batchfile.write(fastq.split(".")[0] + "\t" + fastq + "\n")

In [16]:
# Download PalmDB reference files
# Download the ID to taxonomy mapping
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv?token=GHSAT0AAAAAAB5INUMZR6BPENGRSFHMZDUGZE4VQJA
# Download the customized transcripts to gene mapping
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/palmdb_clustered_t2g.txt?token=GHSAT0AAAAAAB5INUMYWA4NZMY7FLBXGWBWZE4VQWQ
# Download the RdRP amino acid sequences
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/palmdb_rdrp_seqs.fa?token=GHSAT0AAAAAAB5INUMYDHDHXFO2JKKSSNQSZE4VRBQ

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18.7M  100 18.7M    0     0  19.0M      0 --:--:-- --:--:-- --:--:-- 19.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4454k  100 4454k    0     0  11.7M      0 --:--:-- --:--:-- --:--:-- 11.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33.7M  100 33.7M    0     0  41.0M      0 --:--:-- --:--:-- --:--:-- 41.0M


# Build virus reference index from PalmDB amino acid sequences and mask host sequences
You can find the `kb` manual and tutorials [here](https://www.kallistobus.tools/).

The `--aa` argument tells `kb` that this is an amino acid reference.  

The `--d-list` argument is the path to the **host** transcriptome. These sequences will be masked in the index. Here, we are using [`gget`](https://github.com/pachterlab/gget) to fetch the human genome (release 109) FTP download link and pass it directly to `kb`.

We are using `--workflow custom` here since we do not have a .gtf file for the PalmDB fasta file.

Building the index will take some time (~20 min), since the human genomes is quite large.

In [41]:
!gget ref -r 109 -w dna --ftp human -d

Wed Jun 28 21:56:26 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 109.
http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  840M  100  840M    0     0   629k      0  0:22:46  0:22:46 --:--:--  656k


In [50]:
%%time
!kb ref \
--overwrite --verbose \
  --workflow custom \
  --aa \
  --d-list Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
  -t 2 \
  -i index.idx \
  --kallisto /usr/local/bin/kallisto \
  --bustools /usr/local/bin/bustools \
  palmdb_rdrp_seqs.fa

[2023-06-28 23:04:31,786]   DEBUG [main] Printing verbose output
[2023-06-28 23:04:33,992]   DEBUG [main] kallisto binary located at /usr/local/bin/kallisto
[2023-06-28 23:04:33,992]   DEBUG [main] bustools binary located at /usr/local/bin/bustools
[2023-06-28 23:04:33,992]   DEBUG [main] Creating `tmp` directory
[2023-06-28 23:04:33,992]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='index.idx', g=None, f1=None, include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=None, t=2, d_list='Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', aa=True, workflow='custom', distinguish=False, make_unique=False, overwrite=True, kallisto='/usr/local/bin/kallisto', bustools='/usr/local/bin/bustools', fasta='palmdb_rdrp_seqs.fa', gtf=None, feature=None, no_mismatches=False, flank=None)
[2023-06-28 23:04:33,993]    INFO [ref_custom] Indexing palmdb_rdrp_seqs.fa to index.idx
[2023-06-28 23:04:33,993]   DEBUG [ref_custom] kallis

In [None]:
%%time
!kb ref \
  --workflow custom \
  --aa \
  --d-list $(gget ref -r 109 -w dna --ftp human) \
  -t 2 \
  -i index.idx \
  --kallisto /usr/local/bin/kallisto \
  --bustools /usr/local/bin/bustools \
  palmdb_rdrp_seqs.fa

# Align sequencing data and generate virus count matrix
The `-x` techology tells `kb` where to find the barcode and UMI in the data. We will treat the SMART-Seq data like bulk data for this validation.  

Instead of passing one fastq file at a time, we are using a batch file to tell `kb` where to find all of the data at once.

In [None]:
%%time
!kb count \
  --aa \
  --h5ad \
  -t 2 \
  -i index.idx \
  -g palmdb_clustered_t2g.txt \
  -x bulk \
  --parity single \
  -o kb_results \
  --kallisto /usr/local/bin/kallisto \
  --bustools /usr/local/bin/bustools \
  batch.txt

[2023-06-29 01:25:26,165]    INFO [count] Using index index.idx to generate BUS file to kb_results from
[2023-06-29 01:25:26,165]    INFO [count]         /content/kb_results/tmp/tmp3sqn70d4
