# Validation using SARS-CoV2 infected human iPSC derived cardiomyocytes
Data source: https://doi.org/10.1016/j.xcrm.2020.100052  

___
# Install software

In [1]:
!pip install -q ffq gget kb_python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.1/43.1 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m78.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.2/25.2 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.2/119.2 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m89.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.9/21.9 MB[0m [31m55.9 MB/s

# Download SMART-Seq data

In [2]:
import json
import glob

In [3]:
# Get ftp download links for raw data with ffq and store results in json file
!ffq SRR11777734 SRR11777735 SRR11777736 SRR11777737 SRR11777738 SRR11777739 \
    --ftp \
    -o ffq.json

[2023-12-08 19:50:15,033]    INFO Parsing run SRR11777734
[2023-12-08 19:50:17,153]    INFO Parsing run SRR11777735
[2023-12-08 19:50:19,009]    INFO Parsing run SRR11777736
[2023-12-08 19:50:20,833]    INFO Parsing run SRR11777737
[2023-12-08 19:50:22,663]    INFO Parsing run SRR11777738
[2023-12-08 19:50:24,488]    INFO Parsing run SRR11777739


In [4]:
# Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

In [5]:
# Download raw data using FTP links fetched by ffq
for dataset in data_json:
    url = dataset["url"]
    !curl -O $url

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2962M  100 2962M    0     0  42.5M      0  0:01:09  0:01:09 --:--:-- 38.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2634M  100 2634M    0     0  38.7M      0  0:01:07  0:01:07 --:--:-- 39.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2484M  100 2484M    0     0  43.0M      0  0:00:57  0:00:57 --:--:-- 44.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3018M  100 3018M    0     0  43.1M      0  0:01:09  0:01:09 --:--:-- 44.5M
  % Total    % Received % Xferd  Average Speed   Tim

In [6]:
# Since the data is split into many fastq files, we will generate a batch file pointing to each of the fastqs
with open("batch.txt", "w") as batchfile:
  for fastq in glob.glob("*fastq.gz"):
      batchfile.write(fastq.split(".")[0] + "\t" + fastq + "\n")

In [17]:
# Download PalmDB reference files

# Download the ID to taxonomy mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

--2023-12-08 20:35:13--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19705497 (19M) [text/plain]
Saving to: ‘ID_to_taxonomy_mapping.csv’


2023-12-08 20:35:13 (153 MB/s) - ‘ID_to_taxonomy_mapping.csv’ saved [19705497/19705497]

--2023-12-08 20:35:13--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4561689 (4.3M) [text/plain]
Saving to: 

# Build virus reference index from PalmDB amino acid sequences and mask host sequences
You can find the `kb` manual and tutorials [here](https://www.kallistobus.tools/).

The `--aa` argument tells `kb` that this is an amino acid reference.  

The `--d-list` argument is the path to the **host** transcriptome. These sequences will be masked in the index. Here, we are using [`gget`](https://github.com/pachterlab/gget) to fetch the human genome and transcriptome (release 110).

We are using `--workflow custom` here since we do not have a .gtf file for the PalmDB fasta file.

Building the index will take some time (~20 min), since the human genomes is quite large.

In [8]:
!gget ref -r 110 -w cdna,dna -d human

Fri Dec  8 19:57:56 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 110.
{
    "homo_sapiens": {
        "transcriptome_cdna": {
            "ftp": "http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",
            "ensembl_release": 110,
            "release_date": "2023-04-22",
            "release_time": "04:25",
            "bytes": "75M"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 110,
            "release_date": "2023-04-21",
            "release_time": "17:28",
            "bytes": "841M"
        }
    }
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75.2M  100 75.2M    0     0   627k      0  0:02:02  0:02:02 --:--:--  642k
  % Total    

In [9]:
# Concatenate human genome and transcriptome into one file
!cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > Homo_sapiens.GRCh38.cdna_dna.fa.gz

In [18]:
%%time
!kb ref \
  --workflow custom \
  --aa \
  --d-list /content/Homo_sapiens.GRCh38.cdna_dna.fa.gz \
  -t 20 \
  -i index.idx \
  /content/palmdb_rdrp_seqs.fa

[2023-12-08 20:35:28,359]   DEBUG [main] Printing verbose output
[2023-12-08 20:35:30,565]   DEBUG [main] kallisto binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto
[2023-12-08 20:35:30,565]   DEBUG [main] bustools binary located at /usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bustools
[2023-12-08 20:35:30,565]   DEBUG [main] Creating `tmp` directory
[2023-12-08 20:35:30,566]   DEBUG [main] Namespace(list=False, command='ref', tmp=None, keep_tmp=False, verbose=True, i='index.idx', g=None, f1=None, include_attribute=None, exclude_attribute=None, f2=None, c1=None, c2=None, d=None, k=None, t=20, d_list='/content/Homo_sapiens.GRCh38.cdna_dna.fa.gz', d_list_overhang=1, aa=True, workflow='custom', distinguish=False, make_unique=False, overwrite=True, kallisto='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/local/lib/python3.10/dist-packages/kb_python/bins/linux/bustools/bu

# Align sequencing data and generate virus count matrix
The `-x` techology tells `kb` where to find the barcode and UMI in the data. We will treat the SMART-Seq data like bulk data for this validation.  

Instead of passing one fastq file at a time, we are using a batch file to tell `kb` where to find all of the data at once.

`--batch-barcodes` stores the sample identifiers in the barcodes.

In [19]:
%%time
!kb count \
  --aa \
  --h5ad \
  -t 20 \
  -i index.idx \
  -g palmdb_clustered_t2g.txt \
  -x bulk \
  --parity single \
  -o kb_results \
  --batch-barcodes \
  batch.txt

usage: kb [-h] [--list] <CMD> ...
kb: error: `--batch-barcodes` may not be used for technology BULK
CPU times: user 50.7 ms, sys: 9.99 ms, total: 60.7 ms
Wall time: 8.13 s


Download generated alignment:

In [20]:
from google.colab import files

!zip -r kb_results.zip kb_results
files.download("kb_results.zip")

updating: kb_results/ (stored 0%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>