<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/LSCHWCP_2023/Notebooks/Figure_3%20/Figure_3a/3_human_SARSCoV_validation_smartseq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Validation using SARS-CoV2 infected human iPSC derived cardiomyocytes
Data source: https://doi.org/10.1016/j.xcrm.2020.100052  

___
# Install software

In [None]:
!pip install -q ffq gget kb_python

# Download SMART-Seq data

In [None]:
import json
import glob

In [None]:
# Get ftp download links for raw data with ffq and store results in json file
!ffq SRR11777734 SRR11777735 SRR11777736 SRR11777737 SRR11777738 SRR11777739 \
    --ftp \
    -o ffq.json

In [None]:
# Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

In [None]:
# Download raw data using FTP links fetched by ffq
for dataset in data_json:
    url = dataset["url"]
    !curl -O $url

In [None]:
# Since the data is split into many fastq files, we will generate a batch file pointing to each of the fastqs
with open("batch.txt", "w") as batchfile:
  for fastq in glob.glob("*fastq.gz"):
      batchfile.write(fastq.split(".")[0] + "\t" + fastq + "\n")

In [None]:
# Download PalmDB reference files
# Download the ID to taxonomy mapping
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
# Download the customized transcripts to gene mapping
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!curl -O https://raw.githubusercontent.com/lauraluebbert/LSEP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

# Build virus reference index from PalmDB amino acid sequences and mask host sequences
You can find the `kb` manual and tutorials [here](https://www.kallistobus.tools/).

The `--aa` argument tells `kb` that this is an amino acid reference.  

The `--d-list` argument is the path to the **host** transcriptome. These sequences will be masked in the index. Here, we are using [`gget`](https://github.com/pachterlab/gget) to fetch the human genome and transcriptome (release 110).

We are using `--workflow custom` here since we do not have a .gtf file for the PalmDB fasta file.

Building the index will take some time (~20 min), since the human genomes is quite large.

In [None]:
!gget ref -r 110 -w cdna,dna -d human

In [None]:
# Concatenate human genome and transcriptome into one file
!cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > Homo_sapiens.GRCh38.cdna_dna.fa.gz

In [None]:
%%time
!kb ref \
--overwrite --verbose \
  --workflow custom \
  --aa \
  --d-list Homo_sapiens.cdna_dna.fa.gz \
  -t 20 \
  -i index.idx \
  palmdb_rdrp_seqs.fa

# Align sequencing data and generate virus count matrix
The `-x` techology tells `kb` where to find the barcode and UMI in the data. We will treat the SMART-Seq data like bulk data for this validation.  

Instead of passing one fastq file at a time, we are using a batch file to tell `kb` where to find all of the data at once.

`--batch-barcodes` stores the sample identifiers in the barcodes.

In [None]:
%%time
!kb count \
  --aa \
  --h5ad \
  -t 20 \
  -i index.idx \
  -g palmdb_clustered_t2g.txt \
  -x bulk \
  --parity single \
  -o kb_results \
  --batch-barcodes \
  batch.txt

Download generated alignment:

In [None]:
from google.colab import files

!zip -r kb_results.zip kb_results
files.download("kb_results.zip")