<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/Figure_3/Figure_3a/3_human_SARSCoV_validation_smartseq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Validation using SARS-CoV2 infected human iPSC derived cardiomyocytes
Data from https://www.cell.com/cell-reports-medicine/pdf/S2666-3791(20)30068-9.pdf:

In [None]:
# Number of threads to use during alignments
threads = 8 # Change to 2 if not using TPU runtime

## Install software

In [None]:
!pip install -q ffq gget kb_python

## Download SMART-Seq data

In [None]:
import json
import glob

# Get ftp download links for raw data with ffq and store results in json file
!ffq SRR11777734 SRR11777735 SRR11777736 SRR11777737 SRR11777738 SRR11777739 \
    --ftp \
    -o ffq.json

# Load ffq output
f = open("ffq.json")
data_json = json.load(f)
f.close()

# Download raw data using FTP links fetched by ffq
for dataset in data_json:
    url = dataset["url"]
    !curl -O $url

## Download optimized PalmDB reference files


In [None]:
# Download the ID to taxonomy mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/ID_to_taxonomy_mapping.csv
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

## Build virus reference index from PalmDB amino acid sequences and mask host (here, human) sequences
You can find the kb manual and tutorials [here](https://www.kallistobus.tools/).

The --aa argument tells kb that this is an amino acid reference.

The --d-list argument is the path to the host transcriptome. These sequences will be masked in the index. Here, we are using gget to fetch the human genome and transcriptome (release 110).

We are using --workflow custom here since we do not have a .gtf file for the PalmDB fasta file.

Building the index will take some time (~20 min), since the human genomes is quite large.

In [None]:
!gget ref -r 110 -w cdna,dna -d human

In [None]:
# Concatenate human genome and transcriptome into one file
!cat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > Homo_sapiens.GRCh38.cdna_dna.fa.gz

In [None]:
%%time
!kb ref \
  --workflow custom \
  --aa \
  --d-list Homo_sapiens.GRCh38.cdna_dna.fa.gz \
  -t $threads \
  -i index.idx \
  palmdb_rdrp_seqs.fa

## Align data using kallisto translated search

Get fastq files:

In [None]:
fastqs = []
for filename in glob.glob("*.fastq.gz"):
    fastqs.append(filename.split("/")[-1])

fastqs.sort()
fastqs

Loop over files and align one at a time (alternative: use a batch file to align multiple fastqs at the same time). The `-x` techology tells kb where to find the barcode and UMI in the data. We will treat the SMART-Seq data like bulk data for this validation.

In [None]:
%%time
for fastq in fastqs:
    sample = fastq.split(".fastq.gz")[0]

    # !mkdir -p $sample

    !kb count \
        --aa \
        -t $threads \
        -i index.idx \
        -g palmdb_clustered_t2g.txt \
        -x bulk \
        --parity single \
        -o $sample \
        $fastq

    # !$kallisto bus \
    #         -i index.idx \
    #         -o $sample/ \
    #         --aa \
    #         -t 30 \
    #         -x bulk \
    #         $fastq_folder/$fastq \
    #         &> $sample/kb_out.txt

    # !$bustools sort -o $sample/output_sorted.bus $sample/output.bus

    # !$bustools count \
    #     --genecounts \
    #     --cm \
    #     -o $sample/bustools_count/ \
    #     -g $virus_t2g \
    #     -e $sample/matrix.ec \
    #     -t $sample/transcripts.txt \
    #     $sample/output_sorted.bus

## Plot virus counts

In [None]:
!pip install -q kb_python

import kb_python.utils as kb_utils
import anndata
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.colors
%config InlineBackend.figure_format='retina'

def nd(arr):
    """
    Function to transform numpy matrix to nd array.
    """
    return np.asarray(arr).reshape(-1)

In [None]:
u_tax_csv = "ID_to_taxonomy_mapping.csv"

Create adata objects from count matrices:

In [None]:
adatas = []
for fastq in fastqs:
    # Load data
    sample = fastq.split(".fastq.gz")[0]

    # Filepath to counts
    X = f"{sample}/bustools_count/output.mtx"
    # Filepath to barcode metadata
    var_path = f"{sample}/bustools_count/output.genes.txt"
    # Filepath to gene metadata
    obs_path = f"{sample}/bustools_count/output.barcodes.txt"

    # Create AnnData object
    adata = kb_utils.import_matrix_as_anndata(X, obs_path, var_path)

    # Add sample name
    adata.obs["sample"] = sample

    # Append to adata list
    adatas.append(adata)

In [None]:
adata = anndata.concat(adatas, merge="same")
# Set sample as index
adata.obs = adata.obs.set_index("sample")
adata.obs

In [None]:
tax_df = pd.read_csv(u_tax_csv)
tax_df[tax_df["species"].str.contains("Severe acute respiratory syndrome")]

In [None]:
fig, ax = plt.subplots(figsize=(6, 7))
fontsize = 16
width = 0.75

x_labels = ['Infected 1', 'Infected 2', 'Infected 3', 'Control 1', 'Control 2', 'Control 3']

target_ids = tax_df[tax_df["species"].str.contains("Severe acute respiratory syndrome-related coronavirus")]["rep_ID"].unique()

counts = []
samples = adata.obs.index.values
labels = samples
for sample in samples:
    counts.append(adata.X[adata.obs.index == sample, adata.var.index.isin(target_ids)].sum())

x = np.arange(len(labels))

ax.bar(x, counts, width=width, color="#003049", edgecolor="black")

ax.set_yscale("symlog")
ax.set_ylabel("kallisto (raw counts for SARS-CoV)", fontsize=fontsize)
# ax.set_xlabel("Sample", fontsize=fontsize)

ax.set_xticks(x, x_labels, rotation=45, ha="right")

ax.tick_params(axis="both", labelsize=fontsize)
ax.set_title(f"SARS-CoV-2 infected human\niPSC-derived cardiomyocytes", fontsize=fontsize+2)

ax.grid(True, which="both", color="lightgray", ls="--", lw=1)
ax.set_axisbelow(True)

# plt.tight_layout()

plt.savefig("smartseq_benchmark_PRJNA631969.png", dpi=300, bbox_inches="tight")

fig.show()

In [None]:
counts