<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/Supp_Fig_4/Supp_Fig_4c/1_show_primer_bias_splitcode_alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Align SPLIT-Seq data from lung samples from mice infected with SARS-CoV-2
Reference: https://doi.org/10.1038/s41586-022-05344-2

Note: The disk space required to run this notebook (~240GB) exceeds the disk space provided by Google Colab.

### Install software

In [None]:
!pip install -q kb_python anndata gget

# Download v0.50.0 of kallisto
!git clone https://github.com/pachterlab/kallisto.git --branch v0.50.0
!cd kallisto && mkdir build && cd build && cmake .. && make
kallisto = "kallisto/build/src/kallisto"

import numpy as np
import anndata
import pandas as pd
import json
import os
import glob
import matplotlib.pyplot as plt
import matplotlib as mpl
%config InlineBackend.figure_format='retina'

def nd(arr):
    """
    Function to transform numpy matrix to nd array.
    """
    return np.asarray(arr).reshape(-1)

In [None]:
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

virus_fasta = "palmdb_rdrp_seqs.fa"
virus_t2g = "palmdb_clustered_t2g.txt"

In [None]:
# Number of threads to use in alignment
threads = 2

## Download raw data

In [None]:
!pip install -q ffq
import json

out = "data.json"

# Download the complete dataset
!ffq GSE199498 --ftp -o $out

f = open(out)
data = json.load(f)
f.close()

print(len(data))

for dataset in data:
    url = dataset["url"]
    !curl -O $url

## Align data to PalmDB using kallisto translated search

Generate virus index with masked host (here, mouse) genome and transcriptome sequences:

In [None]:
# Get host genome and transcriptome using gget
!gget ref -w cdna,dna -r 110 -d mouse

host_cdna = "Mus_musculus.GRCm39.cdna.all.fa.gz"
host_dna = "Mus_musculus.GRCm39.dna.primary_assembly.fa.gz"

# Concatenate host genome and transcriptome into a single file
host_combined = "combined.cdna_dna.all.fa.gz"
!cat $host_cdna $host_dna > $host_combined

In [None]:
%%time
# Generate virus reference index
virus_index = "virus_index.idx"

!kb ref \
    --aa \
    --kallisto $kallisto \
    -t $threads \
    --d-list $host_combined \
    --workflow custom \
    -i $virus_index \
    $virus_fasta

[2025-02-20 12:42:23,590]    INFO [ref_custom] Indexing palmdb_rdrp_seqs.fa to virus_index.idx
[2025-02-20 12:45:52,317]    INFO [ref_custom] Finished creating custom index
CPU times: user 1.51 s, sys: 602 ms, total: 2.11 s
Wall time: 3min 33s


Get fastq files:

In [None]:
import os
import glob

In [None]:
fastqs = []
for filename in glob.glob("*.fastq.gz"):
    fastqs.append(filename.split("/")[-1])

fastqs.sort()
fastqs

['SRR18496012_1.fastq.gz',
 'SRR18496012_2.fastq.gz',
 'SRR18496013_1.fastq.gz',
 'SRR18496013_2.fastq.gz',
 'SRR18496014_1.fastq.gz',
 'SRR18496014_2.fastq.gz',
 'SRR18496015_1.fastq.gz',
 'SRR18496015_2.fastq.gz',
 'SRR18496016_1.fastq.gz',
 'SRR18496016_2.fastq.gz',
 'SRR18496017_1.fastq.gz',
 'SRR18496017_2.fastq.gz',
 'SRR18496018_1.fastq.gz',
 'SRR18496018_2.fastq.gz',
 'SRR18496019_1.fastq.gz',
 'SRR18496019_2.fastq.gz']

In [None]:
len(fastqs)

16

In [None]:
samples = []
for fastq in fastqs:
    samples.append(fastq.split("_")[0])

In [None]:
samples = list(set(samples))
len(samples)

8

Align data to PalmDB:  
The SPLIT-Seq barcode onlist files (r1_RT_replace.txt and r1r2r3.txt) were provided by Delaney Sullivan (07/15/2023).

In [None]:
# Download SPLIT-Seq barcode onlist files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/Supp_Fig_4/Supp_Fig_4c/r1_RT_replace.txt
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/Supp_Fig_4/Supp_Fig_4c/r1r2r3.txt

In [None]:
%%time
out_folder = "kb_out"
for sample in samples:
    fastq1 = sample + "_1.fastq.gz"
    fastq2 = sample + "_2.fastq.gz"

    !mkdir -p $out_folder/$sample

    !kb count \
        --aa \
        --h5ad \
        --kallisto $kallisto \
        -t $threads \
        -i $virus_index \
        -g $virus_t2g \
        -x SPLIT-Seq \
        -r r1_RT_replace.txt \
        -w r1r2r3.txt \
        -o $out_folder/$sample/ \
        $fastq1 $fastq2

[2025-02-20 12:45:58,579]    INFO [count] Using index virus_index.idx to generate BUS file to palmdb/SRR18496016/ from
[2025-02-20 12:45:58,579]    INFO [count]         SRR18496016_1.fastq.gz
[2025-02-20 12:45:58,579]    INFO [count]         SRR18496016_2.fastq.gz
[2025-02-20 13:08:21,793]    INFO [count] Sorting BUS file palmdb/SRR18496016/output.bus to palmdb/SRR18496016/tmp/output.s.bus
[2025-02-20 13:08:23,965]    INFO [count] Inspecting BUS file palmdb/SRR18496016/tmp/output.s.bus
[2025-02-20 13:08:25,081]    INFO [count] Correcting BUS records in palmdb/SRR18496016/tmp/output.s.bus to palmdb/SRR18496016/tmp/output.s.c.bus with on-list r1r2r3.txt
[2025-02-20 13:08:26,197]    INFO [count] Sorting BUS file palmdb/SRR18496016/tmp/output.s.c.bus to palmdb/SRR18496016/output.unfiltered.bus
[2025-02-20 13:08:27,841]    INFO [count] Generating count matrix palmdb/SRR18496016/counts_unfiltered/cells_x_genes from BUS file palmdb/SRR18496016/output.unfiltered.bus
[2025-02-20 13:08:29,643]  