<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/align_macaque_PBMC_data/5_virus_dlist_cdna_dna_amb/1_align_dlist_cdna_dna_amb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Align sequencing reads to PalmDB with kallisto translated search masking host genomes and transcriptomes using the D-list + discard ambigious reads as host instead of assigning them to virus
This feature is based on an unreleased version of kallisto which is stored in the [dlist_discard_ambiguities](https://github.com/pachterlab/kallisto/tree/dlist_discard_ambiguities) branch of the GitHub repository and we will install it below.

In [None]:
# Install kallisto from the dlist_discard_ambiguities branch
!git clone -q https://github.com/pachterlab/kallisto.git --branch dlist_discard_ambiguities
!cd kallisto && mkdir build && cd build && cmake .. && make

# Install bustools from source
!git clone -q https://github.com/BUStools/bustools.git
!cd bustools && mkdir build && cd build && cmake .. && make

# Define paths to kallisto and bustools binaries
kallisto = "/content/kallisto/build/src/kallisto"
bustools = "/content/bustools/build/src/bustools"

In [None]:
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

virus_fasta = "palmdb_rdrp_seqs.fa"
virus_t2g = "palmdb_clustered_t2g.txt"

In [None]:
# Number of threads to use in alignment
threads = 2

### Download raw sequencing data

In [None]:
!pip install -q ffq
import json

out = "GSE158390_data.json"

# # Download the complete dataset (106 paired fastqs containing a total of 30 billion reads)
# !ffq GSE158390 --ftp -o $out

# Download only two fastq pairs to demonstrate this notebook
!ffq SRR12698499 SRR12698500 --ftp -o $out

f = open(out)
data = json.load(f)
f.close()

print(len(data))

for dataset in data:
    url = dataset["url"]
    !curl -O $url

Generate sample batch file to align all fastq files simultaneously:

In [None]:
import glob

fastqs = []
for filename in glob.glob("*.fastq.gz"):
    fastqs.append(filename.split("/")[-1])

fastqs.sort()

# Get sample names
samples = []
for fastq in fastqs:
    samples.append(fastq.split("_")[0])

samples = list(set(samples))

# Generate sample batch file
sample_batch_file = "batch.txt"
with open(sample_batch_file, "w") as batch_file:
    for sample in samples:
        fastq1 = sample + "_1.fastq.gz"
        fastq2 = sample + "_2.fastq.gz"
        batch_file.write(sample + "\t" + fastq1 + "\t" + fastq2 + "\n")

### Align to PalmDB with a D-list implementation that also masks ambiguous kmers

In [None]:
# Get host genomes and transcriptomes
!pip install -q gget
!gget ref -w cdna,dna -r 110 -d canis_lupus_familiaris
!gget ref -w cdna,dna -r 110 -d macaca_mulatta

canine_cdna = "Canis_lupus_familiaris.ROS_Cfam_1.0.cdna.all.fa.gz"
macaque_cdna = "Macaca_mulatta.Mmul_10.cdna.all.fa.gz"
canine_dna = "Canis_lupus_familiaris.ROS_Cfam_1.0.dna.toplevel.fa.gz"
macaque_dna = "Macaca_mulatta.Mmul_10.dna.toplevel.fa.gz"

Create modified D-list files. We will create copies of the genome and transcriptome in which each fasta header is changed to ">>". This tells the D-list in this modified version of kallisto that we want to throw out ambiguous D-list sequences for these sequences:

In [None]:
canine_cdna_amb = "ambigious_Canis_lupus_familiaris.ROS_Cfam_1.0.cdna.all.fa.gz"
macaque_cdna_amb = "ambigious_Macaca_mulatta.Mmul_10.cdna.all.fa.gz"
canine_dna_amb = "ambigious_Canis_lupus_familiaris.ROS_Cfam_1.0.dna.toplevel.fa.gz"
macaque_dna_amb = "ambigious_Macaca_mulatta.Mmul_10.dna.toplevel.fa.gz"

canine_macaque_fasta = "combined.cdna_dna_ambigious.fa.gz"

In [None]:
%%time
# Replace all headers from ">string" to ">>" to tell kallisto we want to extract ambigious kmers from these
!gzip -dc $canine_cdna | sed '/^>/ s/.*/>>/' | gzip -c > $canine_cdna_amb
!gzip -dc $macaque_cdna | sed '/^>/ s/.*/>>/' | gzip -c > $macaque_cdna_amb
!gzip -dc $canine_dna | sed '/^>/ s/.*/>>/' | gzip -c > $canine_dna_amb
!gzip -dc $macaque_dna | sed '/^>/ s/.*/>>/' | gzip -c > $macaque_dna_amb

In [None]:
# Concatenate normal + ambigious cdna and dna from macaque and dog into a single file
!cat $canine_cdna $macaque_cdna $canine_dna $macaque_dna $canine_cdna_amb $macaque_cdna_amb $canine_dna_amb $macaque_dna_amb > $canine_macaque_fasta

Generate virus index:

In [None]:
virus_index = "virus_index.idx"

# Generate virus reference index
!$kallisto index \
    --aa \
    -t $threads \
    --d-list=$canine_macaque_fasta \
    -i $virus_index \
    $virus_fasta

Align to PalmDB and correct barcodes using host onlist:

In [None]:
out_folder = "virus_dlist_cdna_dna_amb_alignment_results"

In [None]:
%%time
!$kallisto bus \
        -i $virus_index \
        -o $out_folder \
        --aa \
        -t $threads \
        -B $sample_batch_file \
        --batch-barcodes \
        -x 0,0,12:0,12,20:1,0,0

In [None]:
# Download cell barcode onlist generated during alignment to host
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/align_macaque_PBMC_data/bustools_onlist.txt

In [None]:
%%time
!$bustools sort \
    -m 4G \
    -t $threads \
    -o $out_folder/output_sorted.bus \
    $out_folder/output.bus

!$bustools correct \
    -w bustools_onlist.txt \
    -o $out_folder/output_sorted_corrected.bus \
    $out_folder/output_sorted.bus

!$bustools sort \
    -m 4G \
    -t $threads \
    -o $out_folder/output_sorted_corrected_sorted.bus \
    $out_folder/output_sorted_corrected.bus

!$bustools count \
    --genecounts \
    -o $out_folder/bustools_count/ \
    -g $virus_t2g \
    -e $out_folder/matrix.ec \
    -t $out_folder/transcripts.txt \
    $out_folder/output_sorted_corrected_sorted.bus