<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/align_macaque_PBMC_data/8_virus_bwa/2_align_bwa_host_removed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Align sequencing reads to PalmDB with kallisto translated search after removing host sequences using bwa
The removal of host sequences using bwa is shown in [this notebook](https://github.com/pachterlab/LSCHWCP_2023/blob/main/Notebooks/align_macaque_PBMC_data/8_virus_bwa/1_remove_host_reads_with_bwa.ipynb).

In [None]:
# Install kallisto from source
!git clone -q https://github.com/pachterlab/kallisto.git
!cd kallisto && mkdir build && cd build && cmake .. && make

# Install bustools from source
!git clone -q https://github.com/BUStools/bustools.git
!cd bustools && mkdir build && cd build && cmake .. && make

# Define paths to kallisto and bustools binaries
kallisto = "/content/kallisto/build/src/kallisto"
bustools = "/content/bustools/build/src/bustools"

In [None]:
# Download the customized transcripts to gene mapping
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# Download the RdRP amino acid sequences
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa

virus_fasta = "palmdb_rdrp_seqs.fa"
virus_t2g = "palmdb_clustered_t2g.txt"

In [None]:
# Number of threads to use in alignment
threads = 8

Create new batch file with paths to files where host reads were removed based on bwa alignment:

In [None]:
import os
import glob

In [None]:
# Download unmapped reads from bwa alignment to host from Caltech Data
# This folder contains the files that are saved inside the bwa_unmapped_reads/raw folder in the previous notebook (https://github.com/pachterlab/LSCHWCP_2023/blob/main/Notebooks/align_macaque_PBMC_data/8_virus_bwa/1_remove_host_reads_with_bwa.ipynb)
!wget https://data.caltech.edu/records/sh33z-hrx98/files/bwa_unmapped_reads.tar.gz?download=1
!mv bwa_unmapped_reads.tar.gz?download=1 bwa_unmapped_reads.tar.gz
!tar -xvf bwa_unmapped_reads.tar.gz

In [None]:
fastq_folder = "bwa_unmapped_reads_raw"

fastqs = []
for filename in glob.glob(f"{fastq_folder}/*.fastq"):
    fastqs.append(filename.split("/")[-1])

fastqs.sort()

In [None]:
len(fastqs)

In [None]:
samples = []
for fastq in fastqs:
    samples.append(fastq.split("_")[0])

samples = list(set(samples))
len(samples)

In [None]:
sample_batch_file = "batch.txt"
with open(sample_batch_file, "w") as batch_file:
    for sample in samples:
        fastq1 = sample + "_1.fastq"
        fastq2 = sample + "_2.fastq"
        batch_file.write(sample + "\t" + fastq_folder + "/" + fastq1 + "\t" + fastq_folder + "/" + fastq2 + "\n")

Generate virus reference index (no masking):

In [None]:
virus_index = "virus_index.idx"

!$kallisto index \
    --aa \
    -t $threads \
    -i $virus_index \
    $virus_fasta

Align files and correct barcodes based on host cell barcode onlist:

In [None]:
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/align_macaque_PBMC_data/bustools_onlist.txt

In [None]:
out_folder = "virus_bwa_alignment_results"

In [None]:
!$kallisto bus \
        -i $virus_index \
        -o $out_folder \
        --aa \
        -t $threads \
        -B $sample_batch_file \
        --batch-barcodes \
        -x 0,0,12:0,12,20:1,0,0

In [None]:
%%time
!$bustools sort \
    -m 4G \
    -t $threads \
    -o $out_folder/output_sorted.bus \
    $out_folder/output.bus

!$bustools correct \
    -w bustools_onlist.txt \
    -o $out_folder/output_sorted_corrected.bus \
    $out_folder/output_sorted.bus

!$bustools sort \
    -m 4G \
    -t $threads \
    -o $out_folder/output_sorted_corrected_sorted.bus \
    $out_folder/output_sorted_corrected.bus

!$bustools count \
    --genecounts \
    -o $out_folder/bustools_count/ \
    -g $virus_t2g \
    -e $out_folder/matrix.ec \
    -t $out_folder/transcripts.txt \
    $out_folder/output_sorted_corrected_sorted.bus