<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/Supp_Fig_2/1_align_zebov_subset_kraken2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Align a subset of the macaque PBMC Zaire ebolavirus (ZEBOV) dataset using Kraken2 (standard nucleotide alignment) and generate bam files to visualize the alignment

In [1]:
# Install Kraken2 v1.0.2 (defining version for reproducibility)
!git clone https://github.com/DerrickWood/kraken2.git --branch v2.1.2
!cd kraken2 && ./install_kraken2.sh ./

kraken2 = "/content/kraken2/kraken2"
kraken2_build = "/content/kraken2/kraken2-build"

Cloning into 'kraken2'...
remote: Enumerating objects: 1064, done.[K
remote: Counting objects: 100% (355/355), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 1064 (delta 295), reused 282 (delta 281), pack-reused 709[K
Receiving objects: 100% (1064/1064), 454.00 KiB | 2.34 MiB/s, done.
Resolving deltas: 100% (777/777), done.
Note: switching to '84b2874e0ba5ffc9abaebe630433a430cd0f69f4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

make: Entering directory '/content/kraken2/src'
g++ -fopenmp -Wa

In [2]:
# Number of threads used for alignment
threads = 2

### Download raw sequencing file and subset to first 100,000,000 reads


In [3]:
!pip install -q ffq
import json

out = "data.json"
!ffq SRR12698539 --ftp -o $out

f = open(out)
data = json.load(f)
f.close()

print(len(data))

for dataset in data:
    url = dataset["url"]
    !curl -O $url

[2023-12-12 21:23:18,015]    INFO Parsing run SRR12698539
2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6625M  100 6625M    0     0  36.9M      0  0:02:59  0:02:59 --:--:-- 37.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.6G  100 20.6G    0     0  33.2M      0  0:10:35  0:10:35 --:--:-- 32.6M


In [4]:
fastq = "SRR12698539_2.fastq.gz"
test_fastq = "SRR12698539_2_short.fastq"

# Create new file keeping only first X reads
!zcat $fastq | head -400000000 > $test_fastq

### Run Kraken2

Build Kraken2 viral index + add ZEBOV to standard viral reference (otherwise ZEBOV will not be detected). Zaire ebolavirus (ZEBOV) genome ViralProj14703 (linked to NC_002549.1) downloaded from https://www.ncbi.nlm.nih.gov/data-hub/genome/?taxon=186538.

In [5]:
krakendb = "kraken2-2.1.2/krakendb"

In [6]:
# Download ZEBOV genome
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/ebola_ref/GCA_000848505.1_ViralProj14703_genomic.fna

--2023-12-12 21:40:56--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/ebola_ref/GCA_000848505.1_ViralProj14703_genomic.fna
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19260 (19K) [text/plain]
Saving to: ‘GCA_000848505.1_ViralProj14703_genomic.fna’


2023-12-12 21:40:57 (10.1 MB/s) - ‘GCA_000848505.1_ViralProj14703_genomic.fna’ saved [19260/19260]



In [7]:
!$kraken2_build --db $krakendb --download-taxonomy

# Apply fix (https://github.com/DerrickWood/kraken2/issues/292#issuecomment-1206837801) first so the following line works
!$kraken2_build --db $krakendb --download-library viral

# Add ZEBOV genome
!$kraken2_build --db $krakendb --add-to-library GCA_000848505.1_ViralProj14703_genomic.fna

!$kraken2_build --db $krakendb --build --threads $threads

Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
rsync_from_ncbi.pl: unexpected FTP path (new server?) for https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174
Masking low-complexity regions of new file...Unable to find dustmasker in path, can't mask low-complexity sequences
Creating sequence ID to taxonomy ID map (step 1)...
Found 1/1 targets, searched through 2316311 accession IDs, search complete.
Sequence ID to taxonomy ID map complete. [0.321s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 90696 bytes
Capacity estimation complete. [0.010s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 4 bits reserved for taxid.
Completed processing of 1 sequences, 18959 bp
Writing 

Align sequencing reads to custom Kraken2 reference index:

In [8]:
outfolder = "zebov_subset_alignment"
!mkdir -p $outfolder/kraken

In [9]:
!$kraken2 \
    --db $krakendb \
    --threads $threads \
    --minimum-hit-groups 3 \
    --report-minimizer-data \
    --report $outfolder/kraken/SRR12698503.k2report \
    $test_fastq > $outfolder/kraken/SRR12698503.kraken2

Loading database information... done.
100000000 sequences (8800.00 Mbp) processed in 449.308s (13353.9 Kseq/m, 1175.14 Mbp/m).
  23981 sequences classified (0.02%)
  99976019 sequences unclassified (99.98%)


### Extract Kraken reads
The extract_kraken_reads.py script (from the KrakenTools GitHub repo) extracts reads that matched a particular species, identified by the taxonomy ID that is provided with the -t parameter:

In [10]:
!pip install -q biopython

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/3.1 MB[0m [31m2.4 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/3.1 MB[0m [31m9.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.1/3.1 MB[0m [31m30.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
# Define ZEBOV taxonomy ID
ebov_tid = 186538

In [12]:
# Download script
!curl -O https://raw.githubusercontent.com/jenniferlu717/KrakenTools/master/extract_kraken_reads.py

!python extract_kraken_reads.py \
    -k $outfolder/kraken/SRR12698503.kraken2 \
    --include-children \
    -s $test_fastq \
    -t $ebov_tid \
    -r $outfolder/kraken/SRR12698503.k2report \
    -o $outfolder/kraken/SRR12698503_EBOV.tid10298.1.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19061  100 19061    0     0  62885      0 --:--:-- --:--:-- --:--:-- 62700
PROGRAM START TIME: 12-12-2023 21:57:01
>> STEP 0: PARSING REPORT FILE zebov_subset_alignment/kraken/SRR12698503.k2report
	3 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS zebov_subset_alignment/kraken/SRR12698503.kraken2
	100.00 million reads processed
	23981 read IDs saved
>> STEP 2: READING SEQUENCE FILES AND WRITING READS
	23981 read IDs found (100.00 mill reads processed)
	23981 reads printed to file
	Generated file: zebov_subset_alignment/kraken/SRR12698503_EBOV.tid10298.1.fa
PROGRAM END TIME: 12-12-2023 22:24:52


### Align extracted reads to the ZEBOV genome using Bowtie2

In [13]:
# Install bowtie2
!wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip
!unzip bowtie2-2.2.5-linux-x86_64.zip
bowtie2_build = "bowtie2-2.2.5/bowtie2-build"
bowtie2 = "bowtie2-2.2.5/bowtie2"

--2023-12-12 22:24:52--  https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip
Resolving sourceforge.net (sourceforge.net)... 104.18.37.111, 172.64.150.145, 2606:4700:4400::ac40:9691, ...
Connecting to sourceforge.net (sourceforge.net)|104.18.37.111|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip/ [following]
--2023-12-12 22:24:53--  https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip/
Reusing existing connection to sourceforge.net:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip/download [following]
--2023-12-12 22:24:53--  https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-linux-x86_64.zip/download
Reusing existing con

Generate Bowtie2 genome index:

In [14]:
b_index = "b_index"
!mkdir -p $b_index

In [15]:
!$bowtie2_build \
    GCA_000848505.1_ViralProj14703_genomic.fna \
    $b_index/ebov

Settings:
  Output files: "b_index/ebov.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  GCA_000848505.1_ViralProj14703_genomic.fna
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 4739
Using parameters --bmax 3555 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 3555 --dcv 1024
Const

Align extracted reads to ZEBOV genome:

In [16]:
!$bowtie2 \
    -x $b_index/ebov \
    -f -p $threads \
    -U $outfolder/kraken/SRR12698503_EBOV.tid10298.1.fa \
    -S $outfolder/kraken/SRR12698503_EBOV_aligned.sam

23981 reads; of these:
  23981 (100.00%) were unpaired; of these:
    1006 (4.19%) aligned 0 times
    22975 (95.81%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
95.81% overall alignment rate


### Use SAMtools to convert the SAM files to sorted BAM files

In [17]:
# Install SAMtools
!wget https://github.com/samtools/samtools/releases/download/1.6/samtools-1.6.tar.bz2
!tar -vxjf samtools-1.6.tar.bz2
!cd samtools-1.6; make
samtools = "samtools-1.6/samtools"

--2023-12-12 22:24:59--  https://github.com/samtools/samtools/releases/download/1.6/samtools-1.6.tar.bz2
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/3666841/bf7ca5b8-a473-11e7-88be-37125f5eb797?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231212%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231212T222459Z&X-Amz-Expires=300&X-Amz-Signature=0cce325c5302037d9fd6dd41290847e16ac1dc4e98f0c55bfc9e28a7dfd7fe89&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=3666841&response-content-disposition=attachment%3B%20filename%3Dsamtools-1.6.tar.bz2&response-content-type=application%2Foctet-stream [following]
--2023-12-12 22:24:59--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/3666841/bf7ca5b8-a473-11e7-88be-37125f5eb797?X-Amz-

In [18]:
!$samtools view \
    -bS -F4 $outfolder/kraken/SRR12698503_EBOV_aligned.sam \
    > $outfolder/kraken/SRR12698503_EBOV_aligned.bam

In [19]:
!$samtools sort \
    $outfolder/kraken/SRR12698503_EBOV_aligned.bam \
    -o $outfolder/kraken/SRR12698503_EBOV_sorted.bam

In [20]:
!$samtools index \
    $outfolder/kraken/SRR12698503_EBOV_sorted.bam