# Novel Recombinants

This notebook contains supplementary analyses of the 929 novel recombinants identified in the `sc2ts` paper.

This notebook has no pre-requisite tool dependencies. All python operations use packages from the standard library, and unix-based tools will be downloaded as pre-compiled, standalone binaries.

## Data

- Viridan metadata: `data/run_metadata.v04.tsv.gz`
   - URL: https://figshare.com/articles/dataset/Supplementary_table_S1/25712982?file=45969195
- Viridian sequences: `data/Viridian_tree_cons_seqs/<#>.cons.fa.xz`
   - URL: https://doi.org/10.6084/m9.figshare.25713225
- Viridian index: `data/Viridian_tree_cons_seqs/index.tsv.xz`

> **Note**: Keep files in their compressed form (`gz`, `xz`), we will use tools that don't need decompression first.

Change the working directory to the root of the `sc2ts-paper` project.

In [4]:
import os
curr_dir = os.path.basename(os.getcwd())

# If we're still in the notebooks directory, move up to the project root.
if curr_dir != "sc2ts-paper":
  sc2ts_dir = os.path.join(os.getcwd(), "..")
  _ = os.chdir(sc2ts_dir)

Setup input/output files and directories.

In [None]:
metadata = os.path.join("data", "run_metadata.v04.tsv.gz")
sequences = os.path.join("data", "Viridian_tree_cons_seqs")
index = os.path.join("data", "Viridian_tree_cons_seqs", "index.tsv.xz")
recombinants = os.path.join("data", "recombinants.csv")
results = os.path.join("results", "novel_recombinants")

if not os.path.exists("bin"):
  os.makedirs("bin")
if not os.path.exists(results):
  os.makedirs(results)
if not os.path.exists(nextclade_dataset):
  os.makedirs(nextclade_dataset)

## Dependencies

### Install csvtk

`csvtk` is used as a unix-based dataframe engine.

In [None]:
! wget -q -O csvtk.tar.gz https://github.com/shenwei356/csvtk/releases/download/v0.32.0/csvtk_linux_386.tar.gz
! tar -xf csvtk.tar.gz
! mv csvtk bin/
! rm -f csvtk.tar.gz
! bin/csvtk --help | head -n 5

### Install seqkit

`seqkit` is a unix-based tool for sequence queries and manipulation.

In [None]:
! wget -q -O seqkit.tar.gz https://github.com/shenwei356/seqkit/releases/download/v2.9.0/seqkit_linux_amd64.tar.gz
! tar -xf seqkit.tar.gz
! mv seqkit bin/
! rm -f seqkit.tar.xz
! bin/seqkit --help | head -n 5

### Install nextclade

`nextclade` is a unix-based tool for sequence alignment and lineage assignment.

In [None]:
! wget -q -O bin/nextclade https://github.com/nextstrain/nextclade/releases/download/3.10.2/nextclade-x86_64-unknown-linux-musl
! bin/nextclade --help | head -n 5

### Install rebar

`rebar` is a unix-based tool for recombinant sequence detection.

In [None]:
! wget -q -O bin/rebar https://github.com/phac-nml/rebar/releases/download/v0.2.1/rebar-x86_64-unknown-linux-musl
! bin/rebar --help | head -n 5

## Metadata and Sequences

Extract Viridian metadata for the novel recombinants.

In [None]:
! bin/csvtk cut -f sample_id {recombinants} \
  | tail -n+2 \
  | bin/csvtk grep -t -f Run -P - {metadata} \
  | bin/csvtk merge -t -f Run {index} - | \
  > {results}/metadata.tsv

Extract Viridian batch numbers for the novel recombinants.

In [26]:
! bin/csvtk cut -t -f Batch {results}/metadata.tsv | tail -n+2 | sort -g | uniq > {results}/batches.txt

Extract Viridian consensus sequences for the novel recombinants.

In [None]:
! cat results/novel_recombinants/batches.txt | while read batch; do \
  echo Batch: ${batch} 1>&2; \
  bin/csvtk grep -t -f Batch -p ${batch} results/novel_recombinants/metadata.tsv \
    | bin/csvtk cut -t -f Run \
    | tail -n+2 \
    | bin/seqkit grep -w 0 -f - data/Viridian_tree_cons_seqs/${batch}.cons.fa.xz; \
done > results/novel_recombinants/sequences.fasta

## Align

Download the sars-cov-2 lineage model.

In [34]:
! bin/nextclade dataset get \
  --name sars-cov-2 \
  --tag  2025-01-28--16-39-09Z \
  --output-dir dataset/nextclade

Align the sequences.

In [36]:
! bin/nextclade run \
  --input-dataset dataset/nextclade \
  --jobs 2 \
  --output-tsv {results}/nextclade.tsv \
  --output-fasta {results}/nextclade.fasta \
  {results}/sequences.fasta

### Detect Recombination

Download the sars-cov-2 lineage model.

In [40]:
! bin/rebar dataset download \
  --name sars-cov-2 \
  --tag 2025-01-28 \
  --verbosity error \
  --output-dir dataset/rebar

Detection recombination in the sequences.

In [41]:
! bin/rebar run \
  --dataset-dir dataset/rebar \
  --threads 2 \
  --alignment {results}/nextclade.fasta \
  --output-dir {results}/rebar

[INFO  rebar::run] Creating output directory: "results/novel_recombinants/rebar"
[INFO  rebar::run] Number of threads available: 12
[INFO  rebar::run] Using 2 thread(s).
[INFO  rebar::dataset::load] Loading dataset: "dataset/rebar"
[INFO  rebar::run] Loading query alignment: "results/novel_recombinants/nextclade.fasta"
[INFO  rebar::run] Running recombination search.
[2K[1A████████████████████████████████████████ 929/929 (100%) | Sequences / Second: 3.4466/s | Elapsed: 00:04:29 | ETA: 00:00:00[INFO  rebar::run] Exporting CLI Run Args: "results/novel_recombinants/rebar/run_args.json"
[INFO  rebar::run] Exporting linelist: "results/novel_recombinants/rebar/linelist.tsv"
[INFO  rebar::run] Exporting recombination barcodes: "results/novel_recombinants/rebar/barcodes"
[INFO  rebar::run] Done.


Plot parents and breakpoints.

In [46]:
! bin/rebar plot \
  --run-dir {results}/rebar \
  --annotations dataset/rebar/annotations.tsv \
  --verbosity error

Example plot of a novel recombinant:

<img src=../results/novel_recombinants/rebar/plots/novel_B.1.177_B.1.221_15325-20660.png width=480>

## Filter

- `max_run_length` <= 2
- Not Pango X
- Not Delta or Omicron origin.

In [62]:
! bin/csvtk filter2 -f '$max_run_length<=2' data/recombinants.csv \
  | bin/csvtk cut -f sample_id \
  | tail -n+2 \
  | bin/csvtk grep -t -f seqName -P - {results}/nextclade.tsv \
  | bin/csvtk grep -t -f clade_who -r -p "Delta|Omicron" -v \
  | bin/csvtk grep -t -f Nextclade_pango -r -p "^X" -v \
  | bin/csvtk cut -t -f seqName \
  | tail -n+2 \
  | bin/csvtk grep -t -f strain -P - {results}/rebar/linelist.tsv \
  | bin/csvtk grep -t -f population -r -p "^(X|BA|BQ|HR)" -v \
  > {results}/rebar/linelist.filter.tsv