### **Bioprospecting pipeline 1.0**

***
#### **Introduction**
***

A Biosynthetic Gene Cluster (BGC) can be defined as a physically clustered group of two or more genes that encode the biosynthetic enzymes of a pathway (Medema 2015). The compounds encoded by BGCs represent diverse chemical structural classes, which include among others, polyketides, peptides, oligosaccharides, terpenoids, and alkaloids. These are commonly known as natural products and are a rich source of molecule candidates for the development of pharmaceutical applications, for instance, as vitamins, antibiotics, and antifungals (Newman and Cragg 2020). In addition, from an ecological perspective, natural  product BGCs are traits with key roles for the organisms’ survival and adaptation to the environment, participating in defense, nutrient scavenging, quorum sensing, etc (Fischbach and Voigt 2010; O’Brien and Wright 2011).  

The advent of new sequencing technologies and the availability of large numbers of genome sequences has boosted our capacity to mine encoded BGCs, especially from microorganisms, which are known to be highly prolific producers of natural products. As a result, genome mining has accelerated the discovery of new BGCs, positioning as a central approach to study these genomic elements (Ziemert, Alanjaryab, and Weber 2016). Metagenomics, as a culture-independent approach, offers unique opportunities to mine the natural product BGCs encoded by environmental microbial communities (Reddy 2012; Charlop-Powers 2015; Lemetre 2017; Rego 2020). However, given the long nature of these genomic elements, commonly in the order of tens of kb, their identification in metagenomic data requires exceptionally high-quality assemblies. In spite of recent advances (e.g., biosyntheticSPAdes (Meleshko 2019) and BiG-MAP (Pascal Andreu 2021)), identifying the different BGC classes in metagenomic data represents a major challenge that must be addressed to be able to expand bioprospecting endeavors (Nayfach 2021; Robinson 2021; Paoli 2022).  

The workflow we will develop will capitalize on recent bioinformatic advances in BGC bioprospecting analysis to maximize the exploitation of metagenomic data. Briefly, our strategy will consist of generating metagenome assemblies, including the reconstruction of metagenome assembled genomes (MAGs), identifying BGC sequences, clustering these sequences into Gene Cluster Families (GCFs), computing the BGC coverage and GCFs abundance, and performing the functinoal and taxonomic annotation.
In Fig. 1 we represent the implementation of this workflow.

**Fig. 1**. Assembly-based bioprospecting pipeline. Metagenoimc samples are preprocessed and assembled into contigs or Metagenome Assembled Genomes (MAGs) with [VEBA](https://github.com/jolespin/veba). Subsequently, the BGCs are annotated in the assembled data utilizing [antiSMASH](https://github.com/antismash/antismash), and the identified sequences are matched against precomputed Gene Cluster Family (GCF) models of [MIBiG v3](https://mibig.secondarymetabolites.org/) utilizing the [BiG-SLICE](https://github.com/pereiramemo/bigslice) tool to determined or approximate the BGC functions (if these are placed close to known BGCs), and the biosynthetic novelty (i.e., minimum distance to a known BGC sequence). In addition, BGC sequences are taxonomically annotates with [MMseqs taxonomy](https://github.com/soedinglab/MMseqs2#taxonomy) and the unassembled short-read data is mapped on the assembled contigs (or MAGs) to compute the coverage of the BGC sequences. The outputs of the pipeline consists of the following tables: 1) The metadata table containing the functional and taxonomic annotation, and the biosynthetic novelty; 2) The BGC class abundance table; 3) The GCFs abundance tables.

![Figure 1](../figures/Bioprospectig_reads_vs_assembly_dev.png)

***
#### **Prepare the work environment**
***

In [None]:
%load_ext rpy2.ipython
%set_env WORKDIR=workdir
%set_env REPO=/home/epereira/workspace/dev/new_atlantis/repos/bioprospecting
%set_env seqtk=/nfs/bin/seqtk/seqtk

In [None]:
%%bash
mkdir -p ${WORKDIR}/data/sola
mkdir -p ${WORKDIR}/outputs/antismash/taxonomy
mkdir ${WORKDIR}/outputs/bgc_abund
mkdir ${WORKDIR}/outputs/bgc_taxa

***
#### **Input data**
***

The input data, as described in the figure above, are metagenomic samples previously preprocessed and assembled with VEBA. For the purpose of this notebook, we will analyze 38 metagenomes of the [SOLA dataset](https://pubmed.ncbi.nlm.nih.gov/29925880/). Namenly, the input consists of assembled contigs and the mapping information (i.e., bam files).

We can download the data from the New Atlantis Cloud Lab.  

In [None]:
%%bash

# aws s3 cp s3://newatlantis-case-studies/SOLA-samples/ ${WORKDIR}/data/sola --recursive
# aws s3 cp s3://newatlantis-case-studies/mibig_gcf_models/ ${WORKDIR}/data/mibig_gcf_models --recursive


***
#### **1) Identify BGC sequences**
***

For this we will be using our wrap script [run_antismash](https://github.com/pereiramemo/bioprospecting/blob/main/run_scripts/run_antismash.sh), which runs a our container of antiSMASH.  

In [None]:
%%bash

SCAFFOLDS=$(ls ${WORKDIR}/data/sola/ERR*/output/scaffolds.fasta)
for SCAFFOLD in ${SCAFFOLDS}; do
  SAMPLE_NAME=$(echo "${SCAFFOLD}" | sed "s/.*\(ERR[0-9]\+\)\/output.*/\1/");
  OUTPUT_DIR="${WORKDIR}/outputs/antismash/${SAMPLE_NAME}";
  "${REPO}"/run_scripts/run_antismash.sh "${SCAFFOLD}" "${OUTPUT_DIR}" \
  --cpus 40 \
  --genefinding-tool prodigal-m \
  --taxon bacteria \
  --allow-long-headers \
  --minlength 5000;
done

***
#### **2) Taxonomic annotation of BGCs**
***

For this we will use out script [run_mmseqs_taxonomy.sh](https://github.com/pereiramemo/bioprospecting/blob/main/run_scripts/run_mmseqs_taxonomy.sh), which runs a container of [MMseqs](https://github.com/soedinglab/MMseqs2), utilizing [UniProtKB/Swiss-Prot](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue&query=%2A) as the refrence database.

Before doing anything, we have to organize the data and select the scaffolds in which a BGC was annotated:

In [None]:
%%bash
cat "${WORKDIR}"/outputs/bgc_abund/ERR*.tsv | cut -f1 | sort | uniq > "${WORKDIR}"/outputs/bgc_taxa/ids.txt
cat "${WORKDIR}"/data/sola/ERR*/output/scaffolds.fasta >  "${WORKDIR}"/outputs/bgc_taxa/all.fasta
"${seqtk}" subseq \
"${WORKDIR}"/outputs/bgc_taxa/all.fasta \
"${WORKDIR}"/outputs/bgc_taxa/ids.txt > \
"${WORKDIR}"/outputs/bgc_taxa/bgc.fasta

Now we can perform the taxonomic annotation.

In [None]:
%%bash

"${REPO}"/run_scripts/run_mmseqs_taxonomy.sh \
"${WORKDIR}"/outputs/bgc_taxa/bgc.fasta \
"${WORKDIR}"/outputs/bgc_taxa/bgc_taxa_annot \
--threads 40 \
--tax-lineage 1 \
-v 0

Lastly, we will format the taxonomic datatable to make it more suitable for downstream analyses.

In [None]:
%%bash

gawk -v FS="\t" -v OFS="\t" '{
  seq_id = $1; tax_level = $3; tax_path = $9;
  sample = gensub(/__.*/,"","g", seq_id);
  print sample,seq_id,tax_level,tax_path
  }' "${WORKDIR}"/outputs/bgc_taxa/bgc_taxa_annot_lca.tsv > \
  "${WORKDIR}"/outputs/bgc_taxa/bgc_taxa_annot_lca_formatted.tsv

***
#### **3) BGC mapping**
***

Here we will assess how similiar are our BGC sequences found in the metagenomic samples in relation to knonw BGC sequences.  
For this anlaysis we will run the [BiG-SLICE](https://github.com/pereiramemo/bigslice) to map our BGCs against (prevoulsy constructed) GCF models of the [MIBiG database V3](https://mibig.secondarymetabolites.org/).

To be able to run [BiG-SLICE](https://github.com/pereiramemo/bigslice) we have to format the input properly, that is, creating the [dataset.tsv and taxonomy files](https://github.com/medema-group/bigslice/wiki/Input-folder). 

In [None]:
%%bash

ls -d "${WORKDIR}/outputs/antismash/"ERR* | \
while read LINE; do
  DATASET=$(basename $(ls -d ${LINE}))
  PATH2DATASET=$(basename $(dirname ${LINE}))"/"
  echo -e "${DATASET}\t./\ttaxonomy/${DATASET}_taxonomy.tsv\tdataset_${DATASET}"
done > "${WORKDIR}/outputs/antismash/datasets.tsv"

ls "${WORKDIR}/outputs/antismash/"ERR*/scaffolds/ERR*region*.gbk | \
while read LINE; do
  DATASET=$(basename ${LINE/__k*.region*.gbk//})
  SEQID=$(basename ${LINE/.region*.gbk//})
  OUTPUT_FILE="${WORKDIR}/outputs/antismash/taxonomy/${DATASET}_taxonomy.tsv"
  echo -e "${SEQID}\tBacteria" >> "${OUTPUT_FILE}"
done

Now we do the mapping.

In [None]:
%%bash

"${REPO}"/run_scripts/run_bigslice.sh query \
"${WORKDIR}/outputs/antismash/" \
"${WORKDIR}/data/mibig_gcf_models/" \
--num_threads 40 \
--threshold_pct 0.1 \
--query_name SOLA

***
#### **4) BGC clustering**
***

Here again will use [BiG-SLICE](https://github.com/medema-group/bigslice), this time to cluster our BGC sequences into Gene Cluster Families (GCFs).  

In [None]:
%%bash

"${REPO}"/run_scripts/run_bigslice.sh cluster \
"${WORKDIR}/outputs/antismash/" \
"${WORKDIR}/outputs/sola_clust/" \
--num_threads 40 \
--threshold_pct 0.1

***
#### **5) Compute coverage**
***

To do this, let's we will first concatenat all BGC gbk files from each metagenomic sample.  

In [None]:
%%bash

ls -d ${WORKDIR}/outputs/antismash/ERR* | \
while read LINE; do
  SAMPLE=$(basename "${LINE}")
  cat "${LINE}"/scaffolds/"${SAMPLE}"*.region*.gbk > "${WORKDIR}"/outputs/bgc_abund/"${SAMPLE}".gbk
done

Now we can run our custom scripts [get_cov.py](https://github.com/pereiramemo/bioprospecting/blob/main/aux_scripts/get_cov.py) to compute the coverage of each BGC.

In [None]:
%%bash

ls "${WORKDIR}"/outputs/bgc_abund/*.gbk | \
while read LINE; do
  SAMPLE=$(basename "${LINE}" .gbk)
  "${REPO}"/aux_scripts/get_cov.py \
  --input_gbk "${WORKDIR}"/outputs/bgc_abund/"${SAMPLE}".gbk \
  --input_bam "${WORKDIR}"/data/sola/"${SAMPLE}"/output/mapped.sorted.bam \
  --sample_name "${SAMPLE}" \
  --output_tsv "${WORKDIR}"/outputs/bgc_abund/"${SAMPLE}".tsv
done

cat "${WORKDIR}"/outputs/bgc_abund/ERR*.tsv  > "${WORKDIR}"/outputs/bgc_abund/bgc_abund.tsv

In [None]:
***
#### **5) Compute coverage**
***