### **Bioprospecting pipeline 1.0**

#### **Introduction**

A Biosynthetic Gene Cluster (BGC) can be defined as a physically clustered group of two or more genes that encode the biosynthetic enzymes of a pathway (Medema 2015). The compounds encoded by BGCs represent diverse chemical structural classes, which include among others, polyketides, peptides, oligosaccharides, terpenoids, and alkaloids. These are commonly known as natural products and are a rich source of molecule candidates for the development of pharmaceutical applications, for instance, as vitamins, antibiotics, and antifungals (Newman and Cragg 2020). In addition, from an ecological perspective, natural  product BGCs are traits with key roles for the organisms’ survival and adaptation to the environment, participating in defense, nutrient scavenging, quorum sensing, etc (Fischbach and Voigt 2010; O’Brien and Wright 2011).  

The advent of new sequencing technologies and the availability of large numbers of genome sequences has boosted our capacity to mine encoded BGCs, especially from microorganisms, which are known to be highly prolific producers of natural products. As a result, genome mining has accelerated the discovery of new BGCs, positioning as a central approach to study these genomic elements (Ziemert, Alanjaryab, and Weber 2016). Metagenomics, as a culture-independent approach, offers unique opportunities to mine the natural product BGCs encoded by environmental microbial communities (Reddy 2012; Charlop-Powers 2015; Lemetre 2017; Rego 2020). However, given the long nature of these genomic elements, commonly in the order of tens of kb, their identification in metagenomic data requires exceptionally high-quality assemblies. In spite of recent advances (e.g., biosyntheticSPAdes (Meleshko 2019) and BiG-MAP (Pascal Andreu 2021)), identifying the different BGC classes in metagenomic data represents a major challenge that must be addressed to be able to expand bioprospecting endeavors (Nayfach 2021; Robinson 2021; Paoli 2022).  

The workflow we will develop will capitalize on recent bioinformatic advances in BGC bioprospecting analysis to maximize the exploitation of metagenomic data. Briefly, our strategy will consist of generating metagenome assemblies, including the reconstruction of metagenome assembled genomes (MAGs), identifying BGC sequences, clustering these sequences into Gene Cluster Families (GCFs), computing the BGC coverage and GCFs abundance, and performing the functinoal and taxonomic annotation.
In Fig. 1 we represent the implementation of this workflow.

![Figure 1](../figures/Bioprospectig_reads_vs_assembly_dev.png)

**Fig. 1**. Assembly-based bioprospecting pipeline. Metagenoimc samples are preprocessed and assembled into contigs or Metagenome Assembled Genomes (MAGs) with [VEBA](https://github.com/jolespin/veba). Subsequently, the BGCs are annotated in the assembled data utilizing [antiSMASH](https://github.com/antismash/antismash), and the identified sequences are matched against precomputed Gene Cluster Family (GCF) models of [MIBiG v3](https://mibig.secondarymetabolites.org/) utilizing the [BiG-SLICE](https://github.com/pereiramemo/bigslice) tool to determined or approximate the BGC functions (if these are placed close to known BGCs), and the biosynthetic novelty (i.e., minimum distance to a known BGC sequence). In addition, BGC sequences are taxonomically annotates with [MMseqs taxonomy](https://github.com/soedinglab/MMseqs2#taxonomy) and the unassembled short-read data is mapped on the assembled contigs (or MAGs) to compute the coverage of the BGC sequences. The outputs of the pipeline consists of the following tables: 1) The metadata table containing the functional and taxonomic annotation, and the biosynthetic novelty; 2) The BGC class abundance table; 3) The GCFs abundance tables.

#### **Input data**

The data we will be utilizing, as described in the figure above, are metagenomic samples previously preprocessed and assembled utilizing VEBA. For the purpose of this notebook, we will analyze 38 metagenomes of the SOLA dataset. Namenly, the input consists of assembled contigs and the mapping information (i.e., bam files).