Skip to content

new-atlantis-labs/bioprospecting

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bioprospecting

This repository contains the bioinformatic tools and custom code dedicated to the bioprospecting of Biosynthetic Gene Clusters (BGCs) in marine metagenomic samples.

The tools included are containerized, and have to be executed utilizing their corresponding run_* script.

Figure 1

Fig. 1. Assembly-based bioprospecting pipeline. The input data consists of metagenomic samples previously preprocessed and assembled utilizing VEBA. The pipeline is organized in five main tasks. 1) Identify BGC sequences: the BGCs are annotated in the assembled metagenomic data utilizing antiSMASH; 2) Taxonomic annotation: the metagenomic contigs containing BGC sequences are taxonomically annotated utilizing MMseqs taxonomy and the reference database UniRef100; 3) BGC mapping: the metagenomic BGC sequences are mapped against (previously constructed) Gene Cluster Family (GCF) models of the MIBiG database v3 MIBiG v3 utilizing BiG-SLICE; 4) BGC clustering: the metagenomic BGC sequences are clustered into GCFs with BiG-SLICE; 5) Compute coverage: the coverage of the metagenomic contigs, previously estimated with VEBA, is utilized to determine the coverage of the BGC sequences. The output of the pipeline consists of the following tables: Table 1: functional and taxonomic annotation, biosynthetic novelty, and closest GCF id; Table 2: BGC class abundance table; Table 3: GCFs abundance table.

Repository structure

.
├── execution
│   ├── bgc_annotation
│   │   ├── antismash_execution.ipynb
│   │   └── src
│   │       ├── antismash_Dockerfile
│   │       ├── requirements.txt
│   │       ├── run_antismash.sh
│   │       └── utilitites.ipynb
│   ├── bgc_clustering
│   │   ├── bigslice_execution.ipynb
│   │   └── src
│   │       ├── bigslice_Dockerfile
│   │       ├── requirements.txt
│   │       ├── run_bigslice.sh
│   │       └── utilitites.ipynb
│   ├── bgc_mapping
│   │   ├── bigslice_execution.ipynb
│   │   └── src
│   │       ├── bigslice_Dockerfile
│   │       ├── requirements.txt
│   │       ├── run_bigslice.sh
│   │       └── utilities.ipynb
│   └── bgc_taxonomy
│       ├── mmseqs_execution.ipynb
│       └── src
│           ├── mmseqs_Dockerfile
│           ├── requirements.txt
│           ├── run_mmseqs_taxonomy.sh
│           └── utilities.ipynb
├── figures
│   └── Bioprospectig_pipeline_dev.png
└── README.md

The execution folder contains all the modules that compose the bioprospecting pipeline. These are: bgc_annotation, bgc_clustering, bgc_mapping, and bgc_taxonomy.
Each of these modules consist of a folder with the following files:
The *_execution.ipynb NB: which contains the code and documentation necessary to run the analysis.
The src folder with the following files:

  • *_Dockerfile to create the container to run the main tool(s) to be executed in a module.
  • requirements.txt a list of all the packages needed to run the *_execution.ipynb NB.
  • utilities.ipynb the definition of all the functions to be utilized in the *_execution.ipynb NB.
  • run_*.sh a wrap script to easily execute the containerized tool(s).
    The figures folder contains the figures to be included in this repository as part of its documentation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.5%
  • Shell 0.5%