![image](https://cdn.discordapp.com/attachments/996200880351215636/1065002848355631165/New_Atlantis.png) 

---
# Execution of the antiSMASH Tool
---

## Introduction

 
 This notebook will give a demonstration of running the tool antiSMASH on assembled metagenomic data to identify Biosynthetic Gene Clusters (BGCs) from amongst the reads. `antiSMASH` stands for (antibiotics and Secondary Metabolite Analysis Shell) and is a comprehensive genome mining tool used for the automated identification, annotation, and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genomes. Secondary metabolites are bioactive compounds, often with therapeutic potential, including antibiotics, antifungals and antitumorals.
'antiSMASH' has quite a few features, but here we utilize only its ability to identify BGCs It can identify all known types of BGCs such as those responsible for the production of polyketides, non-ribosomal peptides, terpenes, etc.
   
   
The `antiSMASH` tool provides biologists and bioinformaticians with a useful means of deep-diving into genomic data to accelerate the discovery and diversification of biologically active secondary metabolites.

Markdown cell giving a brief explanation of the tool. Cover what insight is hoped to be generated by the tool and a extremely brief desciption of its methods.

## Input Data 
---

The input data are metagenomic samples previously preprocessed and assembled with [VEBA](https://github.com/jolespin/veba), a metagenomic assembly tool. For the purpose of this notebook, we will analyze metagenomes originally from the SOLA dataset. 
The input consists of:
1. **Assembled contigs**: .fasta files.

2. **Mapping files** of the assembled contigs in .bam files.


## Output Data 
---

`antiSMASH` outputs an array of data related to the identified secondary metabolite biosynthesis gene clusters (BGCs). The output is highly comprehensive and includes the following:

1. **Gene Cluster Annotations:** For each identified BGC, antiSMASH provides an annotation of the genes in the cluster, including the predicted function and potential products.

2. **Cluster Border Prediction:** It includes the coordinates of the predicted gene cluster borders in the genome sequence.

3. **Biosynthetic Pathway Predictions:** Based on identified enzymatic domains, it predicts potential biosynthetic pathways that could be used by the organism to produce secondary metabolites.

4. **Comparative Analysis Data:** The software performs a comparative analysis of the identified gene clusters against known BGCs, yielding similarities and potential novelty.

5. **Cluster Networking Data:** antiSMASH can generate data on co-occurring clusters, sharing their potential functional relationships.

6. **Data Files:** The software delivers the analysis results in various formats including text-format summary files, GenBank files, SVG images for cluster diagrams, and .gbk files for downstream analysis.

7. **Interactive HTML reports:** antiSMASH creates an easy-to-navigate HTML report summarizing all the results of the analysis.

8. **Relaxed Cluster Prediction:** In addition to the core areas of BGCs, antiSMASH identifies regions of the genome that could potentially contribute to secondary metabolite production, even if they don’t form a traditional cluster.



---
# Environment
---

This section should point to where the requirements file is store and should be a list of necessary depedencies, install instructions and any auxilliary informantion required to set of the evironment. 
All requirements files should include the "import-ipynb" package in addition to any required to run the tool if you are using a notebook to host source code. This package allows for having notebook based file for storing and documenting utility functions that will be applied later.

In [None]:
#install for import-ipynb that will be applied later
!pip install import-ipynb -q

The following instructions set environment pathing variables.

In [1]:
%load_ext rpy2.ipython
%set_env WORKDIR=workdir
%set_env REPO=/home/epereira/workspace/dev/new_atlantis/repos/bioprospecting
%set_env seqtk=/nfs/bin/seqtk/seqtk

import os 
cwd = os.getcwd()
cwd

env: WORKDIR=workdir
env: REPO=/home/epereira/workspace/dev/new_atlantis/repos/bioprospecting
env: seqtk=/nfs/bin/seqtk/seqtk


'/home/jovyan/shared/Active_Projects/Templates'

In [None]:
## conda not functional yet
%%bash 
conda install numpy -c conda-forge




Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.1.0
  latest version: 23.9.0

Please update conda by running

    $ conda update -n base -c conda-forge conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.9.0





# All requested packages already installed.



## R Based Dependencies
---

Insert handling and instructions for the above.

---
## Import Statements (code)(import ipynb)
---

In [None]:
import import_ipynb 

#import all untility functions from utility_func nb 
from src.NB_utility import *

You may also use a regular Python module to host your source code and import it here. The purpose of using a notebook over a traditional module is for readibility with markdown, more so than for any additional utility. Since they require slightly different handling than importing a python module, examples herein use a utility Notebook. Still do view src/utility_NB.ipynb for documentation instrutions. 

Put all python import statements in a single cell in this section.
If any global settings for packages are altered (such as setting matplotlib to inline) do that here in a separate cell immediately after the import statements.

---
# Parameters
---

Add handling for bash flags and r - object oriented parameter approach. parameter object?

This section should describe all parameters and setting used to input into the tool. All user supplied arguments should be defined and explained in this section.
This section should alternate markdown and code, the markdown explains what a parameter does and what the options are. The first parameter cell should always be paths to data.
The first section should contain one cell that only specifies data locations, such that a user only has to edit this cell in order to point the notebook to their data. The next parameter cell should contain paths to any config files or reference databases. The rest of the section should alternate between markdown explaining the parameter followed by code to set the parameter to some default value in either python or env variables. Duplicate the follwoing examples for each paramter used during data cleaning and execution of the tool.

## Input and output data directories 
---

These variables are strings that conatin the path to the input and output data directories.

In [2]:
input_dir = 'path/to/data'
output_dir = 'path/to/data'

This cell sets an environment variable to be used later. This is a useful way to set parameters for shell scripts or bash used later in the notebook.

In [3]:
import os 

os.environ['MESSAGE'] = 'Hello Worlds'

---
# Data Precleaning (if required) 
---

This section aims to demonstrate any modifications done to the data from the original source to the format it the tool needs for processing. Following the instructions in this section, a user should be able to replicate the steps required to take the original data set (or any similar data set) and perform requisite transformations required for the tool. If there is any metadata, augmentations to reference data bases or similar required to run your tool, all those steps should also be included here. 
This section should follow these steps:
- Load data into notebook (pd.read_csv, bash csv, etc)(if required)
- Validate the loaded data is the correct format (for user replication) 
- Perform any other necessary validations (depending on protocol/tool)
- Alternate markdown cell explaining manipulation with code cell executing the manipulation.
- Mark each step heading with ### to insert into table fo contents 

If any of these steps are not required for your specific tool, do omit them.

---
# Execution of Tool 
---

This section aims to demonstrate how to execute the tool and performs a sample run on test data. This portion of the notebook may be fairly code intensive and is the most important part of the notebook. To improve readibility and clarity, most of the verbose code segments should be written as functions in python, as shell scripts stored in the src directory, or scripts in whichever language necessary for the tool you are using.

If the step you hope to perform involves more than a couple lines of code, please see the function definition format in the src/NB_utility.ipynb and wrap your code in a function using that format. If you prefer to use bash scripts, wrtie your commands into a shell file, and then execute them in this portion of the notebook. Once your code is wrapped as function in the utility NB, you can inport and run it using the format below.
Each step should be separated by a markdown heading with a brief explanation followed by the necessary code.
Use the parameter variables definied earlier in the notebook as arguments for functions written here. If there is an input that is not already included in the parameters section, include it there.

Example function:

In [2]:
#remember this code was executed earlier 
import import_ipynb
from src.NB_utility import *

#now I can execute the hello_world function that was defined in the utility_nb
hello_world()

importing Jupyter notebook from /home/jovyan/shared/Active_Projects/Templates/src/NB_utility.ipynb
Hello World!


True

---
# Data Post Processing (if required) 
---

## Write to output directory
---
If the tool does not do it automatically, use this cell to write the output data to the output directory defined in the parameter section.

This section aims to contain all the code necessary to perform the data cleaning, formatting or analysis that would be performed on the output of this tool. Use the same formatting as previously mentioned in the execution section of the notebook:
- Offload long code sections to the src/Utility_NB and import the code 
- Add validation to catch errors in and irregularities in the data 
- Alternate code and markdown cells 
- Include a markdown header for each step using ### to add it to the table of contents
- Display data and transformations where necessary. 

---
# Visualization 
---

If there is a visualization you would like to include here, generate it here.
Phrase the code used to generate the visualization as a function in the format mentioned in the execution section of this notebook.
Place the function is the utility NB such that it can be reused to generate new visualizations on future data. 
If the vizualization has additional options and parameters, there is no need to add them to the parameters section, and those parameters can be included into a miniature parameter section  in this section.

---
# Conclusion
---
Include any final parting thoughts in this section.
This section may also incude:
- Common mistakes and fixes. 
- Debugging tips.
- Contact for the author.
- Any other information you would like to include