![image](https://cdn.discordapp.com/attachments/996200880351215636/1065002848355631165/New_Atlantis.png) 

---
# Execution of the antiSMASH Tool
---

## Introduction

 
 This notebook will give a demonstration of running the tool `antiSMASH` on assembled metagenomic data to identify Biosynthetic Gene Clusters (BGCs) from amongst the reads. `antiSMASH` stands for (antibiotics and Secondary Metabolite Analysis Shell) and is a comprehensive genome mining tool used for the automated identification, annotation, and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genomes. Secondary metabolites are bioactive compounds, often with therapeutic potential, including antibiotics, antifungals and antitumorals.
'antiSMASH' has quite a few features, but here we utilize only its ability to identify BGCs It can identify all known types of BGCs such as those responsible for the production of polyketides, non-ribosomal peptides, terpenes, etc.
   
   
The `antiSMASH` tool provides biologists and bioinformaticians with a useful means of deep-diving into genomic data to accelerate the discovery and diversification of biologically active secondary metabolites.

`antiSMASH` (antiSMASH - the antibiotics and Secondary Metabolite Analysis SHell) uses a rule-based approach to identify many different types of biosynthetic pathways involved in SM production. These rules are based on the annotation of specific domains (utilizing HMM profiles), and can be found [here](https://docs.antismash.secondarymetabolites.org/glossary/#clustertypes).
In addition, `antiSMASH` integrates secondary metabolite analysis modules such as [ClusterBlast](https://docs.antismash.secondarymetabolites.org/modules/clusterblast/), [Cluster Compare](https://docs.antismash.secondarymetabolites.org/modules/cluster_compare/), and CompaRiPPson(https://docs.antismash.secondarymetabolites.org/modules/comparippson/).

## 1. Initialization

### 1.1 Create directories (Run then comment)
---

In [None]:
# import os 

# if not os.path.exists('./src'):
#     os.makedirs('./src')
#     os.makedirs('./data')
#     os.makedirs('./data/input_data')
#     os.makedirs('./data/output_data')
#     os.makedirs('./data/reference_data')
#     with open('./src/requirements.txt', 'w') as f:
#         pass
#     with open('./src/utilities.ipynb', 'a') as nb:
#         pass

### 1.2 Input Data 
---

The input data are metagenomic samples previously preprocessed and assembled with [VEBA](https://github.com/jolespin/veba), a metagenomic assembly tool.
The input consists of:
1. **Assembled contigs**: .fasta files.

2. **Mapping files** of the assembled contigs in .bam files.


### 1.3 Output Data 
---

`antiSMASH` outputs an array of data related to the identified secondary metabolite biosynthesis gene clusters (BGCs). The output is highly comprehensive and may very depending on which modules are included in the execution of the tool. However, in this module, we will be executing the tool in the minimal mode, that is, only to annotate BGC domains.
In this case, the output of antiSMASH consists of a list of GenBank files containing BGC identified sequences.


### 1.4 Data Loading
---

For the purpose of this notebook, we will analyze metagenomes originally from the [SOLA dataset](https://www.nature.com/articles/s41396-018-0158-1). This dataset consists of 40 metagenomic samples of the surface seawater collected monthly from January 2012 to February 2015 at the SOLA station, located in the northwestern Mediterranean (42°31′N, 03°11′E). 


Download fasta and bam files containing the assemblies

Download the demonstration sample from New Atlantis S3 resources. 

NOTE: additional tests required to ensure s3 is intalled on JHUB and will always work. K+T task.

NOTE: add these utility bash commands to user env. Also have functions written python for this.


In [None]:
! #aws s3 cp s3://newatlantis-case-studies/SOLA-samples/ERR2604088 ${WORKDIR}/data/sola/${SAMPLE}* --recursive

---
# 2. Environment
---

### 2.1 Main dependencies
___

The `antiSMASH` operating environment is fully decribed in the Dockerfile located in the src subfolder of this notebook. The docker image contains all necessary packages to run `antiSMASH`, all distributions are the most recent versions of their respective packages and distributed via apt. 
These packages include:

- [hmmer 1/2](http://hmmer.org)
- [diamond](https://github.com/bbuchfink/diamond)
- [fasttree](http://www.microbesonline.org/fasttree/)
- [prodigal](https://github.com/hyattpd/Prodigal)
- [ncbi blast](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)
- [muscle](https://www.ebi.ac.uk/Tools/msa/muscle/)
- [glimmerhmm](https://ccb.jhu.edu/software/glimmerhmm/man.shtml)
    
The `antiSMASH` [distribution](https://antismash.secondarymetabolites.org) and its associated refernce data are pulled directly from their website.
For more information and detail on how these packages are installed, see the Dockerfile.

NOTE: Add pointer to image on dockerhub when uploaded.

### 2.2 Notebook utility installs
___

The import-ipynb package installed here provides utility in using a refernce notebook as a Python module.

In [52]:
#install for import-ipynb that will be applied later
!pip3 install import-ipynb -q

### 2.3 R based dependencies
---

Insert handling and instructions for the above.

### 2.4 Import statements (code)(import ipynb)
---

In [1]:
import import_ipynb
from src.utilitites import *

importing Jupyter notebook from /home/epereira/workspace/dev/new_atlantis/repos/bioprospecting/execution/bgc_annotation/src/utilitites.ipynb


### 2.5 Session envrionmental variables
---

The ID of the sample being processed.

In [2]:
%env SAMPLE=ERR2604088

env: SAMPLE=ERR2604088


### 2.6 Input and output data files and directories 
---

In [3]:
%env INPUT_FOLDER=./data/input_data/ERR2604088

env: INPUT_FOLDER=./data/input_data/ERR2604088


In [4]:
%env INPUT_FASTA=./data/input_data/ERR2604088/output/scaffolds.fasta

env: INPUT_FASTA=./data/input_data/ERR2604088/output/scaffolds.fasta


In [9]:
%env INPUT_BAM=./data/input_data/ERR2604088/output/mapped.sorted.bam

env: INPUT_BAM=./data/input_data/ERR2604088/output/mapped.sorted.bam


In [12]:
%env OUTPUT_DIR=./data/output_data/sola_antismashed/ERR2604088

env: OUTPUT_DIR=./data/output_data/sola_antismashed/ERR2604088


---
## 3. Parameters
---

Add handling for bash flags and r - object oriented parameter approach. parameter object?

### 3.1 Run Time ENV Variables
___

`--taxon` {bacteria,fungi} Taxonomic classification of input sequence. (default: bacteria)

In [17]:
%env TAXON=bacteria

env: TAXON=bacteria


`-c CPUS, --cpus CPUS` How many CPUs to use in parallel. (default: 48)  
Note: Should be set/optimized per job scheduler.

In [18]:
%env CPUS=4

env: CPUS=4


`--genefinding-tool` {glimmerhmm,prodigal,prodigal-m,none,error} Specify algorithm used for gene finding: GlimmerHMM, Prodigal, Prodigal Metagenomic/Anonymous mode, or none. The 'error' option will raise an error if genefinding is attempted. The 'none' option will not run genefinding. (default: error).  
Note: Tool that `antiSMASH` uses to identify and assemble genes, we recommend using prodigal for this purpose.

In [19]:
%env GENEFINDING_TOOL=prodigal-m

env: GENEFINDING_TOOL=prodigal-m


`--minlength` MINLENGTH Only process sequences larger than <minlength> (default: 1000).

`--minimal` Only run core detection modules, no analysis modules unless explicitly enabled

---
## 4. Data Precleaning (if required) 
---

In [15]:
# Empty

---
# 5. Execution of Tool 
---

The best way to run `antiSMASH` is using the shell script that is provided in the src directory. This file runs the tool in the proper docker container and then saves the results to the output directories specified in the previous section. View `run_antismash.sh . . --help` for more detail on how the tool is run. The following bash script identifies sample files and runs each in the target directory. 

### Execute antiSMASH shell script
---

In [None]:
./src/run_antismash.sh "${INPUT_FASTA}" "${OUTPUT_DIR}" \
  --cpus "${CPUS}" \
  --genefinding-tool "${GENEFINDING_TOOL}" \
  --taxon "${TAXON}" \
  --allow-long-headers \
  --minlength "${MINLENGTH}" \
  --minimal

### Compute the coverage of the identified BGC sequences
---

In [11]:
import os
input_dir = os.environ.get("OUTPUT_DIR", "Default Value")
input_bam = os.environ.get("INPUT_BAM", "Default Value")
sample = os.environ.get("SAMPLE", "Default Value")

get_coverage(input_dir=input_dir, input_bam=input_bam, sample_name=sample)

Unnamed: 0,acc,bgc_class,on_edge,start,end,coverage,sample_name
ERR2604088__k119_156069-1,ERR2604088__k119_156069,redox-cofactor,True,0,6581,1.042547,ERR2604088
ERR2604088__k119_116291-1,ERR2604088__k119_116291,betalactone,True,0,18303,1.01803,ERR2604088
ERR2604088__k119_33754-1,ERR2604088__k119_33754,terpene,True,0,18531,1.053532,ERR2604088
ERR2604088__k119_73676-1,ERR2604088__k119_73676,betalactone,True,0,8132,1.064437,ERR2604088
ERR2604088__k119_146697-1,ERR2604088__k119_146697,terpene,True,0,5125,1.062829,ERR2604088
ERR2604088__k119_9801-1,ERR2604088__k119_9801,terpene,True,0,14890,1.048019,ERR2604088
ERR2604088__k119_111249-1,ERR2604088__k119_111249,phosphonate,True,0,11770,1.031606,ERR2604088
ERR2604088__k119_102369-1,ERR2604088__k119_102369,terpene,True,0,19680,1.043699,ERR2604088
ERR2604088__k119_8575-1,ERR2604088__k119_8575,terpene,True,0,6550,1.049924,ERR2604088
ERR2604088__k119_63734-1,ERR2604088__k119_63734,betalactone,True,0,9507,1.033344,ERR2604088


---
# Conclusion
---
Include any final parting thoughts in this section.
This section may also incude:
- Common mistakes and fixes. 
- Debugging tips.
- Contact for the author.
- Any other information you would like to include

# Write what you want in the above.