![image](https://cdn.discordapp.com/attachments/996200880351215636/1065002848355631165/New_Atlantis.png) 

---
# Execution of the antiSMASH Tool
---

## Introduction

 
 This notebook will give a demonstration of running the tool `antiSMASH` on assembled metagenomic data to identify Biosynthetic Gene Clusters (BGCs) from amongst the reads. `antiSMASH` stands for (antibiotics and Secondary Metabolite Analysis Shell) and is a comprehensive genome mining tool used for the automated identification, annotation, and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genomes. Secondary metabolites are bioactive compounds, often with therapeutic potential, including antibiotics, antifungals and antitumorals.
'antiSMASH' has quite a few features, but here we utilize only its ability to identify BGCs It can identify all known types of BGCs such as those responsible for the production of polyketides, non-ribosomal peptides, terpenes, etc.
   
   
The `antiSMASH` tool provides biologists and bioinformaticians with a useful means of deep-diving into genomic data to accelerate the discovery and diversification of biologically active secondary metabolites.

**Markdown cell giving a brief explanation of the tool. Cover what insight is hoped to be generated by the tool and a extremely brief desciption of its methods.**

Dont have a good grasp on why/how the tool works so please provide some brief comments on this.

## Input Data 
---

The input data are metagenomic samples previously preprocessed and assembled with [VEBA](https://github.com/jolespin/veba), a metagenomic assembly tool. For the purpose of this notebook, we will analyze metagenomes originally from the SOLA dataset. 
The input consists of:
1. **Assembled contigs**: .fasta files.

2. **Mapping files** of the assembled contigs in .bam files.


## Output Data 
---

`antiSMASH` outputs an array of data related to the identified secondary metabolite biosynthesis gene clusters (BGCs). The output is highly comprehensive and includes the following:

- **Gene Cluster Annotations:** For each identified BGC, antiSMASH provides an annotation of the genes in the cluster, including the predicted function and potential products. GenBank file.


NOTE

**got this from the internet so please verify accuracy and remove outputs that arent used for this purpose**


---
# Environment
---

## Main Dependencies
___

The `antiSMASH` operating environment is fully decribed in the Dockerfile located in the src subfolder of this notebook. The docker image contains all necessary packages to run `antiSMASH`, all distributions are the most recent versions of their respective packages and distributed via apt. 
These packages include:

- [hmmer 1/2](http://hmmer.org)
- [diamond](https://github.com/bbuchfink/diamond)
- [fasttree](http://www.microbesonline.org/fasttree/)
- [prodigal](https://github.com/hyattpd/Prodigal)
- [ncbi blast](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)
- [muscle](https://www.ebi.ac.uk/Tools/msa/muscle/)
- [glimmerhmm](https://ccb.jhu.edu/software/glimmerhmm/man.shtml)
    
The `antiSMASH` [distribution](https://antismash.secondarymetabolites.org) and its associated refernce data are pulled directly from their website.
For more information and detail on how these packages are installed, see the Dockerfile.

NOTE: Add pointer to image on dockerhub when uploaded.

## Notebook Utility Installs
___

The import-ipynb package installed here provides utility in using a refernce notebook as a Python module.

In [1]:
#install for import-ipynb that will be applied later
!pip install import-ipynb -q

## Pathing and Directory Environment Variables 
___

**The following instructions set environment pathing variables and loads a jupyter extension for running R.**

In [2]:
%load_ext rpy2.ipython
#working directory for executables 
%set_env WORKDIR=data 

#pathing for seqtk package
%set_env seqtk=/nfs/bin/seqtk/seqtk #pathing for seqtk package

#sets environ relatively in case files move, this sets the REPO variable to be the parent directory of this notebook.
import os 
cwd = os.getcwd()
parent_directory = os.path.dirname(cwd)

#sets env variable "REPO" to name of parent directory
os.environ['REPO'] = parent_directory #sets env variable "REPO" to name of parent directory


env: WORKDIR=data #working directory for executables
env: seqtk=/nfs/bin/seqtk/seqtk #pathing for seqtk package


**These commands creates data and output directories if they don't already exist.**

In [3]:
%%bash 
mkdir -p ${WORKDIR}/data/sola
mkdir -p ${WORKDIR}/outputs/antismash/taxonomy
mkdir ${WORKDIR}/outputs/bgc_abund
mkdir ${WORKDIR}/outputs/bgc_taxa
mkdir ${WORKDIR}/outputs/tables


## R Based Dependencies
---

Insert handling and instructions for the above.

---
## Import Statements (code)(import ipynb)
---

In [None]:
import import_ipynb 
#import all untility functions from utility_func nb 
from src.NB_utility import *

NOTE: your implementation does not appear to need this section, but that may change once we add features to the JHub so I am leaving it here for now.

---
# Parameters
---

Add handling for bash flags and r - object oriented parameter approach. parameter object?

This section should describe all parameters and setting used to input into the tool. All user supplied arguments should be defined and explained in this section.
This section should alternate markdown and code, the markdown explains what a parameter does and what the options are. The first parameter cell should always be paths to data.
The first section should contain one cell that only specifies data locations, such that a user only has to edit this cell in order to point the notebook to their data. The next parameter cell should contain paths to any config files or reference databases. The rest of the section should alternate between markdown explaining the parameter followed by code to set the parameter to some default value in either python or env variables. Duplicate the follwoing examples for each paramter used during data cleaning and execution of the tool.

## Input and output data directories 
---

These variables are strings that conatin the path to the input and output data directories.

This was set earlier, still adjusting the template but no need to worry about this section, but the previous directory structure works well.

## Run Time ENV Variables
___

### Example Sample ID
---

The ID of the sample being processed.

In [None]:
%set_env SAMPLE=ERR2604088

### Taxonomy of Focus 
---

The specific taxonomy of interest.

In [None]:
%set_env TAXON=bacteria

### Number CPUs
---

The number of CPUs allocated to running the task. NOTE:(Should be set/optimized per job scheduler)

In [None]:
%set_env CPU=40

### Gene Finding Tool
___

Tool for `antiSMASH` to use to identify and assemble genes, we recommend using prodigal for this purpose.

In [None]:
%set_env GENETOOL=progigal-m

### Minimum Length 
---

Minimum length of read to be analyzed.  

In [None]:
%set_env MINLENGTH=5000

---
# Data Loading
---

## Download Fasta File Containing Sample
---

Download the demonstration sample from New Atlantis S3 resources. 

NOTE: additional tests required to ensure s3 is intalled on JHUB and will always work. K+T task.

In [None]:
%%bash
aws s3 cp s3://newatlantis-case-studies/SOLA-samples/ERR2604088 ${WORKDIR}/data/sola/${SAMPLE}* --recursive

## Download Benchmarked Gene Cluster Families (GCF) models from the MIBiG database to be used for refernce in the following analysis.
---
NOTE: Please provide more info what these are and name MIBiG by name in the input data section of this notebook.

In [None]:
%%bash
aws s3 cp s3://newatlantis-case-studies/mibig_gcf_models ${WORKDIR}/data/mibig_gcf_models --recursive

---
# Execution of Tool 
---

The best way to run `antiSMASH` is using the shell script that is provided in the src directory. This file handles runs the tool in the proper docker container and then outputs the results to the output directories specified in the previous section. View `run_antismash.sh` for more detail on how the tool is run.

## Execute antiSMASH shell script
---

In [2]:
%%bash

SCAFFOLDS=$(ls ${WORKDIR}/data/sola/ERR*/output/scaffolds.fasta)
for SCAFFOLD in ${SCAFFOLDS}; do
  SAMPLE_NAME=$(echo "${SCAFFOLD}" | sed "s/.*\(ERR[0-9]\+\)\/output.*/\1/");
  OUTPUT_DIR="${WORKDIR}/outputs/antismash/${SAMPLE_NAME}";
  "${REPO}"/execution/src/run_antismash.sh "${SCAFFOLD}" "${OUTPUT_DIR}" \
  --cpus ${CPU} \
  --genefinding-tool ${GENETOOL} \
  --taxon ${TAXON} \
  --allow-long-headers \
  --minlength ${MINLENGTH};
done

importing Jupyter notebook from /home/jovyan/shared/Active_Projects/Templates/src/NB_utility.ipynb
Hello World!


True

NOTE: Add any necessary print statements to shell script to ensure its working, though docker handles this well.
Add return to shell script that prints to confirm a succesfull run.

---
# Conclusion
---
Include any final parting thoughts in this section.
This section may also incude:
- Common mistakes and fixes. 
- Debugging tips.
- Contact for the author.
- Any other information you would like to include

# Write what you want in the above.