---
# Introduction
---

## Initialization Code (Run then remove)

In [3]:
import os 

if not os.path.exists('./src'):
    os.makedirs('./src')
    os.makedirs('./data')
    os.makedirs('./data/input_data')
    os.makedirs('./data/output_data')
    os.makedirs('./data/reference_data')
    with open('./src/requirements.txt', 'w') as f:
        pass
    with open('./src/utilities.ipynb', 'a') as nb:
        pass

## Tool Purpose  
---

[BiG-SLICE](https://github.com/pereiramemo/bigslice) is a tool designed to cluster BGC sequences into Gene Cluster Families (GCFs) based on their protein domain composition utilizing the [Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)](https://en.wikipedia.org/wiki/BIRCH) algorithm (which is a near-linear time complexity clustering algorithm).
The tool can be executed in clustering mode or query mode, which perform the de novo clustering of BGC sequences and the positioning of query BGC sequences onto previously computed GCF models, respectively.   
This notebook is dedicated to the excution of the BiG-SLICE tool utlizing the clustering mode.

## Input Data 
---

Input data consits of BGC sequences (complete or partial) annotated in contigs of Metagenome Assembled Genomes (MAGs), sotred as GenBank files and named fwollowing the [antiSMASH](https://github.com/antismash/antismash) or [MIBiG](https://mibig.secondarymetabolites.org/) nomenclature (i.e., <genome_name>.regionXXX.gbk and BGCXXXXXXX.gbk, respectively). 
These sequences have to be organized in a directory structure having the dataset and genomes subfolders as specified [here](https://github.com/medema-group/bigslice/wiki/Input-folder).
This is the input data that the user must provide to run this Notebook. However, in order to being able execute the BiG-SLICE tool, here we are automatically generating the dataset.tsv and taxonomy.tsv files as described [here](https://github.com/medema-group/bigslice/wiki/Input-folder#datasetstsv).  
For demonstration purposes, here we will be analyzing 38 metagenomics samples of the [SOLA dataset](https://pubmed.ncbi.nlm.nih.gov/29925880/), which is a time series dataset spanning three years (from 2012 to 2015) obtained from a coastal northwestern Mediterranean site.

## Output Data 
---

Desctiption of what data is output by the tool.
- Qualitative (Functional Capaticy, Proteomic Assembly etc)
- File types and formatting (.fasta, Blast6, csv/dataframe with schema etc)

---
# Environment
---

**Dependencies**

[Docker](https://www.docker.com/)  
[tidyverse R package](https://www.tidyverse.org/)  
[RSQLite R package](https://cran.r-project.org/web/packages/RSQLite/index.html)

**Installations**

In [6]:
!pip3 install import-ipynb -q
!Rscript -e 'install.packages("tidyverse")' &> /dev/null
!Rscript -e 'install.packages("RSQLite")' &> /dev/null

---
# Import Statements (code)(import ipynb)
---

In [1]:
import import_ipynb
from src.utilities import *

importing Jupyter notebook from /home/epereira/workspace/dev/new_atlantis/repos/bgc_clust/src/utilities.ipynb


---
# Parameters
---

Define folder for input data:

In [65]:
%env INPUT_DIR=./data/input_data/sola_antismash/
%env OUTPUT_DIR=./data/output_data/

env: INPUT_DIR=./data/input_data/sola_antismash/
env: OUTPUT_DIR=./data/output_data/


`--cpu`: Define number of CPUs to run antiSMASH.

In [9]:
%env CPU=40

env: CPU=40


`--threshold_pct`: Calculate clustering threshold (T) based on a random sampling of pairwise distances between the data, taking the N-th percentile value as the threshold.

In [12]:
%env THRESHOLD_PCT=0.1

env: THRESHOLD_PCT=0.1


## Input and output data directories 
---

In [13]:
%env INPUT_DIR=./data/input_data/sola_antismash/
%env OUTPUT_DIR=./data/output_data/

env: INPUT_DIR=./data/input_data/sola_antismash/
env: OUTPUT_DIR=./data/output_data/


---
# Data Precleaning (if required) 
---

Once we have annotated the BGC sequences in our aseembled metagenomic samples utilizing the bgc_annot notebook, we have the following folders strucutre:

> * input_folder/
>   * metagenomic_dataset_1/
>     * assembly/
>       * contig_1.region001.gbk
>       * contig_2.region002.gbk
>       * ...
>   * metagenomic_dataset_2/
>     * assembly/
>       * contig_1.region001.gbk
>       * contig_2.region002.gbk
>       * ...  
>   * ...        

Let's check how our example dataset looks like:

In [38]:
!ls "${INPUT_DIR}"/* | head -12

./data/input_data/sola_antismash//ERR2604071:
scaffolds

./data/input_data/sola_antismash//ERR2604073:
scaffolds

./data/input_data/sola_antismash//ERR2604074:
scaffolds

./data/input_data/sola_antismash//ERR2604075:
scaffolds



When running the following command we should see all the identified BGC sequences in GenBank format

In [42]:
!find "${INPUT_DIR}" -mindepth 3 -maxdepth 3 -type f -name "*region*.gbk" | head 

./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_156069.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_116291.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_33754.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_73676.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_146697.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_9801.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_111249.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_102369.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_8575.region001.gbk
./data/input_data/sola_antismash/ERR2604088/scaffolds/ERR2604088__k119_63734.region001.gbk
find: ‘standard output’: Broken pipe
find: write error


Now that we checked that we have the GenBank files and the directory has the proper structure we are going to generate the `datasets.tsv` and `taxonomy.tsv` files. 

In [3]:
create_dataset_table("./data/input_data/sola_antismash/")

Next we have to create the taxonomy files. These files are only created to comply with files required to run BiG-SLICE, the taxonomic information is analyzed in a different notebook.

In [4]:
create_taxonomy_tables("./data/input_data/sola_antismash/")

---
# Execution of Tool 
---

This section aims to demonstrate how to execute the tool and performs a sample run on test data. This portion of the notebook may be fairly code intensive and is the most important part of the notebook. To improve readibility and clarity, most of the verbose code segments should be written as functions in python, as shell scripts stored in the src directory, or scripts in whichever language necessary for the tool you are using.

If the step you hope to perform involves more than a couple lines of code, please see the function definition format in the src/NB_utility.ipynb and wrap your code in a function using that format. If you prefer to use bash scripts, wrtie your commands into a shell file, and then execute them in this portion of the notebook. Once your code is wrapped as function in the utility NB, you can inport and run it using the format below.
Each step should be separated by a markdown heading with a brief explanation followed by the necessary code.
Use the parameter variables definied earlier in the notebook as arguments for functions written here. If there is an input that is not already included in the parameters section, include it there.

In [6]:
!"src/run_bigslice.sh" cluster \
"data/input_data/sola_antismash/" \
"data/output_data/bigslice_clust/" \
--num_threads 40 \
--threshold_pct 0.1 

pid 102's current affinity list: 0-47
pid 102's new affinity list: 47
pid 103's current affinity list: 0-47
pid 103's new affinity list: 46
pid 104's current affinity list: 0-47
pid 104's new affinity list: 45
pid 105's current affinity list: 0-47
pid 105's new affinity list: 44
pid 106's current affinity list: 0-47
pid 106's new affinity list: 43
pid 107's current affinity list: 0-47
pid 107's new affinity list: 42
pid 108's current affinity list: 0-47
pid 108's new affinity list: 41
pid 109's current affinity list: 0-47
pid 109's new affinity list: 40
pid 110's current affinity list: 0-47
pid 110's new affinity list: 39
pid 111's current affinity list: 0-47
pid 111's new affinity list: 38
pid 112's current affinity list: 0-47
pid 112's new affinity list: 37
pid 113's current affinity list: 0-47
pid 113's new affinity list: 36
pid 114's current affinity list: 0-47
pid 114's new affinity list: 35
pid 115's current affinity list: 0-47
pid 115's new affinity list: 34
pid 116's current af

---
# Data Post Processing (if required) 
---

## Write to output directory
---
If the tool does not do it automatically, use this cell to write the output data to the output directory defined in the parameter section.

This section aims to contain all the code necessary to perform the data cleaning, formatting or analysis that would be performed on the output of this tool. Use the same formatting as previously mentioned in the execution section of the notebook:
- Offload long code sections to the src/Utility_NB and import the code 
- Add validation to catch errors in and irregularities in the data 
- Alternate code and markdown cells 
- Include a markdown header for each step using ### to add it to the table of contents
- Display data and transformations where necessary. 

---
# Visualization 
---

If there is a visualization you would like to include here, generate it here.
Phrase the code used to generate the visualization as a function in the format mentioned in the execution section of this notebook.
Place the function is the utility NB such that it can be reused to generate new visualizations on future data. 
If the vizualization has additional options and parameters, there is no need to add them to the parameters section, and those parameters can be included into a miniature parameter section  in this section.

---
# Conclusion
---
Include any final parting thoughts in this section.
This section may also incude:
- Common mistakes and fixes. 
- Debugging tips.
- Contact for the author.
- Any other information you would like to include