![image](https://cdn.discordapp.com/attachments/996200880351215636/1065002848355631165/New_Atlantis.png) 

---
# Execution of the BiG-SLICE Tool: mapping mode
---

## Introduction
---

[BiG-SLICE](https://github.com/pereiramemo/bigslice) is a tool designed to cluster BGC sequences into Gene Cluster Families (GCFs) based on their protein domain composition utilizing the [Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)](https://en.wikipedia.org/wiki/BIRCH) algorithm (which is a near-linear time complexity clustering algorithm).
The tool can be executed in clustering mode or query mode, which perform the de novo clustering of BGC sequences and the positioning of query BGC sequences onto previously computed GCF models, respectively.   
This notebook is dedicated to the excution of the BiG-SLICE tool utlizing the mapping mode.

## 1. Initialization

### 1.1 Create directories (run then comment)
---

In [2]:
# import os 

# if not os.path.exists('./src'):
#     os.makedirs('./src')
#     os.makedirs('./data')
#     os.makedirs('./data/input_data')
#     os.makedirs('./data/output_data')
#     os.makedirs('./data/reference_data')
#     with open('./src/requirements.txt', 'w') as f:
#         pass
#     with open('./src/utilities.ipynb', 'a') as nb:
#         pass

### 1.2 Input data 
---

Input data consits of BGC sequences (complete or partial) annotated in contigs of Metagenome Assembled Genomes (MAGs), sotred as GenBank files and named fwollowing the [antiSMASH](https://github.com/antismash/antismash) or [MIBiG](https://mibig.secondarymetabolites.org/) nomenclature (i.e., <genome_name>.regionXXX.gbk and BGCXXXXXXX.gbk, respectively). 
These sequences have to be organized in a directory structure having the dataset and genomes subfolders as specified [here](https://github.com/medema-group/bigslice/wiki/Input-folder).
This is the input data that the user must provide to run this Notebook. However, in order to being able execute the BiG-SLICE tool, here we are automatically generating the dataset.tsv and taxonomy.tsv files as described [here](https://github.com/medema-group/bigslice/wiki/Input-folder#datasetstsv).  
For demonstration purposes, here we will be analyzing 38 metagenomics samples of the [SOLA dataset](https://pubmed.ncbi.nlm.nih.gov/29925880/), which is a time series dataset spanning three years (from 2012 to 2015) obtained from a coastal northwestern Mediterranean site. As a reference, we will be using the [MIBiG](https://mibig.secondarymetabolites.org/) v3 previously clustered with [BiG-SLICE](https://github.com/pereiramemo/bigslice). 

### 1.3 Output data 
---

A SQLite database is created with the following schema: 

### 1.4 Data loading
---

Data to be mapped

In [21]:
#cp -r ../bgc_clustering/data/input_data/sola_antismashed ./data/input_data/

Reference GCF models

In [None]:
#! aws s3 cp s3://newatlantis-case-studies/mibig_gcf_models/ ./data/output_data/mibig_gcf_models --recursive

---
# 2. Environment
---

### 2.1 Main dependencies
___

[Docker](https://www.docker.com/)  
[tidyverse R package](https://www.tidyverse.org/)  
[RSQLite R package](https://cran.r-project.org/web/packages/RSQLite/index.html)

### 2.2 Notebook utility installs
___

The import-ipynb package installed here provides utility in using a refernce notebook as a Python module.

In [6]:
#install for import-ipynb that will be applied later
# !pip3 install import-ipynb -q

### 2.3 R Based dependencies
---

In [None]:
# !Rscript -e 'install.packages("tidyverse")' &> /dev/null
# !Rscript -e 'install.packages("RSQLite")' &> /dev/null

### 2.4 Import statements (code)(import ipynb)
---

In [2]:
# Empty

### 2.5 Session envrionmental variables
---

In [8]:
# Empty

### 2.6 Input and output data files and directories 
---

In [23]:
%env INPUT_DIR=./data/input_data/sola_antismashed

env: INPUT_DIR=./data/input_data/sola_antismashed


In [24]:
%env REF_DIR=./data/output_data/mibig_gcf_models

env: REF_DIR=./data/output_data/mibig_gcf_models


---
## 3. Parameters
---

`-t <N>, --num_threads <N>` The number of parallel jobs to run (default: 48).

In [10]:
%env NUM_THREADS=40

env: NUM_THREADS=40


`-threshold_pct <N>` Calculate clustering threshold (T) based on a random sampling of pairwise distances between the data, taking the N-th percentile value as the threshold.  
Mutually exclusive with --threshold, use '-1' to turn off this parameter (default: -1).

In [11]:
%env THRESHOLD_PCT=0.1

env: THRESHOLD_PCT=0.1


`--query_name <name>` Give a unique name to the query run so that it will be easier to trace within the output visualization.

In [12]:
%env QUERY_NAME=sola

env: QUERY_NAME=sola


---
## 4. Data Precleaning (if required) 
---

In [1]:
# Empty

---
# 5. Execution of Tool 
---

In [32]:
!echo "${QUERY_NAME}"

sola


In [None]:
! ./src/run_bigslice.sh query "${INPUT_DIR}" "${REF_DIR}" \
--num_threads "${NUM_THREADS}" \
--threshold_pct "${THRESHOLD_PCT}" \
--query_name "${QUERY_NAME}" &> /dev/null

---
# Data Post Processing (if required) 
---

## Write to output directory
---
If the tool does not do it automatically, use this cell to write the output data to the output directory defined in the parameter section.

This section aims to contain all the code necessary to perform the data cleaning, formatting or analysis that would be performed on the output of this tool. Use the same formatting as previously mentioned in the execution section of the notebook:
- Offload long code sections to the src/Utility_NB and import the code 
- Add validation to catch errors in and irregularities in the data 
- Alternate code and markdown cells 
- Include a markdown header for each step using ### to add it to the table of contents
- Display data and transformations where necessary. 