![image](https://cdn.discordapp.com/attachments/996200880351215636/1065002848355631165/New_Atlantis.png) 

---
# Execution of the MMSeqs Tool: taxonomy mode
---

## Introduction
---

[MMseqs2](https://github.com/soedinglab/MMseqs2) (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. This tool also includes a [workflow dedicated to the taxonomic annotation of protein sequences](https://github.com/soedinglab/MMseqs2#taxonomy), based on a search against a specified reference database containing taxonomy information, the selection of most representative sequences according to different strategies, and the computation of the lowest common ancestor. 

## 1. Initialization

### 1.1 Create directories (run then comment)
---

In [3]:
# import os 

# if not os.path.exists('./src'):
#     os.makedirs('./src')
#     os.makedirs('./data')
#     os.makedirs('./data/input_data')
#     os.makedirs('./data/output_data')
#     os.makedirs('./data/reference_data')
#     with open('./src/requirements.txt', 'w') as f:
#         pass
#     with open('./src/utilities.ipynb', 'a') as nb:
#         pass

### 1.2 Input data 
---

The input data consists of the [antiSMASH](https://github.com/antismash/antismash) output, containing the BGC annotated sequences as GenBank files, and the MAGs or contigs generated with [VEBA](https://github.com/jolespin/veba) (i.e., the sequences in which the BGCs were annotated).

### 1.3 Output data 
---

An output table containing the taxonomic annotation of the BGC annotated sequences.

### 1.4 Data loading
---

Here we are going to get the previously generated output of antiSMASH, where the tool was used to annotate the BGC sequences in the [SOLA dataset](https://www.nature.com/articles/s41396-018-0158-1). In addition, we will be getting the assembly data of the mentioned dataset.

In order to save some space, we are going to create soft links, instead of copying the data.

In [None]:
! ln -s $(readlink -m ../bgc_annotation/data/output_data/sola_antismashed/) ./data/input_data/

In [None]:
! ln -s  $(readlink -m ../bgc_annotation/data/input_data/sola/) ./data/input_data/

---
# 2. Environment
---

### 2.1 Main dependencies
___

[seqtk](https://github.com/lh3/seqtk)

### 2.2 Notebook utility installs
___

The import-ipynb package installed here provides utility in using a refernce notebook as a Python module.

In [16]:
#install for import-ipynb that will be applied later
!pip3 install import-ipynb -q

[33mDEPRECATION: Loading egg at /home/epereira/anaconda3/lib/python3.11/site-packages/mglib-1.1-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m

### 2.3 R Based dependencies
---

In [17]:
# Empty

### 2.4 Import statements (code)(import ipynb)
---

In [18]:
# Empty

### 2.5 Session envrionmental variables
---

The ID of the sample being processed.

In [54]:
%env SAMPLE=ERR2604088

env: SAMPLE=ERR2604088


Tools

In [46]:
%env seqtk=/home/bioinf/bin/seqtk/seqtk

env: seqtk=/home/bioinf/bin/seqtk/seqtk


Directories

In [55]:
%env INPUT_DIR_ANTISMASH=./data/input_data/sola_antismashed/ERR2604088

env: INPUT_DIR_ANTISMASH=./data/input_data/sola_antismashed/ERR2604088


In [56]:
%env INPUT_DIR_DATASET=./data/input_data/sola/ERR2604088

env: INPUT_DIR_DATASET=./data/input_data/sola/ERR2604088


In [62]:
%env OUTPUT_DIR=./data/output_data/

env: OUTPUT_DIR=./data/output_data/


---
## 3. Parameters
---

`--threads INT`  Number of CPU-cores used (all by default).

In [71]:
%env THREADS=40

env: THREADS=40


`--tax-lineage INT` 0: don't show, 1: add all lineage names, 2: add all lineage taxids [0]

In [24]:
%env TAX_LINAGE=1

env: TAX_LINAGE=1


`-v INT` Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

In [75]:
%env V=0

env: V=0


---
## 4. Data Precleaning (if required) 
---

Get all the ids of all contigs containing a BGC. 

In [58]:
!awk '$1 == "ACCESSION" {print $2}' "${INPUT_DIR_ANTISMASH}"/*/*.region*.gbk > "${INPUT_DIR_ANTISMASH}"/acc.txt

Extract the contig sequences from the assembly fasta file.

In [67]:
!"${seqtk}" subseq "${INPUT_DIR_DATASET}/output/scaffolds.fasta" "${INPUT_DIR_ANTISMASH}"/acc.txt > "${OUTPUT_DIR}/${SAMPLE}".fasta

---
# 5. Execution of Tool 
---

In [None]:
! ./src/run_mmseqs_taxonomy.sh \
"${OUTPUT_DIR}/${SAMPLE}".fasta \
"${OUTPUT_DIR}/${SAMPLE}"_bgc_taxonomy.tsv \
--threads "${THREADS}" \
--tax-lineage "${TAX_LINAGE}" \
-v "${V}"

---
# 6. Data Post Processing (if required) 
---

---
# Conclusion
---
Include any final parting thoughts in this section.
This section may also incude:
- Common mistakes and fixes. 
- Debugging tips.
- Contact for the author.
- Any other information you would like to include