## **MetaBioPros 1.0**

#### This notebook integrates the metagenomic bioprospecting analysis  1.0.  
#### The analysis included are:  
#### 0. Set env
#### 1. Identification of BGCs
#### 2. Taxonnomic annotation of BGCs
#### 3. BGC sequences mapping onto referecne Gene Cluster Families (GCFs)
#### 4. BGCs diversity estimates, functional prediction, and novelty assessment

#### **Dependencies to run this notebook (outside the tools we provide):**  
#### [aws cli](https://aws.amazon.com/cli/)  
#### [bash and R kernels](https://evodify.com/python-r-bash-jupyter-notebook/)  
#### [Docker](https://www.docker.com/)
#### [RSQLite R library](https://cran.r-project.org/web/packages/RSQLite/index.html)
#### [tidyverse R library](https://www.tidyverse.org/)

**0. Set env**

In [1]:
%load_ext rpy2.ipython
%set_env WORKDIR=workdir
%set_env REPO=/home/epereira/workspace/dev/new_atlantis/repos/bioprospecting

env: WORKDIR=workdir
env: REPO=/home/epereira/workspace/dev/new_atlantis/repos/bioprospecting


In [2]:
%%bash
mkdir -p ${WORKDIR}/data/sola
mkdir -p ${WORKDIR}/outputs/antismash

 **1. Identification of BGCs**
 
We will be using the [SOLA metagenomic dataset](https://www.nature.com/articles/s41396-018-0158-1), already assembled with [VEBA](https://github.com/jolespin/veba).
Let’s first get the data.

In [None]:
%%bash

# aws s3 cp s3://newatlantis-case-studies/SOLA-samples/ ${WORKDIR}/data/sola --recursive

This dataset contains the assembled scaffolds (\*.fasta) and the mapping files (\*.bam).

In [3]:
%%bash
ls ${WORKDIR}/data/sola/ERR*/output | head -12

workdir/data/sola/ERR2604071/output:
featurecounts.tsv.gz
mapped.sorted.bam
mapped.sorted.bam.bai
scaffolds.fasta
scaffolds.fasta.1.bt2
scaffolds.fasta.2.bt2
scaffolds.fasta.3.bt2
scaffolds.fasta.4.bt2
scaffolds.fasta.rev.1.bt2
scaffolds.fasta.rev.2.bt2
scaffolds.fasta.saf


Now that we have the data, let's run [antisMASH](https://github.com/antismash/antismash) to identify the BGC sequences.  
For this we will be using our wrap script [run_antismash](https://github.com/pereiramemo/bioprospecting/blob/main/run_scripts/run_antismash.sh), which runs a containerized version 6.0.0 of antiSMASH.  
Note that there is version 7.0.0 available, but for compatibility purposes in downstream analysis, we'll use this version for now.
Since we are using a wrap script to run a containerized version of antiSMASH, we have to use the fist two positional parameters as the input and output folders, respectively.  
To see the help we run:

In [9]:
%%bash
"${REPO}"/run_scripts/run_antismash.sh . . --help-showall


########### antiSMASH 6.0.0 #############

usage: antismash [--taxon {bacteria,fungi}] [--output-dir OUTPUT_DIR]
                 [--output-basename OUTPUT_BASENAME] [--reuse-results PATH]
                 [--limit LIMIT] [--minlength MINLENGTH] [--start START]
                 [--end END] [--databases PATH] [--write-config-file PATH]
                 [--without-fimo]
                 [--executable-paths EXECUTABLE=PATH,EXECUTABLE2=PATH2,...]
                 [--allow-long-headers] [-v] [-d] [--logfile PATH]
                 [--list-plugins] [--check-prereqs]
                 [--limit-to-record RECORD_ID] [-V] [--profiling]
                 [--skip-sanitisation] [--skip-zip-file] [--minimal]
                 [--enable-genefunctions] [--enable-tta]
                 [--enable-lanthipeptides] [--enable-thiopeptides]
                 [--enable-nrps-pks] [--enable-sactipeptides]
                 [--enable-lassopeptides] [--enable-t2pks] [--enable-html]
                 [--genefinding-tool 

Let's run antiSMASH on the SOLA dataset.

In [None]:
%%bash

SCAFOLDS=$(ls ${WORKDIR}/data/sola/ERR*/output/scaffolds.fasta | head -3)
for SCAFOLD in ${SCAFOLDS}; do

  SAMPLE_NAME=$(echo "${SCAFOLD}" | sed "s/.*\(ERR[0-9]\+\)\/output.*/\1/")
  OUTPUT_DIR="${WORKDIR}/outputs/antismash/${SAMPLE_NAME}"
  echo "${SAMPLE_NAME}"
    
  "${REPO}"/run_scripts/run_antismash.sh "${SCAFOLD}" "${OUTPUT_DIR}" \
  --cpus 40 \
  --genefinding-tool prodigal-m \
  --taxon bacteria \
  --allow-long-headers \
  --minlength 5000

done    

ERR2604071


The annoated BGC sequences can be found in `${WORKDIR}/outputs/antismash/`

In [8]:
%%bash
ls ${WORKDIR}/outputs/antismash/

ERR2604071
ERR2604073
ERR2604074


Let's orgnize this data to run [BiG-SLICE](https://github.com/medema-group/bigslice): create the [dataset.tsv and taxonomy files](https://github.com/medema-group/bigslice/wiki/Input-folder).

In [23]:
%%bash

ls -d "${WORKDIR}/outputs/antismash/"ERR* | \
while read LINE; do

  DATASET=$(basename $(ls -d ${LINE}))
  PATH2DATASET=$(basename $(dirname ${LINE}))"/"
  echo -e "${DATASET}\t./\ttaxonomy/${DATASET}_taxonomy.tsv\tdataset_${DATASET}"

done > "${WORKDIR}/outputs/antismash/datasets.tsv"

# mkdir "${WORKDIR}/outputs/antismash/taxonomy"

cut -f3 "${WORKDIR}/outputs/antismash/datasets.tsv" | \
while read LINE; do
  DATASET=$(basename "${LINE}" _taxonomy.tsv)
  echo -e "${DATASET}/\tBacteria" > "${WORKDIR}/outputs/antismash/${LINE}"
done


In [26]:
%%bash
# wget http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip --directory-prefix  ${WORKDIR}/data/
# unzip ${WORKDIR}/data/full_run_result.zip

--2023-09-04 22:21:03--  http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip
Resolving bioinformatics.nl (bioinformatics.nl)... 137.224.16.5
Connecting to bioinformatics.nl (bioinformatics.nl)|137.224.16.5|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip [following]
--2023-09-04 22:21:04--  https://www.bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip
Resolving www.bioinformatics.nl (www.bioinformatics.nl)... 137.224.16.5
Connecting to www.bioinformatics.nl (www.bioinformatics.nl)|137.224.16.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18075963830 (17G) [application/zip]
Saving to: ‘workdir/data/full_run_result.zip’

     0K .......... .......... .......... .......... ..........  0%  137K 35h53m
    50K .......... .......... .......... .......... ..........  0%  274K 26h54m
   100K

Process is interrupted.


In [32]:
%%bash
"${REPO}"/run_scripts/run_bigslice.sh query . . --help

usage: bigslice [-i <folder_path>] [--resume] [--complete] [--threshold T]
                [--threshold_pct <N>] [--query <folder_path>]
                [--query_name <name>] [--run_id <id>] [--n_ranks N_RANKS]
                [-t <N>] [--hmmscan_chunk_size <N>] [--subpfam_chunk_size <N>]
                [--extraction_chunk_size EXTRACTION_CHUNK_SIZE] [--scratch]
                [--early_dumping] [-h] [--program_db_folder PROGRAM_DB_FOLDER]
                [--version]
                <output_folder_path>

                            _________
 ___    _____|\____|\____|\ \____    \
|   \  |       }     }     }  ___)_ _/__
| >  | _--/--|/----|/----|/__/  __||  __|
|   < | ||  __/ (  (| | \___/  /__/|  _|
| >  || || |_ | _)  ) |_ | |\  \__ | |__
|____/|_| \___/|___/|___||_| \____||____| [ Version 1.1.0 ]

Biosynthetic Gene clusters - Super Linear Clustering Engine
(https://github.com/medema-group/bigslice)

positional arguments:
  <output_folder_path>  [Mandatory] the path to the (newly c

And now run it:

In [33]:
%%bash
"${REPO}"/run_scripts/run_bigslice.sh query \
"${WORKDIR}/outputs/antismash/" \
"${WORKDIR}/data/full_run_result" \
--num_threads 40 \
--query_name SOLA

pid 103's current affinity list: 0-47
pid 103's new affinity list: 47
pid 104's current affinity list: 0-47
pid 104's new affinity list: 46
pid 105's current affinity list: 0-47
pid 105's new affinity list: 45
pid 106's current affinity list: 0-47
pid 106's new affinity list: 44
pid 107's current affinity list: 0-47
pid 107's new affinity list: 43
pid 108's current affinity list: 0-47
pid 108's new affinity list: 42
pid 109's current affinity list: 0-47
pid 109's new affinity list: 41
pid 110's current affinity list: 0-47
pid 110's new affinity list: 40
pid 111's current affinity list: 0-47
pid 111's new affinity list: 39
pid 112's current affinity list: 0-47
pid 112's new affinity list: 38
pid 113's current affinity list: 0-47
pid 113's new affinity list: 37
pid 114's current affinity list: 0-47
pid 114's new affinity list: 36
pid 115's current affinity list: 0-47
pid 115's new affinity list: 35
pid 116's current affinity list: 0-47
pid 116's new affinity list: 34
pid 117's current af

We can see the results in the folder `"${WORKDIR}/data/full_run_result"`

In [35]:
%%bash
ls "${WORKDIR}/data/full_run_result/reports"

1
reports.db


The main result we obtain are the SQLite databases. Although we could access these utilizing the mini web application based on Flask library, we are going to import them into an R environment to have full control of results.

In [66]:
%%R

conn_reports_db <- dbConnect(RSQLite::SQLite(), "workdir/data/full_run_result/reports/1/data.db")
conn_data_db <- dbConnect(RSQLite::SQLite(), "workdir/data/full_run_result/result/data.db")


In [67]:
%%R
dbListTables(conn_reports_db)
dbReadTable(conn_reports_db, "sqlite_sequence")

  name  seq
1  bgc   37
2  cds  449
3  hsp 1109


In [56]:
%%R
dbListTables(conn_data_db)

 [1] "bgc"               "bgc_class"         "bgc_features"     
 [4] "bgc_taxonomy"      "cds"               "chem_class"       
 [7] "chem_subclass"     "chem_subclass_map" "clustering"       
[10] "dataset"           "enum_bgc_type"     "enum_run_status"  
[13] "gcf"               "gcf_membership"    "gcf_models"       
[16] "hmm"               "hmm_db"            "hsp"              
[19] "hsp_alignment"     "hsp_subpfam"       "run"              
[22] "run_bgc_status"    "run_log"           "schema"           
[25] "sqlite_sequence"   "subpfam"           "taxon"            
[28] "taxon_class"      
