# Exploring antiSMASH Data Outputs

We've run antiSMASH on all our encapsulin contigs - this is supported but won't give as many hits as with full genomes, for obvious reasons.

The output of antiSMASH is a folder for each contig with a nicely formatted HTML file showing us figures, diagrams, tables, and pretty webpages with responsive elements. This is nice for viewing a single hit or a few hits together, but having to go through thousands of hits this way is clearly not scalable!

To combat this I've already written a script `scripts/31_create_antismash_table.py` which will use the `beautifulsoup4` python package to parse the HTML files and extract all the information into a CSV file which we can load into Pandas. This package is usually used as a web scraper for getting data off Wikipedia or other websites, but we can easily use it to parse lots of HTML files at one time.

Let's explore this DataFrame and see what kind of predictions antiSMASH has given us:

In [2]:
import pandas as pd

antismash_df = pd.read_csv("../metadata/antiSMASH_predictions.csv")
antismash_df

Unnamed: 0,MGYP,MGYA,ERZ,Region,Encapsulin Start,Encapsulin End,Cluster Start,Cluster End,Cluster Type,Closest Match,Identity
0,MGYP000403980963,MGYA00581430,ERZ1688022,2.1,4818,5651,1,10651,RiPP-like,,
1,MGYP000403980963,MGYA00587457,ERZ1764700,2.1,207,1040,1,6040,RiPP-like,,
2,MGYP000271170847,MGYA00593710,ERZ505213,1.0,17638,18435,12638,23435,RiPP-like,,
3,MGYP000215260709,MGYA00587336,ERZ1764663,1.0,53,865,1,5865,RiPP-like,,
4,MGYP000403980963,MGYA00590709,ERZ3455898,1.0,812,1645,1,1689,RiPP-like,,
...,...,...,...,...,...,...,...,...,...,...,...
351,MGYP000403980963,MGYA00588432,ERZ1689943,2.1,6145,6978,1145,9815,RiPP-like,,
352,MGYP003561695899,MGYA00585632,ERZ2484278,1.0,30366,31184,9189,33394,"T3PKS,RiPP-like",,
353,MGYP000403980963,MGYA00587953,ERZ1690184,2.1,787,1620,1,3916,RiPP-like,,
354,MGYP000403980963,MGYA00588877,ERZ1689804,1.0,338,1171,1,4153,RiPP-like,,


## antiSMASH Cluster Types

A big problem with antiSMASH predictions is false positives - since it relies on HMMs and rule-based assignments, any incorrectly predicted protein function can lead to an incorrect assignment as a BGC.

Previous literature has discussed the idea that encapsulins were/are often [misannotated as bateriocins or linocins](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8182298/). From my manual reviewing of this data, it would appear that antiSMASH still uses the old, outdated label for Pfam family PF04454 - this used to be called `Linocin M18` but has now been renamed `Encapsulating Protein for Peroxidase`.

As such, any BGC containing an encapsulin with the PF04454 annotation (as assigned by antiSMASH's implementation of the Pfam HMMs) will probably be assigned to the "RiPP-like" bacteriocin BGC type. These are likely red herrings and not likely to correspond to any informative BGC annotation, so let's ignore these and see what else we can find:

In [5]:
antismash_df = antismash_df[antismash_df["Cluster Type"] != "RiPP-like"]
antismash_df.sort_values(by="Cluster Type")

Unnamed: 0,MGYP,MGYA,ERZ,Region,Encapsulin Start,Encapsulin End,Cluster Start,Cluster End,Cluster Type,Closest Match,Identity
108,MGYP003667787347,MGYA00592995,ERZ3422244,2.1,17354,18160,2511,23160,"NAGGN,RiPP-like",,
80,MGYP000432667684,MGYA00375779,ERZ794912,1.1,133300,134109,115644,146871,"NRPS-like,RiPP-like",neoantimycin,0.2
216,MGYP000432667684,MGYA00375795,ERZ842394,2.1,12779,13588,1,31244,"NRPS-like,RiPP-like",neoantimycin,0.2
245,MGYP000432667684,MGYA00375784,ERZ794917,3.1,161115,161924,143459,186281,"NRPS-like,RiPP-like",neoantimycin,0.2
254,MGYP003561695899,MGYA00585631,ERZ2484272,1.0,6331,7149,1,11596,"T3PKS,RiPP-like",,
352,MGYP003561695899,MGYA00585632,ERZ2484278,1.0,30366,31184,9189,33394,"T3PKS,RiPP-like",,
123,MGYP003110288594,MGYA00587027,ERZ2772623,4.1,6130,7038,2223,27464,arylpolyene,,


Since we only have 7 of these clusters to go through, let's move their output data to a new folder so we can download it and manually inspect the antiSMASH outputs for each one:

In [11]:
import os
import shutil

try:
    os.mkdir("../non_RiPP-like_antiSMASH_clusters")
except FileExistsError:
    pass

for index, row in antismash_df.iterrows():
    mgya = row[1]
    erz = row[2]

    shutil.copytree(f"../antiSMASH/{mgya}-{erz}/", f"../non_RiPP-like_antiSMASH_clusters/{mgya}-{erz}")
