# Exploring DeepBGC Data Outputs

We've supplemented the antiSMASH BGC predictions with further predictions from DeepBGC, a deep-learning prediction tool. Unlike antiSMASH, this tool gives us an easily parsable CSV file as output.

Let's explore this DataFrame and see what kind of predictions DeepBGC has given us:

In [20]:
import pandas as pd


#We can drop a bunch of unnecessary columns from the CSV file
deepbgc_df = pd.read_csv("../metadata/deepbgc_hits.csv").drop(
    ["detector", "detector_version", "detector_label", "bgc_candidate_id", "antibacterial", "cytotoxic",
     "inhibitor", "antifungal","Alkaloid", "NRP", "Other", "Polyketide", "RiPP", "Saccharide", "Terpene", "Hit?", "Contig"],
axis=1)
deepbgc_df

Unnamed: 0,Cluster Start,Cluster End,Cluster Length,num_proteins,num_domains,num_bio_domains,deepbgc_score,product_activity,product_class,protein_ids,bio_pfam_ids,pfam_ids,MGYP,ERZ,MGYC,Encapsulin Start,Encapsulin End,Strand,MGYA
0,1545,26938,25393,19,9,2,0.75971,antibacterial,Saccharide,ERZ2772546.64-NODE-64-length-43846-cov-5.80870...,PF04055;PF00534,PF01165;PF07068;PF13884;PF13884;PF00534;PF1369...,MGYP003110882604,ERZ2772546,MGYC001052550128,3131,4792,1,MGYA00587038
1,532,2822,2290,2,2,0,0.54969,antibacterial,,ERZ3421826.29466-NODE-29466-length-4415-cov-18...,,PF07068;PF04783,MGYP003627698165,ERZ3421826,MGYC001312960595,533,1711,-1,MGYA00592968
2,259,31105,30846,33,11,0,0.63556,antibacterial,,ERZ840836.15-NODE-15-length-44244-cov-6.431533...,,PF10124;PF15027;PF02204;PF00932;PF13479;PF0279...,MGYP001235300537,ERZ840836,MGYC001646726449,260,1168,1,MGYA00589046
3,9720,19487,9767,14,4,0,0.64053,antibacterial,,ERZ1758307.29-NODE-29-length-46129-cov-4.06559...,,PF13420;PF17236;PF00386;PF00386,MGYP001806711746,ERZ1758307,MGYC000169393703,14086,15279,-1,MGYA00583072
4,2,28879,28877,23,40,9,0.79064,,,ERZ2772644.528-NODE-528-length-28880-cov-4.962...,PF04321;PF01041;PF08241;PF07993;PF01050;PF0089...,PF14362;PF13385;PF04965;PF03776;PF08772;PF0972...,MGYP003144195444,ERZ2772644,MGYC001067578561,3,1469,-1,MGYA00587101
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177,0,21448,21448,25,66,14,0.91369,antibacterial,Saccharide,ERZ3421824.335-NODE-335-length-84768-cov-11.57...,PF07993;PF02350;PF01488;PF03721;PF00908;PF0432...,PF07068;PF10108;PF03104;PF13482;PF00692;PF1371...,MGYP003638549746,ERZ3421824,MGYC001310042284,1,1167,1,MGYA00592972
178,194,11195,11001,16,9,0,0.63363,antibacterial,,ERZ2772518.788-NODE-788-length-14799-cov-14.68...,,PF13539;PF00166;PF10124;PF10614;PF03783;PF1125...,MGYP003109746915,ERZ2772518,MGYC001052550269,3482,4393,1,MGYA00587071
179,3436,25241,21805,39,19,1,0.83849,antibacterial,Saccharide,ERZ2772574.516-NODE-516-length-33977-cov-28.03...,PF00534,PF13692;PF00534;PF13439;PF13640;PF13759;PF1364...,MGYP003131986813,ERZ2772574,MGYC001062182714,22191,23099,-1,MGYA00587035
180,695,26279,25584,34,10,0,0.63500,antibacterial,,ERZ2772574.538-NODE-538-length-33404-cov-5.912...,,PF16559;PF12760;PF12850;PF11655;PF11655;PF0954...,MGYP003132011367,ERZ2772574,MGYC001062183451,1585,2463,1,MGYA00587035


Out of 182 hits, most are antibacterial and the majority are either cytotoxic or undefined:

In [21]:
deepbgc_df["product_activity"].value_counts()

antibacterial    128
cytotoxic         31
Name: product_activity, dtype: int64

Most hits don't have an assigned product class, but let's see what the classes are for the assigned ones:

In [22]:
deepbgc_df["product_class"].value_counts()

Saccharide    36
RiPP           6
Terpene        2
Other          2
Polyketide     1
Name: product_class, dtype: int64

These `Saccharide` BGCs might be intereting - let's check their Pfam annotations and see if we can spot anything cool.

First, we need functions to access the Pfam labels in the DataFrame and add their text descriptions:

In [26]:
# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated

import gzip
import json

with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

#This function will access the Pfam family lists for each DataFrame entry and convert it to a free text description
def get_pfam_list_labels(pfam_list):
    pfam_list = pfam_list.split(";")

    try:
        return(" | ".join([get_label(pfam) for pfam in pfam_list]))
    except TypeError:
        return("None")


In [28]:
deepbgc_df["bio_pfam_ids"] = deepbgc_df["bio_pfam_ids"].fillna("None")
deepbgc_df["Descriptions"] = deepbgc_df["bio_pfam_ids"].apply(get_pfam_list_labels)
deepbgc_df["Descriptions"]

0      Radical SAM superfamily | Glycosyl transferase...
1                                                   None
2                                                   None
3                                                   None
4      RmlD substrate binding domain | DegT/DnrJ/EryC...
                             ...                        
177    Male sterility protein | UDP-N-acetylglucosami...
178                                                 None
179                        Glycosyl transferases group 1
180                                                 None
181                        Glycosyl transferases group 1
Name: Descriptions, Length: 182, dtype: object

## Saccharide BGCs

Now that we have our text descriptions for each operon, let's have a look at the functions of the proteins found in these predicted Saccharide operons:

In [35]:
saccharide_df = deepbgc_df[deepbgc_df["product_class"] == "Saccharide"].sort_values(by="MGYP")

for row in saccharide_df.to_dict(orient="records"):
    print(row["MGYP"])
    print(row["Descriptions"])
    print()

MGYP001178754852
Male sterility protein | DegT/DnrJ/EryC1/StrS aminotransferase family | RmlD substrate binding domain | Glycosyl transferases group 1 | 4Fe-4S binding domain | Glycosyl transferase family 2 | KR domain | Cytidylyltransferase | Polysaccharide biosynthesis protein | NAD dependent epimerase/dehydratase family

MGYP001216717877
Glycosyl transferases group 1

MGYP001238560740
Male sterility protein | Acetyltransferase (GNAT) family | Polysaccharide biosynthesis protein | DegT/DnrJ/EryC1/StrS aminotransferase family | Glycosyl transferase family 2 | NAD dependent epimerase/dehydratase family | Glycosyl transferases group 1 | RmlD substrate binding domain | UDP-N-acetylglucosamine 2-epimerase

MGYP001412933479
Glycosyl transferase family 2 | Putative zinc binding domain | UDP-glucose/GDP-mannose dehydrogenase family; NAD binding domain | NAD dependent epimerase/dehydratase family | Methyltransferase domain | RmlD substrate binding domain | UDP-glucose/GDP-mannose dehydrogenas

Looks like they all contain glycosyltransferases, with a bunch of other enzymes too. Getting a nice visualization of these BGCs with genes and functions is going to be non-trivial, but let's write some code to visualize the operon of a DeepBGC prediction:

In [43]:
saccharide_df.head()

Unnamed: 0,Cluster Start,Cluster End,Cluster Length,num_proteins,num_domains,num_bio_domains,deepbgc_score,product_activity,product_class,protein_ids,bio_pfam_ids,pfam_ids,MGYP,ERZ,MGYC,Encapsulin Start,Encapsulin End,Strand,MGYA,Descriptions
159,4509,32614,28105,40,57,10,0.96757,antibacterial,Saccharide,ERZ829061.940-NODE-940-length-32616-cov-5.1691...,PF07993;PF01041;PF04321;PF00534;PF00037;PF0053...,PF07068;PF06305;PF02153;PF16363;PF02719;PF0137...,MGYP001178754852,ERZ829061,MGYC001564165145,4510,6132,-1,MGYA00589052,Male sterility protein | DegT/DnrJ/EryC1/StrS ...
181,1118,28010,26892,34,12,1,0.65797,antibacterial,Saccharide,ERZ2772560.369-NODE-369-length-43388-cov-33.90...,PF00534,PF10124;PF13759;PF13640;PF13759;PF13640;PF1364...,MGYP001216717877,ERZ2772560,MGYC001060599069,1119,2027,1,MGYA00587091,Glycosyl transferases group 1
53,1,40762,40761,33,37,9,0.82601,,Saccharide,ERZ2772632.193-NODE-193-length-42424-cov-17.70...,PF07993;PF00583;PF02719;PF01041;PF00535;PF0137...,PF00354;PF13385;PF04965;PF07227;PF09723;PF0432...,MGYP001238560740,ERZ2772632,MGYC001067565813,19292,20434,-1,MGYA00587024,Male sterility protein | Acetyltransferase (GN...
79,348,29574,29226,35,59,12,0.87255,antibacterial,Saccharide,ERZ2772564.469-NODE-469-length-30021-cov-18.58...,PF00535;PF08421;PF03721;PF01370;PF08242;PF0432...,PF07068;PF10111;PF00535;PF02562;PF13245;PF1360...,MGYP001412933479,ERZ2772564,MGYC001060598356,349,2136,1,MGYA00587008,Glycosyl transferase family 2 | Putative zinc ...
74,280,26423,26143,43,15,2,0.91334,antibacterial,Saccharide,ERZ842402.482388-NODE-984347-length-26423-cov-...,PF00534;PF04820,PF00166;PF13949;PF08291;PF13640;PF12640;PF1375...,MGYP001437231829,ERZ842402,MGYC001773986987,2857,3765,1,MGYA00590531,Glycosyl transferases group 1 | Tryptophan hal...


In [75]:
for row in saccharide_df.to_dict(orient="records"):
    mgya = row["MGYA"]
    erz = row["ERZ"]
    mgyp  = row["MGYP"]
    protein_ids = row["protein_ids"].split(";")
    pfam_df = pd.read_csv(f"../deepbgc/{mgya}-{erz}/{mgya}-{erz}.pfam.tsv", sep="\t").query("protein_id in @protein_ids").query("deepbgc_score > 0.5")

    encapsulin_start = min([row["Encapsulin Start"],row["Encapsulin End"]])
    encapsulin_end  = max([row["Encapsulin Start"],row["Encapsulin End"]])

    print(f"----------{mgyp}----------")
    print()
    for i, protein in enumerate(protein_ids):
        pfams = pfam_df[pfam_df["protein_id"] == protein]["pfam_id"].values

        if len(pfams) > 0:  
            start = min([pfam_df[pfam_df["protein_id"] == protein]["gene_start"].values[0],pfam_df[pfam_df["protein_id"] == protein]["gene_end"].values[0]])
            end = max([pfam_df[pfam_df["protein_id"] == protein]["gene_start"].values[0],pfam_df[pfam_df["protein_id"] == protein]["gene_end"].values[0]])
            pfam_labels = " | ".join([get_label(pfam) for pfam in pfams])
            if encapsulin_start >= start and encapsulin_end <= end:
                print("ENCAPSULIN")
            print(f"{start} - {pfam_labels} - {end}")

    print()

----------MGYP001178754852----------

ENCAPSULIN
4509 - Major capsid protein Gp23 - 6132
7928 - Lipopolysaccharide assembly protein A domain - 8144
9299 - Prephenate dehydrogenase - 9539
9538 - GDP-mannose 4;6 dehydratase | Polysaccharide biosynthesis protein | NAD dependent epimerase/dehydratase family | RmlD substrate binding domain | 3-beta hydroxysteroid dehydrogenase/isomerase family | Male sterility protein | KR domain - 10441
10440 - Glycosyl transferase family 2 | Glycosyltransferase like family 2 - 11112
11122 - DAHP synthetase I family | NeuB family - 11953
11949 - Cytidylyltransferase | MobA-like NTP transferase domain | 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase - 12531
12527 - D-isomer specific 2-hydroxyacid dehydrogenase; NAD binding domain | D-isomer specific 2-hydroxyacid dehydrogenase; catalytic domain - 13490
13464 - Glycosyl transferase family 2 - 14697
14698 - DegT/DnrJ/EryC1/StrS aminotransferase family | Cys/Met metabolism PLP-dependent enzyme | Amin

In [76]:
saccharide_df[saccharide_df["MGYP"] == "MGYP001238560740"]

Unnamed: 0,Cluster Start,Cluster End,Cluster Length,num_proteins,num_domains,num_bio_domains,deepbgc_score,product_activity,product_class,protein_ids,bio_pfam_ids,pfam_ids,MGYP,ERZ,MGYC,Encapsulin Start,Encapsulin End,Strand,MGYA,Descriptions
53,1,40762,40761,33,37,9,0.82601,,Saccharide,ERZ2772632.193-NODE-193-length-42424-cov-17.70...,PF07993;PF00583;PF02719;PF01041;PF00535;PF0137...,PF00354;PF13385;PF04965;PF07227;PF09723;PF0432...,MGYP001238560740,ERZ2772632,MGYC001067565813,19292,20434,-1,MGYA00587024,Male sterility protein | Acetyltransferase (GN...


## Comparing antiSMASH and DeepBGC

Both of these tools have given a very small set of BGC predictions (this is likely because of the difficulty of functionally annotating metagenomic proteins, as well as the limitations on contig length).

Let's compare the number and type of outputs of each package:

In [52]:
antismash_df = pd.read_csv("../metadata/antiSMASH_predictions.csv")

antismash_mgyps = set(antismash_df["MGYP"].unique())
deepbgc_mgyps = set(deepbgc_df["MGYP"].unique())

print(f"antiSMASH Hits: {len(antismash_mgyps)}")
print(f"DeepBGC hits: {len(deepbgc_mgyps)}")
print()
print(f"Unique antiSMASH Hits: {len(antismash_mgyps.difference(deepbgc_mgyps))}")
print(f"Unique DeepBGC hits: {len(deepbgc_mgyps.difference(antismash_mgyps))}")
print(f"Shared hits: {len(antismash_mgyps.intersection(deepbgc_mgyps))}")

antiSMASH Hits: 52
DeepBGC hits: 109

Unique antiSMASH Hits: 44
Unique DeepBGC hits: 101
Shared hits: 8
