In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First create a dict with each of the extra label sets.

In [39]:
labels_dict = {}
for label_file in ['extra_antibiotics',
                   'extra_antifungal',
                   'extra_siderophore']:
    with open(label_file, 'r') as f:
        labels_dict[label_file] = [x.strip() for x in f.readlines()]

In [35]:
metadata = pd.read_csv('data_mibig_fixed.tsv', sep='\t')

I'll now produce a table for each type of label. It'll display the label and how many examples have that label in `data_mibig_fixed.tsv`. 

In [64]:
for label_type, label_list in labels_dict.items():
    print('-'*100)
    print(f'Label Type is {label_type}')
    idx = x['Name'].isin(label_list)
    new_labels = x['Name'][idx]
    print(f'Total number of labels: {len(label_list)}, Total number of examples: {len(new_labels)}')
    print(new_labels.value_counts())
    print(f'The following labels do not appear in MIBIG metadata: {list(set(label_list) - set(new_labels))}')

----------------------------------------------------------------------------------------------------
Label Type is extra_antibiotics
Total number of labels: 37, Total number of examples: 52
kanamycin          5
neomycin           3
streptomycin       3
tobramycin         3
pyrrolomycin       3
salinomycin        2
thiolactomycin     2
nogalamycin        2
uncialamycin       2
rifamycin          2
saframycin A       1
rhodomycin         1
meilingmycin       1
anthramycin        1
apramycin          1
lankamycin         1
blasticidin        1
duramycin          1
alnumycin A        1
congocidine        1
erythromycin       1
pyridomycin        1
teixobactin        1
radamycin          1
kasugamycin        1
tyrocidine         1
mersacidin         1
cremeomycin        1
gramicidin         1
brevicidine        1
enduracidin        1
streptoseomycin    1
daptomycin         1
lysobactin         1
mycemycin A        1
Name: Name, dtype: int64
The following labels do not appear in MIBIG metada

For comparison, here is the count of the fifty most frequent values of the `Name` column.

In [61]:
print('Top Fifty Most Common Values of Name')
metadata['Name'].value_counts()[:50]

Top Fifty Most Common Values of Name


capsular polysaccharide    30
carotenoid                 17
O-antigen                  11
ectoine                     9
lipopolysaccharide          8
aflatoxin                   8
exopolysaccharide           6
melanin                     6
kanamycin                   5
Myxochromide D              5
Myxochromide A              5
glycopeptidolipid           5
prodigiosin                 4
cylindrospermopsin          4
1-heptadecene               4
epothilone                  4
S-layer glycan              3
monoterpenes-diterpenes     3
pederin                     3
mycophenolic acid           3
aerobactin                  3
rebeccamycin                3
eicosapentaenoic acid       3
neomycin                    3
toyocamycin                 3
geldanamycin                3
aclacinomycin               3
pyrrolomycin                3
meridamycin                 3
lasalocid                   3
tobramycin                  3
Myxochromide S              3
piericidin A1               3
staurospor

Observations:

- Very sparse labels. Even if we try to predict `Name` as one overall label, most of its values will be too rare to trust any method that claims to predict them. Even the most frequent value of `Name` only occurs in $30/1815$, which is very rare. 
- Some of the labels don't appear in any of our BGCs. 

Questions: 

1. How were these labels assigned to BGCs? E.g. breakdown of what is by-hand vs from some database or another etc. 

2. Is there any upfront structure we can propose about these labels? I think if we cluster and then fish for similarities, we will always find something. Given that we have lots of labels that appear only once or twice, we should keep any fishing expedition reeled in :)

3. Are there any well-established hierarchical groupings of these labels? I'm sure we can make up plausible ones, but if standard ones exist that would be very helpful. Summarizing these molecules in terms of simple things like size could also be an excellent way to try and use this information. I think whatever do it will have to be relatively coarse.

If the siderophores or antibiotics have some agreed-upon hierarchical structure (even for a subset of them), seeing if clustering of transport domains recovers that _is_ a reasonable thing imo. But I would be skeptical if we just clustered and justified whatever we found _post-hoc_. If no such hierarchical structure exists, thinking about simple summaries of these molecule classes is another way to coarsen