### Validating the Contents of the pickle file aka the pool of BGCs

In [2]:
import pickle

pickle_file = "../preprocessed_bgcs.pkl"

with open(pickle_file, 'rb') as f:
    bgc_data = pickle.load(f)

print(f"Successfully loaded {len(bgc_data)} BGCs from pickle file.")

Successfully loaded 3013 BGCs from pickle file.


This matches the number of BGCs found in data_exploration.ipynb

Next, validating the metadata:

In [15]:
bgc = bgc_data[0]
print("BGC metadata:")
print(f" Accession: {bgc.get('accession')}")
print(f" Biosynthesis Class: {bgc.get('biosynthesis', {}).get('classes', [{}])[0].get('class')}")
print(f" Structure: {bgc.get('compounds', [{}])[0].get('structure')}")
print(f" Mass: {bgc.get('compounds', [{}])[0].get('mass')}")
print(f" Formula: {bgc.get('compounds', [{}])[0].get('formula')}")
print(f" Taxonomy: {bgc.get('taxonomy', {}).get('name')}")

BGC metadata:
 Accession: BGC0002584
 Biosynthesis Class: ribosomal
 Structure: CC[C@H](C)[C@@H](C(=O)OC)NC(=O)C1=C(OC(=N1)C2=CSC(=N2)C3=C(OC(=N3)C4=COC(=N4)[C@H](C(C)C)NC(=O)/C(=N/O)/C(C)C)C)C
 Mass: None
 Formula: None
 Taxonomy: Streptomyces sp. FXJ1.264


The data below was extracted from the original JSON related to BGC0002584
- "accession": "BGC0002584"
- "class": "ribosomal"
- "structure": "CC[C@H](C)[C@@H](C(=O)OC)NC(=O)C1=C(OC(=N1)C2=CSC(=N2)C3=C(OC(=N3)C4=COC(=N4)[C@H](C(C)C)NC(=O)/C(=N/O)/C(C)C)C)C"
- no mass in data
- no formula in data
- "name": "Streptomyces sp. FXJ1.264",

Next, verify graph creation success (BGCs can have multiple compounds, so the number of graphs will be higher than the total count of BGCs):

In [17]:
graph_count = sum(1 for bgc in bgc_data for c in bgc.get('compounds', []) if c.get('mol_graph'))
print(f"{graph_count} molecular graphs present.")

4401 molecular graphs present.


Next, verify there are no missing graphs

In [18]:
missing_graphs = sum(1 for bgc in bgc_data for c in bgc.get('compounds', []) if c.get("mol_graph") is None)
print(f"{missing_graphs} compounds are missing graphs ")

1042 compounds are missing graphs 


Let's see how many total compounds there are, and where these missing graphs might be coming from:

In [23]:
total_compounds = sum(len(bgc.get('compounds', [])) for bgc in bgc_data)
print(f"Total compounds: {total_compounds}")

Total compounds: 5443


I suspect these may be coming from BGCs with multiple compounds and that an issue in the 
mol_to_graph module might be causing this many missing graphs.

In [24]:
bgcs_with_multiple_compounds = [bgc for bgc in bgc_data if len(bgc.get('compounds', [])) > 1]
missing_graphs_from_multi_compound_bgcs = sum(1 for bgc in bgcs_with_multiple_compounds for c in bgc.get('compounds', []) if c.get('mol_graph') is None)
print(f"Number of missing graphs that come from compounds with multiple compounds: {missing_graphs_from_multi_compound_bgcs}")

Number of missing graphs that come from compounds with multiple compounds: 563


This is alot, but still only about half, thus there is more than just a simple issue with mol_to_graph.
Next, manually inspecting some of the BGCs with missing graphs.

In [25]:
missing_graphs_bgcs = [
    (bgc.get('accession'), c.get('structure'))
    for bgc in bgc_data
    for c in bgc.get('compounds', [])
    if c.get('mol_graph') is None
]

# Print the accession number and SMILES string for the first 10 BGCs missing graphs
for accession, smiles_str in missing_graphs_bgcs[:10]:
    print(f"Accession: {accession}, SMILES string: {smiles_str}")

Accession: BGC0001748, SMILES string: None
Accession: BGC0001748, SMILES string: None
Accession: BGC0001748, SMILES string: None
Accession: BGC0001748, SMILES string: None
Accession: BGC0001748, SMILES string: None
Accession: BGC0001748, SMILES string: None
Accession: BGC0001733, SMILES string: None
Accession: BGC0000764, SMILES string: None
Accession: BGC0001756, SMILES string: None
Accession: BGC0001756, SMILES string: None


Now the issue is aparent. Some compounds appear to be missing SMILES strings. <br>
This can be manually verified through chekcing the original JSON files. <br>

Manually verifiation has confirmed that some of the compounds do not have SMILES strings attached to the compounds
connected to them.

There are two options:
1. Use only those BGCs which have graphs attributed to each of there compounds
2. Use an external API to look up the compound name (which is provided in many of the compounds with missing SMILES strings) and query for its related SMILES string if it exists.

For now, I will modify the pool to only accept BGCs with graph for each of their compounds. I will come back to this issue later, but for now I think there are enough acceptable BGCs to build an initial population and begin iterative development on the GA.

A filter was added to remove BGCs that had compound's missing SMILES strings. Final result:

In [28]:
pickle_file = "../preprocessed_bgcs.pkl"

with open(pickle_file, 'rb') as f:
    bgc_data = pickle.load(f)

print(f"Successfully loaded {len(bgc_data)} BGCs from pickle file.")

Successfully loaded 2349 BGCs from pickle file.


In [29]:
missing_graphs = sum(1 for bgc in bgc_data for c in bgc.get('compounds', []) if c.get("mol_graph") is None)
print(f"{missing_graphs} compounds are missing graphs ")

0 compounds are missing graphs 


Success, and still a solid population pool size