# Balance Metabolic Models

With Frowins Scripts I was able to obtain stoichiometric consistent models; next step is mostly to balance the charges

Workplan Martina:
* checken, dass alle eine chemische Formel haben (über Cobrapy)
* dann einmal ran an mass imbalances (falls es die gibt)
* charge imbalances
* Hier geht es vor allem darum in den verschiedenen Datenbanken (bigg, metacyc, kegg, etc). die richtigen Reaktionswege zu finden. Das ist der Teil in dem Protokoll, wo auch von den Protonenimbalances die Rede ist
* Ziel sollte am Ende sein, dass beim Aufruf der Reaktion (in Cobrapy): .check_mass_balance()  eine 0 (=Null) rauskommt, das heißt, dass die Reaktion keine Ladung hat.

## Imports & Paths

In [32]:
import os
import csv
from collections import Counter
from cobra.io import read_sbml_model
from cobra.manipulation.validate import check_mass_balance
import requests
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
import ast

In [2]:
models_path = "Models/mass_balance/"

In [3]:
# import models after mass balancing through Frowins scripts
models = {}
for model_name in (f for f in os.listdir(models_path) if f.endswith(".xml")):
    model = read_sbml_model(f"{models_path}/{model_name}")
    model.solver = "cplex"
    models[model_name[:3]] = model

models = {key: models[key] for key in sorted(models.keys())}  # sorts the dictionary alphabetically (AA1...AA7) because of reasons it doesn't do this while creating
AA1, AA2, AA3, AA4, AA5, AA6, AA7 = [models[f"AA{i}"] for i in range(1, 8)]
# model_list = ["AA{i}" for i in range(1, 8)]

Restricted license - for non-production use only - expires 2026-11-23


## Functions

In [4]:
def get_objective_value(model):
    print(f"value of objective for {model} is {model.optimize().objective_value}")

In [5]:
# checks the charge balance for every reaction in a model (if there are mass unbalanced reactions, these will also show up in the results)
def check_charge_balance(model):
    unbalanced_reactions = check_mass_balance(model)
    print("There are {0} charge unbalanced reactions in {1}".format(len(unbalanced_reactions), model) )
    return unbalanced_reactions

We use the check_mass_balance_function() from cobra to check for charge (!) unbalanced reactions.
The function will show mass and charge unbalanced reactions. However, we already eliminated mass unbalanced reactions in the previous steps (with Frowins Scripts), only for AA3 there are 2 mass unbalanced reactions (these are also charge unbalanced, so maybe need more attention in general?).
So although this function checks both, our results here are only charge unbalanced reactions

In [216]:
# get a pandas dataframe with metabolite info for a specific cobra model that includes: bigg_id, model_id, formula and charge
# NOTE: bigg_id could be wrong because it only takes the model_id and removes the _compartment
def extract_met_info_model(model):
    met_infos = []

    for met in model.metabolites:
        met_infos.append({
            "bigg_id": met.id.rsplit("_", 1)[0],  # strip compartment so that it matches the actual BIGG ID that also doesn't have compartments (e.g., glc__D_c to glc__D)
            "model_id": met.id,
            "model_formula": met.formula,
            "model_charge": met.charge
        })

    met_infos = pd.DataFrame(met_infos)
    return met_infos

In [215]:
#
def compare_bigg_modelMets(model_mets):
    # Merge on BiGG ID (you can tune how you strip compartments if needed)
    merged = model_mets.merge(df_bigg, on="bigg_id", how="left")

    merged["charge_match"] = merged.apply(
        lambda row: row["model_charge"] in row["charges"] if isinstance(row["charges"], list) else False,
        axis=1
    )

    merged["formula_match"] = merged.apply(
        lambda row: row["model_formula"] in row["formulas"] if isinstance(row["formulas"], (list, set)) else False, axis=1
    )

    return merged

In [217]:
def get_mismatches_after_merge(df_merge):
    mismatches = df_merge.loc[(df_merge['formula_match'] == False) | (df_merge['charge_match'] == False)]
    mismatches = mismatches[["model_id", "bigg_id", "model_charge", "charges", "model_formula", "formulas", "charge_match", "formula_match"]]
    return mismatches

In [270]:
def get_confmat_charge_formula(df_merge):
    if isinstance(df_merge, pd.DataFrame):
        conf_matrix = df_merge.groupby(["charge_match", "formula_match"]).size().reset_index(name='count')
        print(conf_matrix)
    elif isinstance(df_merge, dict):
        conf_matrix = {
            "charge_match": ["False", "False", "True", "True"],
            "forumla_match": ["False", "True", "False", "True"],
        }
        conf_matrix = pd.DataFrame(conf_matrix)
        for i, item in enumerate(df_merge.values()):
            conf_matrix_model = item.groupby(["charge_match", "formula_match"]).size().reset_index(name='count')
            name = f"AA{i+1}"
            conf_matrix.insert(i+2, name, conf_matrix_model["count"])
        print(conf_matrix)

## Evaluate current state of models regarding charge balance

In [8]:
# check flux through objective; this also shows that the models are not working correctly because the values are not feasible in vivo
for model in models.values():
    get_objective_value(model)

value of objective for AA1 is 66.94193425569041
value of objective for AA2 is 58.481001782528594
value of objective for AA3 is 43.436128611698244
value of objective for AA4 is 102.68616306092056
value of objective for AA5 is 43.93218694400248
value of objective for AA6 is 65.18134873230218
value of objective for AA7 is 50.884800170810124


In [120]:
# check reactions that maybe need curation because they are demands/sinks; are these supposed to be existing?
print("demands", AA1.demands)
print("sinks", AA1.sinks)
# print("exchanges", AA1.exchanges)

demands []
sinks [<Reaction sink_2ohph_c at 0x7cf0d019aa40>, <Reaction sink_hemeO_c at 0x7cf0d01244c0>, <Reaction sink_mobd_c at 0x7cf0d019a410>]


### get all charge unbalanced reactions for all models

In [9]:
# dictionary to store unbalanced reactions
unbalanced_reactions_dict = {}

for name, model in models.items():
    unbalanced_reactions = check_charge_balance(model)
    unbalanced_reactions_dict[name] = unbalanced_reactions

# these numbers are in accordance with the numbers for "charge balance" reactions in the memote report

There are 445 charge unbalanced reactions in AA1
There are 473 charge unbalanced reactions in AA2
There are 372 charge unbalanced reactions in AA3
There are 410 charge unbalanced reactions in AA4
There are 360 charge unbalanced reactions in AA5
There are 451 charge unbalanced reactions in AA6
There are 452 charge unbalanced reactions in AA7


In [10]:
# We know how many unbalanced reactions each model has on their own but what is the overlap?
unique_reactions = set()

# Loop through all models and collect reaction IDs
for model_name, unbalanced_reactions in unbalanced_reactions_dict.items():
    # Add the reaction ID to the set (sets are by default like 'Mengen', i.e. they only have unique elements
    unique_reactions.update(reaction.id for reaction in unbalanced_reactions.keys())

# this is a list of all the reaction IDs that are charge unbalanced throughout all models
unique_reaction_ids = list(unique_reactions)

print("There are {0} charge unbalanced reactions throughout all models.".format(len(unique_reaction_ids)))
# print(unique_reaction_ids)


There are 808 charge unbalanced reactions throughout all models.


In [69]:
# get reactions for specific model
reaction_names_aa1 = unbalanced_reactions_dict['AA1'].keys()
reaction_names_list = [reaction.id for reaction in reaction_names_aa1]
print(reaction_names_list)

['1P2CBXLCYCL', '1P2CBXLR', '23CTI1', '23CTI2', '23DK5MPPISO', '24DECOAR', '2AACLPGT160', '2AACLPGT181', '2AACLPPEAT160', '2AACLPPEAT181', '2ACLMM', '2OH3K5MPPISO', '3HAACOAT140', '3HAD40', '3OAR40', '4CMLCL_kt', 'A6PAG', 'AACPS1', 'AACPS3', 'AACPS5', 'AADa', 'AADb', 'AALDH', 'ACACT5r_1', 'ACACT6r_1', 'ACHBS', 'ACLS_a', 'ACLS_d', 'ACM6PH', 'ACMANApts', 'ACOAD10f', 'ACOAD11f', 'ACOAD12f', 'ACOAD13f', 'ACOAD14f', 'ACOAD15f', 'ACOAD16f', 'ACOAD17f', 'ACOAD18f', 'ACOAD19f', 'ACOAD20f', 'ACOAD21f', 'ACOAD23f', 'ACOAD25f', 'ACOAD26f', 'ACOAD29f', 'ACOAD3', 'ACOAD30f', 'ACOAD31f', 'ACOAD34f', 'ACOAD3f', 'ACOAD4_1', 'ACOAD6', 'ACOAD6f', 'ACP1_FMN', 'ACPPAT140', 'ACPPAT160', 'ACPPAT181', 'ACSERHS', 'ACSPHAC100', 'ACSPHAC101', 'ACSPHAC120', 'ACSPHAC121', 'ACSPHAC121d6', 'ACSPHAC140', 'ACSPHAC141', 'ACSPHAC141d5', 'ACSPHAC142', 'ACSPHAC160', 'ACSPHAC40', 'ACSPHAC50', 'ACSPHAC60', 'ACSPHAC70', 'ACSPHAC80', 'ACSPHAC90', 'ACSPHACP100', 'ACSPHACP40', 'ACSPHACP50', 'ACSPHACP60', 'ACSPHACP70', 'ACSPHAC

In [11]:
metabolite_counter_compartment = Counter()
metabolite_counter_name = Counter()
seen_reactions = set()  # Track reactions we've already counted

for model in models.values():
    for rxn_id in unique_reaction_ids:
        if rxn_id in model.reactions and rxn_id not in seen_reactions:
            reaction = model.reactions.get_by_id(rxn_id)
            for metabolite in reaction.metabolites:
                metabolite_counter_compartment[metabolite.id] += 1  # this is compartment specific, e.g. h2o_c and h2o_p are different metabolites
                metabolite_counter_name[metabolite.name] += 1  # h2o is only counted once not dependent on metabolite
            seen_reactions.add(rxn_id)  # Mark this reaction as counted

# Write to CSV
with open("metabolite_counts.csv", mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["metabolite_id (compartment specific)", "count"])
    for met_name, count in metabolite_counter_compartment.items():
        writer.writerow([met_name, count])

In [12]:
# these are the amounts of unique metabolites that are part of unbalanced reactions

# compartment specific, e.g. h20_c and h2o_p are counted separately
print(len(metabolite_counter_compartment))
# h2o only exists once
print(len(metabolite_counter_name))

990
880


In [13]:
# the number show how often one metabolite is part of an unbalanced reaction
metabolite_counter_compartment

Counter({'h_c': 417,
         'h2o_c': 268,
         'atp_c': 138,
         'coa_c': 125,
         'ppi_c': 103,
         'pi_c': 87,
         'amp_c': 85,
         'adp_c': 62,
         'h2o_p': 51,
         'co2_c': 50,
         'nadh_c': 48,
         'nad_c': 46,
         'h_p': 44,
         'pyr_c': 39,
         'nadph_c': 37,
         'nadp_c': 34,
         'fad_c': 34,
         'fadh2_c': 34,
         'ACP_c': 31,
         'o2_c': 28,
         'cmp_c': 26,
         'g3p_c': 21,
         'glyc3p_c': 19,
         'fe2_c': 19,
         'glu__L_c': 19,
         'nh4_c': 17,
         'pep_c': 17,
         'pi_p': 14,
         'r5p_c': 13,
         'accoa_c': 13,
         'f6p_c': 12,
         'gly_c': 12,
         'fmn_c': 11,
         'ser__L_c': 10,
         'ctp_c': 10,
         'akg_c': 10,
         '2dr1p_c': 10,
         'dhap_c': 9,
         'pppi_c': 8,
         'gtp_c': 8,
         'asp__L_c': 8,
         'uacgam_c': 8,
         'thmpp_c': 8,
         'ala__L_c': 8,
         

In [179]:
AA1.metabolites.get_by_id("man1p_c").charge

0

## BIGG

In [209]:
### Code by Chat-GPT ###
# this downloads metabolite information for all BIGG metabolites with their bigg ID, name, formulae and charge and saves it to a csv file
# there are 9088 metabolites (this is right number, also according to BIGG website)

# Get list of all universal metabolites
base_url = "http://bigg.ucsd.edu/api/v2/"
list_url = base_url + "universal/metabolites"
response = requests.get(list_url)

# check if request is going through
if response.status_code != 200:
    raise Exception("Failed to fetch metabolite list")

metabolites = response.json()["results"]
print(f"Found {len(metabolites)} metabolites. Fetching details...")

# because this is one big chat-gpt code in this cell, i didnt extract the function and put it in the respective section
# function that fetches specific information for one metabolite, i.e. BIGG ID, name, formulas and charges
def fetch_metabolite_details(met):
    bigg_id = met.get("bigg_id", "")
    name = met.get("name", "")
    url = f"{base_url}universal/metabolites/{bigg_id}"
    try:
        r = requests.get(url, timeout=10)
        if r.status_code == 200:
            data = r.json()  # converts JSON response to a dictionary (data)
            formulae = data.get("formulae", [])  # if formula not available, use empty list []
            charges = data.get("charges", [])

            # Safe quoting for CSV
            safe_name = str(name).replace('"', "'")
            name = f'"{safe_name}"'

            return {
                "bigg_id": bigg_id,
                "name": name,
                "formulas": str(formulae),
                "charges": str(charges)
            }
    except Exception as e:
        print(f"Error with {bigg_id}: {e}")

    safe_name = str(name).replace('"', "'")
    name = f'"{safe_name}"'
    return {
        "bigg_id": bigg_id,
        "name": name,
        "formulas": "[]",
        "charges": "[]"
    }


# Use ThreadPoolExecutor to parallelize requests
results = []
with ThreadPoolExecutor(max_workers=25) as executor:
    futures = [executor.submit(fetch_metabolite_details, met) for met in metabolites]
    for i, future in enumerate(as_completed(futures)):
        results.append(future.result())
        if i % 500 == 0:
            print(f"{i}/{len(metabolites)} done...")

# Save to CSV
df = pd.DataFrame(results)
df.to_csv("bigg_metabolites_complete.csv", index=False, quoting=csv.QUOTE_MINIMAL)
print("Saved to bigg_metabolites_complete.csv")


Found 9088 metabolites. Fetching details...
0/9088 done...
500/9088 done...
1000/9088 done...
1500/9088 done...
2000/9088 done...
2500/9088 done...
3000/9088 done...
3500/9088 done...
4000/9088 done...
4500/9088 done...
5000/9088 done...
5500/9088 done...
6000/9088 done...
6500/9088 done...
7000/9088 done...
7500/9088 done...
8000/9088 done...
8500/9088 done...
9000/9088 done...
Saved to bigg_metabolites_complete.csv


In [210]:
# Read previously created CSV with all BIGG metabolites
df_bigg = pd.read_csv("bigg_metabolites_complete.csv", quotechar='"')

# Convert stringified lists back to real lists
df_bigg["formulas"] = df_bigg["formulas"].apply(ast.literal_eval)
df_bigg["charges"] = df_bigg["charges"].apply(ast.literal_eval)


In [211]:
df_bigg

Unnamed: 0,bigg_id,name,formulas,charges
0,10fthf7glu,"""10-formyltetrahydrofolate-[Glu](7)""",[C50H57N13O25],[-8]
1,10fthf5glu,"""10-formyltetrahydrofolate-[Glu](5)""",[C40H45N11O19],[-6]
2,10fthf6glu,"""10-formyltetrahydrofolate-[Glu](6)""",[C45H51N12O22],[-7]
3,10fthf,"""10-Formyltetrahydrofolate""",[C20H21N7O7],[-2]
4,10fthfglu__L,"""10-Formyltetrahydrofolyl L-glutamate""",[C25H28N8O10],[]
...,...,...,...,...
9083,zymst,"""Zymosterol C27H44O""",[C27H44O],[0]
9084,zym_int2,"""Zymosterol intermediate 2 C27H42O""",[C27H42O],[0]
9085,zymstnl,"""5alpha-cholest-8-en-3beta-ol""",[C27H46O],[0]
9086,xylu__L,"""L-Xylulose""",[C5H10O5],[0]


In [212]:
model_mets = {f"AA{i}_mets": extract_met_info_model(models[f"AA{i}"]) for i in range(1, 8)}

In [213]:
model_mets["AA1_mets"]

Unnamed: 0,bigg_id,model_id,model_formula,model_charge
0,10fthf,10fthf_c,C20H21N7O7,-2
1,12dgr120,12dgr120_c,C27H52O5,0
2,12dgr120,12dgr120_p,C27H52O5,0
3,12dgr140,12dgr140_c,C31H60O5,0
4,12dgr140,12dgr140_p,C31H60O5,0
...,...,...,...,...
1527,xylb,xylb_e,C10H18O9,0
1528,xylu__D,xylu__D_c,C5H10O5,0
1529,zn2,zn2_c,Zn,2
1530,zn2,zn2_e,Zn,2


In [227]:
model_merged = {f"AA{i}_merged": compare_bigg_modelMets(model_mets[f"AA{i}_mets"]) for i in range(1, 8)}
AA1_merged, AA2_merged, AA3_merged, AA4_merged, AA5_merged, AA6_merged, AA7_merged = [model_merged[f"AA{i}_merged"] for i in range(1, 8)]

In [228]:
AA1_merged

Unnamed: 0,bigg_id,model_id,model_formula,model_charge,name,formulas,charges,charge_match,formula_match
0,10fthf,10fthf_c,C20H21N7O7,-2,"""10-Formyltetrahydrofolate""",[C20H21N7O7],[-2],True,True
1,12dgr120,12dgr120_c,C27H52O5,0,"""1,2-Diacyl-sn-glycerol (didodecanoyl, n-C12:0)""",[C27H52O5],[0],True,True
2,12dgr120,12dgr120_p,C27H52O5,0,"""1,2-Diacyl-sn-glycerol (didodecanoyl, n-C12:0)""",[C27H52O5],[0],True,True
3,12dgr140,12dgr140_c,C31H60O5,0,"""1,2-Diacyl-sn-glycerol (ditetradecanoyl, n-C1...",[C31H60O5],[0],True,True
4,12dgr140,12dgr140_p,C31H60O5,0,"""1,2-Diacyl-sn-glycerol (ditetradecanoyl, n-C1...",[C31H60O5],[0],True,True
...,...,...,...,...,...,...,...,...,...
1527,xylb,xylb_e,C10H18O9,0,"""Xylobiose""",[C10H18O9],[],False,True
1528,xylu__D,xylu__D_c,C5H10O5,0,"""D-Xylulose""",[C5H10O5],[0],True,True
1529,zn2,zn2_c,Zn,2,"""Zinc""",[Zn],[2],True,True
1530,zn2,zn2_e,Zn,2,"""Zinc""",[Zn],[2],True,True


In [224]:
model_mismatch = {f"AA{i}_mismatch": get_mismatches_after_merge(model_merged[f"AA{i}_merged"]) for i in range(1, 8)}

In [225]:
model_mismatch["AA1_mismatch"]

Unnamed: 0,model_id,bigg_id,model_charge,charges,model_formula,formulas,charge_match,formula_match
28,1btol_c,1btol,0,[],C4H10O,[C4H10O],False,True
32,1p2cbxl_c,1p2cbxl,0,[-1],C5H6NO2,[C5H6NO2],False,True
35,23ddhb_c,23ddhb,0,[-1],C7H7O4,[C7H7O4],False,True
46,2agpe160_c,2agpe160,0,[0],C21H44NO7P,[C21H44NO7P1],True,False
47,2agpe160_p,2agpe160,0,[0],C21H44NO7P,[C21H44NO7P1],True,False
...,...,...,...,...,...,...,...,...
1507,val__D_p,val__D,0,[0],C5H9NO2,[C5H11NO2],True,False
1522,xyl3_c,xyl3,0,[],C15H26O13,[C15H26O13],False,True
1523,xyl3_e,xyl3,0,[],C15H26O13,[C15H26O13],False,True
1526,xylb_c,xylb,0,[],C10H18O9,[C10H18O9],False,True


In [229]:
get_confmat_charge_formula(AA1_merged)

   charge_match  formula_match  count
0         False          False    178
1         False           True    198
2          True          False    103
3          True           True   1053


In [271]:
get_confmat_charge_formula(model_merged)

  charge_match forumla_match   AA1   AA2   AA3   AA4   AA5   AA6   AA7
0        False         False   178   233   149    96   133   195   189
1        False          True   198   258   235   181   176   238   230
2         True         False   103   103    96   112    95   103    74
3         True          True  1053  1273  1028  1377  1038  1266  1183


In [None]:
# to-do
# allg.: duplicate metabolites anschauen
# bigg info nur mit unbalanced reaktionen gegen checken

In [None]:
# wenn man später actually Änderungen am Model vornehmen will, e.g. andere charges ausprobieren, am besten mit
# with model:
#   bla bla bla
# so überschreibt man das Modell nicht und hat erst mal ne work in progress version, wo man testen kann, ob die Änderungen actually gut sind