<a href="https://colab.research.google.com/github/hariszaf/metabolic_toy_model/blob/main/Antony2025/gapfillingGSMMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Gapfilling Genome-Scale Metabolic Models with DNNGIOR**

## Google Collab Part

In [None]:
# @title Install dependencies
!pip install cobra
!pip install dnngior --no-deps

In [None]:
import os
def create_gurobi_license():
    license_content = (
        "# Gurobi WLS license file\n"
        "# Your credentials are private and should not be shared or copied to public repositories.\n"
        "# Visit https://license.gurobi.com/manager/doc/overview for more information.\n"
        "WLSACCESSID=\n"
        "WLSSECRET=\n"
        "LICENSEID="
    )
    with open("/content/licenses/gurobi.lic", "w") as f:
        f.write(license_content)
    print("License file created at /content/licenses/gurobi.lic")

# Create directory for the license
os.makedirs("/content/licenses", exist_ok=True)

# Generate the license file
create_gurobi_license()

#add to path
os.environ['GRB_LICENSE_FILE'] = '/content/licenses/gurobi.lic'

!pip install gurobipy

In [None]:
!git clone https://github.com/hariszaf/metabolic_toy_model.git

In [None]:
cd /metabolic_toy_model/Antony2025

## Introduction

During this workshop we are going to need to read and write cobra models.
From the dnngior package we will use the Gapfill and NN_Predictor classes.

In [None]:
from cobra.io import read_sbml_model, write_sbml_model
from dnngior.gapfill_class import Gapfill
from dnngior.NN_Predictor import NN
import pandas as pd
import numpy as np

### Gap-filling models: a reminder of why we do this
Let's load the Bifidobacterium model to use as an example

In [None]:
path_to_draft_model = "./files/Bifidobacterium adolescentis_atcc_15703.sbml"
draft_model = read_sbml_model(path_to_draft_model)
draft_model.summary()

As we established before this model does not produce biomass, there are reactions missing from the model that are essential. Because there is no flux through the objective we cannot optimize the model.

The solution: a gap-filling algorithm that will take reactions from a database of known reactions and adds them to the model untill we have a model with flux through the objective. We will not have direct genomic evidence for these reactions (unlike the reactions in the draft model) but we do know that they are required to get a functioning model.

![gapfilling](https://github.com/hariszaf/metabolic_toy_model/blob/main/Antony2025/images/gapfilling.png?raw=1)


There are multiple solutions to this problem as there are many ways to complete the metabolism. Generally you will try and add the fewest reactions as feasible but we might be able to do a bit better. If we have information on what reactions are more likely to be 'really' missing we can create models that more close closely follow reality.

Here is where DNNGIOR helps us, it takes the reactions from the draft model, and based on that makes a prediction using a neural network for what reactions are missing. We can then prioritize adding those reactions.

If we want to use the default settings, its actually quite simple. We can use the Gapfill class of DNNGIOR and give the path to our draft model, the gapfill class will take care of the rest.

In [None]:
gapfill_complete_medium = Gapfill(draftModel = path_to_draft_model)

The Gapfill class will keep track of many things:

including:

1. The reactions in the draft model
2. The prediction by the neural network
3. The added reactions
4. and of course: the gap-filled model

In [None]:
gf_model = gapfill_complete_medium.gapfilledModel
print("1. Number of reactions in the draft model: {}".format(len(gapfill_complete_medium.draft_reaction_ids)))
print("2. Number of predicted reactions: {}".format(sum(gapfill_complete_medium.predicted_reactions>0.5)))
print("3. Number of reactions added: {}".format(len(gapfill_complete_medium.added_reactions)))
print("4. Gapfilled model:")
gf_model.summary()


In principle this means we are done, if all we want is a gap-filled model we can save it using cobra and continue our analysis. But there are still many things to consider so have a closer look at how DNNGIOR works.

In [None]:
write_sbml_model(gf_model, 'gapfilled_Bifidobacterium adolescentis_atcc_15703.sbml')

### How does DNNGIOR work

DNNGIOR gapfilling takes two steps:
1. Make a prediction based on the reactions in the draft model
2. Use these predictions to find weigh the solutions of the linear programming algorithm

This algorithm tries to solve the following objective

minimize: $∑_{𝑟∈𝑚}𝑐_𝑟⁢𝑓_𝑟∣𝑓_b>0$

This means that every reaction has a cost ($c_r$) and the algorithm will try to minimize the sum cost of all reactions that are added.The algorithm will only consider solutions with flux through the objective ($f_b$) as that is the end goal.

These costs will be (partially) determined based on the prediction. So lets have a look at these predictions, we can load in the neural network using this trick:

In [None]:
from dnngior.variables import TRAINED_NN_MSEED
NN_MSEED = NN(path=TRAINED_NN_MSEED)

and make a prediction for our example draft model

In [None]:
prediction = NN_MSEED.predict(draft_model)

This will give us a prediction for what reactions the neural network thinks are missing based on the draft reactions.

In [None]:
pd.Series(prediction).plot.hist(bins=100, title='Neural Network Predictions')

These ~2000 reactions are part of the microbial reactome (all the reactions that were present somewhere in the phylogeny), and the prediction are inverted (1-p) to get the cost as we want high predicted reactions to have a low cost.

All other reactions in the database will get the default cost which is normally set to 1.0 but can be changed if you want to prioritize reactions from the reactome:

In [None]:
gapfill_higher_def_cost = Gapfill(draftModel = path_to_draft_model, objectiveName = 'bio1', default_cost=10, gapfill=False)

In [None]:
w = gapfill_higher_def_cost.weights
pd.Series(w).plot.hist(bins=100, label='Higher default cost', logy=True)

# Customizing the candidate reaction weights

The NN-weights provide a great start to guide the gap-filling process and often it is the best solution but sometimes you would want to finetune the costs based on preferences or addtional knowledge. To do this we can exclude reactions from the candidate list or change the costs associated with reactions.


## Blacklisting reactions

You might want to exclude specific reactions from the gap-filling database (e.g. you know cannot be present based on other data), this can be done using the blacklist argument:

In [None]:
blackList = ['rxn99999_c0']
gapfill_with_blacklist = Gapfill(path_to_draft_model, black_list = blackList, objectiveName = 'bio1')

This will remove these reactions from the candidates and therefore will never be added to the model.

Note however, that sometimes reactions are unavoidable (i.e. no solution can be found without them) and then the gap-filling would fail. A solution to this is to use the `grey_list`, these reactions will be given a much higher cost. By default these reactions get a cost of 1,000 but you can change this using punish_cost. The result is that they will only be added when strictly neccesary.

In [None]:
greyList = ['rxn04070_c0','rxn05467_c0','rxn00543_c0']
gapfill_with_greylist = Gapfill(path_to_draft_model, grey_list = greyList, punish_cost = 5000, objectiveName = 'bio1')

## Challenge:

Grey list 5 reactions added to the original gap-filled Bifido model and gapfill the draft model again.

In [None]:
# @title Solution

greyList = list(gapfill_complete_medium.added_reactions)[:5]
gapfill_with_greylist = Gapfill(path_to_draft_model, grey_list = greyList)

### Manual weights

You can also manually set weights. This is by far the most flexible option, you can make any changes to any reaction you want. To make these changes it is useful to set the gapfill parameter to False. This stops the gapfilling class from automatically continuing to the gapfilling step.

In [None]:
ungapfilled = Gapfill(draftModel = path_to_draft_model, gapfill=False)

Then you can change your candidates using the `Gapfiller.set_weights(scores)` function or manually set reactions directly: `ungapfilled_model.weights['rnx0001'] = 0.4`

reloading your model will reset them back to the NN-predicted weights but there is also a function for this:

In [None]:
ungapfilled.reset_weights()

Once you are ready to continue the gapfilling, you can use the class function gapfill() to resume the gap-filling process

In [None]:
ungapfilled.gapfill()

## Challenge:

Gap-fill a model with random weights and save it as `gf_random_Bifido.sbml`

In [None]:
# @title Solution
random_uniform = np.random.uniform(0,1, len(gapfill_complete_medium.weights))
random_weights = {k:v for k,v in zip(gapfill_complete_medium.weights.keys(), random_uniform)}

ungapfilled_model = Gapfill(draftModel = path_to_draft_model, gapfill=False)
ungapfilled_model.set_weights(random_weights)
ungapfilled_model.gapfill()

gf_random = ungapfilled_model.gapfilledModel

write_sbml_model(gf_random, 'gf_random_Bifido.sbml')


## Gap-filling with a different medium

By default, the gap-filler will assume that your model lives in a complete medium meaning that it can import any metabolite. However, in reality organisms don't allways have this luxury, so for many applications you would want to assume a more specific medium. This will make sure that the right reactions are added for a organism to synthesize metabolites not readily available in their environment.

We can define a medium file that looks like this:

In [None]:
medium_file_path = '../files/biochemistry/Nitrogen-Nitrite_media.tsv'
nit_medium = pd.read_csv(medium_file_path, sep='\t')
print(nit_medium)

We can then provide this to the gap-filler using the `medium_file` or `medium` parameter (`medium_file` takes a path and `medium` takes a pandas dataframe)

## Challenge:

Gap-fill the Bifido model with nitrogen medium and compare the number of added reactions with the one gap-filled on a complete medium

In [None]:
# @title Solution
gapfill_nit_medium = Gapfill(draftModel = path_to_draft_model, objectiveName = 'bio1', medium_file = medium_file_path)
print("Number of reactions added nitrogen medium:", len(gapfill_nit_medium.added_reactions))
print("Number of reactions added complete medium:", len(gapfill_complete_medium.added_reactions))

## Challenge:

Gap-fill the model without phosphate and see what different reactions gets added

In [None]:
# @title Solution

nit_medium.iloc[19]['max_flux'] = 0
gapfill_nit_wop = Gapfill(draftModel = path_to_draft_model, objectiveName = 'bio1', medium = nit_medium)

# set_a = set(gapfill_nit_medium.added_reactions)
# set_b = set(gapfill_nit_wop.added_reactions)

# print(set_a.difference(set_b))
# print(set_b.difference(set_a))

# Batch gapfilling using the command line interface (CLI)

In the case where you have a lot of models you want to gapfill with the same medium you can use the CLI

`python fasta2model_CLI.py -f DIR_FASTA -o output_folder`

This command will create an output folder (-o) containing a subfolder with base ungapfilled models, a subfolder with gapfilled models, a log, and a tsv file telling you the number of added reactions.

This CLI has limited functionality and assumes the same conditions for all gapfilling but you can change the standard gapfilling medium using the -e parameter.

`python fasta2model_CLI.py -f DIR_FASTA -o DIR_OUTPUT' -e PATH_TO_MEDIUM_FILE`

if you allready have base models you can use the -m parameter to provide a folder with base models to skip the base model building step.

`python fasta2model_CLI.py -m DIR_MODELS -o DIR_OUTPUT`


To gap-fill all one-per-phylum models it would look like this:

`!python DNNGIOR/dnngior/fasta2model_CLI.py -m one_per_phylum_models -o one_per_phylum_gapfilled -sm .sbml`