<a href="https://colab.research.google.com/github/hariszaf/metabolic_toy_model/blob/main/Antony2025/reconstructingDraftGSMMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reconstructing Draft Genome-Scale Metabolic Models**

### **Basic setup**

Setup **Gurobi** and **COBRApy**. See the [setting up your environment](https://colab.research.google.com/github/hariszaf/metabolic_toy_model/blob/main/Antony2025/preparingYourEnvironment.ipynb).

### **Reconstructing Draft Genome-Scale Metabolic Models**

Draft models are incomplete models containing only genome-based evidence.They are not capable of producing biomass.

To reconstruct draft models, we will use the [ModelSEED](https://academic.oup.com/nar/article/49/D1/D575/5912569?login=true) pipeline.

[see the web interface](https://modelseed.org/)


### **Installing ModelSEED**

If working on your own machine, remember

1) Activate our conda environment:

```bash
conda activate gsmmWorkshop
```

### **Clone the ModelSEEDpy repository**

In [None]:
!git clone https://github.com/ModelSEED/ModelSEEDpy

### **Install it**

In [None]:
!pip install ModelSEEDpy/.

### **Working with genomes**

To reconstruct models, we first need genome sequences. We can use the genomes from EMBL, [ENSEMBL Bacteria](https://bacteria.ensembl.org/index.html), or another public genomic database.

[Here](https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/species_EnsemblBacteria.txt) is a list of all the genomes available.

With this list we can perform queries, such retrieve all the genomes from a given taxa.

We also need a package to manipulate NCBI taxonomies: [taxoniq](https://github.com/taxoniq/taxoniq).


### **Parsing taxonomies**

In [None]:
!pip install taxoniq

### **Creating a genomes list**

Before reconstructing models, let's perform the following tasks:


1. Make a python dictionary mapping the ENSEMBL genomes to their taxonomies and their fasta file containing genome-enoded proteins.

[genome] = [{taxonomic ranks}, {webpage containing their peptide fasta}]


2. Use our list to find a genome that interests us. For example, we will use a *Bifidobacterium adolescentis* genome;


3. Dowload all genomes belonging to a specific genus. For example, we will use all *Shewanella* genomes;


4. Get one representative genome per phylum.

#### **ENSEMBL Genome Dictionary:**

dowload the genome file

In [None]:
import urllib.request
urllib.request.urlretrieve("https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/species_EnsemblBacteria.txt", "species_EnsemblBacteria.txt")

!ls -la

make a function to generate the dowload url

In [None]:
import taxoniq
import os



def getProteinFast(l):
    p1 = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/fasta/"
    p2 = "bacteria" + "_" + l[13].split("_")[1] + "_" + "collection" + "/"
    p3 = l[1] + "/pep/"
    st = l[4]
    while " " in st:
        st = st.replace(" ", "_")
    p4 = l[1][0].upper() + l[1][1:] + "." + st + ".pep.all.fa.gz"

    return p1 + p2 + p3 + p4

genomes = {}
with open('species_EnsemblBacteria.txt') as f:
    f.readline()
    for line in f:
        a = line.strip().split('\t')
        try:
            taxonomy = taxoniq.Taxon(int(a[3]))
            genomes[a[3] + "_" + a[1]] = [{rank.rank.name: rank.scientific_name for rank in taxonomy.ranked_lineage}, getProteinFast(a)]

        except KeyError:
            pass

genomes

#### **Retrieve a genome**

In [None]:
for genome in genomes:
  if "Bifidobacterium adolescentis" in genomes[genome][0].values():
    print(genome, f" url: {genomes[genome][1]}")

let's pick the first one in the list.

In [None]:
import gzip
import shutil

url = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/fasta/bacteria_100_collection/bifidobacterium_adolescentis_atcc_15703_gca_000010425/pep/Bifidobacterium_adolescentis_atcc_15703_gca_000010425.ASM1042v1.pep.all.fa.gz"
gz_id = "Bifidobacterium adolescentis_atcc_15703.fa.gz"
fast_id = "Bifidobacterium adolescentis_atcc_15703.fa"

urllib.request.urlretrieve(url, gz_id)
with gzip.open(gz_id, 'rb') as f_in:
    with open(fast_id, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

os.remove(gz_id)
!ls -la

#### **Download all Shewanella genomes**

In [None]:
!mkdir shewanella_genomes
#get all Shewanella genomes
root = 'shewanella_genomes'
for genome in genomes:
    if "shewanella" in genome:
        url = genomes[genome][1]
        download_path = os.path.join(root, genome + ".fa.gz")
        extracted_path = os.path.join(root, genome + ".fa")

        urllib.request.urlretrieve(url, download_path)

        with gzip.open(download_path, 'rb') as f_in:
            with open(extracted_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        print(f"Extracted file saved as {extracted_path}")
        os.remove(download_path)

In [None]:
!ls shewanella_genomes -la

#### **Get one genome per Phylum**

In [None]:
!mkdir one_per_phylum

phyla = {}

for genome in genomes:
    if 'phylum' in genomes[genome][0]:#has an annotated phylum
        if genomes[genome][0]['phylum'] not in phyla:
            phyla[genomes[genome][0]['phylum']] = genomes[genome][1]


root = 'one_per_phylum'
for genome in phyla:
    url = phyla[genome]
    download_path = os.path.join(root, genome + ".fa.gz")
    extracted_path = os.path.join(root, genome + ".fa")

    urllib.request.urlretrieve(url, download_path)

    with gzip.open(download_path, 'rb') as f_in:
        with open(extracted_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    print(f"Extracted file saved as {extracted_path}")
    os.remove(download_path)

In [None]:
!ls one_per_phylum -la

### **Annotate Genome**

Before building a draft model, we need to annotate the genome using RAST.

Let's do this with our *Bifidobacterium adolescentis* genome.

In [None]:
import modelseedpy
from modelseedpy.core.msgenome import MSGenome

from modelseedpy.core.rast_client import RastClient
rast = RastClient()



genome_file = 'Bifidobacterium adolescentis_atcc_15703.fa'



genome = MSGenome.from_fasta(genome_file)

rast.annotate_genome(genome)

for i in genome.features:
    print(i.description)

#### **Reconstruct Draft Model**



To reconstruct the draft genome-scale metabolic model we use the `build_metabolic_model` function of `ModelSEEDpy`. The input to the function is a RAST annotated genome.

In [None]:
from modelseedpy import MSBuilder

model_id = 'Bifidobacterium adolescentis_atcc_15703'

base_model = MSBuilder.build_metabolic_model(model_id = model_id,
                                             genome   = genome,
                                             index    = "0",
                                             classic_biomass = True,
                                             gapfill_model   = False,
                                             gapfill_media   = None,
                                             annotate_with_rast = True,
                                             allow_all_non_grp_reactions = True
                                            )

We can see the model is a draft and does not produce biomass.

In [None]:
base_model.optimize()

We save the draft model

In [None]:
model_name = "Bifidobacterium adolescentis_atcc_15703.sbml"
cobra.io.write_sbml_model(cobra_model = base_model, filename = model_name)

!ls -la

### **Batch reconstruction**

Let's reconstruct one draft model per phylum. First we make a function to reconstruct dract models.

In [None]:
def reconstruct_draft_model(model_id, input_protein_fasta, output_model_sbml):
    genome = MSGenome.from_fasta(input_protein_fasta, split = ' ')
    rast.annotate_genome(genome)

    base_model = MSBuilder.build_metabolic_model(model_id = model_id,
                                             genome   = genome,
                                             index    = "0",
                                             classic_biomass = True,
                                             gapfill_model   = False,
                                             gapfill_media   = None,
                                             annotate_with_rast = True,
                                             allow_all_non_grp_reactions = True
                                            )

    cobra.io.write_sbml_model(cobra_model = base_model, filename = output_model_sbml)

    return base_model


Now lets run the function for all the genomes in the `one_per_phylum` folder and write the outputs to the folde `one_per_phylum_models`.

We are only going to reconstruct three models to avoid overloading the RAST server.

In [None]:
!mkdir one_per_phylum_models

rast = RastClient()

root = "one_per_phylum"
genomes = os.listdir(root)[0:3]


for name in genomes:
    if ".fa" in name:
        model_id = name.replace(".fa", "")
        model = reconstruct_draft_model(model_id, os.path.join('one_per_phylum', name), os.path.join('one_per_phylum_models', model_id + ".sbml"))
        print(f"Reconstructed {model_id}")



In [None]:
!ls one_per_phylum_models -la

### **What if we don't have a protein fasta?**

We can use [`Pyrodigal`](https://joss.theoj.org/papers/10.21105/joss.04296) to predict open reading frames and make a protein multifasta for our DNA sequence.

In [None]:
!pip install pyrodigal
!pip install biopython

In [None]:
url = 'https://www.ebi.ac.uk/ena/browser/api/fasta/OZ061323?download=true'

download_path = 'OZ061323.fa'

urllib.request.urlretrieve(url, download_path)

!ls -la

In [None]:
from Bio import SeqIO
import pyrodigal

record = SeqIO.read("OZ061323.fa", "fasta")
dna_seq = str(record.seq)  # Convert the sequence to a plain string

print(f"sequence legth: {len(dna_seq)}")

gene_finder = pyrodigal.GeneFinder()

gene_finder.train(dna_seq)

genes = gene_finder.find_genes(dna_seq)

with open("OZ061323.pep.faa", "w") as f:
    genes.write_translations(f, sequence_id="seqXYZ")

In [None]:
!head OZ061323.pep.faa

In [None]:
model_id = 'OZ061323'

model = reconstruct_draft_model(model_id, 'OZ061323.pep.faa', 'OZ061323.sbml')

In [None]:
ls -la

In [None]:
model.optimize()

#### **Bonus quest: Homework**

**Build a pan-genome model**

1) Build a draft model for all the *Shewanella* genomes that we dowloaded;

2) Make a new model by joining all the reactions that occur at least once in a *Shewanella* genome.