<a href="https://colab.research.google.com/github/hariszaf/metabolic_toy_model/blob/main/Antony2025/reconstructingDraftGSMMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reconstructing Draft Genome-Scale Metabolic Models**

### **Basic setup**

Make sure you have enabled **Gurobi** and **COBRApy**. 

If you are on the codespace, you can always check the [`prep_env.ipynb`](../prep_env.ipynb) notebook.


### **Reconstructing Draft Genome-Scale Metabolic Models**

Draft models are incomplete models containing only genome-based evidence.They are not capable of producing biomass.

To reconstruct draft models, we will use the [ModelSEED](https://academic.oup.com/nar/article/49/D1/D575/5912569?login=true) pipeline.

ModelSEED is a key resource for metabolic modeling, so [let's have a look on their web interface](https://modelseed.org/)!


### **Working with genomes**

To reconstruct models, we first need genome sequences. 

We can use the genomes from EMBL, [ENSEMBL Bacteria](https://bacteria.ensembl.org/index.html), or another public genomic database.

[Here](https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/species_EnsemblBacteria.txt) is a list of all the ENSEMBL genomes available.
With this list we can perform queries, such retrieve all the genomes from a given taxa.

We also need a package to manipulate NCBI taxonomies: [taxoniq](https://github.com/taxoniq/taxoniq).


### **Creating a genomes list**

Before reconstructing models, let's perform the following tasks:


1. Make a python dictionary mapping the ENSEMBL genomes to their taxonomies and their fasta file containing genome-enoded proteins.

`genome : [{taxonomic ranks}, {webpage containing their peptide fasta}]`


2. Use our list to find a genome that interests us. For example, we will use a *Bifidobacterium adolescentis* genome;


3. Dowload all genomes belonging to a specific genus. 


4. Build a draft reconstruction for at least on of those.

#### **ENSEMBL Genome Dictionary:**

dowload the genome file

In [None]:
import urllib.request
urllib.request.urlretrieve("https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/species_EnsemblBacteria.txt", "species_EnsemblBacteria.txt")

!ls -la

In [None]:
!head species_EnsemblBacteria.txt

In [None]:
!wc -l species_EnsemblBacteria.txt

Let's build a function to generate the url we'll use to dowload out genome of preference.

In [None]:
import taxoniq
import os

In [None]:
def get_protein_fasta_url(info):
    """
    info: a line from the species_EnsemblBacteria.txt file
    """
    base_url = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/fasta/"
    species_prefix = info[13].split("_")[1]
    collection_path = f"bacteria_{species_prefix}_collection/"
    species_dir = f"{info[1]}/pep/"

    strain = info[4].replace(" ", "_")
    filename = f"{info[1].capitalize()}.{strain}.pep.all.fa.gz"

    return f"{base_url}{collection_path}{species_dir}{filename}"

Let us now build the dictionary we mentioned with the url for each genome. 

In [None]:
genomes = {}

with open('species_EnsemblBacteria.txt') as f:
    next(f)  # Skip header
    for line in f:
        fields = line.strip().split('\t')
        taxid = fields[3]
        species_name = fields[1]
        key = f"{taxid}_{species_name}"
        
        try:
            taxonomy = taxoniq.Taxon(int(taxid))
            lineage = {
                rank.rank.name: rank.scientific_name
                for rank in taxonomy.ranked_lineage
            }
            fasta_url = get_protein_fasta_url(fields)
            genomes[key] = [lineage, fasta_url]

        except KeyError:
            continue  # Skip if taxid not found

In [None]:
species_name

In [None]:
lineage

In [None]:
taxonomy.ranked_lineage[0].rank.name

Let's now have a look at the actual dictionary (`genomes`) we built. 

In [None]:
list(genomes.keys())[:5]

In [None]:
genomes["123820__actinobacillus_rossii_gca_900444965"]

#### **Find a bacteria that you like!**

Spend some time, not too much (😅) at the [ProTraits](http://protraits.irb.hr/) resource, to find something interesting to you! 

Bacteria can do almost anything and survive almost everywhere. 

Another super useful resource that you could have a look for species traits is [BacDive](https://bacdive.dsmz.de/).

#### **Retrieve a genome**

Replace *Bifidobacterium adolescentis* with the species of your interest and give the following chunk a go.
This will tell you if there is any genome in ENSEMBLE ***including*** the taxon name you gave. 

In [None]:
for genome in genomes:
  if "Bifidobacterium adolescentis" in genomes[genome][0].values():
    print(genome, f" url: {genomes[genome][1]}")

let's pick the first one in the list.

In [None]:
import gzip
import shutil

url = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/current/fasta/bacteria_100_collection/bifidobacterium_adolescentis_atcc_15703_gca_000010425/pep/Bifidobacterium_adolescentis_atcc_15703_gca_000010425.ASM1042v1.pep.all.fa.gz"
gz_id = "Bifidobacterium adolescentis_atcc_15703.fa.gz"
fast_id = "Bifidobacterium adolescentis_atcc_15703.fa"

urllib.request.urlretrieve(url, gz_id)
with gzip.open(gz_id, 'rb') as f_in:
    with open(fast_id, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

os.remove(gz_id)

!ls -ltrh

### **Annotate Genome**

Before building a draft model, we need to annotate the genome using RAST.

Let's do this with our *Bifidobacterium adolescentis* genome.

In [None]:
import modelseedpy
from modelseedpy.core.msgenome import MSGenome

from modelseedpy.core.rast_client import RastClient


genome_file = 'Bifidobacterium adolescentis_atcc_15703.fa'
genome      = MSGenome.from_fasta(genome_file)

rast = RastClient()
rast.annotate_genome(genome)

for i in genome.features:
    print(i.description)

#### **Reconstruct Draft Model**



To reconstruct the draft genome-scale metabolic model we use the `build_metabolic_model` function of `ModelSEEDpy`. The input to the function is a RAST annotated genome.

In [None]:
from modelseedpy import MSBuilder

model_id = 'Bifidobacterium adolescentis_atcc_15703'

base_model = MSBuilder.build_metabolic_model(
    model_id = model_id,
    genome   = genome,     # genome here is the one annotated in the previous chunk
    index    = "0",
    classic_biomass = True,
    gapfill_model   = False,
    gapfill_media   = None,
    annotate_with_rast = True,
    allow_all_non_grp_reactions = True
)

We can see the model is a draft and does not produce biomass.

In [None]:
base_model.optimize()

We can save the draft reconstruction using the `cobra` model.

In [None]:
import cobra

We will discuss thoroughly the `cobra` library in the following section.

In [None]:
model_name = "Bifidobacterium adolescentis_atcc_15703.sbml"
cobra.io.write_sbml_model(cobra_model = base_model, filename = model_name)

!ls -la

### **What if we don't have a protein fasta?**

*-- not to be performed during the ws*

We can use [`Pyrodigal`](https://joss.theoj.org/papers/10.21105/joss.04296) to predict open reading frames and make a protein multifasta for our DNA sequence.

In [None]:
!pip install pyrodigal
!pip install biopython

In [None]:
url = 'https://www.ebi.ac.uk/ena/browser/api/fasta/OZ061323?download=true'

download_path = 'OZ061323.fa'

urllib.request.urlretrieve(url, download_path)

!ls -la

In [None]:
from Bio import SeqIO
import pyrodigal

record = SeqIO.read("OZ061323.fa", "fasta")
dna_seq = str(record.seq)  # Convert the sequence to a plain string

print(f"sequence legth: {len(dna_seq)}")

gene_finder = pyrodigal.GeneFinder()

gene_finder.train(dna_seq)

genes = gene_finder.find_genes(dna_seq)

with open("OZ061323.pep.faa", "w") as f:
    genes.write_translations(f, sequence_id="seqXYZ")

In [None]:
!head OZ061323.pep.faa

In [None]:
model_id = 'OZ061323'

model = reconstruct_draft_model(model_id, 'OZ061323.pep.faa', 'OZ061323.sbml')

In [None]:
!ls -la

In [None]:
model.optimize()