In [1]:
%load_ext autoreload
%aimport gempipe, gempipe.flowchart
%autoreload 1 

In [9]:
from gempipe import Flowchart

file = open('flowcharts/part_3.flowchart', 'r')
header = 'flowchart LR \n'
flowchart = Flowchart(header + file.read())
file.close()
flowchart.render(height=300, zoom=1)

# Part 3: gempipe derive

`gempipe derive` is designed to derive strain-specific GSMMs, using as main inputs the PAM (presence/absence matrix) and the pan-GSMM. Below we show all the options: 

In [2]:
import subprocess

command = f"""gempipe derive -h"""
process = subprocess.Popen(command, shell=True)
response = process.wait()


usage: gempipe derive [-h] [-v] [-c] [-o] [--verbose] [-im] [-ip] [-ir] [-ig]
                      [-m] [--minflux] [--biolog] [--sbml]

gempipe v1.33.4, please cite "TODO". Full documentation available at
https://gempipe.readthedocs.io/en/latest/index.html.

optional arguments:
  -h, --help           Show this help message and exit.
  -v, --version        Show version number and exit.
  -c , --cores         How many parallel processes to use. (default: 1)
  -o , --outdir        Main output directory (will be created if not
                       existing). (default: ./)
  --verbose            Make stdout messages more verbose, including debug
                       messages. (default: False)
  -im , --inpanmodel   Path to the input pan-model. (default: -)
  -ip , --inpam        Path to the input PAM. (default: -)
  -ir , --inreport     Path to the input report file. (default: -)
  -ig , --ingannots    Path to the input genes annotation file. (default: -)
  -m , --media         Medium

## Deriving strain-specific models

Three inputs are required by `gempipe derive` to start: 

1. the pan-GSMM (`-im`/`--inpanmodel`), in its final version. Note that a draft pan-GSMM was provided in [Part 1 (`gempipe recon`)](part_1_gempipe_recon.ipynb), that was subsequently curated in [Part 2](part_2_manual_curation.ipynb) with a dedicated [API](https://gempipe.readthedocs.io/en/latest/autoapi/gempipe/curate/index.html).

2. the presence/absence matrix (PAM) (`-ip`/`--inpam`), as it was provided by `gempipe recon`. Note that this table contains clusters in row, and strains in column, and each cell contanins the strain's genes belonging to the corresponding cluster (read the dedicated [paragraph](https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#creation-of-a-pam)

3. the report (`-ir`/`--inreport`) produced by `gempipe recon`, contaning the strain-to-species relationships. 

Below we report an example of command line: 

    gempipe derive -c 16 -im curation/panmdoel.json -ip reconout/pam.csv -ir reconout/report.csv

When `gempipe derive` starts, a species-specific GSMM is created for each genome that passed quality filtering in [Part 1](https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#filtering-the-genomes). This procedure may be called "derivation by subtraction", meaning that a copy of the pan-GSMM is subtracted from all the genes that are missing from the corresponding strain, according to the PAM content. Removing genes can eventually lead to the removal of metabolic reactions, in accordance to the GPR definitions. 

💡 **Tip!** If [`gempipe recon`](part_1_gempipe_recon.ipynb) started from genbanks, then a genome annotation file was produced. You can pass it to `gempipe derive` using `-ig`/`--ingannots` to improve the "gene annotation" section of the [Memote reports](https://memote.readthedocs.io/en/latest/index.html).

## Gap-filling strain-specific models

After the above process, a strain-specific GSMM may not grow. The reasons are several:

* the biomass assembly reaction coming from the pan-GSMM contains some strain-specific precursors. In this case, it would be better to first remove the strain-specific precursors from the biomass assembly during the manual curation in [Part 2](part_2_manual_curation.ipynb).

* the GPR of some reactions essential to form biomass doesn't take into account strain-specific gene variants. By design, the [clustering step](https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#creation-of-a-pam) in `gempipe recon` is performed at high sequence identity (90%), meaning that two orthologous genes, if sufficiently dissimilar, may be grouped in different clusters even if the function is the same. Despite `gempipe recon` does its best to manage these cases, it could be necessary to expand some GPR during the manual curation in [Part 2](part_2_manual_curation.ipynb).

To solve the cases where a strain-specific GSMM doesn't grow, an automated and strain-specific step of **gap-filling** is performed. You have to provide one or more growth media recipes (`--media`), better if minimal, known or assumed to enable growth of **all** the strains in input. The parameter `--minflux` enables the user to specify the minimum flux through the objective to require during this gap-filling. However, each strain that doesn't growth should deserve a manual inspection, to see if one of the cases discussed above has occurred, leading to an improved manual curation.  

A medium recipe can be described using a **json file** following a rigid syntax, composed by the keys `name` and `exchanges`, plus the lower bounds of each component. For example:

```
{"name": "my_medium_name", "exchanges": {"EX_ca2_e": -1000, "EX_cl_e": -1000, "EX_co2_e": -1000, "EX_cobalt2_e": -1000, "EX_cu2_e": -1000, "EX_h_e": -1000, "EX_fe2_e": -1000, "EX_fe3_e": -1000, "EX_h2o_e": -1000, "EX_pi_e": -1000, "EX_glc__D_e": -1000, "EX_so4_e": -1000, "EX_k_e": -1000, "EX_zn2_e": -1000, "EX_mg2_e": -1000, "EX_mn2_e": -1000, "EX_mobd_e": -1000, "EX_nh4_e": -1000, "EX_o2_e": -1000}}
```



## Biolog® PM simulations

[Biolog® PM](https://www.biolog.com/products/metabolic-characterization-microplates/microbial-phenotype/) growth assays are commonly used to drive manual curation and to validate a GSMM. Using the `--biolog` option, you can automatically simulate all the main [Biolog® PM plates](https://www.biolog.com/wp-content/uploads/2024/06/00A-042-Rev-E-Phenotype-MicroArrays-1-10-Plate-Maps.pdf) (such as PM1, PM2A, PM3, PM4, and so on) for each modeled strain. Alternatively, the same function can be interactively callet via the [gempipe API](autoapi/gempipe/curate/index) (see the function [gempipe.biolog_preview](https://gempipe.readthedocs.io/en/latest/autoapi/gempipe/curate/gaps/index.html#gempipe.curate.gaps.biolog_preview)). 

With this feature, substrates composing the Biolog® PM plates are iterated and tested for utilization, see below for an example output. The same substrate (`substrate`) can be considered as a C, N, P, or S source (`source`) according to the plate or origin; in these cases, a different growth simulation will be performed for each nutritive source. For example, if the starting C source is glucose (`exr_before`), then it is first removed and an FBA is performed; then the new C source (`exr_after`) is introduced, if it's present in the GSMM (`exr_after_present`), and the FBA is performed again; if the objective value increases, then the strain is assumed to be able to use the substrate (`growth=True`). For more information please read the [gempipe paper](how_to_cite.ipynb).

In [8]:
import pandas as pnd

biolog_example = pnd.read_csv('tutoring_materials/K27.csv', index_col=0)
biolog_example

Unnamed: 0,substrate,source,formula,exr_before,exr_before_present,exr_after,exr_after_present,growth,value,status
0,L-Arabinose,C,,EX_glc__D_e,True,EX_arab__L_e,False,,,
1,D-Saccharic Acid,C,C6H8O8,EX_glc__D_e,True,EX_glcr_e,True,True,19.728400,optimal
2,D-Galactose,C,C6H12O6,EX_glc__D_e,True,EX_gal_e,True,False,0.000000,optimal
3,L-Aspartic Acid,N,C4H6NO4,EX_nh4_e,True,EX_asp__L_e,True,True,28.743739,optimal
4,L-Aspartic Acid,C,C4H6NO4,EX_glc__D_e,True,EX_asp__L_e,True,True,17.614535,optimal
...,...,...,...,...,...,...,...,...,...,...
646,Thiourea,S,,EX_so4_e,True,,False,,,
647,D-Serine,N,C3H7NO3,EX_nh4_e,True,EX_ser__D_e,True,True,34.731331,optimal
648,D-Serine,C,C3H7NO3,EX_glc__D_e,True,EX_ser__D_e,True,True,33.295776,optimal
649,Succinic Acid,C,C4H4O4,EX_glc__D_e,True,EX_succ_e,True,True,17.473351,optimal


## Deriving species-specific models

A species-specific GSMM is here defined as a GSMM composed by all the reactions **always** present in **all** the strain-specific GSMMs of the species. After gap-filling the strain-specific GSMMs, one species-specific GSMM is derived for each of the species inputted in `gempipe recon`. Having a representative GSMM for each species eneable you to perform **comparisons** at the species level, detecting for example which pathways are characterizing a species from the others.



## Output files

`gempipe derive` produces 2 **main** output files in the current directory (`-o`/`--outdir`):

* **strain_models_gf/*.json.** The gap-filled strain-specific GSMMs.

* **species_model/*.json.** The species-specific GSMMs.

Other useful files are the following:

* **strain_models/*.json.** The strain-specific GSMMs, before gap-filling. 

* **derive_strains.csv.** Table showing metrics for each strain-specific GSMM, such as the number of reactions, the objective value and the solver status, before and after the gap-filling step. 

* **derive_species.csv.** Table showing metrics for each species-specific GSMM, such as the number of reaction, the objective value and the solver status. One row for each species inputted in `gempipe recon`.