# Introduction

gempipe is a tool for drafting, curating and analyzing pan and multi-strain reconstructions of genome-scale metabolic models (GSMMs or GEMs).

## In brief

gempipe can start from genomes or directly from proteomes, if a reliable annotation is available. Genomes are filtered for quality using both technical and biological metrics. Then, genes are annotated and grouped into clusters, and an extensive gene-recovery procedure is applied to counteract possible errors introduced during genome assembling or gene calling. 

Gene clusters are used to build a reference-free reconstruction based on the [CarveMe semi-curated universe](https://github.com/cdanielmachado/carveme), applying different rules for the generation of GPRs (gene-to-reaction associations), accounting for alternative isoforms and respecting the original enzyme complex definitions stored in [BiGG](http://bigg.ucsd.edu). 

The reference-free reconstruction is used as a source of new reactions for the expansion of an **optional** user-provided **reference**, thus taking into account the strain-specificity outside the scope of the reference itself. This expansion respects the design decision of the reference in terms of metabolites formula and charge and reactions balance. 

The resulting draft pan-GSMM is then annotated _de novo_ with accessions from many databases and duplicated metabolites and reactions are optionally removed. 

Unlike other tools like [CarveMe](https://carveme.readthedocs.io/en/latest/index.html) or [gapseq](https://gapseq.readthedocs.io/en/latest/), and even if a totally automated reconstruction mode ([`gempipe autopilot`](gempipe_autopilot.ipynb)) is provided, manual curation is strongly encouraged. To facilitate [manual curation](part_2_manual_curation.ipynb), gempipe provides an application programming interface ([API](autoapi/gempipe/interface/index)) with dedicated functions. 

Once the pan-GSMM is finalized, it is used to derive a strain-specific GSMM for each input genome or proteome, exploiting the gene clusters information, eventually granting biomass production on a set of user-defined growth media. At this point, auxotrophies and growth-enabling substrates can be predicted, and [Biolog® screenings](https://www.biolog.com/products/metabolic-characterization-microplates/microbial-phenotype/) can be simulated.

Finally, specific functions of the gempipe [API](autoapi/gempipe/interface/index) can be used to analyze the deck of strain-specific GSMMs: phylometabolic trees can be created, strains can be divided in homogeneous metabolic groups, discriminative metabolic features can be extracted, core metabolism of species can be identified, etc.

## Components and workflow

gempipe is composed by 3 command-line programs and an API. The gempipe workflow is divided in four parts: 

* [**Part 1.**](part_1_gempipe_recon.ipynb) Creation of the **draft pan-GSMM** and the presence/absence matrix (**PAM**), starting either from genomes or proteomes (command line program [`gempipe recon`](part_1_gempipe_recon.ipynb)).
* [**Part 2.**](part_2_manual_curation.ipynb) Manual curation of the draft pan-GSMM, for example using functions provided by the [gempipe API](autoapi/gempipe/interface/index)). 
* [**Part 3.**](part_3_gempipe_derive.ipynb) Derivation of strain-specific GSMMs, starting from the PAM and the curated pan-GSMM (command line program [`gempipe derive`](part_3_gempipe_derive.ipynb)). 
* [**Part 4.**](part_4_multi_strain_analysis.ipynb) Analysis of the deck of strain-specific GSMMs, for example using functions provided by the [gempipe API](autoapi/gempipe/interface/index)). 

As a (_discouraged_) alternative to the manual curation, the additional command line program [`gempipe autopilot`](gempipe_autopilot.ipynb) is provided, which internally calls [`gempipe recon`](part_1_gempipe_recon.ipynb) and [`gempipe derive`](part_3_gempipe_derive.ipynb), linking them together performing an automated gap-filling on the draft pan-GSMM. 

Below it is reported the **interactive** flowchart of gempipe. It can be **zoomed** and **panned** to see the details. Some nodes are **clickable** and point to the corresponding doc section. 

In [1]:
%load_ext autoreload
%aimport gempipe, gempipe.flowchart
%autoreload 1

In [8]:
from gempipe import Flowchart

file = open('flowcharts/part_1.flowchart', 'r')
part_1 = file.read()
file.close()

file = open('flowcharts/part_2.flowchart', 'r')
part_2 = file.read()
file.close()

file = open('flowcharts/part_3.flowchart', 'r')
part_3 = file.read()
file.close()

file = open('flowcharts/part_4.flowchart', 'r')
part_4 = file.read()
file.close()

header = 'flowchart LR \n'
flowchart = Flowchart(header + part_1 + part_2 + part_3 + part_4)
flowchart.render(height=300, zoom=2)