# BOFdat step3

## Finding specie-specific metabolic end goals

In [1]:
from BOFdat import step3
import pandas as pd

## Example using the *E.coli* genome-scale model *i*ML1515 and the biomass objective function generate in BOFdat step2

Metabolic end goals may vary considerably from one specie to another. For instance, the peptidoglycan that are present on the surface of *E.coli* may be slightly different in another gram-negative bacteria and completely absent from a gram-positive. This example is obvious for components that define the cell's visible phenotype but what should one do when the metabolites do not completely change the cell's phenotype. The identification of these seemingly invisible specie-specific crucial metabolites is the goal of the Step3 of BOFdat.

Using the metaheuristic genetic algorithm, BOFdat finds the biomass composition that allows to best replicate gene essentiality.

### Steps

1. Generate an initial population

2. Generate optimal biomass compositions

3. Cluster the metabolic end goals

4. Determine stoichiometric coefficients

5. Update


### 1. Generate an initial population

In BOFdat step3, a genetic algorithm is implemented to define the optimal biomass composition. The individuals subjected to the evolution are indexed boolean lists where each position correspond to a defined metabolite. The genetic algorithm requires that an initial population of these individuals be generated. To do so BOFdat first screens the entire model for solvable metabolites. 


In [None]:
step3.generate_initial_population()

### 2. Generate optimal biomass compositions

Each individual in the population defines a different biomass composition that can be evaluated for fitness. Here, the evaluation function is the gene essentiality prediction. Ideally the constraints applied by the biomass objective function on the model would force flux through certain reactions as it happens *in vivo*. Namely, metabolites that the cell need to produce in order to double should be represented in the biomass objective function. The *in silico* knock-out of genes would cut the flux through the reaction and the cell would not be able to grow, replicating *in vivo* gene essentiality.

Single-gene knock-out is executed for each biomass composition generated. Both *in vivo* and *in vitro* essentiality are converted in boolean vectors of essential and non-essential genes. The distance between the two vectors can then be determined using standard distance metrics (Hamming) or the Matthews Correlation Coefficient (MCC). The metric is used as a measure of fitness. The MCC is then maximized throughout the evolution, selecting biomass compositions that yield the highest. The standard genetic operators mutation and cross-over are operated on the inviduals to create genetic diversity.

In [None]:
step3.find_metabolites()

### 3. Cluster the metabolic end goals

The output of a genetic algorithm may vary from one initial population to another. To get a sense of what would be an optimal result, it is generally advised to run more than one evolution and aggregate the results. BOFdat does so by clustering the metabolites that appeared most frequently in the optimal result of every evolution. The distance matrix for every metabolite in the network is generated and reduced to the most frequent metabolites. Hierarchical clustering is then applied to the reduced distance matrix in order to generate clusters of metabolic end goals that can be used for curation.

In [None]:
step3.cluster_metabolites()

### 4. Determine stoichiometric coefficients

In [None]:
step3.determine_coefficients()

### 5. Update