# Empirical Dynamic Modeling with Machine Learning

**Advanced Computational Systems**  
_Kevin Siswandi_  
Biological Information Processing Group  

This notebook will walk you through the process of discovering dynamics from data. To do this, we will make use of a synthetic dataset (virtual strains) that were generated to mimic strains of engineered *E. Coli* for Limonene production. The general outline is:
1. Data Analysis
2. Creation of training data + data augmentation
3. Pipeline building and model training
4. Error analysis (derivative and integrated dynamics)

Prerequisites:
- install data science and scientific computing stack: `pandas`, `numpy`, and `scipy`
- download the dataset `limonene_train.csv` and `limonene_test.csv`
- install visualization tools: `seaborn` and `matplotlib`
- install machine learning libraries: `sklearn`, `tpot`

Parts that you need to complete are marked with `#TODO`. For more information regarding the dataset and methods, see:
* [Weaver et al., 2015](https://pubmed.ncbi.nlm.nih.gov/24981116/) in Wiley Biotechnology and Bioengineering
* [Costello & Martin, 2018](https://www.nature.com/articles/s41540-018-0054-3) in npj Systems Biology and Applications

In [1]:
import numpy as np
import pandas as pd
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
import seaborn as sns
import matplotlib.pyplot as plt

## Data Analysis

The first step is to load the data and take a look at some of its basic properties. Answer the following questions:

**Questions**:
1. How many test and training strains are there in the dataset?
2. For each strain, how many time points are there?
3. How many variables are there in the dataset?

In [3]:
# TODO: change the path to where you save the dataset
train_data = pd.read_csv('data/limonene_train.csv')
test_data = pd.read_csv('data/limonene_test.csv')

# TODO: exploratory data analysis
# how many strains are there in the training data and test data
# how many time points are there for each strain

## YOUR CODE HERE



# display the first 5 rows
train_data.head()

FileNotFoundError: [Errno 2] File b'data/limonene_train.csv' does not exist: b'data/limonene_train.csv'

In the following, we will investigate the distribution of the data by generating some plots. 

**Question**: Do the proteins/enzymes possess the same distribution? To answer this question, you can use `distplot(...)` from seaborn to plot the distribution of every enzyme.

In [2]:
# list of metabolites as targets
targets = ['Acetyl-CoA','Acetoacetyl-CoA','HMG-CoA','Mev','MevP','MevPP','IPP','DMAPP','GPP','Limonene']

# list of enzymes as features
features = ['AtoB','HMGR','HMGS','MK','PMK','PMD','Idi','GPPS','LS']

# TODO: plot the distribution of every protein

## YOUR CODE HERE
for feature in features:
    print(feature)
    sns.distplot(train_data[feature])
    plt.title(feature)
    plt.show()

AtoB


NameError: name 'train_data' is not defined

In the dataset, the metabolites are generated from the same initial condition (0.2), except Limonene, which has initial condition of zero. This choice is typical of bioengineering because the objective is to get Limonene as the useful product (as biofuel). The metabolite concentrations are obtained by solving a Michaelis-Menten based system of Ordinary Differential Equations with the initial conditions as mentioned.

**Question**: create a trace plot of every metabolite, using `tsplot` from seaborn.


In [4]:
# TODO: plot the distribution or time-series (with uncertainty) of the metabolites

## YOUR CODE HERE
species = train_data.columns[2:]
print(species)
## YOUR CODE HERE

Index(['Acetoacetyl-CoA', 'Acetyl-CoA', 'AtoB', 'DMAPP', 'GPP', 'GPPS',
       'HMG-CoA', 'HMGR', 'HMGS', 'IPP', 'Idi', 'LS', 'Limonene', 'MK', 'Mev',
       'MevP', 'MevPP', 'PMD', 'PMK'],
      dtype='object')


Dynamic modeling:
$$ \dot{x} = f(x) $$
Solve by numerical integration to get $x(t)$. We want to $f(x)$ straight from data. To do this:
1. Create training data by computing $f(x)$, the derivative as the output/target, and $x$ as the features
2. train a machine learning model using $x$ as features and the computed derivatives as targets
3. Solve the initial value problem using a numerical integration scheme.

Here, we learn the dynamics $f(x)$ from data using machine learning.

Subsetting data frame
```
df_subset = df.loc[df[0:index]]


## Data Augmentation

As the first step, we will transform the data that you've just looked at into a form that is suitable for training machine learning models for learning dynamics. In the cell below, you will implement a function for generating dataset that can be used for training a model to learn dynamics. In addition, this function will also artificially create additional data points for training (i.e. data augmentation). This consists of the following steps:
1. smoothing the noisy time-series data
2. filtering the smoothed data to obtain interpolated measurements
3. computing the derivatives numerically at the interpolated points

Steps 1-2 are also data augmentation procedure for getting sufficient training data, while step 3 creates the target/output for the machine learning model. **Your task** is to complete the `generate_dataset(...)` function below, taking into account the following:
* computation of the time step for interpolation -- to generate 100 data points between start and end time, what is the time step needed between every time point?
* use `np.linspace` to generate the (e.g. 100, more generally `n_dim`) data points between start time and end time.
* apply `savgol_filter` to the interpolated points
* compute the gradients of the metabolites numerically using `np.gradient`.
* create a multi-index dataframe -- filtered data as features and computed gradients as targets

In [5]:
## TODO: write a function to create and augment the training data

def generate_dataset(data, strain_list, feature_list, target_list, n_dim):
    
    """
    Generate and augment the training data {X, y} for model fitting, using savgol filter as the smoothing method.
    
    Arguments:
    
    data -- time-series data frame of measurements, with 'Strain' as the index
    strain_list -- list of unique strains in `data`
    feature_list -- list of features to be used
    target_list -- list of targets
    n_dim -- number of data points to generate via interpolation
    
    Returns:
    ml_data -- a pandas multi-index dataframe containing features x and targets y.
    
    """
    
    ml_data = pd.DataFrame()
    
    for strain in strain_list:
        measurement_data = {}

        # Interpolate -> Filter -> Add to the table
        for measurement in feature_list + target_list:

            # extract measurement for the specific strain
            measurement_series = data.loc[strain][measurement]
            T = data.loc[strain]['Hour'] # series of time points
            
            ## TODO: extract the start time and end time and the time step
            minT,maxT = None # start time and end time
            delT = None # time step for interpolation
        
            # Interpolate data
            interpolation = interp1d(T,
                                     measurement_series.tolist(),
                                     kind='linear')
            
            # TODO: generate time points to interpolate over using np.linspace
            time_points = None
            
            # Consider the interpolated data over time
            interpolated_measurement = interpolation(time_points)
            
            # TODO: apply savgol filter to interpolated measurement, using window length of 7 and polyorder of 2
            filtered_measurement = None

            # TODO: fill in the data to a multi-index data frame
            if measurement in feature_list:
                # use the filtered measurement of this enzyme as features
                measurement_data[('feature',measurement)] = None # YOUR CODE HERE
            if measurement in target_list:
                # use the filtered measurment of this metabolite as a feature
                measurement_data[('feature',measurement)] = None # YOUR CODE HERE
                # additionally compute gradients of the filtered measurement and use it as target
                measurement_data[('target',measurement)] = None # YOUR CODE HERE
   
        # Create a table
        strain_data = pd.DataFrame(measurement_data,
                                   index=pd.MultiIndex.from_product([[strain],np.linspace(minT,maxT,n_dim)],
                                   names=['Strain', 'Time']))
        ml_data = pd.concat([ml_data,strain_data])
        
    return ml_data


In our case, we assume that the system is described by the autonomous ordinary differential equation:

$$ \dot{m} = f(m, p)$$

where m is the vector of metabolite concentrations and p is the vector of protein concentrations. Now, our goal is to train a model to learn the dynamics $f$, instead of constructing a system of ODEs from knowledge of the pathway mechanisms. This can be done by training a machine learning model using the metabolite and protein concentrations as features and the derivatives as the targets. To do this, we will apply the `generate_dataset(...)` function above to create training data that consists of pairs of features and targets (derivatives).

In [6]:
# make sure that the dataframe is indexed by the strain
train_data = train_data.set_index('Strain')
test_data = test_data.set_index("Strain")

To test if your implementation is correct, run the function `generate_dataset(...)` that you have written above using the dataset provided. Note that you need to extract the strain list (both for training and test data) from the original data frame.

**HINT**: every strain is identified with a number in the 'Strain' column.

In [7]:
# TODO: extract training and test strains
te_strains = None
tr_strains = None

# TODO: apply the function above to create the training and test data
# choose an appropriate data points to generate (recommended: 200)


In [8]:
## uncomment to check the generated training dataset

# display(ml_train)
# display(ml_test)

## Manual Model Building

By now, you have read in the dataset, performed some EDA and created a suitable training data. Now, let's see how we can learn dynamics from the data. The easiest way to do this is to simply train a TPOT model, which will automate all the process (see below). However, let's first build a manual pipeline and then use TPOT afterwards. We will use three representative classes of machine learning models:
1. Random Forest
2. Neural Network
3. Linear Regression

For every model, there would be slightly different preprocessing steps needed (this will be automated by TPOT later). When building a manual pipeline, you can use `Pipeline(...)` from `sklearn`. You need to complete the following.
+ create a random forest regressor with a sufficient number of estimators (e.g. 20)
+ create a pipeline consisting of:
    - standard scaling
    - a linear regressor that is bagged to improve fit. Use `BaggingRegressor(...)` with ridge regression as the base estimator.
+ create a pipelinne consiting of:
    - standard scaling
    - a neural network (multi-layer perceptron) regressor. Use `MLPRegressor(...)` with 4 hidden layers (each of size 5), adam solver, tanh activation, and adaptive learning rate.
    
If some of the terms above are unfamiliar to you, check out the following readings:
* [standard scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) -- centering mean to zero and scaling to unit variance
* [bagging](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html): bootstrap aggregation -- training multiple classifiers on datasets generated from the same (bootstrapped) distribution and aggregating them afterwards
* [ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization) -- linear regression with L2 regularization
* [Adam optimization](https://arxiv.org/abs/1412.6980)
* [adaptive learning rate](https://wiki.tum.de/display/lfdv/Adaptive+Learning+Rate+Method)

In [9]:
# the features are the multi-dimensional time-series concentrations
# the target is the derivative of the dynamics

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

# TODO: create a random forest model with 20 estimators
rf_model = None

# TODO: create a pipeline consisting of: standard scaling of the data + ridge regression
lr_model = None

# TODO: create a pipeline 
nn_model = None

Each model is trained using inputs as follows:
- the features are the multi-dimensional time-series concentrations
- the target is the derivative of the dynamics

Here, the target variables are the derivatives of the metabolites in the reaction. We are going to implement this in such a way so that each target variable is trained with a distinct model. To achieve this, it's important to clone the model passed in to the function with `clone` function. In the next cell, **your task** will be to create a function that will:
* perform the training process for the models that we specify
* return two dictionaries:
    - a dictionary containing a trained model for every target variable
    - a dictonary containing the cross-validation score for each target
* optionally save the cross-validation plot to file

**Tips**: When executing `train_data(..)`, make sure to specify a figure path and create the directory if it doesn't exist. Otherwise, there will be an error returned! This is important because the training takes a long time to run.

In [10]:
## TODO: write a function that performs training
from sklearn.base import clone
from sklearn.model_selection import ShuffleSplit

figure_path = './plots/' # make sure that this directory exists!

def train_data(data,model,plot=False,model_type=None):
    
    """
    Train the input data {X, y}.
    
    Arguments:
    
    data -- multi-index dataframe of time-series measurements, preprocessed by interpolating and filtering
    model -- a selected machine learning model
    plot -- decide to plot the result or not
    model_type -- determine the input model
    
    Returns:
    model_dict -- a trained model dictionary for each target
    score_dict -- a training score dictionary or each target
    
    """
            
    model_dict = {}
    score_dict = {}

    avg_score = 0
    n = 0

    for target_idx in data.columns:
    
        # All we want to train are targets
        if target_idx[0] == 'feature':
            continue
        target = target_idx[1]
        
        # TODO: create the data matrix X and the target vector y
        X = None # YOUR CODE HERE
        y = None # YOUR CODE HERE
        
        if model_type == 'tpot':
            X = np.array(X)
            y = np.array(y)
        
        # TODO: train the model
        # IMPORTANT: clone the model to train a different one for each target
        if model_type == 'tpot':
            # if TPOT, use the best pipeline found
            model_dict[target] = None # YOUR CODE HERE
        else:
            # if RF/NN/LR, simply fit X and y
            model_dict[target] = None # YOUR CODE HERE
        
        # Plot results, if required
        if plot:
            
            # The training_plot function is defined below for you to complete
            CV_plot = training_plot(model_dict[target],
                                    target,X,y,
                                    cv=ShuffleSplit())
            
            axis = plt.gca()
            axis.set_ylim([-0.1, 1.1])
            
            strip_target = ''.join([char for char in target if char != '/'])
            print(strip_target)
            
            CV_plot.savefig(figure_path + strip_target + '_' + model_type + '_CV_plot.pdf',transparent=False)
            
            plt.show()
    
        # TODO: evaluate the model score
        # Every model in sklearn API has its own default scoring metric (see the respective docs), but can be easily accesed via the score method
        score = None ## YOUR CODE HERE
            
        print('Target: {}, CV Pearson R2 coefficient: {:f}'.format(target,score))
        score_dict[target] = score
    
    # TODO: compute the average score over all targets
    avg_score = None #YOUR CODE HERE
    print('Average training score:', avg_score)
    
    return model_dict,score_dict

In [11]:
# TODO: complete the function to plot training curves below
from sklearn.model_selection import learning_curve

def training_plot(estimator,title,X,y,
                  cv=None,n_jobs=1, 
                  train_sizes=np.linspace(.1, 1.0, 5)):
    
    """
    Generate a plot in training process.

    Arguements:
    
    estimator -- a machine learning model.
    title -- a title for the chart.
    X -- array of features.
    y -- target array corresponded to X.
    cv -- a cross-validation generator.
    n_jobs : a number of jobs to run in parallel.
    
    Return:
    plt -- a desired plot.
    
    """
    
    plt.figure()
    plt.title(title)
        
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    # TODO: call the learning_curve function to get scores for different training sizes
    train_sizes = None
    train_scores = None
    test_scores = None
    # YOUR CODE HERE
    
    # TODO: compute the mean and standard deviation of the scores
    train_scores_mean = None #YOUR CODE HERE
    train_scores_std = None #YOUR CODE HERE
    
    test_scores_mean = None #YOUR CODE HERE
    test_scores_std = None #YOUR CODE HERE
    
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


## Automated Model Building

Now, let's automate the machine learning pipeline with TPOT. TPOT finds the best pipeline with programming and cross-validation: it will automate the most tedious part of machine learning by exploring several possible pipelines to find the best one for the specific dataset:

![An example Machine Learning pipeline](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-ml-pipeline.png)

Once TPOT is finished searching (or you get tired of waiting), it returns the best scikit-learn pipeline it found so you can tinker with the pipeline from there.

![An example TPOT pipeline](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-pipeline-example.png)

All images above are courtesy of [Epistasis Lab](https://github.com/EpistasisLab/tpot). More Hints:
- If you need a config dict, take a look at https://github.com/EpistasisLab/tpot/blob/master/tpot/config/regressor.py
- specify a max time limit of 30 minutes to avoid the algorithm running for too long
- use `ShuffleSplit` from scikit-learn for the cross validation method

In [13]:
from tpot import TPOTRegressor
from source.tpot_config import tpot_config_dict

# TODO: create a TPOT regressor
tpot_model = None

In [12]:
# TODO: call the train_data function to train a random forest model (may take ~15 mins to compute)

model_type = 'random_forest'
rf_dict= None # YOUR CODE HERE

In [15]:
# TODO: call the train_data function to train a neural network model (may take ~15 mins to compute)
import time

t0 = time.time()

model_type = 'neural_network'
nn_dict = None # YOUR CODE HERE

t1 = time.time()
print("Elapsed time: ", t1-t0)

Elapsed time:  8.0108642578125e-05


In [16]:
# TODO: call the train_data function to train a linear regression model (may take ~15 mins to compute)

import time

t0 = time.time()

model_type = 'linear_regression'
lr_dict = None # YOUR CODE HERE

t1 = time.time()
print("Elapsed time: ", t1-t0)

Elapsed time:  8.320808410644531e-05


In [17]:
# TODO: call the train_data function to train an automated TPOT model
# (warning, this may take several hours! skip this and come back later if needed after you finish the other sections)

import time

t0 = time.time()

model_type = 'tpot'
tpot_dict = None # YOUR CODE HERE

t1 = time.time()
print("Elapsed time: ", t1-t0)

Elapsed time:  7.390975952148438e-05


**Questions**:
1. You may have noticed that random forest performs poorly for predicting Limonene, but shows OK performance for predicting other metabolites. Why do you think this is so?
2. Try to improve the performance of the neural network by doing some hyperparameter tuning (e.g. reducing the number of hidden layers).

In [18]:
# TODO: complete the function to compute RMSE error for every metabolite and the total RMSE
# make sure that the directory specified in figure_path exists! otherwise the execution will return an error.
import math

figure_path = './plots/' # make sure that this directory exists!

def compute_error(data,model_dict,plot=False,model_type=None):
    
    """
    To check the error of predicted derivative.
    
    Arguments:
    
    data -- time-series data of measurements, preprocessed by interpolating and filtering
    model_dict -- a dictionary of trained models or each target
    plot -- decide to plot the result or not
    model_type -- determine the input model
    
    """
    
    # list of errors
    # this contains the error for every metabolite (target)
    error_list = []

    for target in model_dict:
        # Extract input target
        y_test = data[('target',target)].values
    
        # Extract predicted target
        feature_list = [('feature',feature) for feature in data['feature'].columns]
        target_data = data[feature_list]
        y_prediction = model_dict[target].predict(target_data.values)
    
        # TODO: Compute squared error and append it to the list of errors
        ## YOUR CODE HERE
        error = None
        error_list.append(error)
        
        # Compute mean and standard deviation of squared error
        ## YOUR CODE HERE
        mu = None
        sigma = None
        print(target,'RMSE:',mu,'standard deviation:',sigma)
        
        if plot:
            plt.figure(figsize=(13,4))
            plt.subplot(121)
            sns.distplot(error)
            
            plt.title(target + ' Derivative '+ 'Error Residual Histogram')
            plt.xlabel('Derivative Residual Error')
            plt.ylabel('Probability Density')
    
            plt.subplot(122)
            error_plot(target,y_prediction,y_test) # this function is provided below
    
            strip_target = ''.join([char for char in target if char != '/'])
            plt.savefig(figure_path + strip_target +'_'+ model_type + '_Error_Residuals.pdf')
            plt.show()

    # TODO: compute total error from the error list
    ## YOUR CODE HERE

    mu = None
    sigma = None
    print('Total Derivative','Mean Error:',mu,'Error Standard Deviation:',sigma)
    

In [19]:
# the function error plot is provided to you below.

def error_plot(name,pred,real):
    
    """
    Generate a plot from detecting error of derivatives.

    Arguements:
    
    name -- a name for the title.
    pred -- a list of predicted derivatives
    real -- a list of actual derivatives
    
    """

    plt.scatter(pred,real)
    plt.title(name + ' Predicted vs. Actual')
    
    axis = plt.gca()
    axis.plot([-120,120], [-120,120], ls="--", c=".3")
    
    padding_y = (max(real) - min(real))*0.1
    plt.ylim(min(real)-padding_y,max(real)+padding_y)
    
    padding_x = (max(pred) - min(pred))*0.1
    plt.xlim(min(pred)-padding_x,max(pred)+padding_x)
    
    plt.xlabel('Predicted ' + name)
    plt.ylabel('Actual ' + name)
 

In [20]:
## TODO: call the compute error function for the random forest model,
# using the test data and also training (in-sample) data

#compute_error(ml_test,rf_dict,plot=True,model_type='random_forest')

In [21]:
## TODO: call the compute error function for the neural network model,
# the linear regression model
# using the test data and also training (in-sample) data

#compute_error(ml_test,lr_dict,plot=True,model_type='linear_regression')

In [22]:
## TODO: call the compute error function for the tpot model
# using the test data and also training (in-sample) data

#compute_error(ml_test,tpot_dict,plot=True,model_type='tpot')

Next, we are going to use this trained model to make predictions. To do that, we will solve the initial value problem by numerical integration. The function for solving the IVP is provided for you. The general idea is:
1. Create the derivative function from the learned model
2. solve the IVP with the derivative function and initial conditions

Below, you will find two functions:
1. `int_ode(...)` for solving the initial value problem
2. `ml_ode(...)` for defining the derivative dynamics learned by machine learning (representing the 'ODE equation'). This function will be passed to the `int_ode` function to be integrated.

**Your task** is to apply the provided functions `ml_ode(...)` to integrate the dynamics. In our setting, the function that describes the dynamics of the metabolites $f(m,p) = \dot{m}$ takes the protein concentrations as given.

In [23]:
# you are provided with the function to integrate below
from scipy.integrate import ode

def int_ode(g,y0,times,solver='scipy'):
    
    """
    Integration function corresponded to the ode, generated by ml_ode.
    
    Arguments:
    f -- an ode equation to be integrated
    y0 -- an initial condition as a list of concentrations
    times -- a list of time coordinate
    solver -- string of package used for ode solver
    
    Return:
    x -- a solution of the ode problem
    
    """
    
    if solver == 'assimulo':
        from assimulo.problem import Explicit_Problem
        from assimulo.solvers import Dopri5
        
        # Set up ODE
        rhs = lambda t,x: g(x,t)
        model = Explicit_Problem(rhs,y0,min(times))
        sim = Dopri5(model)
        
        # Preform integration
        _,x = sim.simulate(max(times),max(times))
        return np.array(x)[np.array(times).astype(int)].tolist()
    
    elif solver == 'scipy':
        # Set up ODE
        f = lambda t,x: g(x,t)
        r = ode(f).set_integrator('dopri5',
                                  nsteps=1e4,
                                  atol=1e-3)
    
        r.set_initial_value(y0,times[0])
    
        #widgets.FloatProgress(min=0, max=max(times))
    
        # Preform integration
        x = [y0,]
        currentT = times[0]
        max_delT = 10
    
        for nextT in times[1:]:
        
            while r.t < nextT:
            
                if nextT-currentT < max_delT:
                    dt = nextT-currentT
                else:
                    dt = max_delT
                
                value = r.integrate(r.t + dt)
                currentT = r.t

                f.value = currentT
            
            x.append(value)
        return x

In [24]:
# you are provided with the function to define the derivatives below

def ml_ode(model_dict, data, targets, features, time_index='Hour'):
    
    """
    Set up an ODE.
    
    Arguments:
    model_type -- a string for desired model
    data -- raw time-series data of measurements
    targets -- list of targets
    features -- list of features
    time_index -- a string labelel for time index of the input data
    
    Return:
    f - an output ODE
    
    """
    
    # Create interpolations for each feature
    ml_interpolation = {}
    
    for feature in data.columns:
        feature_columns = feature
        
        if isinstance(feature,tuple):
            if feature[0]=='feature':
                feature = feature[1]
            else:
                continue

        if feature in features:    
            X,y = data.reset_index()[time_index].tolist(), data[feature_columns].tolist()

            ml_interpolation[feature] = interp1d(X,y)
            
    # Define the function to be integrated
    def f(x,t):
        x_dot = []
        
        # Create derivatives for each target
        for target in targets:
            x_pred = []
            
            # loop over all species
            for feature in data.columns:
                if isinstance(feature,tuple):
                    if feature[0]=='feature':
                        feature = feature[1]
                    else:
                        continue
                
                if feature in features:
                    x_pred = np.append(x_pred, ml_interpolation[feature](t))
                elif feature in targets:
                    x_pred = np.append(x_pred, x[targets.index(feature)])
                
            model_prediction = model_dict[target].predict(x_pred.reshape(1,-1))
            x_dot = np.append(x_dot,model_prediction)   
            
        return x_dot
    return f

In [25]:
# TODO: write a function to integrate the dynamics and predict time points
import random
from scipy.integrate import quad
figure_path = './plots/'

def predict_integrate(ts_data,tr_data,model_dict,targets,features,pathway,
              plot=False,model_type=None,solver='scipy'):
    
    """
    Integrate the learned 'ODE' and use it for simulations
    
    Arguments:
    
    tr_data -- raw time-series data of measurements used for training
    ts_data -- raw time-series data of measurements used for testing
    model_dict -- a dictionary of trained models or each target
    pathway -- a selected pathway
    targets -- list of targets
    features -- list of features
    plot -- decide to plot the result or not
    model_type -- determine the input model
    solver -- string of package used for ode solver
    
    """
    
    rmse_average = []
    rmse_percent = []
    
    ts = ts_data 
    
    # Get a randomed strain
    strains = ts.index.get_level_values(0).unique().tolist()
    strain = random.sample(strains,1)
    
    test_data = ts.loc[strain]
        
    # TODO: get the initial conditions from test_data
    y0 = None
    
    # TODO: call ml_ode function to construct the 'ODE'
    g = None

    # Get the time points
    times = test_data.reset_index()['Hour'].tolist()
        
    # TODO: call int_ode to integrate the 'ODE' g
    fit = None
        
    # Format the output as a table
    fit_data = pd.DataFrame(fit, 
                            index=times, 
                            columns = targets).rename_axis('Hour')
    
    # Set up real data and predicted targets
    real = test_data[targets]
    pred = fit_data
        
    # Display them
    print('Real data:')
    display(real)
    print('Predicted data:')
    display(pred)
        
        
    for metabolite in fit_data.columns:
        t,X = times, real[metabolite].tolist()
        real_fcn = interp1d(t,X)
        pred_fcn = interp1d(times,pred[metabolite])
            
        '''
        Optional 
        times =  real[metabolite].dropna().index.tolist()
        real_fcn = interp1d(times,real[metabolite].dropna())
        pred_fcn = interp1d(times,pred[metabolite].loc[times])
        '''
            
        # Calculate RMSE average
        integrand = lambda t: (real_fcn(t) - pred_fcn(t))**2
        rmse = math.sqrt(quad(integrand,min(times),max(times),limit=200)[0])
        rmse_average.append(rmse)
            
         # Calculate RMSE percentage
        percent_integrand = lambda t: abs(real_fcn(t) - pred_fcn(t))/(real_fcn(t)*max(times))
        rmsep = math.sqrt(quad(percent_integrand,min(times),max(times),limit=200)[0])
        rmse_percent.append(rmsep)
        
        print('ML Fit:',metabolite,rmse,
              'RMSE percentage:',rmsep*100)
    
    print('ML model aggregate error')
    print('Average RMSE:',sum(rmse_average)/len(rmse_average))
    print('Total percentage error:',sum(rmse_percent)/len(rmse_percent)*100)
        
    if plot:
        tr = tr_data
        fitT = list(map(list, zip(*fit)))
        
        # Create interpolation functions for each feature
        interp_f = {}
            
        for feature in test_data.columns:
            t,X = test_data.reset_index()['Hour'].tolist(), test_data[feature].tolist()
            interp_f[feature] = interp1d(t,X)
        
        plt.figure(figsize=(12,8))
        
        common_targets = ['Acetyl-CoA','Acetoacetyl-CoA','HMG-CoA','Mev','IPP', 'Limonene']
        for i,target in enumerate(common_targets):
            plt.subplot(2,3,i+1)
            
            for strain in tr_strains:
                strain_interp_f = {}
                strain_df = tr.loc[strain]
                
                X,y = strain_df.reset_index()['Hour'].tolist(), strain_df[target].tolist()
                strain_interp_f[target] = interp1d(X,y)
                
                actual_data = [strain_interp_f[target](t) for t in times]
                
                train_line, = plt.plot(times,actual_data,'r--')
                    
            actual_data = [interp_f[target](t) for t in times]
            
            pos_pred = [max(fitT[i][j],0) for j,t in enumerate(times)]
            prediction_line, = plt.plot(times,pos_pred)
            
            test_line, = plt.plot(times,actual_data,'g--')
            
            plt.ylabel(target)
            plt.xlabel('Time [h]')
            plt.xlim([0,72])
    
        if pathway=='isopentenol':
            product = 'Isopentenol'
        elif pathway=='limonene':
            product = 'Limonene'
    
        plt.tight_layout()
        plt.subplots_adjust(top=0.90)
        plt.subplots_adjust(bottom=0.12)
        plt.suptitle('Prediction of ' + product + ' Strain Dynamics', fontsize=18)
        plt.figlegend((train_line,test_line,prediction_line), 
                      ('Training Set Data','Test Data','Machine Learning Model Prediction'), 
                      loc = 'lower center', ncol=5, labelspacing=0. ) 
            
        plt.savefig(figure_path + product + model_type +'_prediction.eps', format='eps', dpi=600)
        plt.show()

In [26]:
# TODO: call the function predict_integrate to integrate dynamics and make predictions
#predict_integrate(test_data,train_data,lr_dict,targets,features,'limonene',plot=True,model_type='linear_regression')

Congratulations! Now you have seen how machine learning can be used for empirical dynamic modeling in systems biology. Next, you can explore other ways to use what you have For example, you can use the learned model to make simulations and explore the metabolomics/proteomics phase space. You can gain further insights, e.g. by performing PCA and visualizing the simulations in 2-D principal components phase space.