# Simple Protein Folding Optimization

This notebook is derived from the [Protein Folding](https://www.gurobi.com/jupyter_models/protein-folding/)
Jupyter notebook example published by Gurobi.

Here, the setting is the same, but we use Rexl rather than Python to formulate the model.
Rexl includes support for multiple MIP solvers, including [Gurobi](https://www.gurobi.com/),
[HiGHS](https://highs.dev/), and [GLPK](https://www.gnu.org/software/glpk/).
This makes it easy to compare their relative performance.

## Problem Description

For background, see the
[source notebook](https://colab.research.google.com/github/Gurobi/modeling-examples/blob/master/protein_folding/protein_folding.ipynb)
published by Gurobi.

We summarize here. A certain protein consists of `Num` amino acids arranged linearly.
A subset of the amino acids are _hydro-phobic_. The protein tends to fold so that
the number of _matches_ of _hydro-phobic_ amino acids is maximized.

A fold at position `R` allows positions `P` and `Q` to match if `R` and `R+1` are half
way between `P` and `Q`. That is, a fold at `R` allows `P` and `Q` to match if `R = (P + Q - 1) div 2`.
However, if there are any other folds between `P` and `Q` then they will not be matched.

These pictures are taken from The Gurobi notebook. They show a protein in unfolded and folded forms,
with the hydro-phobic amino acids shaded.

![chain](https://github.com/Gurobi/modeling-examples/blob/master/protein_folding/chain.PNG?raw=1)

![folding](https://github.com/Gurobi/modeling-examples/blob/master/protein_folding/folding.PNG?raw=1)


## Model Formulation

Rexl supports the concept of module, which contains various kinds of symbols:
* A _parameter_ (`param`) is defined by a formula. The formula can reference other parameters and constants.
* A _constant_ (`const`) is defined by a formula. The formula can reference parameters and other constants.
* A _free variable_ (`var`) is defined by its domain. There are multiple ways to specify a domain.
  The formulas that define the domain can reference only parameters and constants, not variables.
* A _computed variable_ is defined by a formula. The formula can reference any kind of symbol.
  There are multiple kinds of computed variables:
  * A measure (`msr`) is typically used as an objective (to maximize or minimize).
  * A constraint (`con`) is boolean valued. A solution will make these be `true`.
  * A let (`let`) is typically for displaying information or as an intermediate result used by other computed variables.

New instances of a module can be constructed with different values for the parameters and free variables.
When a solver is invoked to maximize or minimize a measure, the result (when successful) is a new instance
of the module with the free variable values set.

We now define the `Folding` module. It has parameters for the number of positions in the protein
and the hydro-phobic positions. Below, we encode an alternate form for the parameters, when the
protein is specified via a text value containing `H` for a hydro-phobic location and `-` for others.

In [1]:
Folding := module {
    // The number of amino acids in the protein.
    // The positions of the hydro-phobic amino acids.
    param Num := 50;
    param HydroPhobic := [2,4,5,6,11,12,17,20,21,25,27,28,30,31,33,37,44,46];
    const NumHydroPhobic := HydroPhobic->Count();

    // Construct a table of possible matches, including the matching positions (P and Q)
    // and the fold location (R). Then augment with an assigned index (Id) and with
    // the sequence of bad fold locations (X).
    const PossibleMatches :=
        CrossJoin(P: HydroPhobic, Q: HydroPhobic,
            Q - P >= 3 and (Q - P) mod 2 = 1, // The join predicate.
            { P, Q, R: (P + Q - 1) div 2 })     // The result.
        // Add the index (Id) and bad fold locations (X).
        +>{ Id: #, X: Range(P, Q)->TakeIf(it != R) };

    // The number of possible matches.
    const NumPossibleMatches := PossibleMatches->Count();

    // Define the free variables:
    // * A true/false variable for each possible match.
    // * A true/false variable for each possible fold location.
    var Matches := Tensor.Fill(false, NumPossibleMatches);
    var Folds := Tensor.Fill(false, Num);

    // We want to maximize the number of matches.
    msr NumMatches := Sum(Matches.Values);

    // Construct the constraints.
    // For a match, there must be a fold at R.
    // For a match, there must NOT be a fold at any of the bad fold locations (X).
    con MustFold := PossibleMatches->ForEach(Matches[Id] <= Folds[R]);
    con CantFold := PossibleMatches->ForEach(Matches[Id] + Folds[X] <= 1);

    // Define the table of matches.
    let MatchTable := PossibleMatches->TakeIf(Matches[Id])->{ Id, P, Q, R };
};

In [2]:
Folding.Num;
Folding.NumHydroPhobic;

50
18


## Optimize

Now optimize using each of the supported MIP solvers.  We run each twice to see both cold and warm
solve times. [HiGHS](https://highs.dev/) really shines on this problem.

If you don't have a Gurobi license, the second one will fail (result in `null`). Also note that the first
run of Gurobi includes time to verify the license, which may artificially inflate the time.

In [3]:
#!time
Best1 := Folding->Maximize(NumMatches, "highs");

Solver: HiGHS


Wall time: 135.6336ms

In [4]:
#!time
Best1 := Folding->Maximize(NumMatches, "highs");

Solver: HiGHS


Wall time: 72.9459ms

In [5]:
#!time
Best2 := Folding->Maximize(NumMatches, "gurobi");

Solver: Gurobi


Wall time: 3039.4419ms

In [6]:
#!time
Best2 := Folding->Maximize(NumMatches, "gurobi");

Solver: Gurobi


Wall time: 327.4464ms

In [7]:
#!time
Best3 := Folding->Maximize(NumMatches, "glpk");

Solver: GLPK


Wall time: 2702.3877ms

In [8]:
#!time
Best3 := Folding->Maximize(NumMatches, "glpk");

Solver: GLPK


Wall time: 2687.5423ms

In [9]:
Best1.NumMatches;
Best2.NumMatches;
Best3.NumMatches;

10
10
10


See if the solutions match.

In [10]:
Best1.MatchTable;
Best2.MatchTable;
Best3.MatchTable;

Seq<{i8,i8,i8,i8}>
   0) { Id: 0, P: 2, Q: 5, R: 3 }
   1) { Id: 17, P: 5, Q: 12, R: 8 }
   2) { Id: 23, P: 6, Q: 11, R: 8 }
   3) { Id: 36, P: 12, Q: 17, R: 14 }
   4) { Id: 43, P: 17, Q: 20, R: 18 }
   5) { Id: 48, P: 20, Q: 25, R: 22 }
   6) { Id: 57, P: 25, Q: 28, R: 26 }
   7) { Id: 61, P: 27, Q: 30, R: 28 }
   8) { Id: 68, P: 30, Q: 37, R: 33 }
   9) { Id: 73, P: 37, Q: 44, R: 40 }
Seq<{i8,i8,i8,i8}>
   0) { Id: 0, P: 2, Q: 5, R: 3 }
   1) { Id: 17, P: 5, Q: 12, R: 8 }
   2) { Id: 23, P: 6, Q: 11, R: 8 }
   3) { Id: 36, P: 12, Q: 17, R: 14 }
   4) { Id: 43, P: 17, Q: 20, R: 18 }
   5) { Id: 48, P: 20, Q: 25, R: 22 }
   6) { Id: 57, P: 25, Q: 28, R: 26 }
   7) { Id: 64, P: 28, Q: 31, R: 29 }
   8) { Id: 67, P: 30, Q: 33, R: 31 }
   9) { Id: 74, P: 37, Q: 46, R: 41 }
Seq<{i8,i8,i8,i8}>
   0) { Id: 0, P: 2, Q: 5, R: 3 }
   1) { Id: 17, P: 5, Q: 12, R: 8 }
   2) { Id: 23, P: 6, Q: 11, R: 8 }
   3) { Id: 36, P: 12, Q: 17, R: 14 }
   4) { Id: 43, P: 17, Q: 20

In [11]:
Best1.MatchTable = Best2.MatchTable;
Best1.MatchTable = Best3.MatchTable;
Best2.MatchTable = Best3.MatchTable;

Seq<bool>
   0) true
   1) true
   2) true
   3) true
   4) true
   5) true
   6) true
   7) false
   8) false
   9) false
Seq<bool>
   0) true
   1) true
   2) true
   3) true
   4) true
   5) true
   6) true
   7) true
   8) false
   9) false
Seq<bool>
   0) true
   1) true
   2) true
   3) true
   4) true
   5) true
   6) true
   7) false
   8) true
   9) false


### Double

Now double the length of the protein and optimize.

In [12]:
Double := Folding=>{ Num: Num * 2, HydroPhobic: HydroPhobic ++ (Num + HydroPhobic) };
Double.Num;

100


In [13]:
#!time
Best1 := Double->Maximize(NumMatches, "highs");
Best1.NumMatches;

Solver: HiGHS
20


Wall time: 4530.2879ms

In [14]:
#!time
Best2 := Double->Maximize(NumMatches, "gurobi");
Best2.NumMatches;

Solver: Gurobi
20


Wall time: 49301.131ms

In [15]:
// GLPK struggles with this. I haven't let it run to completion.
// #!time
// Best3 := Double->Maximize(NumMatches, "glpk");
// Best3.NumMatches;

## Different Protein Input

Here is a second module formulation that takes a single parameter named `Protein` as a text value
containg `H` and `-` characters. The `Num` and `HydroPhobic` symbols are now constants.

In [16]:
Folding2 := module {
    param Protein := "--H-HHH----HH----H--HH---H-HH-HH-H---H------H-H---";

    // The number of amino acids in the protein.
    // The positions of the hydro-phobic amino acids.
    const Num := Protein.Len;
    const HydroPhobic := Range(Num)->TakeIf(Protein[it:*1] = "H");
    const NumHydroPhobic := HydroPhobic->Count();

    // Construct a table of possible matches, including the matching positions (P and Q)
    // and the fold location (R). Then augment with an assigned index (Id) and with
    // the sequence of bad fold locations (X).
    const PossibleMatches :=
        CrossJoin(P: HydroPhobic, Q: HydroPhobic,
            Q - P >= 3 and (Q - P) mod 2 = 1, // The join predicate.
            { P, Q, R: (P + Q - 1) div 2 })     // The result.
        // Add the index (Id) and bad fold locations (X).
        +>{ Id: #, X: Range(P, Q)->TakeIf(it != R) };

    // The number of possible matches.
    const NumPossibleMatches := PossibleMatches->Count();

    // Define the free variables:
    // * A true/false variable for each possible match.
    // * A true/false variable for each possible fold location.
    var Matches := Tensor.Fill(false, NumPossibleMatches);
    var Folds := Tensor.Fill(false, Num);

    // We want to maximize the number of matches.
    msr NumMatches := Sum(Matches.Values);

    // Construct the constraints.
    // For a match, there must be a fold at R.
    // For a match, there must NOT be a fold at any of the bad fold locations (X).
    con MustFold := PossibleMatches->ForEach(Matches[Id] <= Folds[R]);
    con CantFold := PossibleMatches->ForEach(Matches[Id] + Folds[X] <= 1);

    // Define the table of matches.
    let MatchTable := PossibleMatches->TakeIf(Matches[Id])->{ Id, P, Q, R };
};

In [17]:
Folding2.Num;
Folding2.NumHydroPhobic;

50
18


In [18]:
#!time
Best := Folding2->Maximize(NumMatches, "highs");
Best.NumMatches;

Solver: HiGHS
10


Wall time: 67.3397ms

## Double

Now double the length of the protein and optimize. Note the simplicity of the new
`Protein` value in terms of the old.

In [19]:
Double := Folding2=>{ Protein: Protein & Protein };
Double.Num;
Double.Protein;

100
--H-HHH----HH----H--HH---H-HH-HH-H---H------H-H-----H-HHH----HH----H--HH---H-HH-HH-H---H------H-H---


In [20]:
#!time
Best := Double->Maximize(NumMatches, "highs");
Best.NumMatches;

Solver: HiGHS
20


Wall time: 4879.3951ms

## Double and Reverse

This time, concatenate the original Protein text with it's reverse. Does this allow better matching?

In [21]:
Double := Folding2=>{ Protein: Protein & Protein[::-1] };
Double.Num;
Double.Protein;

100
--H-HHH----HH----H--HH---H-HH-HH-H---H------H-H------H-H------H---H-HH-HH-H---HH--H----HH----HHH-H--


In [22]:
#!time
Best := Double->Maximize(NumMatches, "highs");
Best.NumMatches;

Solver: HiGHS
22


Wall time: 4113.9823ms

Now in the opposite order.

In [23]:
Double := Folding2=>{ Protein: Protein[::-1] & Protein };
Double.Num;
Double.Protein;

100
---H-H------H---H-HH-HH-H---HH--H----HH----HHH-H----H-HHH----HH----H--HH---H-HH-HH-H---H------H-H---


In [24]:
#!time
Best := Double->Maximize(NumMatches, "highs");
Best.NumMatches;

Solver: HiGHS
22


Wall time: 4119.0905ms