# Introduction

**This demo shows to run a distributed simulation of protein folding using GROMACS within a bittensor subnet**

In this subnet:
- Validators select a protein (pbd_id), download the structure and preprare input files
- Miners run the simulation and send back their results
- Scoring is based on free energy of the folded structure

In [None]:
import sys
import bittensor as bt

from typing import Tuple
from folding.protocol import FoldingSynapse
from folding.miners.forward import forward
from folding.validators.protein import Protein

bt.trace()

def memory(files: dict):
    total = 0
    for k, v in files.items():
        size_kb = sys.getsizeof(v)/1024
        total += size_kb
        print(f'file {k!r}: {size_kb:.2f} KB')
    print('------')
    print(f'Total: {total:.2f} KB')

### Protein class is contains the protein sequence and the current state of the protein folding simulation.

In [None]:
protein = Protein(max_steps=50)
protein

### validator is currently responsible for preparing the protein for the simulation.

In [None]:
protein.md_inputs

In [None]:
memory(protein.md_inputs)


# Simulation using only Synapse

In [None]:
synapse = FoldingSynapse(pdb_id=protein.pdb_id, md_inputs=protein.md_inputs)#, mdrun_args='-maxh 0.01')
synapse

### Simulate the miner receiving the synapse and performing the md simulation.

In [None]:
forward(synapse)

### Simulation results are attached to synapse

In [None]:
synapse.deserialize()

In [None]:
memory(synapse.md_output)

### Perform reward calculation for miner

In [None]:
protein.gro_path

In [None]:
protein.gro_hash(protein.gro_path)

In [None]:
protein.gro_hash('/Users/steffencruz/Desktop/py/opentensor/folding/data/1UBQ/dendrite/test/md_0_1.gro')


In [None]:
import re
example = '    1MET      N    1   4.502'
pattern = re.compile(r'\s*(\d+\w+)\s+(\w+\d*\s*\d+)\s+(\-?\d+\.\d+)+')
def gro_content(gro_path, begin=0, end=-1):
    bt.logging.info(f'Calculating hash for path {gro_path!r}')
    with open(gro_path, 'rb') as f:
        name, length, *lines, _ = f.readlines()
        name = name.decode().strip()
        length = int(length)
        bt.logging.info(f'{name=}, {length=}, {len(lines)=}')
    buf = ''
    for i, line in enumerate(lines[begin:end]):
        line = line.decode().strip()
        match = pattern.match(line)
        if not match:
            raise Exception(f'Error parsing line {i+1} in {gro_path!r}: {line!r}')
        buf += match.group(1)+match.group(2).replace(' ', '')
    return name+buf

end = 500000
ref = gro_content(protein.gro_path, begin=0, end=end)
pred = gro_content('/Users/steffencruz/Desktop/py/opentensor/folding/data/1UBQ/dendrite/test/md_0_1.gro', begin=0, end=end)
bt.logging.success(ref)
bt.logging.success(pred)
bt.logging.success(ref==pred)

In [None]:
pattern.match('312SOL    HW1 1938  -0.044   0.387  -0.016\n')

In [None]:
reward = protein.reward(synapse.md_output, hotkey='test')

reward

# Outlook
We have demonstrated that a molecular dynamics simulation can be carried out in the context of a subnet.

- We only have a single pdb_id right (1UBQ) now, but we can easily extend this to a list of pdb_ids. Lets establish a way to access all pdbs that are eligible for simulation. NOTE: we have the means to download the files from a database, given the pdb id, we just need a list of eligible pdb ids. This can be a static list which we have in a file, or we get it from a database too.
- Use Gromacs python API, if it makes sense to do so (https://gromacs-py.readthedocs.io/en/latest/notebook/00_basic_example.html, https://github.com/Becksteinlab/GromacsWrapper). Making sense mens that we either have much cleaner code, or we can do something that we cannot do with the command line tool.
- We need to actually fold some proteins and understand the expected results. Is this code stable? Does it produce the expected results? We need to understand the expected results and how to interpret them.

# Validation Flow
We need to understand how large this problem space is so that we do not exhaust all of the proteins too quickly and effectively kill the PoW component.
Key points are:
1. How many proteins are eligible for simulation?
2. Is there a principled way to modify initial conditions to create new protchallallengeschengesein?

# Reward Mechanism
We need to benchmark the reward mechanism in many ways before we can deploy this on mainnet. We need to understand the expected results in terms of miner rewards, competitiveness, dependency on hardware, etc. Main points are:
1. How long does it take to fold a protein? How many simulations can a miner do in a day?
2. How much does it cost to fold a protein? How much does it cost to run a simulation?
3. How busy will validators be? How many simulations can a validator do in a day?
4. What is the expected reward for a miner? How does this depend on the hardware?
5. How does the reward depend on the protein? How does the reward depend on the simulation parameters?
 and also understand the expected results in terms of miner rewards, competitiveness, dependency on hardware, etc.



## Remaining Steps
- Run on testnet [DONE]
- Run on mainnet


## Opportunities for Improvements
- Improved customization of input files (e.g. force field, box, mdp templates)
- Performance optimization (file usage, simulation length, parallelization)
- Allow for different miners (e.g. AI models versus GPU models versus CPU models)
- Perturbation of the structure (e.g. mutation) to prevent lookup attacks
- More complex scoring function (e.g. based on RMSD)
- More complex simulation (e.g. folding of a protein with multiple chains)

