> <p><small><small>Copyright 2020 DeepMind Technologies Limited.</p>
> <p><small><small> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at </p>
> <p><small><small> <a href="https://www.apache.org/licenses/LICENSE-2.0">https://www.apache.org/licenses/LICENSE-2.0</a> </p>
> <p><small><small> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. </p>


**Aim**

This Colab aims to give a basic intuition on what machine learning is. It does so by walking you through solving a toy protein folding problem.

Protein folding is a very important problem, as proteins are the basic building blocks of everything alive on Earth, including our bodies. Proteins work together like miniature machines to execute many functions in our bodies, from transmitting signals and providing structure to our cells to working as chemical factories and peacekeepers of our bodies. Proteins start their lives as one-dimensional chains of amino acids, but then they “fold” into 3D shapes. It is similar to how in origami you fold a 2D sheet of paper into a 3D crane or a unicorn, but proteins are even simpler than 2D, they start as 1D sequences. Understanding how proteins acquire 3D structure is one of the most difficult and unsolved problems in biology. We get the 1D sequence of a protein from its DNA or RNA, but it is very difficult to predict what shape it will take when it folds into a 3D structure. Scientists are currently using machine learning to get closer to understanding this important biological question.

**Disclaimer**


*This is unrelated to DeepMind's AlphaFold algorithm.  If you'd like to know more about AlphaFold, please see [our Blog post](https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery).*

This code is intended for educational purposes, and in the name of readability for a non-technical audience does not always follow best practices for software engineering.

**Links to resources**
- [What is Colab?](https://colab.sandbox.google.com/notebooks/intro.ipynb) If you have never used Colab before, get started here!

#                        Let's explore proteins and machine learning!






> ## Example of a protein.




Let's see an animation of one the most important protein machines in our bodies - ATP synthase.

The job of ATP synthase is to make ATP, the energy currency of the cell. You may know that mitochondria are the powerhouse of the cell, and ATP synthase is the crucial protein.

In [None]:
#@title Look how complex this structure is!
from IPython.display import YouTubeVideo
YouTubeVideo('GM9buhWJjlA', width=600, height=400)


> ## Introduction




---
In nature, our cells make proteins bit by bit by building a chain from building blocks called amino acids. Proteins then fold into shape naturally in the crowded, watery insides of the cell. We know the sequence of amino acids in a protein because of the **genetic code**, but we don't always know what its 3D shape is. Why is its shape important? Because a protein's function, what is does in our body, is determined by its shape!

---
In this colab:




- You will play with shaping and visualizing proteins. You will use sliders to fix misfolded proteins and track your progress using an error function.
- You will learn about some concepts that are important in machine learning.



Through these activities, you will get a feel for what machine learning is and why it can help us solve some of the most challenging problems, like protein folding!

---



> ## Now let's get everything we need to work with proteins.





*Instructions:*  Click on the button on the left to run the cell.

In [None]:
#@title Run this cell!

print("""Great job! You just ran a cell. \n""" \
      """The colab is now loading the necessary libraries and """ \
      """data in. It might take a couple of minutes.\n """ \
      """Meanwhile, you can go watch the video above again!""")

from IPython.utils import io

print('Downloading necessary libraries...')
def install_libraries():
  !pip install biopython
  !pip install -U ProDy
  !pip install py3Dmol
  
with io.capture_output() as captured:
  install_libraries()

# Import libraries.
import math
import numpy as np
import scipy
import copy
import py3Dmol
import matplotlib.pyplot as plt

from google.colab import widgets
from matplotlib import pylab

from mpl_toolkits.mplot3d import Axes3D
from prody import *

print('Done downloading the libraries!')
print('Loading functions...')

step = 0
# Visualisation utility.
def _visualize_protein(*alist, **kwargs):
  """Slightly modified and simplified version of py3Dmol's view3d function."""
  width = kwargs.get('width', 400)
  height = kwargs.get('height', 400)
  data_list = kwargs.pop('data', None)
  modes = kwargs.pop('mode', None)
  style = kwargs.pop('style', [])
  zoomto = kwargs.pop('zoomto', {})
  frames = kwargs.pop('frames', 30)
  interval = kwargs.pop('interval', 1)
  scale = kwargs.pop('scale', 100)

  view = py3Dmol.view(width=400, height=400)  
  
  for i, atoms in enumerate(alist):
      pdb = prody.utilities.createStringIO()
      writePDBStream(pdb, atoms)
      view.addAsOneMolecule(pdb.getvalue(), 'pdb')
      view.setStyle({'model': -1}, {'cartoon': {'color':'spectrum'}})
      view.setStyle({'model': -1, 'hetflag': True}, {'stick':{}})
      view.setStyle({'model': -1, 'bonds': 0}, {'sphere':{'radius': 0.5}})    

  view.setBackgroundColor('0xeeeeee')
  view.zoomTo()
  return view

def _rotate(protein, position, angles):
  """Splits a protein into two parts (at position), and rotates the second part
  by the given angles (as Euler rotation angles in X, Y, Z)."""

  # Build a rotation transformation for the second part of the protein.
  rotation = scipy.spatial.transform.Rotation.from_euler('xyz', angles=angles)

  coords = protein.getCoords() # Atom coordinates. 
  first = coords[:position]
  second = coords[position:]
  offset = coords[-1]
  translated = second - offset
  rotated = rotation.apply(translated)
  second = rotated + offset

  rotated = copy.deepcopy(protein)
  rotated.setCoords(np.concatenate([first, second]))
  return rotated
 
  
# Download protein data.
def download_proteins():
  """Downloads one of CASP13's protein structures. """
  with io.capture_output() as captured:
    !wget http://predictioncenter.org/download_area/CASP13/targets/casp13.targets.T-D.4public.tar.gz
    !apt-get install p7zip-full
    !p7zip -d casp13.targets.T-D.4public.tar.gz
    !tar -xvf casp13.targets.T-D.4public.tar.gz
  print('Downloaded the necessary proteins!')
    

def load_target_protein(target_pdb):
  """Loads the target protein structure into colab."""
  with io.capture_output() as captured:
    target_protein = parsePDB(target_pdb)
  print('Loaded target protein.')
  return target_protein
  

def get_rmsd_metric(pdb_orig, pdb_wiggled):
  """Compute the RMSD error between two proteins."""
  pdb_orig = copy.deepcopy(pdb_orig)

  coords = pdb_wiggled.getCoords()
  pdb_orig.addCoordset(coords)
  pdb_orig.setACSIndex(0)

  alignCoordsets(pdb_orig.calpha)
  mean_rmsd = np.mean(calcRMSD(pdb_orig))
  return round(mean_rmsd, 4)


# Used for dynamically updating error curves.
def _view_results(grid, steps, rmsds):
  with grid.output_to(0, 0):
    grid.clear_cell()
    pylab.figure(figsize=(5, 5))
    pylab.plot(steps, rmsds, 'gray')
    pylab.xlabel('Steps')
    pylab.ylabel('Error')
    pylab.xticks(steps)


TARGET_PDB = 'T0953s2-D1.pdb'
download_proteins()
target_protein = load_target_protein(TARGET_PDB)


> ## What does _Protein Folding_ look like?

At DeepMind, we've create a computer model that folds proteins from the amino acid chain. Below is an animation of the *folding* process.

<img src="https://storage.googleapis.com/dm-educational/assets/protein_folding/T0870-D1.gif" />

# Now it's time to solve some problems!

<center>
<img src="https://storage.googleapis.com/dm-educational/assets/protein_folding/scientist.png" width="200" />
</center>

In the following three problems, you will be given a protein misfolded at a particular bond, and your task is to repair it.  Move the sliders to try and repair the protein. With each move you make, a graph will update to show your progress.

In [None]:
#@title Run this cell to prepare the problems.

# These are the initial values for the problems:
steps_p1 = []
rmsds_p1 = []

steps_p2 = []
rmsds_p2 = []

steps_p3 = []
rmsds_p3 = []


> ## **Problem #1:**
 

A protein was misfolded at bond number *165* by a *30 degree* rotation.

---

Below you have access to a slider that can help you fix the misfolded protein. We will keep track of your attempts by plotting the **error** (a number telling you how close to the answer you are).  Every time you try a new angle, we will
calculate the error, and update the **error curve**.  Try to get zero error!

We also visualise the **misfolded** protein and the **correct** protein.  Notice they are initially a bit different.  They should look the same when you have found the correct angle to fix the misfold.


In [None]:
#@title Fix the misfolded protein {run: "auto"}

bond_number = 165  #@param {type: "integer"}

angle = 0  #@param {type: "slider", min: -180, max: 180}

assert(0 <= bond_number <= 344), "Bond number must be between 0 and 344"

misfolded_protein = _rotate(target_protein, 165,
                            [math.radians(30), 0, 0])
misfolded_protein = _rotate(misfolded_protein, bond_number,
                            [math.radians(angle), 0, 0])
steps_p1.append(len(steps_p1))
rmsds_p1.append(get_rmsd_metric(target_protein, misfolded_protein))

grid = widgets.Grid(2, 3)

_view_results(grid, steps_p1, rmsds_p1)

with grid.output_to(1, 0):
  print("Error curve, lower is better")

with grid.output_to(1, 1):
  print("Misfolded protein")

with grid.output_to(1, 2):
  print("Correct protein")

with grid.output_to(0, 1):
  grid.clear_cell()
  _visualize_protein(misfolded_protein).show()

with grid.output_to(0, 2):
  grid.clear_cell()
  _visualize_protein(target_protein).show()

if rmsds_p1[-1] < 0.001:
  print("Excellent! This looks like the correct protein!")
else:
  print("That's not quite right, try again.")


### Solution

**`bond_number`**: 165  (don't change this value)

**`angle`**: -30  (we need to fix the misfold of 30 degrees, by rotating back, or -30 degrees)

> ## **Problem #2:**

A protein was misfolded by a *42* degree rotation at some bond between *100* and *120*.

---

Just like before, you have access to a slider that you can use to set the right rotation.  However, this time we don't know at which bond the rotation happened!

You'll have to try different values for the **`bond_number`**.  We recommend you set the correct **`angle`** before you start changing the **`bond number`**.

Once again, try to get zero error!

In [None]:
#@title Fix the misfolded protein {run: "auto"}

bond_number =   100#@param {type: "integer"}

angle = 0  #@param {type: "slider", min: -180, max: 180}

assert(0 <= bond_number <= 344), "Bond number must be between 0 and 344"

misfolded_protein = _rotate(target_protein, 115,
                            [math.radians(42), 0, 0])
misfolded_protein = _rotate(misfolded_protein, bond_number,
                            [math.radians(angle), 0, 0])
steps_p2.append(len(steps_p2))
rmsds_p2.append(get_rmsd_metric(target_protein, misfolded_protein))

grid = widgets.Grid(2, 3)

_view_results(grid, steps_p2, rmsds_p2)

with grid.output_to(1, 0):
  print("Error curve, lower is better")

with grid.output_to(1, 1):
  print("Misfolded protein")

with grid.output_to(1, 2):
  print("Correct protein")

with grid.output_to(0, 1):
  grid.clear_cell()
  _visualize_protein(misfolded_protein).show()

with grid.output_to(0, 2):
  grid.clear_cell()
  _visualize_protein(target_protein).show()

if rmsds_p2[-1] < 0.001:
  print("Excellent! This looks like the correct protein!")
else:
  print("That's not quite right, try again.")


### Solution

**`bond_number`**: 115

**`angle`**: -42

> ## **Problem #3:**

A protein was misfolded at a bond by an unknown number of degrees.

**Note**: we do not tell you by how much it was rotated nor at which bond it happened!

---

You'll have to figure out a good strategy to change the **`angle`** and the **`bond number`** until you get to the right solution.

This is a **HARD** problem!  Don't worry if you don't get it right

In [None]:
#@title Fix the misfolded protein {run: "auto"}

bond_number =   0#@param {type: "integer"}

angle = 0  #@param {type: "slider", min: -180, max: 180}

assert(0 <= bond_number <= 344), "Bond number must be between 0 and 344"

misfolded_protein = _rotate(target_protein, 123,
                            [math.radians(-99), 0, 0])
misfolded_protein = _rotate(misfolded_protein, bond_number,
                            [math.radians(angle), 0, 0])
steps_p3.append(len(steps_p3))
rmsds_p3.append(get_rmsd_metric(target_protein, misfolded_protein))

grid = widgets.Grid(2, 3)

_view_results(grid, steps_p3, rmsds_p3)

with grid.output_to(1, 0):
  print("Error curve, lower is better")

with grid.output_to(1, 1):
  print("Misfolded protein")

with grid.output_to(1, 2):
  print("Correct protein")

with grid.output_to(0, 1):
  grid.clear_cell()
  _visualize_protein(misfolded_protein).show()

with grid.output_to(0, 2):
  grid.clear_cell()
  _visualize_protein(target_protein).show()

if rmsds_p3[-1] < 0.001:
  print("Excellent! This looks like the correct protein!")
else:
  print("That's not quite right, try again.")


### Solution

**`bond_number`**: 123

**`angle`**: 99

# Machine learning (Extra credit-ish)

Folding a protein that had a single place broken wasn't _too_ hard, right?  But imagine that you had to find the right combination of rotations for tens, hundreds, or even thousands of different places!  Then the problem would be a lot harder...

The whole field of _Machine Learning_ revolves around this very important concept:

### Computers are good at doing boring repetitive things.  If you know when you have improved something, just make a computer do that _over and over_.


## Programming

In the following sections, we will walk you through making your very own machine learning system to fold proteins.  Don't worry, you don't need to know programming before doing this.  And if you get stuck, we have the solutions so you don't miss out on the fun!

> ## Iterations

You can make the computer repeat a command as many times as you wish.

To do this, we use the **`for`** command, e.g.

In [None]:
for i in range(10):
  print("Hello")

The details of how this works don't matter.  In this exercise, you will always use

    for <a variable name> in range(<a number>):
      <some commands>
      
You can use the variable that you put after **`for`**, if you want.  It will take the value 0, 1, 2, ..., all the way up to the number that you put inside `range`, minus one.  For example:

In [None]:
for somename in range(7):
  print(somename)

> ## Variables

You can also save something you compute into a new variable.  For this you use the equals sign (=).  Unlike in Maths, in Python the equals sign means to save some value in a variable, rather than an equation to solve.  For example:

In [None]:
square = 5*5
print(square)

You can even overwrite a variable with something that's computed from its own value. This might seem strange, but it is OK.

In [None]:
square = square * square
print(square)

And of course you can combine iteration and variable assignments together
 to make interesting things. For example, to print all integer squares between 0 and 100:

In [None]:
for number in range(11):
  square = number * number
  print(square)

> ## What is _Machine Learning_?


Basically it is a way to teach computers to learn by trial and error.  In a nutshell, it involves the following steps:

1.  Start somewhere
2.  Improve the current solution by a little bit
3.  Repeat 1 & 2 until you get a solution that is good enough.

Or, if you prefer equations ;)

    next_solution = current_solution + improvement

And because in programming we can use the same variable to update itself:

    solution = solution + improvement

So computers learn a little bit like we do: by making mistakes, and learning something from them.

_____

In this section you will make a program that will teach the computer to fold a protein with 10 breakages of unknown rotation.

In this section, you will have the following functions available to you:

*  **`get_initial_candidate()`**: Gets a random guess for the rotation angles.
*  **`get_improvement()`**: Tries to get a better candidate by nudging the angles a bit and seeing if this is better than before.  If it tries many times and cannot improve the protein, it just gives up and returns something that doesn't change anything (i.e. rotates everything by zero, or rather, doesn't rotate anything).  It might also end up making things worse... that's still OK.
*  **`view_results()`**: Visualises the current progress.

> ## Practice

In [None]:
#@title Run this cell:  Start using machine learning!  You can always re-run if you need to.

import time

rmsds_ml = [8]
steps_ml = [0]

_temperature = 1.0
_temp_decay = 0.99
_num_iters = 3
_timer = time.time()

offsets = [165, 42, 115, 123, -99, 30, -179, 2, 0, 53]

def get_initial_candidate():
  global _temperature
  global _timer
  _timer = time.time()
  _temperature = 1.0
  return np.random.randint(-180, 180, [10])

def _eval_candidate(c):
  rotated = target_protein
  for i, rot in enumerate(c):
    rotated = _rotate(rotated, 15 + 35*i,
                      [math.radians(rot - offsets[i]), 0, 0])

  return get_rmsd_metric(target_protein, rotated)  

def get_improvement():
  global _temperature
  global _temp_decay
  global _num_iters
  for _ in range(_num_iters):
    delta = np.random.randint(-5, 5, [10])

    rmsd = _eval_candidate(delta + candidate)
    if rmsd < rmsds_ml[-1] or np.random.random() < _temperature:
      rmsds_ml.append(rmsd)
      steps_ml.append(len(steps_ml))
      _temperature *= _temp_decay
      return delta

  rmsd = _eval_candidate(candidate)
  rmsds_ml.append(rmsd)
  steps_ml.append(len(steps_ml))
  return np.zeros([10])

from google.colab import widgets
from matplotlib import pylab

grid = widgets.Grid(1, 1)

def view_results():
  global _timer
  if _timer + 1 < time.time():
    _timer = time.time()
    with grid.output_to(0, 0):
      grid.clear_cell()
      pylab.figure(figsize=(5, 5))
      pylab.plot(steps_ml, rmsds_ml, 'gray')
      pylab.xlabel('Steps')
      pylab.ylabel('Error')

def get_rmsd_metric(pdb_orig, pdb_wiggled):
  pdb_orig = copy.deepcopy(pdb_orig)

  coords = pdb_wiggled.getCoords()
  pdb_orig.addCoordset(coords)
  pdb_orig.setACSIndex(0)

  alignCoordsets(pdb_orig.calpha)
  mean_rmsd = np.mean(calcRMSD(pdb_orig))
  return round(mean_rmsd, 4)


print("Ready to learn!")

### *Warm-up problem*

Make a new candidate by using the **`get_initial_candidate()`** function above, and save it in a variable called **`candidate`**.

_____

#### Solution

If you get stuck, you can see the hidden cell below, but do give it a try.

In [None]:
candidate = get_initial_candidate()

### Your code goes in the cell below

### *Some improvements*


Now try to update your candidate with an improvement and then visualise your result.  You can either add this improvement directly, or you can save it first in a new variable, up to you.  Remember to use the **`get_improvement()`** function described above.

Since you already created your candidate above, you don't need to do it again here, but you can if you want.

You will have to make sure that your improvement is saved in the **`candidate`** variable so that the system can keep track of how you are doing.

If you are feeling adventurous, and like copy-paste a lot... try stringing a few improvements together before visualising.

______

**NOTE**:  It is absolutely normal for the system to sometimes get a bit worse before improving again.  **Remember: This is how we learn, by making mistakes.**

______

#### Solution

As above, here is the solution, in case you get stuck :)

In [None]:
candidate = candidate + get_improvement()
candidate = candidate + get_improvement()
candidate = candidate + get_improvement()
candidate = candidate + get_improvement()
view_results()

### Your code goes in the cell below

### *Final exercise*

Now let's put all of it together.

Write a program that creates a random candidate, and then improves it 1000 times, visualising its results along the way.

Put your code in the cell below and run it :)

____

#### Solution

As usual, here is the solution in the hidden cell below.  Try to solve the problem, and look only if you get stuck or are curious.

In [None]:
candidate = get_initial_candidate()

for i in range(1000):
  candidate = candidate + get_improvement()
  view_results()


## Your code goes below