출처 : https://benjamin.computer/posts/2018-02-10-mres-part1.html

# 1. 인공지능과 생물학을 같이 배우기

기계학습(Machine learning)과 구조생물학(structural biology)을 연구하는 것은 어렵지만 동시에 보람있는 일입니다. 조금만 할 줄 안다면 새로운것을 발견할 확률이 아주 높습니다. 다른 학문에서는 흔치 않은 일이죠.

## 구조생물학과 컴퓨터 과학 

생물학은 큰 복잡합니다. 그래서 생물학 실험 데이터의 양도 엄청나죠. 하지만 문제는 양적인 것이 아닙니다. 데이터의 불균일성이죠. 실험 데이터들은 한마디로 엉망입니다. 셀수없는 파일 포멧들과 서로다른 작업 흐름들 사이에서 실험 결과가 나온다는 게 놀랍기만 하죠. 

컴퓨터 역시 점점 복잡해지고 있습니다. 근래에는 인공지능이라는 개념에 많은 발전이 있었죠. 이러한 컴퓨터 과학의 발전으로 생물학을 이해하는데 조금 더 가까워 지고 있습니다.  
 
저는 구조생물학(Structural biology)에 많은 관심을 가지고 있습니다. 그래서 이번 기회에 컴퓨터과학을 구조생물학에 응용해보려고 합니다.

## 딥러닝(Deep Learning) 과 폴리펩타이드(polypeptides)

단백질 접힘(Protein folding)은 개개의 아미노산이 기능을 낼 수 있는 형태의 접힌 구조를 를 만드는 과정을 말합니다. 이 과정에는 복잡하기 때문에 예측하기가 쉽지 않습니다. 만약 우리가 아미노산 서열만 가지고 단백질 구조를 예측한다면 좀 더 나은 약물을 만들수 있을 것입니다. 

제가 하려고 하는것은 좀 더 좁고 구체적입니다. 항체(antiboied) 특이적인 폴리펩타이드를 예측하는 것이죠. 그게 성공한다면 보다 나은 항체를 만들 수 있고, 치료약이나 진단에 사용할 수 있을 것입니다.

딥러닝은 최근에 아주 뜨거운 주제이죠. 

Deep learning has been the hot, sexy topic for quite a while now. It's funny how an old idea can suddenly come back and take the world by storm. Biologists have been using neural nets for some time now. PSIPRED is a is a good example of a program I've used before that uses neural nets. It tries to detect secondary structure when given an amino acid sequence. The neural networks in the news are typically image and classifier related and the things they can do are quite amazing. The canonical example is the Googlenet based on the work of Lecun et al.

![](https://pharmchem.ucsf.edu/sites/pharmchem.ucsf.edu/files/styles/pharmacy_half/public/3eeb-coulombic.png?itok=t7CtNDUU)
<center> A classic example of a large protein binding to another chemical. Image courtesy of https://pharmchem.ucsf.edu/research/compchem/bioinformatics </center>

Such nets are called deep, not only because they have several layers, but also because they rely on convolutions. A convolution operation takes a kernel of a certain size and convolves it over the input. The kernel might be, say, an 8 x 8 patch that sums up all the values within it. reducing the input by a factor of 8 (roughly). However, at the same time, it creates a third dimension that only gets deeper in time. For instance, an input image might be 256 x 256 in size, made up of single values. After the first convolution it might be 128 x 128 x 3 depending on your kernel and it's operation. A further operation might reduce further to 64 x 64 x 6. The last dimension keeps getting deeper as the original dimensions get smaller.

## Tensorflow

Google released Tensorflow sometime ago, and I used it briefly when I was looking at Natural Language Processing. This time around I thought I'd learn a little more first. I'd recommend the coursera introduction to machine learning. You'll get the bare basics which you'll need to get started. I decided to use Tensorflow over Caffe or Keras because it's easy to setup and there's plenty of documentation. Not only that, it's pitched at just the right level - it's quite powerful but quick to use. I decided to setup a machine to run all of my nets on. I managed to get hold of a machine being thrown out from the University. With a spare graphics card kicking around I had a machine I could install Arch Linux on and take advantage of Tensorflow's GPU support. Tensorflow has support for a few languages, the most popular being Python (which is what I use). It's good to know that C is also an option for a more final product.

Tensorflow has the idea of the graph and the session. Tensors are supposed to flow through the graph in a session. You build a graph that is made up of operations that pass tensors around. With this graph built, you create a session which takes some inputs and runs them through the graph. Hack-a-day has a really good introduction to Tensorflow that describes how the classic idea of a set of neurons and links, maps on to a series of matrix multiplications. Essentially, most tensors we are likely to deal with are either matrices (2D) or maybe 3D. 

The following is a short example of one of my early test nets. This example creates a convolutional neural net that has 1 convolutional layer and two fully-connected layers. At points, I resize the tensors to perform certain operations, but mostly, it's a series of matrix multiplications and additions. Finally, I use a tanh activation function. If you read a lot of the Tensorflow examples out there, you'll see a lot of ReLUs being used, but for our purposes, we need a nice range between -1 and 1.

In [None]:
import tensorflow as tf

graph = tf.Graph()

with tf.device('/gpu:0'):
      with graph.as_default():
        tf_train_dataset = tf.placeholder(tf.bool, 
            [None, FLAGS.max_cdr_length, FLAGS.num_acids],name="train_input") 

        output_size = FLAGS.max_cdr_length * 4
        dmask = tf.placeholder(tf.float32, [None, output_size], name="dmask")
        x = tf.cast(tf_train_dataset, dtype=tf.float32)

        W_conv0 = weight_variable([FLAGS.window_size, 
          FLAGS.num_acids, FLAGS.num_acids] , "weight_conv_0")
        b_conv0 = bias_variable([FLAGS.num_acids], "bias_conv_0")

        h_conv0 = tf.tanh(conv1d(x, W_conv0) + b_conv0)

        dim_size = FLAGS.num_acids * FLAGS.max_cdr_length
        W_f = weight_variable([dim_size, output_size], "weight_hidden")
        b_f = bias_variable([output_size], "bias_hidden")

        h_conv0_flat = tf.reshape(h_conv0, [-1, dim_size])
        h_f = tf.tanh( (tf.matmul(h_conv0_flat, W_f) + b_f)) * dmask

        W_o = weight_variable([output_size, output_size], "weight_output")
        b_o = bias_variable([output_size],"bias_output")

        y_conv = tf.tanh( ( tf.matmul(h_f, W_o) + b_o) * dmask, name="output")

return graph

With this graph in place, I can then run it over on my GPU with the following session:

In [None]:


def run_session(graph, datasets):
    ''' Run the session once we have a graph, training methodology and a dataset '''
    with tf.device('/gpu:0'):
        with tf.Session(graph=graph) as sess:
            training_input, training_output, validate_input, validate_output, test_input, test_output = datasets
            # Pull out the bits of the graph we need
            ginput = graph.get_tensor_by_name("train_input:0")
            gtest = graph.get_tensor_by_name("train_test:0")
            goutput = graph.get_tensor_by_name("output:0")
            gmask = graph.get_tensor_by_name("dmask:0")
            stepnum = 0
            # Working out the accuracy
            basic_error = cost(goutput, gtest) 
            # Setup all the logging for tensorboard 
            variable_summaries(basic_error, "Error")
            merged = tf.summary.merge_all() 
            train_writer = tf.summary.FileWriter('./summaries/train',graph)
            train_step = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(basic_error)

            tf.global_variables_initializer().run()

            while stepnum < len(training_input):
            item_is, item_os = next_item(training_input, training_output, FLAGS)
            mask = create_mask(item_is)
            summary, _ = sess.run([merged, train_step],
                feed_dict={ginput: item_is, gtest: item_os, gmask: mask})

            if stepnum % 100 == 0:
                mask = create_mask(validate_input)
                train_accuracy = basic_error.eval(
                feed_dict={ginput: validate_input, gtest: validate_output,  gmask : mask}) 
                print('step %d, training accuracy %g' % (stepnum, train_accuracy))

            train_writer.add_summary(summary, stepnum)
            stepnum += 1

#             # save our trained net
#             saver = tf.train.Saver()
#             saver.save(sess, 'saved/nn02')


# There are a few little gotchas here that are worth mentioning. It's import to call:

tf.global_variables_initializer().run()

# Tensorflow gets upset if the variables in the system are not initialized. Most examples don't do things the way I do them but I wanted to partition my program a little differently. If you've named your tensors and placeholders, you can references them later by name:

ginput = graph.get_tensor_by_name("train_input:0")

A placeholder like ginput does exactly what you'd expect. It's like a sort of socket that you plug your data into. If I pass it a Numpy array of data, Tensorflow will make a tensor out of it and send it on it's way around the graph.

There's more to this example that I've not included, such as the cost functions and various support functions to create and initialise proper weights but I think we can agree that it's not a lot of code to generate a fully usable neural network.

## How do I research good like?

These of you who have done a research based masters degree or a PhD will no doubt have your own war stories. I'm sure there has been plenty written about the research experience but although I'm still going through the process I have a few things I can mention.

Firstly, have a plan, and then realise that the plan is more like a framework. I've changed around bits of my plan already, but the things I've intended to cover, I've mostly covered. The order and priorities have changed, but overall, I can say whether or not I'm on track. The very nature of research will present you with new things you never expected and that will force changes, but be aware of what you are spending your time on. Make sure you keep plenty of time spare for writing. In my case, I've got around 4 months penciled down, which might not even be enough.

Secondly, regular contact with folks, especially your supervisor, is important. I work remotely - very remotely! I live in Washington DC but I'm still talking to my supervisor in London every couple of weeks with some news. It helps keep you on the straight and narrow and reminds you that you are not alone. In fact, this is very important to remember. I've had help from my wife, my friends and even people I've never met at the London Biohackspace, so keep in touch with folk.

Having the right tools is important, so long as you are spending time using the tools to get the work done! I've setup an AI machine that I've since not messed with; I can rely on it to just work. I've not upgraded tensorflow or any of the libraries on it and I won't do until the job is done. I use Zotero to keep all my references in check, Latex for all my writing and I use a timer to record how long I'm spending each week on research. I'm a big fan of Trello for keeping track of the things you need to be doing and any ideas that come into your head.

Finally, choose your project wisely. I went around at least 6 different supervisors, asking them about the projects that interested me. Not only that, I spoke with a couple of friends (who are both doctors in computer science) about which project sounded right for me, and I'm really glad I did! I've ended up with a project I truly enjoy and perhaps, that's the most important thing. That way, you'll get it finished and to a high standard. Love of the project is needed to get through the tough stages (and I've already had a couple of these). Ask people you trust what they think of your options. You'll likely make a better choice. 

## Going further

In the next post, I'll talk a bit about the different kinds of neural networks we can make: from conv nets to LSTMs. I'll go into a little more detail about the various tests and algorithms we use to assess biological structures and what problems I've encountered on the way. I'll also talk a little about Jupyter notebooks and how we can make science a bit more accessible.

# Protein Loops in Tensorflow - A.I Bio Part 2

In the last post I talked about some of the basics of structural biology. I'm focusing on these annoying loops that form part of the antibody - the bits that do the work. My theory is maybe neural networks can do a better job than other methods thus far. But how do we even begin to approach this problem? How can we go from a list of amino acids to a full 3D structure?
Representing proteins

Amino acids have an amine and carboxyl group, with what is known as a sidechain that hangs off the side. I'm no chemist, so I'm sure someone out there will tell me what these words mean. But for our purposes, what we need to know is the amine and carboxyl groups form a backbone - a (polypeptide) chain if you will. It is this chain that we want to chase. For now, we don't actually need to know the sidechain details.

Furthermore, we can represent this backbone using three dihedral angles, known as phi, psi and omega. Looking at the backbone we can count along each acid (known as a residue) and read off the atoms. If we start from the 5 prime end, we get: nitrogen, carbon, carbon, nitrogen, carbon... etc 

The first carbon atom we come across is known as the Alpha Carbon (Ca). The second is the Carboxyl Carbon (C). So for each residue, we have N-Ca-C, N-Ca-C, and repeat. If we know the distances and angles between these atoms we can calculate these dihedral angles. From these angles we can recreate the structure again if we need to.

A dihedral angle is essentially the angle between two planes that intersect. We take 4 atoms in our structure, find the two planes then the angle between them. Phi is the twist around the Alpha Carbon and the Nitrogen, with Psi being the twist around the Carboxyl Carbon and the Alpha Carbon.

Omega is a bit of a special case. It is the twist around the Nitrogen and the Carboxyl Carbon but it very rarely changes from around 180 degrees. It hovers roughly +/- 15 degrees from this, except in special cases. As with so many things in biology, there is always a special case!

## Structures in neural networks

There has been some interesting work of late in creating 3D objects with neural networks. As far as I can tell, there are two main ways to create 3D structures; combining existing ones and discretising the space. There are a few workshops and conferences such as CVPR, 3DDL at NIPS, some github repositories like Synthesize3,3D pose estimation and Deep Lung. So far, the field seems to be fairly new, with most of the code being on the bleeding edge.

Fortunately, we don't have to do any of that. Angles can be represented as two numbers, the sin and cosine of the angle in question. Both of these functions range from -1 to 1 and can therefore fit nicely in to the classic model of a neuron in our network. If we use tanh as oppose to something like a ReLU, we can cover this range quite comfortably (we can ignore the slight non-linear problem here for now - we are hacking this somewhat!). 

## NeRF wars!

So how do we go from angles back to a structure we can look at in order to do a comparison? One could do an awful lot of trig, but there is a more elegant algorithm called NeRF (and sn-NeRF). Annoyingly, the paper is behind the Wiley paywall, but the general gist of the algorithm is to start with 3 atoms, and place the next atom based on the positions of the previous three.

NeRF has two steps. Firstly, the atom is placed using the known bond angles, distances and the previous positions using a little trigonometry with the bond and torsion angles. The last step creates a matrix that rotates the atom into the correct reference plane.

The paper misses out two vital issues! One thing the paper does not go into is the fact that one should start from a Nitrogen at the 5 prime end! I've been trying to go from the 3 prime end, and it didn't really work so well. I suspect this is one of these cases where the knowledge is just assumed in that domain. Secondly, there is a cheeky minus sign in the second section X value.

I had a little help from James Phillips at the London Biohackspace who took at look at my code, and noticed the 5 prime issue! Always good if you can get another person to look over your code when you are stuck.

![](https://c1.staticflickr.com/5/4719/38597405410_757336f5da_o.gif)
<center>from https://benjamin.computer/posts/2018-03-16-mres-part2.html</center>

Some of the things you can do with the NeRF algorithm. The real structure is in grey. The coloured structure is attempting to match it's endpoint.

Here is the NeRF algorithm in full, in Python for these who might want it.


In [5]:
import numpy as np
import math, itertools

class NeRF(object):

    def __init__(self):
        # TODO - PROLINE has different lengths which we should take into account
        # TODO - A_TO_C angle differs by +/- 5 degrees
        #bond_lengths = { "N_TO_A" : 1.4615, "PRO_N_TO_A" : 1.353, "A_TO_C" : 1.53, "C_TO_N" : 1.325 }
        self.bond_lengths = { "N_TO_A" : 1.4615,  "A_TO_C" : 1.53, "C_TO_N" : 1.325 }
        self.bond_angles = { "A_TO_C" : math.radians(109), "C_TO_N" : math.radians(115), "N_TO_A" : math.radians(121) }
        self.bond_order = ["C_TO_N", "N_TO_A", "A_TO_C"]

    def _next_data(self, key):
        ''' Loop over our bond_angles and bond_lengths '''
        ff = itertools.cycle(self.bond_order)
        for item in ff:
            if item == key:
                next_key = next(ff)
                break
        return (self.bond_angles[next_key], self.bond_lengths[next_key], next_key)

    def _place_atom(self, atom_a, atom_b, atom_c, bond_angle, torsion_angle, bond_length) :
        ''' Given the three previous atoms, the required angles and the bond
        lengths, place the next atom. Angles are in radians, lengths in angstroms.''' 
        # TODO - convert to sn-NeRF
        ab = np.subtract(atom_b, atom_a)
        bc = np.subtract(atom_c, atom_b)
        bcn = bc / np.linalg.norm(bc)
        R = bond_length

        # numpy is row major
        d = np.array([-R * math.cos(bond_angle),
        R * math.cos(torsion_angle) * math.sin(bond_angle),
        R * math.sin(torsion_angle) * math.sin(bond_angle)])

        n = np.cross(ab,bcn)
        n = n / np.linalg.norm(n)
        nbc = np.cross(n,bcn)

        m = np.array([ 
        [bcn[0],nbc[0],n[0]],
        [bcn[1],nbc[1],n[1]],
        [bcn[2],nbc[2],n[2]]])

        d = m.dot(d)
        d = d + atom_c
        return d

    def compute_positions(self, torsions):
        ''' Call this function with a set of torsions (including omega) in degrees.'''
        atoms = [[0, -1.355, 0], [0, 0, 0], [1.4466, 0.4981, 0]]
        torsions = list(map(math.radians, torsions))
        key = "C_TO_N"
        angle = self.bond_angles[key]
        length = self.bond_lengths[key]

        for torsion in torsions:
            atoms.append(self._place_atom(atoms[-3], atoms[-2], atoms[-1], angle, torsion, length))
            (angle, length, key) = self._next_data(key)
        return atoms

if __name__ == "__main__":
    nerf = NeRF()
    print("3NH7_1 - using real omega")
    torsions = [ 142.951668191667, 173.2,-147.449854444109, 137.593755455898, -176.98,
                -110.137784727015, 138.084240732612, 162.28,-101.068226849313, -96.1690297398444, 167.88,
                -78.7796836206707, -44.3733790929788, 175.88,-136.836113196726, 164.182984866024, -172.22,
                -63.909882696529, 143.817250526837, 168.89,-144.50345668635, 158.70503596547, 175.87,
                -96.842536650294, 103.724939588454, -172.34,-85.7345901579845, -18.1379473766538, -172.98,
                -150.084356709565]

    atoms0 = nerf.compute_positions(torsions)
    print(len(atoms0))
    for atom in atoms0:
        print(atom)

3NH7_1 - using real omega
33
[0, -1.355, 0]
[0, 0, 0]
[1.4466, 0.4981, 0]
[ 1.66402739  1.58662823 -0.72350302]
[ 2.96102383  2.2598307  -0.69940133]
[ 2.74560278  3.76296751 -0.88668057]
[ 3.47980754  4.53616952 -0.10008854]
[ 3.3316362   5.98996057 -0.12287197]
[ 4.58355452  6.60851816 -0.74816152]
[ 4.35459731  7.60463118 -1.59134741]
[ 5.41907018  8.53327984 -1.96616928]
[ 5.25395647  9.82842507 -1.16852507]
[ 4.60531723 10.79606033 -1.79985236]
[ 4.1430318  11.98367674 -1.08442124]
[ 2.85997035 11.63986009 -0.32518253]
[ 1.97381949 10.93510234 -1.01342535]
[ 0.74171301 10.45861241 -0.38824281]
[ 0.49119583  9.01228285 -0.81983514]
[-0.39369909  8.35404832 -0.08545561]
[-0.60984724  6.92076923 -0.27246707]
[-1.16458277  6.67987328 -1.67786297]
[-0.73789141  5.57186797 -2.26598176]
[-1.3578372   5.08654901 -3.49728797]
[-1.37242548  3.5567375  -3.47821136]
[-2.26650212  2.9952417  -4.27882014]
[-2.33374215  1.5432591  -4.43116038]
[-1.5556844   1.13733803 -5.68445773]
[-0.36472043  

NeRF isn't perfect; it makes some assumptions that don't always hold up. The distances given at the top do vary, though they have been verified by a few experiments. Adding in this variation might be something I implement in the future, but for now, NeRF recreates the majority of structures quite well.
Next steps with our nets

With a representation of the chain as a series of angles, we can build several different kinds of networks that, when given a series of amino acids, produce a set of numbers ranging from -1 to 1. Recombining the sin and cosine, converting to degrees and applying NeRF, we get a structure that is very close to the original.

In the next blog post, I'll look at the different kinds of networks, which work best and what pitfalls we need to avoid.