# Dihedral Angle Calculator

This notebook comprehends all the steps necessary to parse the inputs from the ProteiNet dataset (link), append physical_chemical descriptors from AAIndex and calculate it's dihedral angles and contact map (distogram on future versions). This is a work in progress and much of what is present here will be changed within the next months. For instance this block only loads a single casp record. To execute it, follow the steps below.

#### Installation of Libraries

The original docker for tensorflow2 doesnt comes with several libraries used throughout this note book. So, pip execution could be broken outside this conteiner and for different versions of the tensorflow dockers available.
Eventhough this script could be run on whatever computer that has the requirements met (Tensorflow 1.15 CUDA), be careful when running outside a container. Since I could not verify compatibility with other systems.

**The conteiner version and name is:**
- TF2.1.0
- tensorflow/tensorflow:latest-gpu-py3-jupyter

**Note that some of the libraries used on imported modules may be different, all the needed libs are downloaded below and listed on GitHub.**

##### Image ran: tensorflow/tensorflow:1.15.0-gpu-py3-jupyter
docker pull tensorflow/tensorflow:1.15.0-gpu-py3-jupyter

To run the container, one could also make an alias, as so:

alias docker_tf='docker run -v /LOCAL/VOLUME/:/tf/CONTAINER_VOLUME -p 8888:8888 --rm --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu-py3-jupyter'

### How to Run this notebook:
1. First of all, download one of the ProteiNet TXT Datasets (this notebook was tested using CASP7's 50 fining);
2. Make sure you have the following packages installed:
>> - Python 3 <br>
>> - Tensorflow <br>
>> - Scikit Learn <br>
>> - Matplotlib <br>
>> - tqdm <br>
>> - regex <br>
>> (you can download docker and pull/run the above mentioned container) <br>
3. Execute the cells in sequence.
>> 1. Observe cell description before running. Some of them load files produced on previous steps. So you can continue to explore stuff and skip some cells.


### Global and Control Variables
<br>
The variables listed and declared here will be used to control the entire process. Each of these will be described using comments following the declaration. The parameters written here **are the defaults**. Don't change unless you know exactly what you are doing. For instance, changin _p_number_ could generate a buffer overflow on your computer, and you will end up being mad.

In [1]:
import subprocess
import sys
import os
from os import listdir
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow.keras as keras
import model2
import tensorflow as tf
from generators import AngleDataGenerator
from Utils import Utils as utils
#SUpresses TF warnings
import numpy as np
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

In [3]:
a = utils.load_obj('dists_0001.pkl', 'train_70/train_70_xy/')

In [5]:
a[0].shape

(263, 263)

### List all training files

In [2]:
data_dir = 'latest_train_ind2'
all_files = listdir(data_dir)

### Map all filenames to a given y

In [3]:
# In this case, we are using len(all_files)//4 because we have 4 different data types inside the folder
# gotta fix this ASAP
sequence_dih_map = { 'x_{:04d}.npy'.format(i+1):'y_{:04d}.npy'.format(i+1) for i in np.arange(len(all_files)//4)}

### Prepares and compiles the model

In [4]:
mod = model2.AnglePredictor()
model = mod.build_model('tanh','bilstm',[500,46])
model.compile(loss='mae', optimizer='adam', metrics=['accuracy'])

### Preparing training data

In [5]:
# A rough split between 2/3 -1/3 from training data
train_ids = dict(list(sequence_dih_map.items())[0:200])
valid_ids = dict(list(sequence_dih_map.items())[200:300])

In [6]:
# Data Generators specification
train_gen = AngleDataGenerator(train_ids)
valid_gen = AngleDataGenerator(valid_ids)


In [7]:
a = np.load('testing_data/x_0001.npy')

In [8]:
class SaveTestPredCallback(tf.keras.callbacks.Callback):
    ' Runs prediction on test set and stores the results inside the test_res folder'
    def __init__(self, x_to_pred=None):
        self.x_p=x_to_pred
    
    
    def on_epoch_end(self,epoch,logs=None):
        print('\nlalalalalalala')
        epoch_pred = self.model.predict(self.x_p)
        np.save('teste_pred_{}.npy'.format(epoch),epoch_pred)
        print(epoch_pred.shape)
        
    def on_train_batch_end(self, epoch,logs=None):
        print('\nluululululuulul')

mc = SaveTestPredCallback(a)

### Model Training

In [9]:
history = model.fit(x=train_gen,
          epochs=2,
                    callbacks=[mc],
          validation_data=valid_gen,
          use_multiprocessing=True,
         workers = 6)

Train for 6 steps, validate for 3 steps
Epoch 1/2

luululululuulul
1/6 [====>.........................] - ETA: 2:33 - loss: 0.2916 - accuracy: 0.9070
luululululuulul
luululululuulul
luululululuulul
luululululuulul
luululululuulul

lalalalalalala
(1, 500, 2)
Epoch 2/2

luululululuulul
1/6 [====>.........................] - ETA: 3s - loss: 0.2212 - accuracy: 0.2421
luululululuulul
luululululuulul
luululululuulul
luululululuulul
luululululuulul

lalalalalalala
(1, 500, 2)


In [11]:
history.history

{'loss': [0.293262558678786, 0.22531992693742117],
 'accuracy': [0.7011979, 0.6],
 'val_loss': [0.24976701041062674, 0.24439901610215506],
 'val_accuracy': [0.2400625, 0.2400625]}

In [None]:
# Compute some statistics on the model
train_losses = history.history['loss']
train_acc = history.history['loss']
val_losses = history.history['vloss']
val_acc = history.history['val_accuracy']

# Save statistics
utils.save_batch({'tl':train_losses, 'ta': train_acc, 'vl': val_losses, 'va':val_acc},'',1)

''

In [6]:
import numpy as np

In [8]:
from random import randint

In [39]:
randint(1,6)

3

In [41]:
randint(1,6)

3