# NPPLS Data Preparation Pipeline

This notebook comprehends all the steps necessary to parse the inputs from the ProteiNet dataset (link), append physical_chemical descriptors from AAIndex and calculate it's ZMatrix, dihedral angles and distogram (currently Ca, planning on expanding to Cbeta but need answers from ProteinNet's creator). This is a work in progress and much of what is present here will be changed within the next months. 
<br>**Essa que tá valendo!!**

#### Installation of Libraries

The original docker for tensorflow2 doesnt comes with several libraries used throughout this note book. So, pip execution could be broken outside this conteiner and for different versions of the tensorflow dockers available.
Eventhough this script could be run on whatever computer that has the requirements met (Tensorflow 2.1.0 CUDA), be careful when running outside a container, since I could not verify compatibility with other systems.

**The conteiner version and name is:**
- TF 2.2.0
- tensorflow/tensorflow:latest-gpu-py3-jupyter

**Note that some of the libraries used on imported modules may be different, all the needed libs are downloaded below and listed on GitHub.**

##### Image ran: tensorflow/tensorflow:1.15.0-gpu-py3-jupyter
docker pull tensorflow/tensorflow:1.15.0-gpu-py3-jupyter

To run the container, one could also make an alias, as so:

alias docker_tf='docker run -v /LOCAL/VOLUME/:/tf/CONTAINER_VOLUME -p 8888:8888 --rm --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu-py3-jupyter'

### How to Run this notebook:
1. First of all, download one of the ProteiNet TXT Datasets (this notebook was tested using CASP7's 50 fining);
2. Make sure you have the following packages installed:
>> - Python 3 <br>
>> - Tensorflow <br>
>> - Scikit Learn <br>
>> - Matplotlib <br>
>> - tqdm <br>
>> - regex <br>
>> (you can download docker and pull/run the above mentioned container) <br>
3. Execute the cells in sequence.
>> 1. Observe cell description before running. Some of them load files produced on previous steps. So you can continue to explore stuff and skip some cells.


### Global and Control Variables
<br>
The variables listed and declared here will be used to control the entire process. Each of these will be described using comments following the declaration. The parameters written here **are the defaults**. Don't change unless you know exactly what you are doing. For instance, changin _p_number_ could generate a buffer overflow on your computer, and you will end up being mad.

In [1]:
import subprocess
import sys
import os
from os import listdir
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

# Function for installation of libraries via pip
def install_pkg(package):
    subprocess.call([sys.executable, "-m", "pip", "install", package])

pkgs = ['tqdm','scipy']

for package in pkgs:
    try:
        import package
    except ImportError:
        print('Trying to install Package: {}'.format(package))
        install_pkg(package)

# Import sub-block
# Takes care of already installed libraries on the container
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm
import numpy as np
from Utils import Utils as utils

Trying to install Package: tqdm
Trying to install Package: scipy


### DataPrep Class

This class consists of a series of methods applied to the data preparation pipeline. The first step is to import the lib. Here we import as data_prep_lib out of plain lazyness. One day I will really change the name os this class. God knows I will..

In [2]:
#import data_prep_test2 as data_prep_lib
import datapreplib

## Preparing DIH data

**Dihedral Angle files Prep**<br>
Outputs:<br>
1- Sequence and descriptors;<br>
2- $\phi$ and $\psi$ of each protein;

In [None]:
# output files location
data_files = {'train':['/tf/fernando/storage/model1/Model1_pipeline/casp7/training_70','train_70_dih'],#,
              'valid':['/tf/fernando/storage/model1/Model1_pipeline/casp7/validation','valid_dih'],
               'test':['/tf/fernando/storage/model1/Model1_pipeline/casp7/testing','testing_data_dih']}


In [None]:
# Preparation Routine
for k in data_files.keys():
    print('Preparing file * {} * | storing at * {} *'.format(data_files[k][0],data_files[k][1]))
    data_prep = data_prep_lib.Data_Prep_Pipeline(input_file=data_files[k][0])
    data_prep.prep_data(mode='dih',save_dir=data_files[k][1])

## Preparing Dist files

**Distance routines preparation**<br>
Which includes:<br>
1- Sequence and descritors; <br>
2- Inter-$C_\alpha$ distance matrices

In [None]:
# Destination of the individual protein files
data_files = {'train':['/tf/fernando/storage/model1/Model1_pipeline/casp7/training_70','train_70_dist'],#,
              'valid':['/tf/fernando/storage/model1/Model1_pipeline/casp7/validation','valid_dist'],
               'test':['/tf/fernando/storage/model1/Model1_pipeline/casp7/testing','testing_data_dist']}


In [None]:
# Preparation Routine
for k in data_files.keys():
    print('Preparing file * {} * | storing at * {} *'.format(data_files[k][0],data_files[k][1]))
    data_prep = data_prep_lib.Data_Prep_Pipeline(input_file=data_files[k][0])
    data_prep.prep_data(mode='dist',save_dir=data_files[k][1])

## Preparing ZMat files

**Z-Matrix preparation**<br>
The outputs:<br>
1- Sequence and descritors<br>
2- Zmatrix reppresentations for roteins

In [None]:
data_files = {'train':['/tf/fernando/storage/model1/Model1_pipeline/casp7/training_70','train_70_zmat2']}#,
              'valid':['/tf/fernando/storage/model1/Model1_pipeline/casp7/validation','valid_zmat'],
              'test':['/tf/fernando/storage/model1/Model1_pipeline/casp7/testing','testing_data_zmat']}


In [3]:
# testing stuff, will be deleted soon (24/10/2020)
'''
data_files = {'train':['/tf/fernando/storage/model1/Model1_pipeline/casp7/training_70','train_tert_final'],
              'valid':['/tf/fernando/storage/model1/Model1_pipeline/casp7/validation','valid_tert_final'],
              'test':['/tf/fernando/storage/model1/Model1_pipeline/casp7/testing','testing_data_tert_final']}
'''

In [4]:
for k in data_files.keys():
    print('Preparing file * {} * | storing at * {} *'.format(data_files[k][0],data_files[k][1]))
    data_prep = data_prep_lib.Data_Prep_Pipeline(input_file=data_files[k][0])
    data_prep.prep_data(mode='zmat',save_dir=data_files[k][1])

Preparing file * /tf/fernando/storage/model1/Model1_pipeline/casp7/training_70 * | storing at * train_tert_final *
# Done Processing data.hes
Preparing file * /tf/fernando/storage/model1/Model1_pipeline/casp7/validation * | storing at * valid_tert_final *
# Done Processing data.s
Preparing file * /tf/fernando/storage/model1/Model1_pipeline/casp7/testing * | storing at * testing_data_tert_final *
# Done Processing data.s
