%%markdown
# [1. Implementation of the GOR Method](#Implementation-of-GOR)

## [2. GOR Training](#GOR-training)

Training set:
* Set of proteins with known structure
* For each id we have:
    * One file containing the primary sequence
    * One file containing the secondary structure
  
![](imgs/1.png)

Need to count:
* Number of times we observe residue R in conformation S divided by N, the total number of residues &rarr; P(R,S) ~ f(R,S) = (# R,S)/N 
* Number of times we observe residue R divided by the total N of residues &rarr; marginal probability of residue R
* Number of occurrence of the conformation S divided by the total N of residues. &rarr; marginal probability of observing S

Observed frequencies in the training set (TS) are used for **estimating**/approximating these probabilities.

Example &rarr; training set containing just 2 sequences (for simplicity). 

![](imgs/2.png)

Defining a table to store counts
* Rows: counts corresponding to joint frequency of R and given SS:
    * \#R, H
    * \#R, E
    * \#R, C
* \# R &rarr; overall frequency of residue type R		

![](imgs/3.png)

Defining a small table that stores the frequencies of helix (H), strand (E) and coil (C).
![](imgs/4.png)

* Each matrix is initialized with zeroes
* Scanning each position &rarr; starting from index 0:
    * reading R and S &rarr; updating the field Pij according to the values.
    
In our example the window size is just 1!
![](imgs/5.png)

* We scan each sequence updating the values in the column
    * updating the counts of 
        * \#R, H
        * \#R, E
        * \#R, C
        
and the counts of total H, E and C in the smaller table.

* For transforming the above **frequencies** into **probabilities**
    * **Devide** each number (counts) by the total lenght of all sequences used in the training
    
&rarr; in our case we divide by 78:

In [26]:
sequence_1 = 'EYFTLQIRGRERFEMFRELNEALELKDAQAG'
ss_1 = 'CCCCCCCCCHHHHHHHHHHHHHHHHHHHHCC'

sequence_2 = 'KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC'
ss_2 = 'CEEEEECCCCCCCCCCHHHHHHHHHHCCCCCEEEECCCCCEEEEEEC'

len(sequence_1+sequence_2)

78

![](imgs/6.png)

## [3. GOR Prediction](#GOR-prediction)

* GOR model is used for predicting SS on unseen protein sequences
* Each residue positon of a query sequence is analyzed
* The highest value of 
    * The function $S^* = argmax_S I(S;R)$ finds the highest scoring predicted conformation "$S^*$" of the residue R
        * The conformation $S^*$ which maximizes log ratio of the information function $I$ is our predicted conformation
        
![](imgs/7.png)       


Given any sequence:

* >NewSequence
* GLKRR

* Each residue R is located in the table
    * the probabilities for
        * \#R, H
        * \#R, E
        * \#R, C
* Are extracted and used in the function $I$
* The conformation with the highest value is our predicted conformation $S^*$

* Here an example of residue NewSequence[0] = G:



![](imgs/8.png)

![](imgs/9.png)

The maximum is C thus it is predicted that residue G has the conformation C.

## [4. Using Windows of Flanking Residues](#Windows) 

* We extend the information function over a 'window' of residues
* Symmetric windows are centered at a given residue position
* Central residue is indexed as $R_0$ which is assigend the conformation $S^*$
    * Residues to the left of $R_0$ hold negative indeces up to $-d$
    * Residues to the right of $R_0$ hold positive indeces up to $d$
    
* The information function is updated as follows:
![](imgs/10.png)

[p1 45:00]

* The fromula requires us to solve terms involving w residues:
    * Exponential number of possible configurations &rarr; computationally to expensive 
        * we have 20^w possibilities!!!
    * Need for very large DB to estimate reliable distributions
* Simplification;
    * **Assumption of statistical independence**: Makes assumption about the contribution of the sequence context to the central residue conformation.
    * Residues $R_-d, ... , R_d$ are treated to be statistically independent
    
![](imgs/11.png)

## [5. Windows Based GOR](#sliding-window) 

* That way we can factorize the joint probability of the full context into the product of marginal probability of residues in the context.
* Joint probability == all the marginal probabilities

&rarr; Keep in mind that residues are NOT independent along the sequence. 

* By using the 
    * chainrule
    * independence assumption 
    * and making the log of products which is the sum of the logs

*$I$ can be rewritten as: 

![](imgs/12.png)

* As shown in the last line above the joint probability can be writen as a sum of individual information funcitons
* Taking the different residues in the window into consideration
    * Resulting in individual contributions of each residue in the window to the calculation of the joint $I$ function
    

* What to do with the window falling out of the seuqence in the beginning and the end:
    * Initialize scanning postion at an index for which the window is full e.g. window size 17 setting first $R_0$ on index 8 of the string of the sequence
    * Adding zeros to undifined regions of **partial windows**
        * First $R_0$ is on index 0
        * You don't have any contribution from partial windows
        
* GOR is a linear model
* The sliding window approach influences the accuracy of the prediction in a negative way
    * The first few residues are affected more than residues in the middle of the sequence
   
 
### Now our Parameters are:
* $P(R,S)$ the probability of observingg a conformation S

### gor_train.py
Training GRO model from user-defined training set and stores all trained parameters (=the GOR model) into an output file

# 1. I need to make several matrices --> np.arrays


first field f[0,0] 
should contain the index of the fileds

1. First test on one sequence.
1. Then testing on a simplified profile as input
     * profile should have 5 sequences ---> lines
     
3. Remember that for coil:
    * Training files contain '-' 
    * Blindset contains 'C'  
        * Make an if.


In [20]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import argparse

# d1g2ya_.dssp
# d1g2ya_.fasta from training files.

'''Make function: Takes fasta and dssp sequence as parameters as input.'''



# win_size = 3 # has to be an odd number pass through argparse later
# num_rows = win_size*4 #adapts to desired window size

# R_tot
# SS_tot



def make_frequency_array():
    '''
    Takes window size and the name of the array as arguments.
    Makes an array that has as many lines as the win_size and 
    the number of columns is defined by the number of naturally 
    occurring aa in eukaryotes. Returns the array.
    '''
    array = np.zeros((1,20))#, dtype= 'float64')
    return array
    
R_H = make_frequency_array()           # generating arrays holding the counts of residue in conformation X --> R_X
R_E = make_frequency_array()
R_C = make_frequency_array()
R_count = make_frequency_array()       # generating array holding the total residue count
SS_count = make_frequency_array()      # generating array holding the total secondary structure count


# ---> at some point: transform 'counts'   into **probabilities** by dividing by total number of residues
# Which is the number of lines of the sequ pro_file.
# profile_matrix = np.loadtxt("/Users/ila/Downloads/test.txt")
# print(profile_matrix)


In [2]:
# ss_array = np.zeros((2,4)) # maybe change it to 1, 3
def make_frequency_df(zeroarray):
    header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
#     row_names = ['#R,H', '#R,E', '#R,C', '#R']
#     freq_array = make_frequency_array(win_size)
    freq_df = pd.DataFrame(data = zeroarray,  columns=header_col)
    return freq_df

# generating dataframes holding the counts of residue in conformation X --> R_X
df_R_H = make_frequency_df(R_H)           
df_R_E = make_frequency_df(R_E)
df_R_C = make_frequency_df(R_C)
df_R_count=make_frequency_df(R_count)
# df_SS_count = make_frequency_df(SS_count)     

tot_H = 0
tot_E = 0
tot_C = 0

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
        cleanstring = newline_list[1].rstrip()
        return cleanstring

aa_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/fasta/d1g2ya_.fasta')
print(aa_string)

ss_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/d1g2ya_.dssp')
print(ss_string)


MVSKLSQLQTEMLAALLESGLSKEALIQALG
---HHHHHHHHHHHHHHH----HHHHHHHH-


In [3]:
df_R_H 
df_R_E 
df_R_C 

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
'''Increments corresponding positions according to R and H in each field.'''
l = len(aa_string)
for i in range(l):
    ss = ss_string[i]
    aa = aa_string[i]
    df_R_count[aa]+=1     # Increment each df
    if ss == 'H':
        df_R_H[aa]+=1
    elif ss == 'E':
        df_R_E[aa]+=1
    else:                       # so if ss == '-' or ss == 'C' or even if i got some X:
        df_R_C[aa]+=1

# check sum of residue R in HEC conformation.

In [5]:
df_R_H

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,1.0,7.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [6]:
df_R_E

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df_R_C

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0


In [8]:
print("Residue count: \n")
df_R_count

Residue count: 



Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,2.0,0.0,1.0,8.0,2.0,2.0,0.0,0.0,4.0,1.0,0.0,0.0,1.0


In [None]:
# row_names = ['#R,H', '#R,E', '#R,C', '#R']
# # new = -1 0 1

# win_adapted_row_names = []
# for i in range(3):
#     win_adapted_row_names += row_names[i]+str(i)
    

# print(win_adapted_row_names)

negwin = (9//2)*-1
poswin = 9//2
indexes = []
for i in range([-4:4]):
    indexes += i

print(indexes)    
# print(negwin)
# print(poswin)

MVSKLSQLQTEMLAALLESGLSKEALIQALG
---HHHHHHHHHHHHHHH----HHHHHHHH-


['MVSKLSQLQTEMLAALLESGLSKEALIQALG', '---HHHHHHHHHHHHHHH----HHHHHHHH-']

In [20]:
import os
z= os.listdir('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/fasta/')
z.sort()
print(z)

['4ywn.fasta', '6g3z.fasta']


In [22]:
list1 = ['physics', 'Biology', 'chemistry', 'maths']
# x = list1.sort()
print ("list now : ", list1)

list now :  None
