%%markdown
# [1. Implementation of the GOR Method](#Implementation-of-GOR)

## [2. GOR Training](#GOR-training)

Training set:
* Set of proteins with known structure
* For each id we have:
    * One file containing the primary sequence
    * One file containing the secondary structure
  
![](imgs/1.png)

Need to count:
* Number of times we observe residue R in conformation S divided by N, the total number of residues &rarr; P(R,S) ~ f(R,S) = (# R,S)/N 
* Number of times we observe residue R divided by the total N of residues &rarr; marginal probability of residue R
* Number of occurrence of the conformation S divided by the total N of residues. &rarr; marginal probability of observing S

Observed frequencies in the training set (TS) are used for **estimating**/approximating these probabilities.

Example &rarr; training set containing just 2 sequences (for simplicity). 

![](imgs/2.png)

Defining a table to store counts
* Rows: counts corresponding to joint frequency of R and given SS:
    * \#R, H
    * \#R, E
    * \#R, C
* \# R &rarr; overall frequency of residue type R		

![](imgs/3.png)

Defining a small table that stores the frequencies of helix (H), strand (E) and coil (C).
![](imgs/4.png)

* Each matrix is initialized with zeroes
* Scanning each position &rarr; starting from index 0:
    * reading R and S &rarr; updating the field Pij according to the values.
    
In our example the window size is just 1!
![](imgs/5.png)

* We scan each sequence updating the values in the column
    * updating the counts of 
        * \#R, H
        * \#R, E
        * \#R, C
        
and the counts of total H, E and C in the smaller table.

* For transforming the above **frequencies** into **probabilities**
    * **Devide** each number (counts) by the total lenght of all sequences used in the training
    
&rarr; in our case we divide by 78:

In [26]:
sequence_1 = 'EYFTLQIRGRERFEMFRELNEALELKDAQAG'
ss_1 = 'CCCCCCCCCHHHHHHHHHHHHHHHHHHHHCC'

sequence_2 = 'KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC'
ss_2 = 'CEEEEECCCCCCCCCCHHHHHHHHHHCCCCCEEEECCCCCEEEEEEC'

len(sequence_1+sequence_2)

78

![](imgs/6.png)

## [3. GOR Prediction](#GOR-prediction)

* GOR model is used for predicting SS on unseen protein sequences
* Each residue positon of a query sequence is analyzed
* The highest value of 
    * The function $S^* = argmax_S I(S;R)$ finds the highest scoring predicted conformation "$S^*$" of the residue R
        * The conformation $S^* which maximizes log ratio of the information function $I$ is our predicted conformation
        
![](imgs/7.png)       


Given any sequence:

* >NewSequence
* GLKRR

* Each residue R is located in the table
    * the probabilities for
        * \#R, H
        * \#R, E
        * \#R, C
* Are extracted and used in the function $I$
* The conformation with the highest value is our predicted conformation $S^*$

* Here an example of residue NewSequence[0] = G:



![](imgs/8.png)

![](imgs/9.png)

The maximum is C thus it is predicted that residue G has the conformation C.

## [4. Using Windows of Flanking Residues](#Windows) 

* We extend the information function over a 'window' of residues
* Symmetric windows are centered at a given residue position
* Central residue is indexed as $R_0$ which is assigend the conformation $S^*$
    * Residues to the left of $R_0$ hold negative indeces up to $-d$
    * Residues to the right of $R_0$ hold positive indeces up to $d$
    
* The information function is updated as follows:
![](imgs/10.png)

[p1 45:00]

* The fromula requires us to solve terms involving w residues:
    * Exponential number of possible configurations &rarr; computationally to expensive 
        * we have 20^w possibilities!!!
    * Need for very large DB to estimate reliable distributions
* Simplification;
    * **Assumption of statistical independence**: Makes assumption about the contribution of the sequence context to the central residue conformation.
    * Residues $R_-d, ... , R_d$ are treated to be statistically independent
    
![](imgs/11.png)

## [5. Windows Based GOR](#sliding-window) 

* That way we can factorize the joint probability of the full context into the product of marginal probability of residues in the context.
* Joint probability == all the marginal probabilities

&rarr; Keep in mind that residues are NOT independent along the sequence. 

* By using the 
    * chainrule
    * independence assumption 
    * and making the log of products which is the sum of the logs

*$I$ can be rewritten as: 

![](imgs/12.png)

* As shown in the last line above the joint probability can be writen as a sum of individual information funcitons
* Taking the different residues in the window into consideration
    * Resulting in individual contributions of each residue in the window to the calculation of the joint $I$ function
    

* What to do with the window falling out of the seuqence in the beginning and the end:
    * Initialize scanning postion at an index for which the window is full e.g. window size 17 setting first $R_0$ on index 8 of the string of the sequence
    * Adding zeros to undifined regions of **partial windows**
        * First $R_0$ is on index 0
        * You don't have any contribution from partial windows
        
* GOR is a linear model
* The sliding window approach influences the accuracy of the prediction in a negative way
    * The first few residues are affected more than residues in the middle of the sequence
   
 
### Now our Parameters are:
* $P(R,S)$ the probability of observingg a conformation S