%%markdown
# [1. Implementation of the GOR Method](#Implementation-of-GOR)

## [2. GOR Training](#GOR-training)

Training set:
* Set of proteins with known structure
* For each id we have:
    * One file containing the primary sequence
    * One file containing the secondary structure
  
![](imgs/1.png)

Need to count:
* Number of times we observe residue R in conformation S divided by N, the total number of residues &rarr; P(R,S) ~ f(R,S) = (# R,S)/N 
* Number of times we observe residue R divided by the total N of residues &rarr; marginal probability of residue R
* Number of occurrence of the conformation S divided by the total N of residues. &rarr; marginal probability of observing S

Observed frequencies in the training set (TS) are used for **estimating**/approximating these probabilities.

Example &rarr; training set containing just 2 sequences (for simplicity). 

![](imgs/2.png)

Defining a table to store counts
* Rows: counts corresponding to joint frequency of R and given SS:
    * \#R, H
    * \#R, E
    * \#R, C
* \# R &rarr; overall frequency of residue type R		

![](imgs/3.png)

Defining a small table that stores the frequencies of helix (H), strand (E) and coil (C).
![](imgs/4.png)

* Each matrix is initialized with zeroes
* Scanning each position &rarr; starting from index 0:
    * reading R and S &rarr; updating the field Pij according to the values.
    
In our example the window size is just 1!
![](imgs/5.png)

* We scan each sequence updating the values in the column
    * updating the counts of 
        * \#R, H
        * \#R, E
        * \#R, C
        
and the counts of total H, E and C in the smaller table.

* For transforming the above **frequencies** into **probabilities**
    * **Devide** each number (counts) by the total lenght of all sequences used in the training
    
&rarr; in our case we divide by 78:

In [26]:
sequence_1 = 'EYFTLQIRGRERFEMFRELNEALELKDAQAG'
ss_1 = 'CCCCCCCCCHHHHHHHHHHHHHHHHHHHHCC'

sequence_2 = 'KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC'
ss_2 = 'CEEEEECCCCCCCCCCHHHHHHHHHHCCCCCEEEECCCCCEEEEEEC'

len(sequence_1+sequence_2)

78

![](imgs/6.png)

## [3. GOR Prediction](#GOR-prediction)

* GOR model is used for predicting SS on unseen protein sequences
* Each residue positon of a query sequence is analyzed
* The highest value of 
    * The function $S^* = argmax_S I(S;R)$ finds the highest scoring predicted conformation "$S^*$" of the residue R
        * The conformation $S^*$ which maximizes log ratio of the information function $I$ is our predicted conformation
        
![](imgs/7.png)       


Given any sequence:

* >NewSequence
* GLKRR

* Each residue R is located in the table
    * the probabilities for
        * \#R, H
        * \#R, E
        * \#R, C
* Are extracted and used in the function $I$
* The conformation with the highest value is our predicted conformation $S^*$

* Here an example of residue NewSequence[0] = G:



![](imgs/8.png)

![](imgs/9.png)

The maximum is C thus it is predicted that residue G has the conformation C.

## [4. Using Windows of Flanking Residues](#Windows) 

* We extend the information function over a 'window' of residues
* Symmetric windows are centered at a given residue position
* Central residue is indexed as $R_0$ which is assigend the conformation $S^*$
    * Residues to the left of $R_0$ hold negative indeces up to $-d$
    * Residues to the right of $R_0$ hold positive indeces up to $d$
    
* The information function is updated as follows:
![](imgs/10.png)

[p1 45:00]

* The fromula requires us to solve terms involving w residues:
    * Exponential number of possible configurations &rarr; computationally to expensive 
        * we have 20^w possibilities!!!
    * Need for very large DB to estimate reliable distributions
* Simplification;
    * **Assumption of statistical independence**: Makes assumption about the contribution of the sequence context to the central residue conformation.
    * Residues $R_-d, ... , R_d$ are treated to be statistically independent
    
![](imgs/11.png)

## [5. Windows Based GOR](#sliding-window) 

* That way we can factorize the joint probability of the full context into the product of marginal probability of residues in the context.
* Joint probability == all the marginal probabilities

&rarr; Keep in mind that residues are NOT independent along the sequence. 

* By using the 
    * chainrule
    * independence assumption 
    * and making the log of products which is the sum of the logs

*$I$ can be rewritten as: 

![](imgs/12.png)

* As shown in the last line above the joint probability can be writen as a sum of individual information funcitons
* Taking the different residues in the window into consideration
    * Resulting in individual contributions of each residue in the window to the calculation of the joint $I$ function
    

* What to do with the window falling out of the seuqence in the beginning and the end:
    * Initialize scanning postion at an index for which the window is full e.g. window size 17 setting first $R_0$ on index 8 of the string of the sequence
    * Adding zeros to undifined regions of **partial windows**
        * First $R_0$ is on index 0
        * You don't have any contribution from partial windows
        
* GOR is a linear model
* The sliding window approach influences the accuracy of the prediction in a negative way
    * The first few residues are affected more than residues in the middle of the sequence
   
 
### Now our Parameters are:
* $P(R,S)$ the probability of observingg a conformation S

### gor_train.py
Training GRO model from user-defined training set and stores all trained parameters (=the GOR model) into an output file

# 1. I need to make several matrices --> np.arrays


first field f[0,0] 
should contain the index of the fileds

1. First test on one sequence.
1. Then testing on a simplified profile as input
     * profile should have 5 sequences ---> lines
     
3. Remember that for coil:
    * Training files contain '-' 
    * Blindset contains 'C'  
        * Make an if.


In [20]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import argparse

# d1g2ya_.dssp
# d1g2ya_.fasta from training files.

'''Make function: Takes fasta and dssp sequence as parameters as input.'''



# win_size = 3 # has to be an odd number pass through argparse later
# num_rows = win_size*4 #adapts to desired window size

# R_tot
# SS_tot



def make_frequency_array():
    '''
    Takes window size and the name of the array as arguments.
    Makes an array that has as many lines as the win_size and 
    the number of columns is defined by the number of naturally 
    occurring aa in eukaryotes. Returns the array.
    '''
    array = np.zeros((1,20))#, dtype= 'float64')
    return array
    
R_H = make_frequency_array()           # generating arrays holding the counts of residue in conformation X --> R_X
R_E = make_frequency_array()
R_C = make_frequency_array()
R_count = make_frequency_array()       # generating array holding the total residue count
SS_count = make_frequency_array()      # generating array holding the total secondary structure count


# ---> at some point: transform 'counts'   into **probabilities** by dividing by total number of residues
# Which is the number of lines of the sequ pro_file.
# profile_matrix = np.loadtxt("/Users/ila/Downloads/test.txt")
# print(profile_matrix)


In [2]:
# ss_array = np.zeros((2,4)) # maybe change it to 1, 3
def make_frequency_df(zeroarray):
    header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
#     row_names = ['#R,H', '#R,E', '#R,C', '#R']
#     freq_array = make_frequency_array(win_size)
    freq_df = pd.DataFrame(data = zeroarray,  columns=header_col)
    return freq_df

# generating dataframes holding the counts of residue in conformation X --> R_X
df_R_H = make_frequency_df(R_H)           
df_R_E = make_frequency_df(R_E)
df_R_C = make_frequency_df(R_C)
df_R_count=make_frequency_df(R_count)
# df_SS_count = make_frequency_df(SS_count)     

tot_H = 0
tot_E = 0
tot_C = 0

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
        cleanstring = newline_list[1].rstrip()
        return cleanstring

aa_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/fasta/d1g2ya_.fasta')
print(aa_string)

ss_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/d1g2ya_.dssp')
print(ss_string)


MVSKLSQLQTEMLAALLESGLSKEALIQALG
---HHHHHHHHHHHHHHH----HHHHHHHH-


In [3]:
df_R_H 
df_R_E 
df_R_C 

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
'''Increments corresponding positions according to R and H in each field.'''
l = len(aa_string)
for i in range(l):
    ss = ss_string[i]
    aa = aa_string[i]
    df_R_count[aa]+=1     # Increment each df
    if ss == 'H':
        df_R_H[aa]+=1
    elif ss == 'E':
        df_R_E[aa]+=1
    else:                       # so if ss == '-' or ss == 'C' or even if i got some X:
        df_R_C[aa]+=1

# check sum of residue R in HEC conformation.

In [5]:
df_R_H

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,1.0,7.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [6]:
df_R_E

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df_R_C

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0


In [8]:
print("Residue count: \n")
df_R_count

Residue count: 



Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,2.0,0.0,1.0,8.0,2.0,2.0,0.0,0.0,4.0,1.0,0.0,0.0,1.0


In [None]:
# row_names = ['#R,H', '#R,E', '#R,C', '#R']
# # new = -1 0 1

# win_adapted_row_names = []
# for i in range(3):
#     win_adapted_row_names += row_names[i]+str(i)
    

# print(win_adapted_row_names)

negwin = (9//2)*-1
poswin = 9//2
indexes = []
for i in range([-4:4]):
    indexes += i

print(indexes)    
# print(negwin)
# print(poswin)

### sort() returns None type

&rarr; apply to list directly and return list instead

In [39]:
import os
z= os.listdir('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/fasta/')
a = z.sort()
print('cant return "a" becaues it is', a)
print(z)


cant return "a" becaues it is None
['4ywn.fasta', '6g3z.fasta']


['4ywn.fasta', '6g3z.fasta']


In [22]:
list1 = ['physics', 'Biology', 'chemistry', 'maths']
# x = list1.sort()
print ("list now : ", list1)

list now :  None


Found X in my fasta sequences along the way:
   * cheking if 'X' sequences have a corresponding profile

In [3]:
# Sequences containing X:
!cat ~/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/findX_ids_only

6l77
6kko
5xyf
5v0m
5uc0
6dhx
5ir2
4y0o
5c5z
6mdw
7bvv
5t2x
4zey
5wd6
6iqo
5wd8 


In [7]:
#Adding '.profile' in vim
!cat ~/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/findX_ids_only

6l77.profile
6kko.profile
5xyf.profile
5v0m.profile
5uc0.profile
6dhx.profile
5ir2.profile
4y0o.profile
5c5z.profile
6mdw.profile
7bvv.profile
5t2x.profile
4zey.profile
5wd6.profile
6iqo.profile
5wd8.profile 


In [10]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [14]:
!ls  /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/ > sequprofile_list

In [15]:
all_profiles = lines_list('./sequprofile_list')

['4uiq.profile', '4y0l.profile', '4y0o.profile', '4yte.profile', '4ywn.profile', '4zc4.profile', '4zey.profile', '4zkp.profile', '4zlr.profile', '5a88.profile', '5abr.profile', '5anp.profile', '5aun.profile', '5av5.profile', '5azw.profile', '5b71.profile', '5bn2.profile', '5bp5.profile', '5bpk.profile', '5bpu.profile', '5bxq.profile', '5c5z.profile', '5c8a.profile', '5ceg.profile', '5ctd.profile', '5d16.profile', '5d6t.profile', '5d71.profile', '5dcf.profile', '5dd8.profile', '5dg6.profile', '5dq0.profile', '5eiv.profile', '5f1s.profile', '5f2a.profile', '5fb9.profile', '5ffl.profile', '5fq0.profile', '5ghl.profile', '5gke.profile', '5gna.profile', '5hjf.profile', '5ht8.profile', '5ib0.profile', '5ii0.profile', '5ir2.profile', '5jsn.profile', '5jwo.profile', '5kqa.profile', '5kwv.profile', '5ldd.profile', '5ltf.profile', '5m9o.profile', '5mc9.profile', '5mmh.profile', '5n07.profile', '5nl9.profile', '5t2y.profile', '5u39.profile', '5u4u.profile', '5u5n.profile', '5u7e.profile', '5uiv.p

In [29]:
def lines_list(file1):
    cleanlines = []
    with open (file1, 'r') as rfile:
        lines_list = rfile.readlines()
        for line in lines_list:
            nonewline = line.rstrip()
            cleanlines.append(nonewline)
    return cleanlines  

In [34]:
path = '../../all_data/blindset/findX_ids_only'
X_seqs = lines_list(path)    
all_profiles = lines_list('./sequprofile_list')

print("X ", len(X_seqs))
print('all ',len(all_profiles))
intersec = set(X_seqs) & set(all_profiles)
print('Matching',len(intersec))

print(intersec)

X  16
all  127
Matching 10
{'4zey.profile', '4y0o.profile', '5c5z.profile', '5wd6.profile', '5ir2.profile', '7bvv.profile', '6iqo.profile', '6mdw.profile', '5xyf.profile', '6l77.profile'}


### So I assume I have only 117 blind sequences...
No I can keep them because only the "matching residues" of the 20 aa list are considered when filling the GOR matrices.

In [40]:
cat /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile

0.55	0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.45	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	
0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	
0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	0.0	0.0	0.55	
1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.45	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	

In [73]:
import sys
import os
import pandas as pd
import numpy as np

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
#         cleanstring = newline_list[1].rstrip()
        return newline_list

In [128]:
list1 = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile')
list1

cleanlines = []
floats = []
# for i in list1:

### Better to loadtxt dicrectly into np array:

In [140]:
profile_arr = np.loadtxt("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile", usecols=range(0,20), dtype=np.float64)
profile_arr
profile_arr.shape

rows, cols = profile_arr.shape

print(profile_arr[0])
print(profile_arr[1])



# for i in range(8):
#     for j in range(19):
#         print(profile_arr[i][j])


[0.55 0.45 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]
[0.   0.   0.45 0.   0.   0.   0.55 0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


### You can add entire rows of np arrays;

In [138]:
sum_of_2rows = profile_arr[0]+profile_arr[1]
sum_of_2rows

array([0.55, 0.45, 0.45, 0.  , 0.  , 0.  , 0.55, 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

In [137]:
header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
nospace = ""
for i in a:
#     print(i, end='')
    nospace+=i
nospace

num_li = []
for j in range(len(header_col)):
    num_li.append(j)
    
print(num_li)    

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


### Need to check that numcols == numcols no

### Need to check that len(dssp_string) == number of rows in profile

In [152]:
#!/anaconda3/bin/python
import sys
import os
import glob
import pandas as pd
import numpy as np
import argparse 

'''For the first try: Make function: Takes fasta and dssp sequence as parameters as input.'''
# d1g2ya_.dssp
# d1g2ya_.fasta from training files.

win_size = 3 # has to be an odd number pass through argparse later

def make_zero_array(window_size):
    '''
    Takes window size and the name of the array as arguments.
    Makes an array that has as many lines as the win_size and 
    the number of columns is defined by the number of naturally 
    occurring aa in eukaryotes. Returns the array.
    '''
    array = np.zeros((window_size,20)) #, dtype= 'float64' is allready default - not necessary to specify!!!
    return array
    
R_H = make_zero_array(win_size)           # generating arrays holding the counts of residue in conformation X --> R_X
R_E = make_zero_array(win_size)
R_C = make_zero_array(win_size)
R_count = make_zero_array(win_size)       # generating array holding the total residue count
SS_count = make_zero_array(win_size)      # generating array holding the total secondary structure count

def make_frequency_df(zeroarray):
    '''
    Makes dataframe from zero array to better vizualize whats going on.
    That enables us to index columns by residue name.
    '''
    header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
    # row_names = ['R_0'] #['#R,H', '#R,E', '#R,C', '#R'] # want to implement using -1 0 1 according to window....
#     freq_array = make_zero_array(win_size)
    freq_df = pd.DataFrame(data = zeroarray,  columns=header_col) # index=row_names
    return freq_df

''' generating dataframes holding the counts of residue in conformation X --> R_X'''
df_R_H = make_frequency_df(R_H)           
df_R_E = make_frequency_df(R_E)
df_R_C = make_frequency_df(R_C)
df_R_count=make_frequency_df(R_count)
# df_SS_count = make_frequency_df(SS_count)     

''' generating smaller datafram holding the total counts conformations'''
ss_array = np.zeros((1,3)) # making array holding total n of R in H, E or C
df_all_SS = pd.DataFrame(data=ss_array, columns=['H', 'E', 'C'], index= ['#S'])
# print("HEC df")
# print(df_all_SS)

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
        cleanstring = newline_list[1].rstrip()
        return cleanstring

def read_profile_into_array(infile2):
    '''
    Takes a seq profile file as input. Reads it into an np.array.
    'NaN' column (index 20) is excluded. Returns the array
    '''
    profile_array = np.loadtxt(infile2, usecols=range(0,20), dtype=np.float64) # need to indicate range to get rid of last col containing 'nan'
    return profile_array

def dict_from_2_lists(list1, list2):
    '''
    Makes dictionary from 20 numbers list and all header of profile.
    keys: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 
    values: ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
    used to access correct column in GOR arrays.
    '''
    keys = list1
    values = list2
    index_dict = dict(zip(keys, values))
    return index_dict

indexes = dict_from_2_lists([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'])

def train_gor(profile, ssfile, RH, RE, RC, total_R, total_SS):
    '''
    Takes as input: (1) np.array of seq profile (2) ss from dssp file (3) dataframes comprising the gor model 
    (RH, RE, RC, total_R and total_SS). Increments corresponding positions according to R in given conformation 
    in each field. Returns the trained GOR model.
    '''
    # aa_string = read_clean_lines(profile)
    profile_arr = read_profile_into_array(profile)          # np.array
    print(profile_arr)

    ss_string = read_clean_lines(ssfile)
    print(ss_string)                                        

    rows, cols = profile_arr.shape                          # len(ss_string) must be == number of rows 

    if len(ss_string) != rows:
        print('Error: ', ssfile, 'length not the same as in profile! ')
        return RH, RE, RC, total_R, total_SS

    for i in range(rows):
        ss = ss_string[i]                   # type of structure at index i
        profile_row = profile_arr[i]              # Profile row at index i
        total_R += profile_row                    # Incrementing each df by adding one row of the profile
        
        if ss == 'H':
            RH += profile_row
            total_SS[ss] += 1

        elif ss == 'E':
            RE += profile_row
            total_SS[ss] += 1

        else:                               
            RC += profile_row
            total_SS['C'] += 1                # If not H or E --> its assigned to 'C' compatible with training and blind files.
    return RH, RE, RC, total_R, total_SS

In [153]:
df_R_H
print(indexes)

{0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'Q', 6: 'E', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}


In [158]:
RH, RE, RC, total_R, total_SS = train_gor("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile", "/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/dssp/test_ss.dssp", df_R_H, df_R_E, df_R_C, df_R_count, df_all_SS)
# make_frequency_df(RH)
df_R_H = make_frequency_df(RH)           
df_R_E = make_frequency_df(RE)
df_R_C = make_frequency_df(RC)
df_R_H

[[0.55 0.45 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.45 0.   0.   0.   0.55 0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.45]
 [0.55 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.45]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.45 0.   0.   0.55]
 [1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.45 0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.55 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.45 0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.45 0.   0.   0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]]
CEEEEECCC


Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [151]:
df_R_E.loc[[0]]+profile_arr[0]

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.55,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### About the axis in np.arrays

Found [here](https://www.sharpsightlabs.com/blog/numpy-axes-explained/) exerzizes [here](https://machinelearningmastery.com/numpy-axis-for-rows-and-columns/)

When ever we need to do 
* Column wise **or**
* Row wise

operations we cant use the intuiteve row and column indexing but we have to use 

    axis 
    
instead!    

Operations such as the sum can be performed 
* Column wise using
    * ```axis=0```
* Row wise using
    * ```axis=1```
* Apply operation to the entire array
    * ```axis=None```
    
Example:    

In [160]:
import numpy as np

In [161]:
#Defining data as a list of lists
data = [[1,2,3], [4,5,6]]

In [167]:
# Converting to np array
data = np.asarray(data)
print(data)

[[1 2 3]
 [4 5 6]]


In [262]:
# Get shape of the array = tuple
axis = data.shape
print(data.shape)

(2, 3)


array([1, 2, 3])

In [263]:
axis[1]

3

With shape we can visualize that
* x=0 (2) is the number of lines 
* while x=1 (3) is the number of columns

In [169]:
data.shape[0]

2

In [170]:
data.shape[1]

3

Acessing first row and first colum:

In [174]:
data[0,0]

1

Accessing first row and **all** columns

In [177]:
data[0,:]

array([1, 2, 3])

### Sum of entire array:

In [179]:
print(data)

[[1 2 3]
 [4 5 6]]


In [181]:
result = data.sum(axis=None)
print(result)

21


### Summing data by column

In [187]:
col_sum = data.sum(axis=0)
print(col_sum)

[5 7 9]


[5 7 9]


## Note that this doenst work

In [217]:
top = np.asarray([-1,-1,-1])
bottom = np.asarray([1,1,1])

print("Shape: ", np.shape(top))

overhang = np.concatenate((top, data, bottom), axis=0)
print(overhang)

Shape:  (3,)


ValueError: all the input arrays must have same number of dimensions

### But this works:

In [216]:
top = np.asarray([[-1,-1,-1]])
bottom = np.asarray([[1,1,1]])

print("Shape: ", np.shape(top))

overhang = np.concatenate((top, data, bottom), axis=0)
print(overhang)


Shape:  (1, 3)
[[-1 -1 -1]
 [ 1  2  3]
 [ 4  5  6]
 [ 1  1  1]]


## Add over rows

In [218]:
single1 = np.asarray([[0,0,0], [0,0,0]])
single2 = np.asarray([[1,1,1], [1,1,1]])
middle = np.asarray([[2,2,2], [3,3,3], [4,4,4]])
test1 = np.concatenate((single1,data,single2), axis=1) 
test1

array([[0, 0, 0, 1, 2, 3, 1, 1, 1],
       [0, 0, 0, 4, 5, 6, 1, 1, 1]])

In [219]:
7//2

3

In [269]:
ss_count_matrix 


array([['H', 38, 0],
       ['E', 148, 0],
       ['-', 0, 0],
       ['TOT', 186, 0]], dtype=object)

In [274]:
ss_count_matrix[1:3]

array([['E', 148, 0],
       ['-', 0, 0]], dtype=object)

In [253]:
def test_matrices(dssp_file,profile_file, H_matrix, E_matrix, C_matrix, aa_freq_matrix, ss_count_matrix):
    

    #open the current dssp file and obtain the ss sequence
    dssp_opened = open(dssp_file, "r")
    for line in dssp_opened:
        if line[0] == ">":
            continue
        else:
            dssp_seq = line.rstrip()


    #load the current sequence profile and initiate the window matrix with padding before and after the profile
    pad = (int(w)//2)
    padding_matrix = np.zeros((pad, 20))
    profile_matrix = np.loadtxt(profile_file, dtype= 'float64')
    padded_profile = np.concatenate((padding_matrix,profile_matrix,padding_matrix), axis = 0 )
    
#iterate over the dssp sequence and add the current window matrix to the corresponding matrices 
#     c = -1
    
    for i in range(len(dssp_seq)):    # len dssp_seq == number of lines in profile_file!!!
#         c += 1            # do I need to take his indexing? No better way to do this?
        
#         why does he divide by 100 at every step
        window_matrix = np.divide(padded_profile[i:(i+(window)],100)
        
        ss = dssp_seq[i]           
                   
        if ss == "H":
            np.add(H_matrix, window_matrix, out = H_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[0][1] += 1
            ss_count_matrix[3][1] += 1
            
        elif ss == "E":
            np.add(E_matrix, window_matrix, out = E_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[1][1] += 1
            ss_count_matrix[3][1] += 1

        elif ss == "-":
            np.add(C_matrix, window_matrix, out = C_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[2][1] += 1
            ss_count_matrix[3][1] += 1
            
        print("**")
        print("Here C", c, "Here window matrix", window_matrix)

    return()

In [228]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [255]:
x = np.arange(5)

x

array([0, 1, 2, 3, 4])

In [256]:
np.true_divide(x, 4)


array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [278]:
st='CEEECHHHH'
for i in st:
    print(i)

C
E
E
E
C
H
H
H
H


In [282]:
li = list(st)
type(li)

list

In [279]:
'C'.count(st)

0

In [285]:
from collections import Counter
cnt = Counter(li)

In [286]:
common = cnt.most_common()

In [287]:
common

[('H', 4), ('E', 3), ('C', 2)]

In [288]:
my_arr = np.asarray([[0,1,1],[1,2,2],[2,3,3]])
my_arr

array([[0, 1, 1],
       [1, 2, 2],
       [2, 3, 3]])

In [289]:
print(np.sum(my_arr, axis=0))

[3 6 6]


In [290]:
print(np.sum(my_arr, axis=1))

[2 5 8]


In [296]:
# with open("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/4ywn.profile") as profile:
profile1 = np.loadtxt('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/d1dcea2.profile', dtype=np.float64)    
print(np.sum(profile1, axis=1))

[1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.01 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.  ]


In [305]:
import pandas as pd


total_R = np.zeros((3, 20))
total_Rdf = pd.DataFrame(data = total_R) # index=row_names
total_Rdf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
cur_window_arr = np.ones((3,20))
cur_window_arr

In [303]:
total_R.shape

(3, 20)

In [309]:
total_Rdf += cur_window_arr 
total_Rdf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
2,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


### Running on test set

In [323]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/dssp" -w=17




R_H
           A         R         N         D         C         Q         E         G         H         I         L         K         M         F         P         S         T         W         Y         V
0   0.039188  0.015568  0.018886  0.021369  0.004664  0.012274  0.027703  0.031624  0.007053  0.024153  0.037517  0.017981  0.007355  0.013387  0.019722  0.026195  0.022761  0.001810  0.012575  0.027030
1   0.039258  0.016148  0.018144  0.023968  0.003503  0.013573  0.030534  0.033782  0.008747  0.020812  0.037425  0.021044  0.006450  0.010951  0.019490  0.026079  0.022274  0.001879  0.012413  0.024501
2   0.040371  0.015429  0.018028  0.023480  0.003689  0.014362  0.031137  0.029582  0.009234  0.020162  0.039466  0.021601  0.007053  0.012042  0.017216  0.025476  0.023852  0.002088  0.014988  0.023968
3   0.039095  0.016729  0.018770  0.024107  0.003364  0.014617  0.030580  0.026729  0.009350  0.022761  0.040325  0.020603  0.008283  0.014292  0.015406  0.025824  0.025592  0.002135

# Running on *Training Set*

In [None]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset//profiles" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/dssp" -w=17
