%%markdown
# [1. Implementation of the GOR Method](#Implementation-of-GOR)

## [2. GOR Training](#GOR-training)

Training set:
* Set of proteins with known structure
* For each id we have:
    * One file containing the primary sequence
    * One file containing the secondary structure
  
![](imgs/1.png)

Need to count:
* Number of times we observe residue R in conformation S divided by N, the total number of residues &rarr; P(R,S) ~ f(R,S) = (# R,S)/N 
* Number of times we observe residue R divided by the total N of residues &rarr; marginal probability of residue R
* Number of occurrence of the conformation S divided by the total N of residues. &rarr; marginal probability of observing S

Observed frequencies in the training set (TS) are used for **estimating**/approximating these probabilities.

Example &rarr; training set containing just 2 sequences (for simplicity). 

![](imgs/2.png)

Defining a table to store counts
* Rows: counts corresponding to joint frequency of R and given SS:
    * \#R, H
    * \#R, E
    * \#R, C
* \# R &rarr; overall frequency of residue type R		

![](imgs/3.png)

Defining a small table that stores the frequencies of helix (H), strand (E) and coil (C).
![](imgs/4.png)

* Each matrix is initialized with zeroes
* Scanning each position &rarr; starting from index 0:
    * reading R and S &rarr; updating the field Pij according to the values.
    
In our example the window size is just 1!
![](imgs/5.png)

* We scan each sequence updating the values in the column
    * updating the counts of 
        * \#R, H
        * \#R, E
        * \#R, C
        
and the counts of total H, E and C in the smaller table.

* For transforming the above **frequencies** into **probabilities**
    * **Devide** each number (counts) by the total lenght of all sequences used in the training
    
&rarr; in our case we divide by 78:

In [26]:
sequence_1 = 'EYFTLQIRGRERFEMFRELNEALELKDAQAG'
ss_1 = 'CCCCCCCCCHHHHHHHHHHHHHHHHHHHHCC'

sequence_2 = 'KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC'
ss_2 = 'CEEEEECCCCCCCCCCHHHHHHHHHHCCCCCEEEECCCCCEEEEEEC'

len(sequence_1+sequence_2)

78

![](imgs/6.png)

## [3. GOR Prediction](#GOR-prediction)

* GOR model is used for predicting SS on unseen protein sequences
* Each residue positon of a query sequence is analyzed
* The highest value of 
    * The function $S^* = argmax_S I(S;R)$ finds the highest scoring predicted conformation "$S^*$" of the residue R
        * The conformation $S^*$ which maximizes log ratio of the information function $I$ is our predicted conformation
        
![](imgs/7.png)       


Given any sequence:

* >NewSequence
* GLKRR

* Each residue R is located in the table
    * the probabilities for
        * \#R, H
        * \#R, E
        * \#R, C
* Are extracted and used in the function $I$
* The conformation with the highest value is our predicted conformation $S^*$

* Here an example of residue NewSequence[0] = G:



![](imgs/8.png)

![](imgs/9.png)

The maximum is C thus it is predicted that residue G has the conformation C.

## [4. Using Windows of Flanking Residues](#Windows) 

* We extend the information function over a 'window' of residues
* Symmetric windows are centered at a given residue position
* Central residue is indexed as $R_0$ which is assigend the conformation $S^*$
    * Residues to the left of $R_0$ hold negative indeces up to $-d$
    * Residues to the right of $R_0$ hold positive indeces up to $d$
    
* The information function is updated as follows:
![](imgs/10.png)

[p1 45:00]

* The fromula requires us to solve terms involving w residues:
    * Exponential number of possible configurations &rarr; computationally to expensive 
        * we have 20^w possibilities!!!
    * Need for very large DB to estimate reliable distributions
* Simplification;
    * **Assumption of statistical independence**: Makes assumption about the contribution of the sequence context to the central residue conformation.
    * Residues $R_-d, ... , R_d$ are treated to be statistically independent
    
![](imgs/11.png)

## [5. Windows Based GOR](#sliding-window) 

* That way we can factorize the joint probability of the full context into the product of marginal probability of residues in the context.
* Joint probability == all the marginal probabilities

&rarr; Keep in mind that residues are NOT independent along the sequence. 

* By using the 
    * chainrule
    * independence assumption 
    * and making the log of products which is the sum of the logs

*$I$ can be rewritten as: 

![](imgs/12.png)

* As shown in the last line above the joint probability can be writen as a sum of individual information funcitons
* Taking the different residues in the window into consideration
    * Resulting in individual contributions of each residue in the window to the calculation of the joint $I$ function
    

* What to do with the window falling out of the seuqence in the beginning and the end:
    * Initialize scanning postion at an index for which the window is full e.g. window size 17 setting first $R_0$ on index 8 of the string of the sequence
    * Adding zeros to undifined regions of **partial windows**
        * First $R_0$ is on index 0
        * You don't have any contribution from partial windows
        
* GOR is a linear model
* The sliding window approach influences the accuracy of the prediction in a negative way
    * The first few residues are affected more than residues in the middle of the sequence
   
 
### Now our Parameters are:
* $P(R,S)$ the probability of observingg a conformation S

### gor_train.py
Training GRO model from user-defined training set and stores all trained parameters (=the GOR model) into an output file

# 1. I need to make several matrices --> np.arrays


first field f[0,0] 
should contain the index of the fileds

1. First test on one sequence.
1. Then testing on a simplified profile as input
     * profile should have 5 sequences ---> lines
     
3. Remember that for coil:
    * Training files contain '-' 
    * Blindset contains 'C'  
        * Make an if.


In [20]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import argparse

# d1g2ya_.dssp
# d1g2ya_.fasta from training files.

'''Make function: Takes fasta and dssp sequence as parameters as input.'''



# win_size = 3 # has to be an odd number pass through argparse later
# num_rows = win_size*4 #adapts to desired window size

# R_tot
# SS_tot



def make_frequency_array():
    '''
    Takes window size and the name of the array as arguments.
    Makes an array that has as many lines as the win_size and 
    the number of columns is defined by the number of naturally 
    occurring aa in eukaryotes. Returns the array.
    '''
    array = np.zeros((1,20))#, dtype= 'float64')
    return array
    
R_H = make_frequency_array()           # generating arrays holding the counts of residue in conformation X --> R_X
R_E = make_frequency_array()
R_C = make_frequency_array()
R_count = make_frequency_array()       # generating array holding the total residue count
SS_count = make_frequency_array()      # generating array holding the total secondary structure count


# ---> at some point: transform 'counts'   into **probabilities** by dividing by total number of residues
# Which is the number of lines of the sequ pro_file.
# profile_matrix = np.loadtxt("/Users/ila/Downloads/test.txt")
# print(profile_matrix)


In [2]:
# ss_array = np.zeros((2,4)) # maybe change it to 1, 3
def make_frequency_df(zeroarray):
    header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
#     row_names = ['#R,H', '#R,E', '#R,C', '#R']
#     freq_array = make_frequency_array(win_size)
    freq_df = pd.DataFrame(data = zeroarray,  columns=header_col)
    return freq_df

# generating dataframes holding the counts of residue in conformation X --> R_X
df_R_H = make_frequency_df(R_H)           
df_R_E = make_frequency_df(R_E)
df_R_C = make_frequency_df(R_C)
df_R_count=make_frequency_df(R_count)
# df_SS_count = make_frequency_df(SS_count)     

tot_H = 0
tot_E = 0
tot_C = 0

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
        cleanstring = newline_list[1].rstrip()
        return cleanstring

aa_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/fasta/d1g2ya_.fasta')
print(aa_string)

ss_string = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/d1g2ya_.dssp')
print(ss_string)


MVSKLSQLQTEMLAALLESGLSKEALIQALG
---HHHHHHHHHHHHHHH----HHHHHHHH-


In [3]:
df_R_H 
df_R_E 
df_R_C 

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
'''Increments corresponding positions according to R and H in each field.'''
l = len(aa_string)
for i in range(l):
    ss = ss_string[i]
    aa = aa_string[i]
    df_R_count[aa]+=1     # Increment each df
    if ss == 'H':
        df_R_H[aa]+=1
    elif ss == 'E':
        df_R_E[aa]+=1
    else:                       # so if ss == '-' or ss == 'C' or even if i got some X:
        df_R_C[aa]+=1

# check sum of residue R in HEC conformation.

In [5]:
df_R_H

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,1.0,7.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [6]:
df_R_E

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df_R_C

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0


In [8]:
print("Residue count: \n")
df_R_count

Residue count: 



Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,4.0,0.0,0.0,0.0,0.0,3.0,3.0,2.0,0.0,1.0,8.0,2.0,2.0,0.0,0.0,4.0,1.0,0.0,0.0,1.0


In [None]:
# row_names = ['#R,H', '#R,E', '#R,C', '#R']
# # new = -1 0 1

# win_adapted_row_names = []
# for i in range(3):
#     win_adapted_row_names += row_names[i]+str(i)
    

# print(win_adapted_row_names)

negwin = (9//2)*-1
poswin = 9//2
indexes = []
for i in range([-4:4]):
    indexes += i

print(indexes)    
# print(negwin)
# print(poswin)

### sort() returns None type

&rarr; apply to list directly and return list instead

In [39]:
import os
z= os.listdir('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/fasta/')
a = z.sort()
print('cant return "a" becaues it is', a)
print(z)


cant return "a" becaues it is None
['4ywn.fasta', '6g3z.fasta']


Found X in my fasta sequences along the way:
   * cheking if 'X' sequences have a corresponding profile

In [3]:
# Sequences containing X:
!cat ~/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/findX_ids_only

6l77
6kko
5xyf
5v0m
5uc0
6dhx
5ir2
4y0o
5c5z
6mdw
7bvv
5t2x
4zey
5wd6
6iqo
5wd8 


In [7]:
#Adding '.profile' in vim
!cat ~/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/findX_ids_only

6l77.profile
6kko.profile
5xyf.profile
5v0m.profile
5uc0.profile
6dhx.profile
5ir2.profile
4y0o.profile
5c5z.profile
6mdw.profile
7bvv.profile
5t2x.profile
4zey.profile
5wd6.profile
6iqo.profile
5wd8.profile 


In [10]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [14]:
!ls  /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/ > sequprofile_list

In [15]:
all_profiles = lines_list('./sequprofile_list')

['4uiq.profile', '4y0l.profile', '4y0o.profile', '4yte.profile', '4ywn.profile', '4zc4.profile', '4zey.profile', '4zkp.profile', '4zlr.profile', '5a88.profile', '5abr.profile', '5anp.profile', '5aun.profile', '5av5.profile', '5azw.profile', '5b71.profile', '5bn2.profile', '5bp5.profile', '5bpk.profile', '5bpu.profile', '5bxq.profile', '5c5z.profile', '5c8a.profile', '5ceg.profile', '5ctd.profile', '5d16.profile', '5d6t.profile', '5d71.profile', '5dcf.profile', '5dd8.profile', '5dg6.profile', '5dq0.profile', '5eiv.profile', '5f1s.profile', '5f2a.profile', '5fb9.profile', '5ffl.profile', '5fq0.profile', '5ghl.profile', '5gke.profile', '5gna.profile', '5hjf.profile', '5ht8.profile', '5ib0.profile', '5ii0.profile', '5ir2.profile', '5jsn.profile', '5jwo.profile', '5kqa.profile', '5kwv.profile', '5ldd.profile', '5ltf.profile', '5m9o.profile', '5mc9.profile', '5mmh.profile', '5n07.profile', '5nl9.profile', '5t2y.profile', '5u39.profile', '5u4u.profile', '5u5n.profile', '5u7e.profile', '5uiv.p

In [29]:
def lines_list(file1):
    cleanlines = []
    with open (file1, 'r') as rfile:
        lines_list = rfile.readlines()
        for line in lines_list:
            nonewline = line.rstrip()
            cleanlines.append(nonewline)
    return cleanlines  

In [34]:
path = '../../all_data/blindset/findX_ids_only'
X_seqs = lines_list(path)    
all_profiles = lines_list('./sequprofile_list')

print("X ", len(X_seqs))
print('all ',len(all_profiles))
intersec = set(X_seqs) & set(all_profiles)
print('Matching',len(intersec))

print(intersec)

X  16
all  127
Matching 10
{'4zey.profile', '4y0o.profile', '5c5z.profile', '5wd6.profile', '5ir2.profile', '7bvv.profile', '6iqo.profile', '6mdw.profile', '5xyf.profile', '6l77.profile'}


### So I assume I have only 117 blind sequences...
No I can keep them because only the "matching residues" of the 20 aa list are considered when filling the GOR matrices.

In [40]:
cat /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile

0.55	0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.45	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	
0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	
0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	0.0	0.0	0.55	
1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.0	0.0	0.45	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
0.45	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.55	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	

In [73]:
import sys
import os
import pandas as pd
import numpy as np

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
#         cleanstring = newline_list[1].rstrip()
        return newline_list

In [128]:
list1 = read_clean_lines('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile')
list1

cleanlines = []
floats = []
# for i in list1:

### Better to loadtxt dicrectly into np array:

In [140]:
profile_arr = np.loadtxt("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile", usecols=range(0,20), dtype=np.float64)
profile_arr
profile_arr.shape

rows, cols = profile_arr.shape

print(profile_arr[0])
print(profile_arr[1])


[0.55 0.45 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]
[0.   0.   0.45 0.   0.   0.   0.55 0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.  ]


### You can add entire rows of np arrays;

In [138]:
sum_of_2rows = profile_arr[0]+profile_arr[1]
sum_of_2rows

array([0.55, 0.45, 0.45, 0.  , 0.  , 0.  , 0.55, 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

In [137]:
header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
nospace = ""
for i in a:
#     print(i, end='')
    nospace+=i
nospace

num_li = []
for j in range(len(header_col)):
    num_li.append(j)
    
print(num_li)    

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


### Need to check that numcols == numcols no

### Need to check that len(dssp_string) == number of rows in profile

In [152]:
#!/anaconda3/bin/python
import sys
import os
import glob
import pandas as pd
import numpy as np
import argparse 

'''For the first try: Make function: Takes fasta and dssp sequence as parameters as input.'''
# d1g2ya_.dssp
# d1g2ya_.fasta from training files.

win_size = 3 # has to be an odd number pass through argparse later

def make_zero_array(window_size):
    '''
    Takes window size and the name of the array as arguments.
    Makes an array that has as many lines as the win_size and 
    the number of columns is defined by the number of naturally 
    occurring aa in eukaryotes. Returns the array.
    '''
    array = np.zeros((window_size,20)) #, dtype= 'float64' is allready default - not necessary to specify!!!
    return array
    
R_H = make_zero_array(win_size)           # generating arrays holding the counts of residue in conformation X --> R_X
R_E = make_zero_array(win_size)
R_C = make_zero_array(win_size)
R_count = make_zero_array(win_size)       # generating array holding the total residue count
SS_count = make_zero_array(win_size)      # generating array holding the total secondary structure count

def make_frequency_df(zeroarray):
    '''
    Makes dataframe from zero array to better vizualize whats going on.
    That enables us to index columns by residue name.
    '''
    header_col = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
    # row_names = ['R_0'] #['#R,H', '#R,E', '#R,C', '#R'] # want to implement using -1 0 1 according to window....
#     freq_array = make_zero_array(win_size)
    freq_df = pd.DataFrame(data = zeroarray,  columns=header_col) # index=row_names
    return freq_df

''' generating dataframes holding the counts of residue in conformation X --> R_X'''
df_R_H = make_frequency_df(R_H)           
df_R_E = make_frequency_df(R_E)
df_R_C = make_frequency_df(R_C)
df_R_count=make_frequency_df(R_count)
# df_SS_count = make_frequency_df(SS_count)     

''' generating smaller datafram holding the total counts conformations'''
ss_array = np.zeros((1,3)) # making array holding total n of R in H, E or C
df_all_SS = pd.DataFrame(data=ss_array, columns=['H', 'E', 'C'], index= ['#S'])
# print("HEC df")
# print(df_all_SS)

def read_clean_lines(infile1):
    ''' Reads all lines from a file. Returns string of second line. The '\n' is stripped.'''
    with open(infile1, 'r') as rfile:
        newline_list = rfile.readlines()
        cleanstring = newline_list[1].rstrip()
        return cleanstring

def read_profile_into_array(infile2):
    '''
    Takes a seq profile file as input. Reads it into an np.array.
    'NaN' column (index 20) is excluded. Returns the array
    '''
    profile_array = np.loadtxt(infile2, usecols=range(0,20), dtype=np.float64) # need to indicate range to get rid of last col containing 'nan'
    return profile_array

def dict_from_2_lists(list1, list2):
    '''
    Makes dictionary from 20 numbers list and all header of profile.
    keys: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 
    values: ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
    used to access correct column in GOR arrays.
    '''
    keys = list1
    values = list2
    index_dict = dict(zip(keys, values))
    return index_dict

indexes = dict_from_2_lists([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'])

def train_gor(profile, ssfile, RH, RE, RC, total_R, total_SS):
    '''
    Takes as input: (1) np.array of seq profile (2) ss from dssp file (3) dataframes comprising the gor model 
    (RH, RE, RC, total_R and total_SS). Increments corresponding positions according to R in given conformation 
    in each field. Returns the trained GOR model.
    '''
    # aa_string = read_clean_lines(profile)
    profile_arr = read_profile_into_array(profile)          # np.array
    print(profile_arr)

    ss_string = read_clean_lines(ssfile)
    print(ss_string)                                        

    rows, cols = profile_arr.shape                          # len(ss_string) must be == number of rows 

    if len(ss_string) != rows:
        print('Error: ', ssfile, 'length not the same as in profile! ')
        return RH, RE, RC, total_R, total_SS

    for i in range(rows):
        ss = ss_string[i]                   # type of structure at index i
        profile_row = profile_arr[i]              # Profile row at index i
        total_R += profile_row                    # Incrementing each df by adding one row of the profile
        
        if ss == 'H':
            RH += profile_row
            total_SS[ss] += 1

        elif ss == 'E':
            RE += profile_row
            total_SS[ss] += 1

        else:                               
            RC += profile_row
            total_SS['C'] += 1                # If not H or E --> its assigned to 'C' compatible with training and blind files.
    return RH, RE, RC, total_R, total_SS

In [153]:
df_R_H
print(indexes)

{0: 'A', 1: 'R', 2: 'N', 3: 'D', 4: 'C', 5: 'Q', 6: 'E', 7: 'G', 8: 'H', 9: 'I', 10: 'L', 11: 'K', 12: 'M', 13: 'F', 14: 'P', 15: 'S', 16: 'T', 17: 'W', 18: 'Y', 19: 'V'}


In [158]:
RH, RE, RC, total_R, total_SS = train_gor("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile", "/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/dssp/test_ss.dssp", df_R_H, df_R_E, df_R_C, df_R_count, df_all_SS)
# make_frequency_df(RH)
df_R_H = make_frequency_df(RH)           
df_R_E = make_frequency_df(RE)
df_R_C = make_frequency_df(RC)
df_R_H

[[0.55 0.45 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.45 0.   0.   0.   0.55 0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.45]
 [0.55 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.45]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.45 0.   0.   0.55]
 [1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.45 0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.55 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.45 0.   0.
  0.   0.   0.   0.   0.   0.  ]
 [0.45 0.   0.   0.   0.   0.   0.   0.   0.55 0.   0.   0.   0.   0.
  0.   0.   0.   0.   0.   0.  ]]
CEEEEECCC


Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [151]:
df_R_E.loc[[0]]+profile_arr[0]

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.55,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### About the axis in np.arrays

Found [here](https://www.sharpsightlabs.com/blog/numpy-axes-explained/) exerzizes [here](https://machinelearningmastery.com/numpy-axis-for-rows-and-columns/)

When ever we need to do 
* Column wise **or**
* Row wise

operations we cant use the intuiteve row and column indexing but we have to use 

    axis 
    
instead!    

Operations such as the sum can be performed 
* Column wise using
    * ```axis=0```
* Row wise using
    * ```axis=1```
* Apply operation to the entire array
    * ```axis=None```
    
Example:    

In [160]:
import numpy as np

In [161]:
#Defining data as a list of lists
data = [[1,2,3], [4,5,6]]

In [167]:
# Converting to np array
data = np.asarray(data)
print(data)

[[1 2 3]
 [4 5 6]]


In [262]:
# Get shape of the array = tuple
axis = data.shape
print(data.shape)

(2, 3)


In [263]:
axis[1]

3

With shape we can visualize that
* x=0 (2) is the number of lines 
* while x=1 (3) is the number of columns

In [169]:
data.shape[0]

2

In [170]:
data.shape[1]

3

Acessing first row and first colum:

In [174]:
data[0,0]

1

Accessing first row and **all** columns

In [177]:
data[0,:]

array([1, 2, 3])

### Sum of entire array:

In [179]:
print(data)

[[1 2 3]
 [4 5 6]]


In [181]:
result = data.sum(axis=None)
print(result)

21


### Summing data by column

In [187]:
col_sum = data.sum(axis=0)
print(col_sum)

[5 7 9]


[5 7 9]


## Note that this doenst work

In [217]:
top = np.asarray([-1,-1,-1])
bottom = np.asarray([1,1,1])

print("Shape: ", np.shape(top))

overhang = np.concatenate((top, data, bottom), axis=0)
print(overhang)

Shape:  (3,)


ValueError: all the input arrays must have same number of dimensions

### But this works:

In [216]:
top = np.asarray([[-1,-1,-1]])
bottom = np.asarray([[1,1,1]])

print("Shape: ", np.shape(top))

overhang = np.concatenate((top, data, bottom), axis=0)
print(overhang)


Shape:  (1, 3)
[[-1 -1 -1]
 [ 1  2  3]
 [ 4  5  6]
 [ 1  1  1]]


## Add over rows

In [218]:
single1 = np.asarray([[0,0,0], [0,0,0]])
single2 = np.asarray([[1,1,1], [1,1,1]])
middle = np.asarray([[2,2,2], [3,3,3], [4,4,4]])
test1 = np.concatenate((single1,data,single2), axis=1) 
test1

array([[0, 0, 0, 1, 2, 3, 1, 1, 1],
       [0, 0, 0, 4, 5, 6, 1, 1, 1]])

In [219]:
7//2

3

In [269]:
ss_count_matrix 


array([['H', 38, 0],
       ['E', 148, 0],
       ['-', 0, 0],
       ['TOT', 186, 0]], dtype=object)

In [274]:
ss_count_matrix[1:3]

array([['E', 148, 0],
       ['-', 0, 0]], dtype=object)

In [253]:
def test_matrices(dssp_file,profile_file, H_matrix, E_matrix, C_matrix, aa_freq_matrix, ss_count_matrix):
    

    #open the current dssp file and obtain the ss sequence
    dssp_opened = open(dssp_file, "r")
    for line in dssp_opened:
        if line[0] == ">":
            continue
        else:
            dssp_seq = line.rstrip()


    #load the current sequence profile and initiate the window matrix with padding before and after the profile
    pad = (int(w)//2)
    padding_matrix = np.zeros((pad, 20))
    profile_matrix = np.loadtxt(profile_file, dtype= 'float64')
    padded_profile = np.concatenate((padding_matrix,profile_matrix,padding_matrix), axis = 0 )
    
#iterate over the dssp sequence and add the current window matrix to the corresponding matrices 
#     c = -1
    
    for i in range(len(dssp_seq)):    # len dssp_seq == number of lines in profile_file!!!
#         c += 1            # do I need to take his indexing? No better way to do this?
        
#         why does he divide by 100 at every step
        window_matrix = np.divide(padded_profile[i:(i+(window)],100)
        
        ss = dssp_seq[i]           
                   
        if ss == "H":
            np.add(H_matrix, window_matrix, out = H_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[0][1] += 1
            ss_count_matrix[3][1] += 1
            
        elif ss == "E":
            np.add(E_matrix, window_matrix, out = E_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[1][1] += 1
            ss_count_matrix[3][1] += 1

        elif ss == "-":
            np.add(C_matrix, window_matrix, out = C_matrix)
            np.add(aa_freq_matrix, window_matrix, out = aa_freq_matrix)
            ss_count_matrix[2][1] += 1
            ss_count_matrix[3][1] += 1
            
        print("**")
        print("Here C", c, "Here window matrix", window_matrix)

    return()

In [228]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [255]:
x = np.arange(5)

x

array([0, 1, 2, 3, 4])

In [256]:
np.true_divide(x, 4)


array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [278]:
st='CEEECHHHH'
for i in st:
    print(i)

C
E
E
E
C
H
H
H
H


### Sum each column in array:

In [288]:
my_arr = np.asarray([[0,1,1],[1,2,2],[2,3,3]])
my_arr

array([[0, 1, 1],
       [1, 2, 2],
       [2, 3, 3]])

In [289]:
print(np.sum(my_arr, axis=0))

[3 6 6]


### Sum each row in array

In [290]:
print(np.sum(my_arr, axis=1))

[2 5 8]


In [296]:
# with open("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/4ywn.profile") as profile:
profile1 = np.loadtxt('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/d1dcea2.profile', dtype=np.float64)    
print(np.sum(profile1, axis=1))

[1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.01 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.
 1.   1.   1.   1.   1.   1.   1.   1.   1.   1.   1.  ]


In [305]:
import pandas as pd


total_R = np.zeros((3, 20))
total_Rdf = pd.DataFrame(data = total_R) # index=row_names
total_Rdf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
cur_window_arr = np.ones((3,20))
cur_window_arr

In [303]:
total_R.shape

(3, 20)

In [309]:
total_Rdf += cur_window_arr 
total_Rdf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
2,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


In [327]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

# Testing GOR Training Script on Minimal Example:

Used first 4 columns of 4ywn files (dssp and profile)
Note that the matrix contains zerovalues in all columns following col 4!

I manually calculated the algo on a piece of paper.

The input files looked as follows:

In [325]:
!cat /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/tiny_test/d/tiny4ywn.dssp

>4ywn_A
CEEECHHHH

In [326]:
!cat /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/tiny_test/p/tiny4ywn.profile

0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.06	0.0	0.0	0.5 	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.24	0.08	0.0	0.0 	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.15	0.07	0.0	0.03	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.25	0.02	0.03	0.22	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.0	0.0	0.0	0.0 	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.0	0.72	0.02	0.03	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.05	0.04	0.08	0.28	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
0.79	0.0	0.0	0.0 	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0


## Manual Model for Comparisson:

I transferred the computation to an excel sheet as it looks tidyier:

![manual GOR](imgs/sheet.png)

Sheet is [here](https://docs.google.com/spreadsheets/d/1wA5H5ZJFLEsr6_o92mE6oa1UeGyt0lNIjqa28POvPgE/edit?usp=sharing)

## Output of Testrun:

shown below

Note that only columns A R N and D are non-zero!

Tiny test successful!

In [96]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -s='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/tiny_test/d' -p='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/tiny_test/p' -w=3




R_H
          A         R         N         D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V
0  0.033333  0.086667  0.014444  0.058889  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.093333  0.084444  0.011111  0.034444  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2  0.093333  0.084444  0.011111  0.034444  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0


R_E
          A         R         N         D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V
0  0.033333  0.008889  0.000000  0.055556  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.050000  0.016667  0.000000  0.058889  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2  0.071111  0.018889  0.003333  0.027778  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0


R_C
      

### Created 5 csvs holding each DF

In [97]:
!ls

Steps_GOR_method.ipynb           gor_training_output_SS.csv
gor_training_out_C.csv           [1m[34mimgs[m[m
gor_training_out_E.csv           sequprofile_list
gor_training_out_H.csv           unnamed0.csv
gor_training_out_marg_prob_R.csv


In [2]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

df_H = pd.read_csv('gor_training_out_H.csv', index_col=[0]) # index_col indicates that first column is the index
df_E = pd.read_csv('gor_training_out_E.csv', index_col=[0])
df_C = pd.read_csv('gor_training_out_C.csv', index_col=[0])
df_marg_p_R = pd.read_csv('gor_training_out_marg_prob_R.csv', index_col=[0])

SS_df = pd.read_csv('gor_training_output_SS.csv', index_col=[0])


In [5]:
import numpy as np

In [30]:
df_E

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.033333,0.008889,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.05,0.016667,0.0,0.058889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.071111,0.018889,0.003333,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
SS_df

Unnamed: 0,H,E,C,tot
#S,0.444444,0.333333,0.222222,1.0


In [63]:
df_E.values.sum()

0.34444444444444444

In [61]:
n=np.log(0.3)
m=np.log(0.44)
o=np.log(1.4)

alles = n-(m+o)
alles

-0.7194644888773187

In [85]:
arr1 = np.zeros((2,3))
arr1


array([[0., 0., 0.],
       [0., 0., 0.]])

In [98]:
array1 = np.random.rand(3,20)


# array2 = np.random.rand(2,3)
# dfa = pd.DataFrame(array2)
giac = array1-14df_H
giac

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.003019,0.037182,0.002746,0.057375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.029333,0.010193,0.006056,0.031042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.033679,0.012564,0.009638,0.012117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [95]:
print(array1.shape[1])
array1.shape

3


(2, 3)

Zeroes should not be a problem given the size of input later on!

In [18]:
a = np.log(df_H)-np.log(df_E)
a


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.0,2.277267,inf,0.058269,,,,,,,,,,,,,,,,
1,0.624154,1.622683,inf,-0.536305,,,,,,,,,,,,,,,,
2,0.271934,1.49752,1.203973,0.215111,,,,,,,,,,,,,,,,


In [12]:
ana= df_H/df_E
ana

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,1.0,9.75,inf,1.06,,,,,,,,,,,,,,,,
1,1.866667,5.066667,inf,0.584906,,,,,,,,,,,,,,,,
2,1.3125,4.470588,3.333333,1.24,,,,,,,,,,,,,,,,


In [135]:
a = SS_df['E'].values[0]
a


0.3333333333333333

In [139]:
df_E/(a*2)

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.05,0.013333,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.075,0.025,0.0,0.088333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.106667,0.028333,0.005,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Running on *Training Set*



In [333]:
!ls /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/ | wc -l
!ls /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/ | wc -l

    1260
    1260


**Both sets contain the same number of files &rarr; everything should work just fine.
Lets run the script on all 1260 ids.**

In [104]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/" -w=17 -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/'


--- 14.257370781898498 minutes ---


In [10]:
!ls /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/

gor_training_out_C.csv           gor_training_out_marg_prob_R.csv
gor_training_out_E.csv           gor_training_output_SS.csv
gor_training_out_H.csv           win_size.txt


In [112]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

Reading matrix from csv

In [126]:
import numpy as np
# To ensure that non of the outputs are truncated with '...'
np.set_printoptions(threshold=np.inf)
import pandas as pd
# To ensure that non of the outputs are truncated with '...'
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

H = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_H.csv', index_col=[0])
E = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_E.csv', index_col=[0])
C = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_C.csv', index_col=[0])
R = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_marg_prob_R.csv', index_col=[0])

SS = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_output_SS.csv', index_col=[0])
# profile_arr = np.loadtxt("/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/test_data/profiles/test_profile", usecols=range(0,20), dtype=np.float64)

In [None]:
import os, sys, glob

In [135]:
def listdir_nohidden(path):
    '''
    To ignore hidden files from os.listdir.
    Returns only 'nonhidden files' from directory
    --> the glob pattern * matches all files that are NOT hidden
    '''
    return glob.glob(os.path.join(path, '*'))

listdir_nohidden('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/')

['/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_marg_prob_R.csv',
 '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_H.csv',
 '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_C.csv',
 '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_E.csv',
 '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_output_SS.csv']

In [231]:
display(H)
display(SS)

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.028754,0.017102,0.013535,0.019515,0.004146,0.013761,0.02511,0.021407,0.007526,0.018959,0.03218,0.019663,0.007632,0.012545,0.014265,0.020864,0.017838,0.003864,0.010187,0.021384
1,0.02955,0.017285,0.01357,0.01971,0.004163,0.014001,0.025805,0.021055,0.007421,0.018998,0.032526,0.019747,0.007695,0.012581,0.013915,0.02098,0.017904,0.003887,0.010192,0.02139
2,0.030365,0.017361,0.01347,0.019852,0.00423,0.014275,0.026298,0.020455,0.007393,0.01933,0.033343,0.019885,0.007809,0.012687,0.013487,0.02092,0.017894,0.003923,0.010111,0.021364
3,0.03128,0.017711,0.0134,0.01989,0.004216,0.014621,0.026984,0.019793,0.007308,0.019616,0.034164,0.020205,0.007921,0.012582,0.012993,0.020715,0.017742,0.003931,0.010053,0.021302
4,0.032149,0.018008,0.013332,0.019899,0.00409,0.015062,0.027841,0.01868,0.007239,0.019948,0.034934,0.02059,0.008034,0.012527,0.012392,0.020558,0.017568,0.003982,0.010054,0.021312
5,0.033588,0.01823,0.012978,0.019631,0.004094,0.015281,0.02833,0.017266,0.007187,0.020512,0.03638,0.020783,0.008251,0.0128,0.011463,0.020091,0.017194,0.004117,0.010179,0.021381
6,0.03503,0.018607,0.01265,0.019264,0.004092,0.015527,0.02884,0.015378,0.006949,0.020967,0.038521,0.021133,0.008519,0.012987,0.010322,0.019653,0.016628,0.00431,0.010249,0.021239
7,0.036131,0.019466,0.012449,0.018836,0.00403,0.016009,0.029742,0.013953,0.00691,0.021096,0.039117,0.022171,0.008355,0.012909,0.008907,0.019422,0.016279,0.004381,0.010265,0.021077
8,0.037766,0.020316,0.011195,0.016121,0.004018,0.016658,0.030587,0.012337,0.006944,0.021896,0.041744,0.023299,0.008817,0.013544,0.007327,0.017279,0.014817,0.004491,0.010674,0.021746
9,0.037672,0.020439,0.01231,0.016067,0.004263,0.016932,0.030024,0.014807,0.007269,0.021368,0.041631,0.023409,0.008916,0.013373,0.004551,0.017749,0.014613,0.004276,0.010627,0.021121


Unnamed: 0,H,E,C,tot
#S,0.355109,0.221427,0.423465,1.0


In [12]:
H = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_H.csv', index_col=[0])
E = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_E.csv', index_col=[0])
C = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_C.csv', index_col=[0])
R = pd.read_csv('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/gor_training_out_marg_prob_R.csv', index_col=[0])

display(H)

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.031541,0.01876,0.014847,0.021406,0.004548,0.015094,0.027544,0.023482,0.008255,0.020796,0.035299,0.021568,0.008371,0.01376,0.015647,0.022886,0.019566,0.004238,0.011174,0.023456
1,0.032212,0.018842,0.014792,0.021486,0.004537,0.015262,0.028129,0.022952,0.008089,0.02071,0.035455,0.021525,0.008388,0.013714,0.015169,0.02287,0.019517,0.004237,0.01111,0.023317
2,0.032896,0.018808,0.014593,0.021507,0.004583,0.015465,0.02849,0.022161,0.008009,0.020941,0.036123,0.021543,0.00846,0.013745,0.014611,0.022664,0.019386,0.00425,0.010954,0.023145
3,0.033682,0.019071,0.014429,0.021417,0.00454,0.015744,0.029056,0.021313,0.007869,0.021122,0.036787,0.021757,0.00853,0.013548,0.013991,0.022306,0.019104,0.004233,0.010825,0.022938
4,0.034409,0.019274,0.014269,0.021298,0.004377,0.016121,0.029799,0.019993,0.007748,0.02135,0.03739,0.022038,0.008599,0.013408,0.013263,0.022003,0.018803,0.004262,0.010761,0.02281
5,0.035736,0.019396,0.013807,0.020886,0.004355,0.016258,0.030142,0.01837,0.007647,0.021824,0.038706,0.022112,0.008779,0.013618,0.012196,0.021376,0.018294,0.00438,0.01083,0.022748
6,0.037054,0.019682,0.013381,0.020377,0.004329,0.016424,0.030507,0.016267,0.00735,0.022178,0.040747,0.022355,0.009011,0.013738,0.010918,0.020788,0.017589,0.004559,0.010841,0.022466
7,0.038005,0.020475,0.013094,0.019813,0.004239,0.016839,0.031284,0.014677,0.007268,0.02219,0.041145,0.023321,0.008788,0.013578,0.009369,0.02043,0.017123,0.004608,0.010797,0.02217
8,0.039516,0.021258,0.011713,0.016868,0.004205,0.01743,0.032005,0.012909,0.007266,0.022911,0.043678,0.024379,0.009226,0.014172,0.007666,0.01808,0.015504,0.0047,0.011169,0.022753
9,0.03962,0.021496,0.012947,0.016898,0.004484,0.017808,0.031576,0.015573,0.007645,0.022473,0.043784,0.02462,0.009377,0.014065,0.004786,0.018667,0.015369,0.004498,0.011177,0.022214


# Converting to probablilties:

**Q:** When calculating the frequency for the different amino acids and conformations the total number that I’m dividing by is different for the 3 windows positions? Or is it always the same? 

**A:** in principle it should be different because you’re always observing less like minus 1 or minus d or plus d windows. In order to get proper probabilities you should divide by the times of numbers you see a -k position or a + k position. To be sure that you are getting ‘proper probabilities’ the best thing to do is dividing by the sum of the counts computed by the overall matrix.  If I have the matrix for the marginal probability- 
For a windows size 3 you observe at position R-1 and the position R+1 a number of times which (n-1) times where n is the total number of residues used in the training. So n should be the number you obtain when summing the #R_0 row (so total number of residues). If you sum #R_-1 you will get (n-1).
Thus in order to get a proper normalization matrix is to take the final #R (marginal probability) matrix and then sum each row #R_d. Each row will yield a different value. These values are then used to normalize their corresponding rows.  → R_ss_-1/sum(#R_-1). 

**Q:** Ah so I cannot use the total count for normalizing each window position? 

**A** NO! If you have many sequences: This would be n-(n*d) so its a but difficult to be computed. The easiest way is to scan the entire training set: then compute all the counts R_H, R_E, R_C, #R, #SS. Then look at the #R matrix - sum each row to obtain the n specific to the d position within the window. Each particular n of #R_d must be used to normalize the corresponding positions in all the matrices. 

Prof Remark; Later on: Even if you use the over-all N your are computing a log odd and the denominator is always the same ⇒ thus not changing the result by much… But to obtain proper probabilities you should divide by window-position-specific sum of n.  E.g n_0 = sum of #R_0 ⇒ divide all H_0, E_0, C_0 rows by n_0. Or n_7 = sum of #R_7 ⇒ divide all H_7, E_7, C_7 by n_7!!!

* Thus I have to re calculate the probablilities using the correct **n_d**

In [212]:

n_series = R.sum(axis=1)
n_series

0     0.911652
1     0.917366
2     0.923045
3     0.928692
4     0.934314
5     0.939895
6     0.945370
7     0.950700
8     0.955707
9     0.950822
10    0.945537
11    0.940083
12    0.934533
13    0.928915
14    0.923266
15    0.917551
16    0.911830
dtype: float64

In [226]:
n_d = pd.DataFrame(data=n_series, columns=['sum = n_d'])
n_d

Unnamed: 0,sum = n_d
0,0.911652
1,0.917366
2,0.923045
3,0.928692
4,0.934314
5,0.939895
6,0.94537
7,0.9507
8,0.955707
9,0.950822


In [232]:
# names H E C
display(H)

Unnamed: 0,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
0,0.028754,0.017102,0.013535,0.019515,0.004146,0.013761,0.02511,0.021407,0.007526,0.018959,0.03218,0.019663,0.007632,0.012545,0.014265,0.020864,0.017838,0.003864,0.010187,0.021384
1,0.02955,0.017285,0.01357,0.01971,0.004163,0.014001,0.025805,0.021055,0.007421,0.018998,0.032526,0.019747,0.007695,0.012581,0.013915,0.02098,0.017904,0.003887,0.010192,0.02139
2,0.030365,0.017361,0.01347,0.019852,0.00423,0.014275,0.026298,0.020455,0.007393,0.01933,0.033343,0.019885,0.007809,0.012687,0.013487,0.02092,0.017894,0.003923,0.010111,0.021364
3,0.03128,0.017711,0.0134,0.01989,0.004216,0.014621,0.026984,0.019793,0.007308,0.019616,0.034164,0.020205,0.007921,0.012582,0.012993,0.020715,0.017742,0.003931,0.010053,0.021302
4,0.032149,0.018008,0.013332,0.019899,0.00409,0.015062,0.027841,0.01868,0.007239,0.019948,0.034934,0.02059,0.008034,0.012527,0.012392,0.020558,0.017568,0.003982,0.010054,0.021312
5,0.033588,0.01823,0.012978,0.019631,0.004094,0.015281,0.02833,0.017266,0.007187,0.020512,0.03638,0.020783,0.008251,0.0128,0.011463,0.020091,0.017194,0.004117,0.010179,0.021381
6,0.03503,0.018607,0.01265,0.019264,0.004092,0.015527,0.02884,0.015378,0.006949,0.020967,0.038521,0.021133,0.008519,0.012987,0.010322,0.019653,0.016628,0.00431,0.010249,0.021239
7,0.036131,0.019466,0.012449,0.018836,0.00403,0.016009,0.029742,0.013953,0.00691,0.021096,0.039117,0.022171,0.008355,0.012909,0.008907,0.019422,0.016279,0.004381,0.010265,0.021077
8,0.037766,0.020316,0.011195,0.016121,0.004018,0.016658,0.030587,0.012337,0.006944,0.021896,0.041744,0.023299,0.008817,0.013544,0.007327,0.017279,0.014817,0.004491,0.010674,0.021746
9,0.037672,0.020439,0.01231,0.016067,0.004263,0.016932,0.030024,0.014807,0.007269,0.021368,0.041631,0.023409,0.008916,0.013373,0.004551,0.017749,0.014613,0.004276,0.010627,0.021121


In [4]:

# for i in range(n_d.shape[0]):
#       prob_H = (H.loc[i]/n_d.loc[i])

import pandas as pd
import numpy as np

data1 = {"a":[1.,3.,5.,2.],
         "b":[4.,8.,3.,7.],
         "c":[5.,45.,67.,34]}
data2 = {"a":[4., 2., 11, 5]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2) 

display(df1)
display(df2)

wrong_div = df1.div(df2, axis='columns')
display(wrong_div)

also_wrong = df1/df2
display(also_wrong)

# or you can use df1/df2.values[0,:]

Unnamed: 0,a,b,c
0,1.0,4.0,5.0
1,3.0,8.0,45.0
2,5.0,3.0,67.0
3,2.0,7.0,34.0


Unnamed: 0,a
0,4.0
1,2.0
2,11.0
3,5.0


Unnamed: 0,a,b,c
0,0.25,,
1,1.5,,
2,0.454545,,
3,0.4,,


Unnamed: 0,a,b,c
0,0.25,,
1,1.5,,
2,0.454545,,
3,0.4,,


In [263]:
display(df1, df2)

Unnamed: 0,a,b,c
0,1.0,4.0,5.0
1,3.0,8.0,45.0
2,5.0,3.0,67.0
3,2.0,7.0,34.0


Unnamed: 0,a
0,4.0
1,2.0
2,11.0
3,5.0


In [8]:
# for i in range(df1.shape[0]):
# #     print(df2.loc[i])
#     print(df1.loc[i]/df2.loc[i])
    
# display(df1)    

df3 = np.divide(df1, df2)
df3.to_csv('test.csv')

## Running gor_train.py

In [9]:
# Re ran --> probabilities had to be recalculated
!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/" -w=17 -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/'


--- 14.416290330886842 minutes ---


## Rerun after imporving speed by changing dataframes to np.arrays

In [51]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/" -w=17 -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out_improved/'


--- 4.130448051293691 minutes ---


In [5]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR'

In [158]:
win_arr = np.array([17])

In [161]:
np.savetxt('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR/win_size.txt',win_arr)
a = np.loadtxt('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/documentation_notebooks/GOR/win_size.txt', dtype=np.int32)

In [184]:
!mv win_size.txt /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/
!ls /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/

gor_training_out_C.csv           gor_training_out_marg_prob_R.csv
gor_training_out_E.csv           gor_training_output_SS.csv
gor_training_out_H.csv           win_size.txt


Finding a way to get to basename:
* throwing away the path

In [186]:

ID = os.path.basename('/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/4uiq.profile')[:-8]
ID

'4uiq'

# Running gor_predict.py on Blind Set

For testing purposes...
All outputs are saved to:

    '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out'

In [190]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_prediction.py -m='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out' -q='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/' -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out'


--- 1.7891995310783386 minutes ---


In [14]:
# rerun after fixing probabilities of HEC and R
!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_prediction.py -m='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out' -q='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/' -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out'


--- 1.5586898803710938 minutes ---


# Re-running gor_predict.py on Blind Set

Modifyed script: replaced dataframes whith np.arrays wherever useful 

### As arrays seem to be processed much faster than pandas dataframes:

I've modifyed the code to have np.arrys wherever possible.
Rerunning to test if the performance improves as expected.

In [53]:
!chmod u+x /Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_predict_arr.py

In [60]:

!/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_predict_arr.py -m='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out_improved/' -q='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/blindset/seqprofile_blind/' -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_pred_out_improved/'


--- 2.1680989265441895 seconds ---


## Checking if prediction remained the same:

Result: no diff

In [61]:
!diff -q /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out/ /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_pred_out_improved/

In [191]:
!ls '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out' |wc -l

4uiq.dssp 5bn2.dssp 5eiv.dssp 5kqa.dssp 5uni.dssp 5zry.dssp 6ia7.dssp 6ovi.dssp
4y0l.dssp 5bp5.dssp 5f1s.dssp 5kwv.dssp 5v2i.dssp 6aoz.dssp 6iqo.dssp 6q7n.dssp
4y0o.dssp 5bpk.dssp 5f2a.dssp 5ldd.dssp 5vog.dssp 6dew.dssp 6isu.dssp 6q8j.dssp
4yte.dssp 5bpu.dssp 5fb9.dssp 5ltf.dssp 5wd6.dssp 6dn4.dssp 6j0y.dssp 6r5w.dssp
4ywn.dssp 5bxq.dssp 5ffl.dssp 5m9o.dssp 5wnw.dssp 6ei6.dssp 6k6l.dssp 6r82.dssp
4zc4.dssp 5c5z.dssp 5fq0.dssp 5mc9.dssp 5woq.dssp 6exx.dssp 6k7q.dssp 6slk.dssp
4zey.dssp 5c8a.dssp 5ghl.dssp 5mmh.dssp 5wuj.dssp 6fsf.dssp 6kok.dssp 6t7o.dssp
4zkp.dssp 5ceg.dssp 5gke.dssp 5n07.dssp 5x4b.dssp 6fwt.dssp 6l77.dssp 6usc.dssp
4zlr.dssp 5ctd.dssp 5gna.dssp 5nl9.dssp 5xga.dssp 6g3z.dssp 6ltz.dssp 6vci.dssp
5a88.dssp 5d16.dssp 5hjf.dssp 5t2y.dssp 5xks.dssp 6gbi.dssp 6md3.dssp 6vk4.dssp
5abr.dssp 5d6t.dssp 5ht8.dssp 5u39.dssp 5xvk.dssp 6gw6.dssp 6mdw.dssp 6wk3.dssp
5anp.dssp 5d71.dssp 5ib0.dssp 5u4u.dssp 5xyf.dssp 6h9e.dssp 6mlx.dssp 6yj1.dssp
5aun.dssp 5dcf.dssp 5ii0.dssp 5u5n.dssp 

In [192]:
!ls '/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_predict_out' |wc -l

     127


In [42]:
a = np.asarray([[5, 2, 3], [10, 4, 6]])



In [43]:
b =  np.sum(a, axis=1)
b

array([10, 20])

In [45]:
b.shape
type(b)

numpy.ndarray

In [49]:
d = b.reshape(2,1)
d

array([[10],
       [20]])

In [50]:
a/d

array([[0.5, 0.2, 0.3],
       [0.5, 0.2, 0.3]])

In [29]:
c = np.sum(a, axis=1)
c

array([ 6, 15])

In [39]:
c.shape
type(b)

numpy.ndarray

In [35]:
d = c.reshape(1,2)
d

array([[ 6, 15]])

In [33]:
d = np.divide(a, c.T)

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

In [41]:
a/(d.T)


array([[0.16666667, 0.33333333, 0.5       ],
       [0.26666667, 0.33333333, 0.4       ]])

In [8]:
for i in {0..4}
do
    # Vars that remain the same in each iteration:
    p="\"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset\""
    s="\"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp\""
    ipath="\"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/test_cv_ids/set${i}_clean\""
    o="\"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/train$i"\"
    echo /Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="$p" -s="$s" -i="$ipath" -w="17" -o="$o"
done

/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset" -s="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp" -i="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/test_cv_ids/set0_clean" -w=17 -o="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/train0"
/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset" -s="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp" -i="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/test_cv_ids/set1_clean" -w=17 -o="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out/train1"
/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset" -s="/Users/ila/01-Unibo/02_L

In [7]:
for i in {0..4}
do
head -5 /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/cv_ids_clean/set${i}_clean > /Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/test_cv_ids/train${i}
done


In [None]:

# !/Users/ila/01-Unibo/02_Lab2/files_lab2_project/scripts/gor_training.py -p="/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/seqprofile_training/" -s"/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/trainingset/dssp/" -w=17 -o='/Users/ila/01-Unibo/02_Lab2/files_lab2_project/all_data/outputs/gor_train_out_improved/'
