# Assignment 1 by Lennart Tuijnder (VUB)

All the exploratory work and fiddling around can be found back on: https://github.com/ltuijnder/BioInformatics_Assignment1 
This notebook summarises all the important analysis and describes the code.

# Part 1: GOR algorithm

Let's first start with the import of the data and along the way introduce concepts on how I do the analysis.

## Importing the data: 

### Create a dataframe
So we have been given two input data files: `dssp_info.txt`, `stride_info.txt` where each row is an amino acid entry of a given protein. 

The data file imported via the module `pandas` using the function `read_csv`, we now have hold of a `pandas.dataframe` object which is basically a fancy excel table. We can tell the dataframe that we want to group our data on the column on the proteins such that we can easily just loop over protein per protein. 

We should also group the data by `PDB_chain_code` since our data sets contains instance of proteins  that are present multiple times (4 times "1n7s", and 2 times "1wmh") but have a different `chain_code` attachted to them. One protein can fold in different ways and hence form different secundary strucutres (SS), so to our analysis with the GOR we treat them as different proteins. 

Now that we have grouped our dataframe we can loop over each individual protein and acces its specific information.

### Converting residues to numbers

Next, we turn our attention to the residues of the proteins. These will be converted to numbe

In [26]:
def createDataset(dataSet = "dssp", protFamily = None):
    # MODIFIED sequence 1fsg in stride data set!
    df = pd.read_csv("inputData/"+dataSet+"_info.txt", sep="\t", header=None, names=["PDB_code", "PDB_chain_code", "PDB_seq_code", "residue_name", "SS"])
    df['residue_name'] = pd.to_numeric( df['residue_name'].map(aminoToNumber).fillna(-1), downcast='signed')# Fill non matching (aka the not standard amino acids) = -1
    df["SS"] = pd.to_numeric( df["SS"].map(SStoNumber), downcast="signed")
    
    if protFamily is not None:
        # Evaluate here the protein family specific dataset.
        pass
    
    grouped = df.groupby(["PDB_code","PDB_chain_code"], sort=False)

    proteinDict = {}
    for (proteinCode, chainCode), proteinData in grouped:
        rPosition = proteinData["PDB_seq_code"].values
        skipped = np.diff(rPosition)-1
        hasSkipped = np.sum(skipped)!=0
        
        sequence = proteinData["residue_name"].values
        SS = proteinData["SS"].values
        
        if hasSkipped:
            indexesSkipped = np.where(skipped!=0)[0]
            SS = proteinData["SS"].values
        
            # Add -1's on the places where it skipped.
            # Add these seperately add these first seperatatly in a tuple, to not mess up the counting.
            newSequence = () 
            newSS = ()
            
            previousIndex = 0
            for index in indexesSkipped:
                newSequence += (sequence[previousIndex:(index+1)],) # Add the section between the skipped area's
                newSS += (SS[previousIndex:(index+1)],) # Add +1 since the coordinat where the skips happens is still part of the previous section before the skip.
                numberSkipped = skipped[index]
                
                newSequence += (-np.ones(numberSkipped if numberSkipped<8 else 8) ,) # Add the -1's
                newSS += (-np.ones(numberSkipped if numberSkipped<8 else 8) ,)
                previousIndex = index+1
                
            newSequence += (sequence[previousIndex:],)
            newSS += (SS[previousIndex:],)
            
            sequence = np.concatenate(newSequence)
            SS = np.concatenate(newSS)
            
        proteinDict[proteinCode+"_"+chainCode] = (sequence, SS)
    
    return proteinDict
#dsspDictionary = createDataset()

# The GOR algorithm and the frequency table:

The GOR algorithm is based on information theory and in doing so uses probability distributions. With GOR 3 we evaluate the probability that we would have a certain residue $r_{j+m}$ with "$m\in(-8,...,+8)$" steps away from a position $j$ who has SS $s_j$ and residue $r_j$. 

These distribution will be approximated with freqencies or counts (in this text frequencies and counts interchangeable). For the GOR 3 algorithm we need to be able to count the following frequency configuration: $$ F_{s_j,m,r_{jm},r}$$
which represents the count that at position $j$ there was a SS $s_j$ with residue $r_j$ and at position $m$ a residue $r_{jm}$. All probabilities distributions that need to be approximated can be stored within this giant four dimensional matrix that has the following size: $$ (F_{s_j,m,r_{jm},r}).shape=(3,17,20,20)$$
The first 3 is because the SS at position $j$ can be Helix(H), Sheet(E) or Coil (C). There are 17 different m values, 8 to the left side of $j$ (the negative $m$ values), 8 to the right side (the positive $m$ values) and $m=0$ which results in 17 configurations. 