# Obtaining protein descriptors
***

There are two information sources that are used for calculating the protein descriptors:
1. The protein sequence
2. DSSP data

**Contents**

1. [Protein sequence](#Protein-sequence)
    1. [Getting the sequences](#Getting-the-sequences)
    1. [Calculating descriptors from the sequence](#Calculating-descriptors-from-the-sequence)
        1. [Sequence functions](#Sequence-functions)
1. [DSSP data](#DSSP-data)
    1. [Getting DSSP data](#Getting-DSSP-data)
    1. [Processing DSSP files](#Processing-DSSP-files)

## Protein sequence

### Getting the sequences
It is possible to obtain the protein sequence from the `.pdb` files using the BioPandas python library, but instead they were procured by downloading the FASTA (text-based file format that stores nucleotide of amino acid sequences, represented in single letter codes) file from the Protein Data Bank that contained all the protein sequences in the database (link to file: https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt). 

A function iterates over the lines of the file and returns a python dictionary that only contains the proteins that are present in the training dataset. This dictionary contains the following columns for every protein entry:
* The PDB ID (`ID`)
* Name of the protein (`name`)
* Length of the sequence - includes all the chains in the protein (`length`)
* Number of chains in the protein (`n_chains`)
* Average number of aminoacids per chain (`aa/chain`)
* The single letter code protein sequence as a string (`sequence`)

In [1]:
def parse_seq(seqence_file, valid_ids_list):
    '''runs through the lines of a FASTA file, checks the ID of every entry 
    against a given list, and returns a dictionary that contains the
    information for every protein in the list given (data_dict)'''

    data_dict = {'ID': [], 'name': [], 'length': [], 'n_chains': [], 'aa/chain': [], 'sequence': []}
    
    seq_list = open(seqence_file, 'r')

    line = next(seq_list).strip()
    current_id = line[1:5]
    name = line[(line.find('  ')+2):].strip()
    chain_count = 1
    seq = next(seq_list).strip()
    
    for line in seq_list:
    
        if line[0]=='>':
            id_ = line[1:5]
            if id_ not in valid_ids_list:
                continue
            else:
                if id_==current_id:
                    chain_count += 1
                    line = next(seq_list)
                    seq += line.strip()
                else:
                    data_dict['ID'].append(current_id)
                    current_id = id_
                    data_dict['name'].append(name)
                    name = line[(line.find('  ')+2):].strip()
                    data_dict['n_chains'].append(chain_count)
                    data_dict['sequence'].append(seq)
                    data_dict['length'].append(len(seq))
                    data_dict['aa/chain'].append(round(len(seq)/chain_count, 2))
                    chain_count = 1
                    seq = ''
        else:
            seq+=line.strip()
            
    data_dict['ID'].append(current_id)
    data_dict['name'].append(name)
    data_dict['n_chains'].append(chain_count)
    data_dict['sequence'].append(seq)
    data_dict['length'].append(len(seq))
    data_dict['aa/chain'].append(round(len(seq)/chain_count, 2))

    seq_list.close()
    return data_dict
            

In [2]:
import pandas as pd

# contains one column ('ID') with all the PDB IDs that made it into the training set
valid_ids_df = pd.read_csv("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\valid_ids.txt")
valid_ids = valid_ids_df['ID'].to_list() # makind it into a list

# calling the parse_seq() function on the data
seq_dict = parse_seq("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\pdb_seqres.txt", valid_ids)

In [3]:
seq_dict_df = pd.DataFrame(seq_dict) # making the dictionary into a DataFrame

In [4]:
seq_dict_df

Unnamed: 0,ID,name,length,n_chains,aa/chain,sequence
0,10gs,GLUTATHIONE S-TRANSFERASE P1-1,418,2,209.0,PPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKAS...
1,13gs,GLUTATHIONE S-TRANSFERASE,420,2,210.0,MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKA...
2,16pk,3-PHOSPHOGLYCERATE KINASE,415,1,415.0,EKKSINECDLKGKKVLIRVDFNVPVKNGKITNDYRIRSALPTLKKV...
3,184l,T4 LYSOZYME,164,1,164.0,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
4,185l,T4 LYSOZYME,164,1,164.0,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
...,...,...,...,...,...,...
9602,8cpa,CARBOXYPEPTIDASE A,307,1,307.0,ARSTNTFNYATYHTLDEIYDFMDLLVAQHPELVSKLQIGRSYEGRP...
9603,8gpb,GLYCOGEN PHOSPHORYLASE B,842,1,842.0,SRPLSDQEKRKQISVRGLAGVENVTELKKNFNRHLHFTLVKDRNVA...
9604,966c,MMP-1,157,1,157.0,RWEQTHLTYRIENYTPDLPRADVDHAIEKAFQLWSNVTPLTFTKVS...
9605,9hvp,HIV-1 PROTEASE,198,2,99.0,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...


In [5]:
# The DataFrame is saved to a .csv file
seq_dict_df.to_csv("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\CSVs\\sequences.csv", index=False)

### Calculating descriptors from the sequence

A series of functions are writtend to calculate the following descriptors for the proteins from their sequence:
* Molecular weight of protein (`prot_MW()`)
* Sum of the solvent accessible surface area (SASA) of amino acids (might be different from actual SASA of the protein) (`tot_aa_SASA()`)
* Total Van Der Waals volume of the amino acids in the sequence (`tot_VWvol()`)
* The counts of all amino acids in the sequence (`count_aa()`)
* The percentage of all amino acids in the sequence (`percent_aa()`)
* Percentages of amino acids with various proterties - polar, apolar, acidic etc... (more details in Descriptions.pdf) (`percents()`)
* Total topological surface area from amino acids in the sequence (`tot_aa_TPSA()`)
* Polar and different types of apolar surface area (`pol_SA()`)

#### Sequence functions

In [6]:
def prot_MW(seq):
    '''calculates molecular weight (MW) from sequence, returns MW in kiloDaltons (kDa)'''
    
    sequence = seq.upper()
    MW = 0
    aa = {'A': 89.0935,
        'R': 174.2017, 
        'N': 132.1184, 
        'D': 133.1032, 
        'C': 121.159,  
        'E': 147.1299,
        'Q': 146.1451, 
        'G': 75.0669, 
        'H': 155.1552, 
        'I': 131.1736, 
        'L': 131.1736, 
        'K': 146.1882, 
        'M': 149.2124, 
        'F': 165.19, 
        'P': 115.131, 
        'S': 105.093, 
        'T': 119.1197, 
        'W': 204.2262, 
        'Y': 181.1894, 
        'V': 117.1469}
    
    for a in sequence:
        try:
            MW += aa[a]
        except KeyError:
            MW += 136.9008 # if nonstandard aa, add average wt of aa
    return round(MW/1000, 2)
    
    
def tot_aa_SASA(seq):
    '''calculates the sum of solvent accessible surface area 
    of the amino acids in the sequence, returns it in A**2 (angstrom sqr). 
    Source of data in 'aa' dictionary:
    https://www.researchgate.net/figure/Average-SASA-values-for-amino-acids_tbl5_24032510'''
    
    sequence = seq.upper()
    SASA = 0
    
    # aa SASA
    aa = {'A': 209.02,
        'R': 335.73, 
        'N': 259.85, 
        'D': 257.99, 
        'C': 240.50,  
        'E': 285.03,
        'Q': 286.75, 
        'G': 185.15, 
        'H': 290.04, 
        'I': 273.46, 
        'L': 278.44, 
        'K': 303.43, 
        'M': 291.52, 
        'F': 311.30, 
        'P': 235.41, 
        'S': 223.04, 
        'T': 243.55, 
        'W': 350.68, 
        'Y': 328.82, 
        'V': 250.09}
    
    for a in sequence:
        try:
            SASA+=aa[a]
        except KeyError:
            SASA+=271.99 # if non-standard aa, add average value
    return round(SASA, 2)
        
        
def tot_VWvol(seq):
    '''calculates the van der waals volume (VWV) of the 
    amino acids in the sequence, returns it in A**3 (angstrom cubed). 
    Source of data in 'aa' dictionary:
    http://proteinsandproteomics.org/content/free/tables_1/table08.pdf'''
    
    sequence = seq.upper()
    VWV = 0
    aa = {'A': 67,
        'R': 148, 
        'N': 96, 
        'D': 91, 
        'C': 86,  
        'E': 109,
        'Q': 114, 
        'G': 48, 
        'H': 118, 
        'I': 124, 
        'L': 124, 
        'K': 135, 
        'M': 124, 
        'F': 135, 
        'P': 90, 
        'S': 73, 
        'T': 93, 
        'W': 163, 
        'Y': 141, 
        'V': 105}
    
    for a in sequence:
        try:
            VWV+=aa[a]
        except KeyError:
            VWV+=109 # if non-standard aa, add average value
    return VWV
        
        
def count_aa(seq):
    '''calculates the number of every aa in the sequence, 
    does not count non-standard aa. Returns a dict of the values ({aa: val})'''
    
    sequence = seq.upper()
    aa = {'nA': 0,
        'nR': 0, 
        'nN': 0, 
        'nD': 0, 
        'nC': 0,  
        'nE': 0,
        'nQ': 0, 
        'nG': 0, 
        'nH': 0, 
        'nI': 0, 
        'nL': 0, 
        'nK': 0,
        'nM': 0, 
        'nF': 0, 
        'nP': 0, 
        'nS': 0, 
        'nT': 0, 
        'nW': 0, 
        'nY': 0, 
        'nV': 0}
    
    for a in seq:
        try:
            aa['n'+a]+=1
        except KeyError:
            continue # if non-standard aa, don't count
    return aa
    
def percent_aa(seq):
    '''calculates the percentage of every aa in the sequence, 
    does not count non-standard aa. Returns a dict of the values ({aa: val})'''
    
    sequence = seq.upper()
    counts = count_aa(sequence)
    seq_len = len(seq)
    aa = {'%A': 0,
        '%R': 0, 
        '%N': 0, 
        '%D': 0, 
        '%C': 0,  
        '%E': 0,
        '%Q': 0, 
        '%G': 0, 
        '%H': 0, 
        '%I': 0, 
        '%L': 0, 
        '%K': 0,
        '%M': 0, 
        '%F': 0, 
        '%P': 0, 
        '%S': 0, 
        '%T': 0, 
        '%W': 0, 
        '%Y': 0, 
        '%V': 0}
    for an in counts.keys():
        a = an[-1]
        aa['%'+a]=round((counts[an]/seq_len)*100, 2)
    return aa
    
    
def percents(seq):
    '''calculates percentages of different types of amino acids'''
    
    perc = percent_aa(seq)
    apol_aa = ['A', 'V', 'P', 'L', 'I', 'M']
    pol_unchar = ['S', 'T', 'C', 'N', 'Q']
    pol_ch = ['D', 'E', 'K', 'R', 'H']
    pol_ch_basic = ['K', 'R', 'H']
    pol_ch_acid = ['D', 'E']
    aromatic = ['F', 'W', 'Y']
    aliphatic = ['A', 'G', 'H', 'I', 'L', 'P', 'V']
    hydrox = ['S', 'T']
    amidic = ['N', 'Q']
    w_S = ['C', 'M']

    pol = {'%polar': 0, # % of *a*polar (error persisted until 
                        # final creation of the training data) aa
          '%pol_unch': 0, # % of polar uncharged aa
          '%pol_ch': 0, # % of polar charged aa
          '%pol_ch_basic': 0, # % of polar charged basic aa
          '%pol_ch_acid': 0, # % of polar charged acidic aa
          '%aromatic': 0, # % of aromatic aa
          '%aliphatic': 0, # % of aliphatic aa
          '%hydrox': 0, # % of hydroxylic aa
          '%amidic': 0, # % of amidic aa
          '%w_S': 0} # % of aa that contain a sulphur atom
    
    for a in apol_aa:
        pol['%polar']+=perc['%'+a]
    for a in pol_unchar:
        pol['%pol_unch']+=perc['%'+a]
    for a in pol_ch:
        pol['%pol_ch']+=perc['%'+a]
    for a in pol_ch_basic:
        pol['%pol_ch_basic']+=perc['%'+a]
    for a in pol_ch_acid:
        pol['%pol_ch_acid']+=perc['%'+a]
    for a in aromatic:
        pol['%aromatic']+=perc['%'+a]
    for a in aliphatic:
        pol['%aliphatic']+=perc['%'+a]
    for a in hydrox:
        pol['%hydrox']+=perc['%'+a]
    for a in amidic:
        pol['%amidic']+=perc['%'+a]
    for a in w_S:
        pol['%w_S']+=perc['%'+a]
    
    pol.update((x, round(y, 3)) for x, y in pol.items())
    return pol
        
        
def tot_aa_TPSA(seq):
    '''calculates the total topological polar surface area (TPSA)
    of all amino acids. Returns TPSA in A**2 (angsrom sqared). 
    Source of data in 'aa' dictionary: PubChem - https://pubchem.ncbi.nlm.nih.gov/'''
    
    sequence = seq.upper()
    TPSA = 0
    aa = {'A': 63.3,
        'R': 128, 
        'N': 106, 
        'D': 101, 
        'C': 64.3,  
        'E': 101,
        'Q': 106, 
        'G': 63.3, 
        'H': 92, 
        'I': 63.3, 
        'L': 63.3, 
        'K': 89.3, 
        'M': 88.6, 
        'F': 63.3, 
        'P': 49.3, 
        'S': 83.6, 
        'T': 83.6, 
        'W': 79.1, 
        'Y': 83.6, 
        'V': 63.3}
    
    for a in sequence:
        try:
            TPSA+=aa[a]
        except KeyError:
            TPSA+=81.76 # if non-standard aa, add average PSA
    return round(TPSA, 3)
    
    
def pol_SA(seq):
    '''calculates polar and apolar surface area. Returns dict with values
    for each area type in A**2 (angstrom sqared) ({area: val})'''

    sequence = seq.upper()
    count = count_aa(sequence)
    
    pol = {'apolar_SA': 0, 
          'pol_unch_SA': 0, 
          'pol_ch_SA': 0, 
          'pol_ch_basic_SA': 0, 
          'pol_ch_acid_SA': 0}
    
    apol_aa = ['A', 'V', 'P', 'L', 'I', 'M']
    pol_unchar = ['S', 'T', 'C', 'N', 'Q']
    pol_ch = ['D', 'E', 'K', 'R', 'H']
    pol_ch_basic = ['K', 'R', 'H']
    pol_ch_acid = ['D', 'E']
    
    # aa SASA
    aa = {'A': 209.02,
        'R': 335.73, 
        'N': 259.85, 
        'D': 257.99, 
        'C': 240.50,  
        'E': 285.03,
        'Q': 286.75, 
        'G': 185.15, 
        'H': 290.04, 
        'I': 273.46, 
        'L': 278.44, 
        'K': 303.43, 
        'M': 291.52, 
        'F': 311.30, 
        'P': 235.41, 
        'S': 223.04, 
        'T': 243.55, 
        'W': 350.68, 
        'Y': 328.82, 
        'V': 250.09}
    
    for a in apol_aa:
        pol['apolar_SA']+=count['n'+a]*aa[a]
    for a in pol_unchar:
        pol['pol_unch_SA']+=count['n'+a]*aa[a]
    for a in pol_ch:
        pol['pol_ch_SA']+=count['n'+a]*aa[a]
    for a in pol_ch_basic:
        pol['pol_ch_basic_SA']+=count['n'+a]*aa[a]
    for a in pol_ch_acid:
        pol['pol_ch_acid_SA']+=count['n'+a]*aa[a]
    
    pol.update((x, round(y, 3)) for x, y in pol.items())
    return pol

## DSSP data

The DSSP data is retrieved from a web server with the use of an API (https://www3.cmbi.umcn.nl/xssp/api/), by querying the PDB IDs from the `valid_ids` list.

Most of the code below is taken from the example code provided in the API (https://www3.cmbi.umcn.nl/xssp/api/examples)

### Getting DSSP data

In [7]:
"""
This example client takes a PDB ID, sends it to the REST service, which retrieves the
DSSP data. The DSSP data is then output to the console.
"""

import json
import requests
import time
import os

def get_dssp(pdb_id):
    # Read the pdb id into a variable
    data = {'data': pdb_id}
    rest_url = 'https://www3.cmbi.umcn.nl/xssp/'
    
    # Send a request to the server to retrieve the dssp data from the pdb id.
    # If an error occurs, an exception is raised and the program exits. If the
    # request is successful, the id of the job running on the server is
    # returned.
    url_create = '{}api/create/pdb_id/dssp/'.format(rest_url)
    r = requests.post(url_create, data=data)
    r.raise_for_status()

    job_id = json.loads(r.text)['id']

    
    # Loop until the job running on the server has finished, either successfully
    # or due to an error.
    ready = False
    while not ready:
        url_status = '{}api/status/pdb_id/dssp/{}/'.format(rest_url, job_id)
        r = requests.get(url_status)
        r.raise_for_status()

        status = json.loads(r.text)['status']
  
        # If the status equals SUCCESS, exit out of the loop by changing the
        # condition ready. This causes the code to drop into the `else` block
        # below.
        #
        # If the status equals either FAILURE or REVOKED, an exception is raised
        # containing the error message. The program exits.
        #
        # Otherwise, wait for five seconds and start at the beginning of the
        # loop again.
        if status == 'SUCCESS':
            ready = True
        elif status in ['FAILURE', 'REVOKED']:
            raise Exception(json.loads(r.text)['message'])
        else:
            time.sleep(5)
    else:
        # Requests the result of the job. If an error occurs an exception is
        # raised and the program exits. If the request is successful, the result
        # is returned.
        url_result = '{}api/result/pdb_id/dssp/{}/'.format(rest_url, job_id)
        r = requests.get(url_result)
        r.raise_for_status()
        result = json.loads(r.text)['result']

        # Return the result to the caller, which prints it to the screen.
        return result

The `get_dssp()` function above is used to retrieve the the DSSP data from PDB IDs, which then is written to files by the the code below:

In [8]:
def write_dssp(dssp, ID):
    with open("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\DSSPs\\{}.dssp".format(ID), 'w') as file_:
        file_.write(dssp)

for ID in valid_ids:
    write_dssp(get_dssp(ID), ID)

### Processing DSSP files

After the DSSP files were downloaded and written into files, the files were processed to extract the useful information. 

The following code goes systematically through the DSSP files and creates a dictionary that is then converted to a Pandas DataFrame.

In [9]:
def parse_dssps(dssp_directory):
    '''This function takes in a path to a directory with dssp files and builds a 
    dictionary containing all the values extracted from their respective files 
    (see 'dssp_dict' dictionary keys for details, and descriptions.pdf
    for explanations of each entry)'''
    
    # dictionary with all the descriptors and fields that will be
    # used in training 
    dssp_dict = {
    'ID': [], 
    'tot_SS_bri': [], 
    'intrachain_SS_bri': [], 
    'interchain_SS_bri': [], 
    'asa': [], 
    'tot_O(I)>H-N(J)_H_bonds': [], 
    'O(I)>H-N(J)_H_bonds%': [], 
    'parallel_bri_H_bonds': [], 
    'parallel_bri_H_bonds%': [], 
    'antiparallel_bri_H_bonds': [], 
    'antiparallel_bri_H_bonds%': [], 
    'O(I)>H-N(I-5)_H_bonds': [] ,
    'O(I)>H-N(I-5)_H_bonds%': [], 
    'O(I)>H-N(I-4)_H_bonds': [], 
    'O(I)>H-N(I-4)_H_bonds%': [], 
    'O(I)>H-N(I-3)_H_bonds': [], 
    'O(I)>H-N(I-3)_H_bonds%': [], 
    'O(I)>H-N(I-2)_H_bonds': [], 
    'O(I)>H-N(I-2)_H_bonds%': [], 
    'O(I)>H-N(I-1)_H_bonds': [], 
    'O(I)>H-N(I-1)_H_bonds%': [], 
    'O(I)>H-N(I+0)_H_bonds': [], 
    'O(I)>H-N(I+0)_H_bonds%': [], 
    'O(I)>H-N(I+1)_H_bonds': [], 
    'O(I)>H-N(I+1)_H_bonds%': [], 
    'O(I)>H-N(I+2)_H_bonds': [], 
    'O(I)>H-N(I+2)_H_bonds%': [], 
    'O(I)>H-N(I+3)_H_bonds': [], 
    'O(I)>H-N(I+3)_H_bonds%': [], 
    'O(I)>H-N(I+4)_H_bonds': [], 
    'O(I)>H-N(I+4)_H_bonds%': [], 
    'O(I)>H-N(I+5)_H_bonds': [], 
    'O(I)>H-N(I+5)_H_bonds%': []
    }

    # iterating through the files in the given directory
    for filename in os.listdir(dssp_directory):
        
        file_ = os.path.join(dssp_directory, filename)
        with open(file_, 'r') as f:
            
            # keeping track of line count with a variable 'c'
            c = 0
            for l in f:
                c+=1
                sp = l.split()  # splitting each line into its elements
                
                # depending on what line the iteration is on, different 
                # elements of the line will be added to the dictionary fields
                if c == 3:
                    dssp_dict['ID'].append(l.split()[-2].strip().lower())
                elif c == 7:
                    dssp_dict['tot_SS_bri'].append(int(sp[2]))
                    dssp_dict['intrachain_SS_bri'].append(int(sp[3]))
                    dssp_dict['interchain_SS_bri'].append(int(sp[4]))
                elif c == 8:
                    dssp_dict['asa'].append(float(l.split()[0]))
                elif c == 9:
                    dssp_dict['tot_O(I)>H-N(J)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(J)_H_bonds%'].append(float(sp[1]))
                elif c == 10:
                    dssp_dict['parallel_bri_H_bonds'].append(int(sp[0]))
                    dssp_dict['parallel_bri_H_bonds%'].append(float(sp[1]))
                elif c == 11:
                    dssp_dict['antiparallel_bri_H_bonds'].append(int(sp[0]))
                    dssp_dict['antiparallel_bri_H_bonds%'].append(float(sp[1]))
                elif c == 12:
                    dssp_dict['O(I)>H-N(I-5)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I-5)_H_bonds%'].append(float(sp[1]))
                elif c == 13:
                    dssp_dict['O(I)>H-N(I-4)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I-4)_H_bonds%'].append(float(sp[1]))
                elif c == 14:
                    dssp_dict['O(I)>H-N(I-3)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I-3)_H_bonds%'].append(float(sp[1]))
                elif c == 15:
                    dssp_dict['O(I)>H-N(I-2)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I-2)_H_bonds%'].append(float(sp[1]))
                elif c == 16:
                    dssp_dict['O(I)>H-N(I-1)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I-1)_H_bonds%'].append(float(sp[1]))
                elif c == 17:
                    dssp_dict['O(I)>H-N(I+0)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+0)_H_bonds%'].append(float(sp[1]))
                elif c == 18:
                    dssp_dict['O(I)>H-N(I+1)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+1)_H_bonds%'].append(float(sp[1]))
                elif c == 19:
                    dssp_dict['O(I)>H-N(I+2)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+2)_H_bonds%'].append(float(sp[1]))
                elif c == 20:
                    dssp_dict['O(I)>H-N(I+3)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+3)_H_bonds%'].append(float(sp[1]))
                elif c == 21:
                    dssp_dict['O(I)>H-N(I+4)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+4)_H_bonds%'].append(float(sp[1]))
                elif c == 22:
                    dssp_dict['O(I)>H-N(I+5)_H_bonds'].append(int(sp[0]))
                    dssp_dict['O(I)>H-N(I+5)_H_bonds%'].append(float(sp[1]))
    return dssp_dict

A dictionary is made (`dssp_dict`) and then converted to a Pandas DataFrame (`dssp_df`), which is saved as a .csv file, that will be used later in the assembly of the final dataset for trainig

In [10]:
dssp_dict = parse_dssps("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\DSSPs\\")

dssp_df = pd.DataFrame.from_dict(dssp_dict)
dssp_df.to_csv("C:\\Users\\Ieremita Emanuel\\Desktop\\CS_project\\CSVs\\dssp_df.csv", index=False)