This Jupyter Notebook shows the construction of DisProt_PDB dataset which is used for Disorder-order classification of protein regions in the other Jupyter Notebook. Dataset consists of disordered regions from <a href = "https://disprot.org/">DisProt database</a> and globular (i.e. ordered) domains from <a href = "https://www.rcsb.org/search">PDB database</a>.


Importing the Python packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# DisProt dataset

Disordered regions are obtained from the latest release of DisProt database downloaded from this <a href="https://disprot.org/download">link</a> considering the structural state aspect ('Aspect' field with drop-down options, one of them being the 'Structural State (IDPO)' option).

In [2]:
df_disprot = pd.read_csv('data/DisProt_release_2022_12_structural_state_aspect.tsv', sep='\t') 

In [3]:
df_disprot.head()

Unnamed: 0,acc,name,organism,ncbi_taxon_id,disprot_id,region_id,start,end,term_namespace,term,term_name,ec,ec_name,reference,region_sequence,confidence,obsolete
0,P03265,DNA-binding protein,Human adenovirus C serotype 5,28285,DP00003,DP00003r002,294,334,Structural state,IDPO:00076,disorder,ECO:0006220,X-ray crystallography-based structural model w...,pmid:8632448,EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT,,
1,P03265,DNA-binding protein,Human adenovirus C serotype 5,28285,DP00003,DP00003r004,454,464,Structural state,IDPO:00076,disorder,ECO:0006220,X-ray crystallography-based structural model w...,pmid:8632448,VYRNSRAQGGG,,
2,P49913,Cathelicidin antimicrobial peptide,Homo sapiens,9606,DP00004,DP00004r001,134,170,Structural state,IDPO:00076,disorder,ECO:0006206,near-UV circular dichroism evidence used in ma...,pmid:9452503,LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,,
3,P03045,Antitermination protein N,Escherichia phage lambda,10710,DP00005,DP00005r001,1,107,Structural state,IDPO:00076,disorder,ECO:0006165,nuclear magnetic resonance spectroscopy eviden...,pmid:9659923,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,,
4,P03045,Antitermination protein N,Escherichia phage lambda,10710,DP00005,DP00005r004,1,107,Structural state,IDPO:00076,disorder,ECO:0006210,small-angle X-ray scattering evidence used in ...,pmid:21936008,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,,


In [4]:
df_disprot.shape

(5568, 17)

All posible structural state information listed in <code>'term_name'</code> column:

In [5]:
set(df_disprot['term_name'])

{'disorder', 'molten globule', 'order', 'pre-molten globule'}

Distribution of regions in different structural states:

In [6]:
df_disprot['term_name'].value_counts()

disorder              5404
pre-molten globule      72
molten globule          49
order                   43
Name: term_name, dtype: int64

<img src="assets/classification-of-order-among-proteins.png" width=600 align="left">

Molten globule states are compact, but have less specific configurations compared to the folded state. Pre-molten globule state corresponds to an intermediate between molten globule and unfolded state. Accordingly, molten globule structural state we assign to 'order' class, while pre-molten globule state is assigned to 'disorder' class.

In [7]:
df_disprot['term_name'] = df_disprot['term_name'].apply(lambda x : 'order' if (x == 'order' or x == 'molten globule') else 'disorder')

Class distribution:

In [8]:
df_disprot['term_name'].value_counts()

disorder    5476
order         92
Name: term_name, dtype: int64

Checking for duplicates according to <code>region_id</code>:

In [10]:
len(set(df_disprot['region_id']))

5568

Extracting columns with data relevant to us:

In [11]:
df_disprot = df_disprot[['region_id', 'region_sequence', 'term_name']]

In [12]:
df_disprot.head()

Unnamed: 0,region_id,region_sequence,term_name
0,DP00003r002,EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT,disorder
1,DP00003r004,VYRNSRAQGGG,disorder
2,DP00004r001,LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,disorder
3,DP00005r001,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder
4,DP00005r004,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder


In [13]:
df_disprot.set_index('region_id', inplace=True)

In [14]:
df_disprot.head()

Unnamed: 0_level_0,region_sequence,term_name
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1
DP00003r002,EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT,disorder
DP00003r004,VYRNSRAQGGG,disorder
DP00004r001,LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,disorder
DP00005r001,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder
DP00005r004,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder


In [15]:
df_disprot.shape

(5568, 2)

Removing <code>region_sequence</code> duplicates (e.g. regions with ids 'DP00005r001' i 'DP00005r004').

In [16]:
len(set(zip(df_disprot['region_sequence'].values, df_disprot['term_name'].values)))

4299

In [17]:
df_disprot = df_disprot.drop_duplicates(subset=['region_sequence', 'term_name'], keep='last')

In [18]:
df_disprot.head()

Unnamed: 0_level_0,region_sequence,term_name
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1
DP00003r002,EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT,disorder
DP00003r004,VYRNSRAQGGG,disorder
DP00004r001,LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,disorder
DP00005r017,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder
DP00006r014,GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGF...,disorder


In [19]:
df_disprot.shape

(4299, 2)

# PDB dataset

PDB entries are obtained from <a href="http://dunbrack.fccc.edu/pisces/PISCES_ChooseInputPage.php">PISCES web server</a> with default parameters (reduction similarity, X-ray resolution, etc.)

In [20]:
file = open('data/cullpdb_pc25.0_res0.0-2.0_len40-10000_R0.25_Xray_d2023_03_21_chains8444.fasta.txt')

In [21]:
lines = file.readlines()

In [22]:
lines

['>5D8VA 92116E4FD2C44E0A 83 XRAY  0.480  0.072  0.078 NACO.noDsdr.noBrk High-potential iron-sulfur protein <HIP_THETI(1-83)> [Thermochromatium tepidum]\n',
 'AAPANAVTADDPTAIALKYNQDATKSERVAAARPGLPPEEQHCANCQFMQANVGEGDWKGCQLFPGKLINVNGWCASWTL\n',
 'KAG\n',
 '\n',
 '>3NIRA 919E68AF159EF722 46 XRAY  0.480  0.127 NA NACO.noDsdr.noBrk Crambin <CRAM_CRAAB(1-46)> [Crambe hispanica]\n',
 'TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN\n',
 '\n',
 '>5NW3A 7181BBB4A3E8B1A8 54 XRAY  0.590  0.135  0.146 NACO.noDsdr.noBrk Rubredoxin <RUBR_PYRFU(1-54)> [Pyrococcus furiosus]\n',
 'MAKWVCKICGYIYDEDAGDPDNGISPGTKFEELPDDWVCPICGAPKSEFEKLED\n',
 '\n',
 '>1UCSA 154C6BE62D913192 64 XRAY  0.620  0.139  0.155 NACO.noDsdr.noBrk Ice-structuring protein RD1 <ANP1_LYCDA(1-64)> [Lycodichthys dearborni]\n',
 'NKASVVANQLIPINTALTLIMMKAEVVTPMGIPAEEIPKLVGMQVNRAVPLGTTLMPDMVKNYE\n',
 '\n',
 '>3X2MA D693A38CE0E0BC40 180 XRAY  0.640  0.122  0.129 NACO.noDsdr.noBrk Endoglucanase V-like protein <B3Y002_PHACH(27-206)> [Phaneroch

Parsing the FASTA format file:

In [23]:
pdb_entries_dict = {}

sequence = ""
entry_id = None

for line in lines:
    if line.startswith(">"):
        #putting the previous result in the dictionary
        if entry_id is not None:
            pdb_entries_dict[entry_id] = sequence
            
        #recording the entry_id of the next PDB entry 
        entry_id = line.split(' ')[0][1:]
        sequence = ""
    else:     
        sequence += line.strip()
        
#for the last selected entry (since for it we will not enter the for-loop again and insert it into the dictionary)
pdb_entries_dict[entry_id] = sequence

In [24]:
pdb_entries_dict

{'5D8VA': 'AAPANAVTADDPTAIALKYNQDATKSERVAAARPGLPPEEQHCANCQFMQANVGEGDWKGCQLFPGKLINVNGWCASWTLKAG',
 '3NIRA': 'TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN',
 '5NW3A': 'MAKWVCKICGYIYDEDAGDPDNGISPGTKFEELPDDWVCPICGAPKSEFEKLED',
 '1UCSA': 'NKASVVANQLIPINTALTLIMMKAEVVTPMGIPAEEIPKLVGMQVNRAVPLGTTLMPDMVKNYE',
 '3X2MA': 'ATGGYVQQATGQASFTMYSGCGSPACGKAASGFTAAINQLAFGSAPGLGAGDACGRCFALTGNHDPYSPNYTGPFGQTIVVKVTDLCPVQGNQEFCGQTTSNPTNQHGMPFHFDICEDTGGSAKFFPSGHGALTGTFTEVSCSQWSGSDGGQLWNGACLSGETAPNWPSTACGNKGTAPS',
 '2VB1A': 'KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL',
 '1US0A': 'MASRILLNNGAKMPILGLGTWKSPPGQVTEAVKVAIDVGYRHIDCAHVYQNENEVGVAIQEKLREQVVKREELFIVSKLWCTYHEKGLVKGACQKTLSDLKLDYLDLYLIHWPTGFKPGKEFFPLDESGNVVPSDTNILDTWAAMEELVDEGLVKAIGISNFNHLQVEMILNKPGLKYKPAVNQIECHPYLTQEKLIQYCQSKGIVVTAYSPLGSPDRPWAKPEDPSLLEDPRIKAIAAKHNKTTAQVLIRFPMQRNLVVIPKSVTPERIAENFKVFDFELSSQDMTTLLSYNRNWRVCALLSCTSHKDYPFHEEF',
 '6E6OA': 'DACEQAAIQCVESACESLC

Number of extracted PDB entries:

In [25]:
len(pdb_entries_dict)

8444

Transfering data from dictionary to <code>pandas</code> <code>DataFrame</code> object: 

In [26]:
dict_for_df = {'entry_id' : list(pdb_entries_dict.keys()), 'entry_sequence' : list(pdb_entries_dict.values())}

In [27]:
df_pdb = pd.DataFrame.from_dict(dict_for_df)

In [28]:
df_pdb.head()

Unnamed: 0,entry_id,entry_sequence
0,5D8VA,AAPANAVTADDPTAIALKYNQDATKSERVAAARPGLPPEEQHCANC...
1,3NIRA,TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN
2,5NW3A,MAKWVCKICGYIYDEDAGDPDNGISPGTKFEELPDDWVCPICGAPK...
3,1UCSA,NKASVVANQLIPINTALTLIMMKAEVVTPMGIPAEEIPKLVGMQVN...
4,3X2MA,ATGGYVQQATGQASFTMYSGCGSPACGKAASGFTAAINQLAFGSAP...


Each PDB entry is associated with a <code>'order'</code> class.

In [29]:
df_pdb['class'] = 'order'

In [30]:
df_pdb.head()

Unnamed: 0,entry_id,entry_sequence,class
0,5D8VA,AAPANAVTADDPTAIALKYNQDATKSERVAAARPGLPPEEQHCANC...,order
1,3NIRA,TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN,order
2,5NW3A,MAKWVCKICGYIYDEDAGDPDNGISPGTKFEELPDDWVCPICGAPK...,order
3,1UCSA,NKASVVANQLIPINTALTLIMMKAEVVTPMGIPAEEIPKLVGMQVN...,order
4,3X2MA,ATGGYVQQATGQASFTMYSGCGSPACGKAASGFTAAINQLAFGSAP...,order


In [31]:
df_pdb.set_index(['entry_id'], inplace=True)

In [32]:
df_pdb.head()

Unnamed: 0_level_0,entry_sequence,class
entry_id,Unnamed: 1_level_1,Unnamed: 2_level_1
5D8VA,AAPANAVTADDPTAIALKYNQDATKSERVAAARPGLPPEEQHCANC...,order
3NIRA,TTCCPSIVARSNFNVCRLPGTPEALCATYTGCIIIPGATCPGDYAN,order
5NW3A,MAKWVCKICGYIYDEDAGDPDNGISPGTKFEELPDDWVCPICGAPK...,order
1UCSA,NKASVVANQLIPINTALTLIMMKAEVVTPMGIPAEEIPKLVGMQVN...,order
3X2MA,ATGGYVQQATGQASFTMYSGCGSPACGKAASGFTAAINQLAFGSAP...,order


# Concatenation of DisProt and PDB datasets

We merge these two datasets (DataFrames <code>df_disprot</code> and <code>df_pdb</code>) save it in a CSV format file <code>DisProt_PDB_dataset.csv</code>.

In [33]:
df_disprot.columns

Index(['region_sequence', 'term_name'], dtype='object')

In [34]:
df_pdb.columns

Index(['entry_sequence', 'class'], dtype='object')

Renaming the columns in order to be able to merge the two <code>DataFrames</code>:

In [35]:
df_disprot.rename(columns={"region_sequence" : "sequence", "term_name" : "class"}, inplace=True)

In [36]:
df_pdb.rename(columns={"entry_sequence" : "sequence"}, inplace=True)

Concatenation of dataframes:

In [37]:
df_merged = pd.concat([df_disprot, df_pdb])

In [38]:
df_merged.index.name = "ID"

In [39]:
df_merged

Unnamed: 0_level_0,sequence,class
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
DP00003r002,EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT,disorder
DP00003r004,VYRNSRAQGGG,disorder
DP00004r001,LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES,disorder
DP00005r017,MDAQTRRRERRAEKQAQWKAANPLLVGVSAKPVNRPILSLNRKPKS...,disorder
DP00006r014,GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGF...,disorder
...,...,...
7BY3B,MTSGSALFDTNILIDLFSGRREAKQALEAWPPQNAISLITWMEVMV...,order
1K32A,KTFKLHEMHGLCMPNLLLNPDIHGDRIIFVCCDDLWEHDLKSGSTR...,order
2ZKZA,MTVFVDHKIEYMSLEDDAELLKTMAHPMRLKIVNELYKHKALNVTQ...,order
3RNVA,DPVLSEIKMPTKHHLVIGNSGDPKYIDLPEIEENKMYIAKEGYCYI...,order


Saving data from merged <code>DataFrame</code> to file: 

In [40]:
df_merged.to_csv('data/DisProt_PDB_dataset.csv')