### Short notebook for parsing/reading ODiNPred disorder data
###### Last updated 2021-05-31
The following short code snipped shows how one could predict IDRs from another prediction tool that provides per-residue disorder scores. We're using [ODiNPred](https://st-protein.chem.au.dk/odinpred) as our example because it's GREAT, but, other predictors are out there and could obviously be used in the same way!

If you have any questions about this code please shoot Alex an email :-).


NOTE that because it's memorial day, the function `predict_disorder_domains_from_external_scores()` is not on the main PyPI version of metapredict, but, you can still get the current working version where it CAN be found by running:

    pip install metapredict@git+git://github.com/holehouse-lab/metapredict.git
    
Which will install the current version [from our GitHub repository](https://github.com/idptools/metapredict).

In [1]:
import metapredict as meta

odinpred_file = 'DisorderPredictionssp_P04637_P53_HUMANC.txt'

# read the file into Python
with open(odinpred_file, 'r') as fh:
    content = fh.readlines()

# this line extracts the 4th (idx=3) column from the file, discarding the
# first (1) line to build a list of disorder scores
disorder = [float(x.strip().split()[3]) for x in content[1:]]

idrs = meta.predict_disorder_domains_from_external_scores(disorder)
print(idrs[1])

[[0, 103], [290, 327], [349, 393]]


The output about should look something like:

    [[0, 103], [290, 327], [349, 393]]

i.e. this uses Python indexing to say there is an IDR between index position 0 and 103, a second between 290 and 327, and another between 349 and 393. NOTE that these boundaries DEPEND on the settings passwed to `predict_disorder_domains_from_external_scores()`. By default the 'disorder' threshold is set to 0.5, but, depending on your disorder scores this might not make any sense.

Re-running with a higher threshold:

In [2]:
idrs = meta.predict_disorder_domains_from_external_scores(disorder, disorder_threshold=0.8)
print(idrs[1])

[[0, 99], [294, 393]]


Reveals a different set of 

    [[0, 99], [294, 393]]


Which of these two scenarios is right? In general, my recommendation at this stage is to see what OTHER extant data exists on this protein - i.e., have people obtained structural information? Perhaps there's a reason which is predicted to be borderline because it folds on binding a partner (e.g. DNA, RNA, protein, ion *etc.*). In any case, for protein-specific analysis a combination of disorder predictor plus primary literature is generally the best approach to build a high-confidence map of which regions are disordered *in the context you care about*.

#### Getting sequence information as well
If you can provide the amino acid sequence, the function `predict_disorder_domains_from_external_scores()` will also give you the IDR sequences by extracting them based on the identified boundaries. For example:

In [3]:
import metapredict as meta
odinpred_file = 'DisorderPredictionssp_P04637_P53_HUMANC.txt'

# read the file into Python
with open(odinpred_file, 'r') as fh:
    content = fh.readlines()

# this line extracts the 4th (idx=3) column from the file, discarding the
# first (1) line to build a list of disorder scores
disorder = [float(x.strip().split()[3]) for x in content[1:]]

# now we select the first (idx=0) column to extract the amino acid, and finally
# include a "".join() call to combine the list of residues into a string
local_sequence = "".join([x.strip().split()[0] for x in content[1:]])

idrs = meta.predict_disorder_domains_from_external_scores(disorder, sequence=local_sequence)
print(idrs[1])

[[0, 103, 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTY'], [290, 327, 'KKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEY'], [349, 393, 'LKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD']]


The above output should now look something like:

    [[0, 103, 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTY'], [290, 327, 'KKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEY'], [349, 393, 'LKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD']]


Where the protein sequences define the amino acids encompassed by the disordered region defined by the index positions