# metapredict: A deep-learning based predictor of consensus disorder in proteins

This Jupyter Notebook contains examples to be used for *metapredict*. This notebook was updated for version 1.2 of metapredict. For more information on metapredict, please see: https://metapredict.readthedocs.io/en/latest/

# Setting up the notebook

In [None]:
# Make graphs show up properly
%matplotlib inline

from metapredict import meta
import os

# Predicting disorder

### Predicting Disorder From a Sequence

In [None]:
# Example sequence is hnRNPA1 UniprotID P09651
hnRNPA1 = 'MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGGGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'

In [None]:
# print predicted disorder for hnRNPA1
print(meta.predict_disorder(hnRNPA1))

NOTE: You do not need to set the sequence to a variable and then predict disorder using that variable. You can directly input the sequence as a String into the meta.predict_disorder() function.

In [None]:
# directly input sequence...
print(meta.predict_disorder('MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGGGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'))

### Predicting disorder using a Uniprot ID

metapredict allows disorder predictions by inputting the Uniprot ID rather than the sequence. The Uniprot ID for p53 is P04637. Let's give that a go.

In [None]:
print(meta.predict_disorder_uniprot('P04637'))
# NOTE: This function can take a little bit of time due to having to fetch the sequence from Uniprot.

### Calculating percent disorder

If you just need the percent disorder of a sequence, metapredict can do that using the percent_disorder() function.

In [None]:
# We will use the hnRNPA1 sequence we set before for this.
# Like meta.predict_disorder(), you can also just input a sequence.
print(meta.percent_disorder(hnRNPA1))

The cutoff for a residue to be considered disordered in the percent_disorder() function for metapredict is by default set to 0.3. However, you can manually alter this cutoff to make the classification of a residue as disorder more strict by increasing the cutoff value.

In [None]:
print("Cutoff value of 0.5 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=0.5)))
print("Cutoff value of 0.75 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=0.75)))
print("Cutoff value of 1.0 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=1.0)))

### Predicting disorder using a FASTA formatted file

Metapredict allows users to predict disorder from all sequences in a fasta file. The utility of this is that you can download an entire proteome (in FASTA format) and predict disorder for all sequences in that proteome using a single command. However, for obvious reasons, in this example we will just look at 2 sequences.

In [None]:
print(meta.predict_disorder_fasta('Tau_and_p53.fasta'))

When predicting disorder form a FASTA file, metapredict will return the values as a dictionary to make it easy to retrieve individual disorder scores.

In [None]:
fasta_disorder=meta.predict_disorder_fasta('Tau_and_p53.fasta')

In [None]:
print(fasta_disorder['TAU'])

In [None]:
print(fasta_disorder['p53'])

You can also save the output from the predict_disorder_fasta() function by specifying `output_file=` and then specifying the file path and the file name. This writes to the relative/absolute location, using the current working directory as default.

In [None]:
meta.predict_disorder_fasta('Tau_and_p53.fasta', output_file='Tau_and_p53_predictions.csv')

You can also specify a different directory that the `.csv` will save by 

In [None]:
# set the file path to the current directory followed by the example_output_path
meta.predict_disorder_fasta('Tau_and_p53.fasta', output_file="example_output_path/Tau_and_p53_predictions")

### Predicting disorder domains

metapredict holds functionality to predict 'ordered' and disordered domains in a sequence. This can be done by inputting a sequnce or by using a Uniprot ID. We have found that using this function is slightly more accurate than simply using a binary classification of regions by looking at a resiude and whether it is above or below the cutoff value.

In [None]:
disorder_domains = meta.predict_disorder_domains("MKAPSNGFLPSSNEGEKKPINSQLWHACAGPLVSLPPVGSLVVYFPQGHSEQVAASMQKQTDFIPNYPNLPSKLICLLHS")

The output from `predict_disordered_domains())` is a 4-element list with the following components:

* 0: The raw per-residue disorder scores from 0 to 1
* 1: The smoothed per-residue disorder score used for boundary identification (may extend above and below 0 and 1)
* 2: A list of the IDRs 
* 3: A list of the folded domains

The IDRs and folded domains combined should equal the full sequence.

The IDRs and folded domains are themselves defined as a list of lists, where each sublist has three elements:
* 0 domain start position (0 indexed)
* 1 domain end position (0 indexed)
* 2 IDR sequence

Note that the start and end positions are Python-indexed (as opposed to human indexed) so one can do `sequence[start:end]` and get the same IDR back.

As an example:

In [None]:
print('IDR(s) shown below')
disorder_domains[2]

In [None]:
print('Folded domains(s) shown below')
disorder_domains[3]

In [None]:
testing_domains_func = meta.predict_disorder_domains("MKAPSNGFLPSSNEGEKKPINSQLWHACAGPLVSLPPVGSLVVYFPQGHSEQVAASMQKQTDFIPNYPNLPSKLICLLHS")

In [None]:
# if we want the raw disorder scores for the sequence, we simply call the 0 item in the tuple
raw_dis_scores=testing_domains_func[0]
print(raw_dis_scores)

In [None]:
# if we want the 'smoothed' disorder scores, which are used for defining the domains, we call the 1 item
smoothed_dis_scores = testing_domains_func[1]
print(smoothed_dis_scores)

In [None]:
# if we want the list of IDRs, we call the 2 item
IDRs = testing_domains_func[2]
print(IDRs)

In [None]:
# in IDRs, the first two items in the list are the coordinates for the IDR and the third item is the sequence
# this sequence only has 1 IDR, so let's grab that from the list of IDRs.
IDR1 = IDRs[0]
# now we can grab the coordinates and the sequence from the IDR.
start = IDR1[0]
end = IDR1[1]
local_idr_sequence = IDR1[2]
print("The coordinates for this IDR are {} and {}. The sequence of the IDR is {}".format(start, end, local_idr_sequence))

#### Additional arguments

With the predict_disorder_domains() function, you can specify various parameters including the cutoff value for the disorder *default is 0.42*, the minimum IDR size *default is 12*, the minimum size of the folded domain *default is 50*, and the gap closure size *default is 10*. Information on the various paramters can be found below:

**disorder_threshold : float**
        Value that defines what 'disordered' is based on the metapredict disorder score. The higher the value the more stringent the cutoff. Default = 0.42

**minimum_IDR_size : int**
        Defines the smallest possible IDR. This is a hard limit - i.e. we CANNOT get IDRs smaller than this. Default = 12.

**minimum_folded_domain : int** 
        Defines where we expect the limit of small folded domains to be. This is NOT a hard limit and functions to modulate the removal of large gaps (i.e. gaps less than this size are treated less strictly). Note that, in addition, gaps < 35 are evaluated with a threshold of 0.35\*disorder_threshold and gaps < 20 are evaluated with a threshold of 0.25\*disorder_threshold. These two lengthscales were decided based on the fact that coiled-coiled regions (which are IDRs in isolation) often show up with reduced apparent disorder within IDRs, and but can be as short as 20-30 residues. The folded_domain_threshold is used based on the idea that it allows a 'shortest reasonable' folded domain to be identified. Default=50.

**gap_closure : int**
        Defines the largest gap that would be 'closed'. Gaps here refer to a scenario in which you have two groups of disordered residues seprated by a 'gap' of un-disordered residues. In general large gap sizes will favour larger contigous IDRs. It's worth noting that gap_closure becomes relevant only when minimum_region_size becomes very small (i.e. < 5) because really gaps emerge when the smoothed disorder fit is "noisy", but when smoothed gaps are increasingly rare. Default=10.

In [None]:
# example specifying all parameters
print(meta.predict_disorder_domains(hnRNPA1, disorder_threshold=0.3, minimum_IDR_size=15, minimum_folded_domain=60, gap_closure=12))

### Predicting disorder domains from a Uniprot ID

Similar to being able to predict disorder scores from a Uniprot ID, you can also predict disorder domains using a Uniprot ID.

In [None]:
print(meta.predict_disorder_domains_uniprot('P04637'))

# Graphing Disorder

Metapredict contains substantial functionality for graphing disorder to make it easy to quickly visualise which parts of your sequence of interest are disorderd. In addition, we added some functionality so you can customize your graph in a few *nifty* ways.

### Graphing disorder form a sequence

Similar to the meta.predict_disorder(), metapredict can generate graphs directly from a sequence using the *meta.graph_disorder()* function.

In [None]:
# graph disorder using previously defined hnRNPA1 sequence
meta.graph_disorder(hnRNPA1)

The graph_disorder() function has a few arguments that allows you to customize the graph. Here are a few examples:

### Adding the name of the protein to the title of the graph

In [None]:
meta.graph_disorder(hnRNPA1, title="hnRNPA1")

### Shading regions of the graph

In [None]:
meta.graph_disorder(hnRNPA1, shaded_regions=[[1, 20],[73, 103], [175, 373]])

### Specifying color of the shaded regions of the graph

In [None]:
meta.graph_disorder(hnRNPA1, shaded_regions=[[1, 20],[73, 103], [175, 373]], shaded_region_color="orange")

### Saving the output graph

Metapredict also makes it easy for you to save graphs directly from Python. Simply set `output_file=` with the file path and the name.

In [None]:
meta.graph_disorder(hnRNPA1, output_file = "example_output_path/hnRNPA1_disorder.png")

### Specify the DPI of the output

You can also specify the DPI of the generated graph. The higher the DPI, the higher the resolution of the graph.

In [None]:
meta.graph_disorder(hnRNPA1, DPI=600)

### Change output filetype
You can also change the output filetype simply by changing the file extension - e.g. generate a PDF instead:


In [None]:
meta.graph_disorder(hnRNPA1, output_file = "example_output_path/hnRNPA1_disorder.pdf")

## Graph disorder from a .fasta file

Similar to being able to predict disorder from a FASTA file, you can also generate graphs from a FASTA file from Python. If no `output_file` is defined all the sequences in the FASTA file render in the notebook.

In [None]:
meta.graph_disorder_fasta('Tau_and_p53.fasta')

By default the `graph_disorder_fasta()` function will specify the title of the graph as the FASTA header. 

### Saving graphs generated from a FASTA file

metapredict makes it easy for you to generate a large number of graphs from any fasta file using the `graph_disorder_fasta()` function. By default, it will save the graphs to your current directory, but you can specify the output directory as well. Here are a few examples:

In [None]:
# save the output to a specific folder
output_path = "my_cool_graphs"
meta.graph_disorder_fasta("Tau_and_p53.fasta", output_dir=output_path)

By default the files names generated will be (up to) the first alpha-numeric characters in the FASTA header. 

### Avoiding overwriting in output files
You may have sequences that have almost identical fasta headers (or may be identical in the first 14 characters). To avoid these overwriting, the `meta.graph_disorder_fasta()` function comes with an `indexed_filenames=` parameter which, if set to `True`, means each output file generated contains a leading integer (starting at 1 and monotonically increasing) guarenteeing uniqueness.

In [None]:
output_path = "my_cool_graphs"
meta.graph_disorder_fasta("Tau_and_p53.fasta", output_dir=output_path, indexed_filenames=True)

### Generating graphs from a Uniprot ID

Similar to being able to predict disorder of a sequence using the Uniprot ID, you can also generate graphs using a Uniprot ID. 

In [None]:
meta.graph_disorder_uniprot('P04637')

# For full documentation of metapredict, please see:

http://metapredict.readthedocs.io

# For access to the code for metapredict, please see:

https://github.com/idptools/metapredict

# For predicting disorder using our server please ee:

https://metapredict.net