# Metapredict examples

In [1]:
# Make graphs show up properly
%matplotlib inline

## Predicting Disorder

In [2]:
import metapredict
from metapredict import meta

### Predicting Disorder From a Sequence

In [3]:
# Example sequence is hnRNPA1 UniprotID P09651
hnRNPA1 = 'MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGGGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'

In [4]:
# print predicted disorder for hnRNPA1
print(meta.predict_disorder(hnRNPA1))

[1, 1, 1, 1, 0.965, 0.895, 0.828, 0.729, 0.631, 0.615, 0.537, 0.448, 0.473, 0.373, 0.374, 0.346, 0.226, 0.169, 0.199, 0.195, 0.177, 0.205, 0.17, 0.24, 0.215, 0.213, 0.226, 0.247, 0.285, 0.24, 0.228, 0.256, 0.269, 0.187, 0.185, 0.205, 0.149, 0.136, 0.135, 0.104, 0.088, 0.105, 0.093, 0.121, 0.215, 0.297, 0.354, 0.327, 0.306, 0.265, 0.274, 0.283, 0.29, 0.312, 0.323, 0.343, 0.253, 0.199, 0.13, 0.042, 0.01, 0.028, 0.033, 0.021, 0.063, 0.092, 0.085, 0.12, 0.184, 0.202, 0.252, 0.286, 0.354, 0.372, 0.393, 0.41, 0.38, 0.403, 0.377, 0.437, 0.426, 0.418, 0.35, 0.385, 0.447, 0.444, 0.411, 0.446, 0.454, 0.482, 0.562, 0.563, 0.559, 0.557, 0.564, 0.523, 0.496, 0.524, 0.517, 0.486, 0.448, 0.389, 0.314, 0.239, 0.197, 0.214, 0.136, 0.098, 0.15, 0.163, 0.137, 0.152, 0.234, 0.198, 0.172, 0.16, 0.169, 0.187, 0.203, 0.208, 0.164, 0.113, 0.141, 0.109, 0.04, 0.006, 0.009, 0.012, 0, 0, 0, 0, 0, 0.029, 0.069, 0.163, 0.268, 0.326, 0.314, 0.297, 0.291, 0.283, 0.279, 0.262, 0.257, 0.261, 0.285, 0.189, 0.127, 0.097

NOTE: You do not need to set the sequence to a variable and then predict disorder using that variable. You can directly input the sequence as a String into the meta.predict_disorder() function.

In [5]:
# directly input sequence...
print(meta.predict_disorder('MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGGGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'))

[1, 1, 1, 1, 0.965, 0.895, 0.828, 0.729, 0.631, 0.615, 0.537, 0.448, 0.473, 0.373, 0.374, 0.346, 0.226, 0.169, 0.199, 0.195, 0.177, 0.205, 0.17, 0.24, 0.215, 0.213, 0.226, 0.247, 0.285, 0.24, 0.228, 0.256, 0.269, 0.187, 0.185, 0.205, 0.149, 0.136, 0.135, 0.104, 0.088, 0.105, 0.093, 0.121, 0.215, 0.297, 0.354, 0.327, 0.306, 0.265, 0.274, 0.283, 0.29, 0.312, 0.323, 0.343, 0.253, 0.199, 0.13, 0.042, 0.01, 0.028, 0.033, 0.021, 0.063, 0.092, 0.085, 0.12, 0.184, 0.202, 0.252, 0.286, 0.354, 0.372, 0.393, 0.41, 0.38, 0.403, 0.377, 0.437, 0.426, 0.418, 0.35, 0.385, 0.447, 0.444, 0.411, 0.446, 0.454, 0.482, 0.562, 0.563, 0.559, 0.557, 0.564, 0.523, 0.496, 0.524, 0.517, 0.486, 0.448, 0.389, 0.314, 0.239, 0.197, 0.214, 0.136, 0.098, 0.15, 0.163, 0.137, 0.152, 0.234, 0.198, 0.172, 0.16, 0.169, 0.187, 0.203, 0.208, 0.164, 0.113, 0.141, 0.109, 0.04, 0.006, 0.009, 0.012, 0, 0, 0, 0, 0, 0.029, 0.069, 0.163, 0.268, 0.326, 0.314, 0.297, 0.291, 0.283, 0.279, 0.262, 0.257, 0.261, 0.285, 0.189, 0.127, 0.097

### Predicting disorder using a Uniprot ID

metapredict allows disorder predictions by inputting the Uniprot ID rather than the sequence. The Uniprot ID for p53 is P04637. Let's give that a go.

In [6]:
print(meta.predict_disorder_uniprot('P04637'))
# NOTE: This function can take a little bit of time due to having to fetch the sequence from Uniprot -
# the time depends on your internet connectivity!

[1, 1, 1, 1, 1, 1, 0.993, 0.982, 0.882, 0.86, 0.824, 0.794, 0.719, 0.668, 0.634, 0.562, 0.513, 0.498, 0.457, 0.4, 0.404, 0.377, 0.271, 0.266, 0.246, 0.294, 0.337, 0.317, 0.337, 0.356, 0.367, 0.402, 0.417, 0.449, 0.428, 0.487, 0.482, 0.484, 0.478, 0.45, 0.465, 0.46, 0.407, 0.398, 0.406, 0.376, 0.385, 0.373, 0.392, 0.374, 0.437, 0.478, 0.467, 0.534, 0.504, 0.521, 0.549, 0.617, 0.696, 0.755, 0.791, 0.812, 0.807, 0.803, 0.793, 0.762, 0.759, 0.764, 0.771, 0.764, 0.764, 0.741, 0.747, 0.761, 0.76, 0.766, 0.765, 0.768, 0.782, 0.777, 0.765, 0.74, 0.718, 0.738, 0.728, 0.694, 0.68, 0.641, 0.627, 0.565, 0.572, 0.58, 0.527, 0.585, 0.567, 0.541, 0.5, 0.503, 0.456, 0.454, 0.455, 0.45, 0.405, 0.361, 0.352, 0.35, 0.352, 0.269, 0.247, 0.18, 0.149, 0.131, 0.115, 0.106, 0.148, 0.144, 0.152, 0.16, 0.173, 0.187, 0.219, 0.204, 0.165, 0.119, 0.071, 0.062, 0.024, 0.05, 0.084, 0.054, 0.013, 0.031, 0.027, 0, 0, 0, 0, 0, 0, 0, 0.032, 0.01, 0.058, 0.071, 0.116, 0.234, 0.396, 0.409, 0.368, 0.346, 0.313, 0.304, 0.30

### Calculating percent disorder

If you just need the percent disorder of a sequence, metapredict can do that using the percent_disorder() function.

In [None]:
# We will use the hnRNPA1 sequence we set before for this.
# Like meta.predict_disorder(), you can also just input a sequence.
print(meta.percent_disorder(hnRNPA1))

The cutoff for a residue to be considered disordered in the percent_disorder() function for metapredict is by default set to 0.3. However, you can manually alter this cutoff to make the classification of a residue as disorder more strict by increasing the cutoff value.

In [None]:
print("Cutoff value of 0.5 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=0.5)))
print("Cutoff value of 0.75 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=0.75)))
print("Cutoff value of 1.0 gives percent disorder of {}".format(meta.percent_disorder(hnRNPA1, cutoff=1.0)))

### Predicting disorder using a FASTA formatted file

Metapredict allows users to predict disorder from all sequences in a fasta file. The utility of this is that you can download an entire proteome (in FASTA format) and predict disorder for all sequences in that proteome using a single command. However, for obvious reasons, in this example we will just look at 2 sequences.

In [None]:
print(meta.predict_disorder_fasta('Tau_and_p53.fasta'))

When predicting disorder form a FASTA file, metapredict will return the values as a dictionary to make it easy to retrieve individual disorder scores.

In [None]:
fasta_disorder=meta.predict_disorder_fasta('Tau_and_p53.fasta')

In [None]:
print(fasta_disorder['TAU'])

In [None]:
print(fasta_disorder['p53'])

You can also save the output from the predict_disorder_fasta() function by setting *save=True*. By default, this will save the output to your current directory. Additionally, by default this wiil save the file as **predicted_disorder_values.csv**; however, you can specify the name by setting *output_name="my_awesome_predictions.csv"*.

In [None]:
meta.predict_disorder_fasta('Tau_and_p53.fasta', save=True, output_name='Tau_and_p53_predictions.csv')

You can also specify the location that the .csv will save by specifying *output_path=/Users/exampleUser/Desktop/ExampleFolder*

In [None]:
# get the current file path
import os
path="{}/example_output_path/".format(os.getcwd())
meta.predict_disorder_fasta('Tau_and_p53.fasta', save=True, output_path=path, output_name="Tau_and_p53_prediction.csv")

### Predicting disorder domains

metapredict holds functionality to predict 'ordered' and disordered domains in a sequence. This can be done by inputting a sequnce or by using a Uniprot ID. We have found that using this function is slightly more accurate than simply using a binary classification of regions by looking at a resiude and whether it is above or below the cutoff value.

In [None]:
print(meta.predict_disorder_domains("MKAPSNGFLPSSNEGEKKPINSQLWHACAGPLVSLPPVGSLVVYFPQGHSEQVAASMQKQTDFIPNYPNLPSKLICLLHS"))

The formatting can look a little confusing, so let's walk through it quickly. The returned tuple is broken down as follows:
 0. the raw disorder scores from 0 to 1 where 1 is the highest probability that a residue is disordered, 1. the smoothed disorder score used for boundary identification, 2. a list of elements where each element is a list where 0 and 1 define the IDR location and 2 gives the actual sequence, and 3. a list of elements where each element is a list where 0 and 1 define the folded domain location and 2 gives the actual sequence

In [None]:
testing_domains_func=meta.predict_disorder_domains("MKAPSNGFLPSSNEGEKKPINSQLWHACAGPLVSLPPVGSLVVYFPQGHSEQVAASMQKQTDFIPNYPNLPSKLICLLHS")

In [None]:
# if we want the raw disorder scores for the sequence, we simply call the 0 item in the tuple
raw_dis_scores=testing_domains_func[0]
print(raw_dis_scores)

In [None]:
# if we want the 'smoothed' disorder scores, which are used for defining the domains, we call the 1 item
smoothed_dis_scores = testing_domains_func[1]
print(smoothed_dis_scores)

In [None]:
# if we want the list of IDRs, we call the 2 item
IDRs = testing_domains_func[2]
print(IDRs)

In [None]:
# in IDRs, the first two items in the list are the coordinates for the IDR and the third item is the sequence
# this sequence only has 1 IDR, so let's grab that from the list of IDRs.
IDR1 = IDRs[0]
# now we can grab the coordinates and the sequence from the IDR.
Coordinate1 = IDR1[0]
Coordinate2 = IDR1[1]
IDRSequence = IDR1[2]
print("The coordinates for this IDR are {} and {}. The sequence of the IDR is {}".format(Coordinate1, Coordinate2, IDRSequence))

#### Additional arguments

With the predict_disorder_domains() function, you can specify various parameters including the cutoff value for the disorder *default is 0.42*, the minimum IDR size *default is 12*, the minimum size of the folded domain *default is 50*, and the gap closure size *default is 10*. Information on the various paramters can be found below:

**disorder_threshold : float**
        Value that defines what 'disordered' is based on the metapredict disorder score. The higher the value the more stringent the cutoff. Default = 0.42

**minimum_IDR_size : int**
        Defines the smallest possible IDR. This is a hard limit - i.e. we CANNOT get IDRs smaller than this. Default = 12.

**minimum_folded_domain : int** 
        Defines where we expect the limit of small folded domains to be. This is NOT a hard limit and functions to modulate the removal of large gaps (i.e. gaps less than this size are treated less strictly). Note that, in addition, gaps < 35 are evaluated with a threshold of 0.35*disorder_threshold and gaps < 20 are evaluated with a threshold of 0.25*disorder_threshold. These two lengthscales were decided based on the fact that coiled-coiled regions (which are IDRs in isolation) often show up with reduced apparent disorder within IDRs, and but can be as short as 20-30 residues. The folded_domain_threshold is used based on the idea that it allows a 'shortest reasonable' folded domain to be identified. Default=50.

**gap_closure : int**
        Defines the largest gap that would be 'closed'. Gaps here refer to a scenario in which you have two groups of disordered residues seprated by a 'gap' of un-disordered residues. In general large gap sizes will favour larger contigous IDRs. It's worth noting that gap_closure becomes relevant only when minimum_region_size becomes very small (i.e. < 5) because really gaps emerge when the smoothed disorder fit is "noisy", but when smoothed gaps are increasingly rare. Default=10.

In [None]:
# example specifying all parameters
print(meta.predict_disorder_domains(hnRNPA1, disorder_threshold=0.3, minimum_IDR_size=15, minimum_folded_domain=60, gap_closure=12))

### Predicting disorder domains from a Uniprot ID

Similar to being able to predict disorder scores from a Uniprot ID, you can also predict disorder domains using a Uniprot ID.

In [None]:
predict_disorder_domains_uniprot('P04637')

# Graphing Disorder

Metapredict contains substantial functionality for graphing disorder to make it easy to quickly visualise which parts of your sequence of interest are disorderd. In addition, we added some functionality so you can customize your graph in a few *nifty* ways.

### Graphing disorder form a sequence

Similar to the meta.predict_disorder(), metapredict can generate graphs directly from a sequence using the *meta.graph_disorder()* function.

In [None]:
# graph disorder using previously defined hnRNPA1 sequence
meta.graph_disorder(hnRNPA1)

In [None]:
# in the same way as the predict_disordeR() function, you can also directly
# input the amino. acid sequence as a String for grpahing.
meta.graph_disorder('MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGGGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF')

The graph_disorder() function has a few arguments that allows you to customize the graph. Here are a few examples:

### Adding the name of the protein to the title of the graph

In [None]:
meta.graph_disorder(hnRNPA1, name="hnRNPA1")

### Changing the lines on the graph

In [None]:
meta.graph_disorder(hnRNPA1, line_intervals=[0, 0.5])

### Saving the output graph

Metapredict also makes it easy for you to save graphs directly from Python. Simply set save=True and output="name of my graph.png" and the graph will save to your current directory (whatever folder this jupyter notebook is in). You can also specify the path by: output="/Users/thisUser/Desktop/MyCoolGraphs/myGraph.png
where the last part of the path is the name of your output graph.

In [None]:
meta.graph_disorder(hnRNPA1, save=True, output = "hnRNPA1_disorder.png")

### Specify the DPI of the output

Lastly, you can also specify the DPI of the generated graph. The higher the DPI, the higher the resolution of the graph.

In [None]:
meta.graph_disorder(hnRNPA1, DPI=600)

## Graph disorder from a .fasta file

Similar to being able to predict disorder from a FASTA file, you can also generate graphs from a FASTA file from Python. 

In [None]:
meta.graph_disorder_fasta('Tau_and_p53.fasta', save=False)

By default the graph_disorder_fasta() function will specify the title of the graph as the FASTA header. However, you can set save=False to get the graphs returned immediately. **Warning** if you have a large FASTA file and you set save=False, you will have to individually close each individual graph. I do not recommend setting save=False for large .fasta files.

### Saving graphs generated from a FASTA file

metapredict makes it easy for you to generate a large number of graphs from any fasta file using the graph_disorder_fasta() function. By default, it will save the graphs to your current directory, but you can specify the output path as well. Here are a few examples:

In [None]:
# just save the graphs of the sequences in the FASTA file to the curdir 
meta.graph_disorder_fasta("Tau_and_p53.fasta")

In [None]:
# save the output to a specific folder
import os
path="{}/my_cool_graphs".format(os.getcwd())
meta.graph_disorder_fasta("Tau_and_p53.fasta", output_path=path)

By default the files will save as the part of the FASTA header. However, depending on how you download the FASTA file, the header may contain characters that some operating systems cannot use for file names. To bypass this, you can set remove_characters=True

### Generating graphs from a Uniprot ID

Similar to being able to predict disorder of a sequence using the Uniprot ID, you can also generate graphs using a Uniprot ID. 

In [None]:
meta.graph_disorder_uniprot('P04637')

When using a Uniprot ID to generate a graph, all of the same functionality as meta.graph_disorder() can be used including setting the title of the graph by setting *name=myCoolProtein*, setting *save=True* to save a graph, altering the resolution by changing *DPI=300* (DPI can be changed to numbers other than 300), *line_intervals=[0.25, 0.5, 0.75]* to change the dashed lines on the graph (you can use any float between 0 and 1 in the list as far as specifying the lines), specifying the output name of the graph by setting *output=my_cool_graph.png*, and finally specifying the name and the output of the generated graph using *output=/Users/ThisUser/Desktop/MyCoolGraphsFolder/ThisProtein.png*

# For full documentation of metapredict, please see:

http://metapredict.readthedocs.io

# For access to the code for metapredict, please see:

https://github.com/idptools/metapredict