# Batch disorder predictions using metapredict

<a target="_blank" href="https://colab.research.google.com/github/idptools/metapredict/blob/batch_mode/colab/metapredict_colab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


#### *Version 1.3* (updated 2023-27-05)
Updated to use batch mode.


#### *Version 1.2* (updated 2022-30-08)
This notebook provides a simple interface for performing batch predictions of disorder predictions using metapredict V2.

More more information of how metapredict works [please see our preprint]().


## TL/DR
Upload a FASTA file, get a CSV file with per-residue disorder scores for the sequences. No limit on the number of sequences that can be submitted, but in general google-colab notebooks can crash...

## Known issues:
Some anti-tracking tools and other plugins designed to prevent ads will impact the the ability of the notebook to work. 

Known errors include:

* `TypeError: google.colab._files is undefined`

To diagnose this, we suggest visiting the notebook in an Incognito window, noting you'll still need to sign in.

## More info
More details at the end of this page!

In [1]:
#@title Download metapredict
#@markdown Press play to download metapredict.
import time
start = time.time()

# install then import metapredict
!pip install git+https://git@github.com/idptools/metapredict.git@batch_mode --quiet;

# included for good measure but metapredict should have this
# as a dependencies!
!pip install protfasta --quiet
import metapredict as meta
# get stuff for getting files and what not
from google.colab import files

# import other goodies
import re
import os
from random import randint
import protfasta
from datetime import datetime
import time
import numpy as np
end = time.time()
print(f'Packages installed and ready to go (setup took {np.round(end-start,2)} seconds)!')



  Preparing metadata (setup.py) ... [?25l[?25hdone
Packages installed and ready to go (setup took 19.64 seconds)!


In [3]:
#@title Choose a `.fasta` file to make predictions.
#@markdown Press the play button then choose the .fasta file containing sequences you'd like to predict disorder for. Your browser will download the disorder prediction results!
#@markdown The file will download as `<date_and_time>_disorder.csv`.
prediction_mode = 'disorder scores' #@param ["disorder scores", "disordered regions"]
# start timing
start = time.time()

# upload and save
uploaded = files.upload()
print('Uploading sequences...')

# get filename
try: 
    # this ENSURES we overwrite an existing
    # file if it was there before...
    fn = list(uploaded.keys())[0]  
    with open(fn,'wb') as fh:
      fh.write(uploaded[fn]) 
except Exception:
    raise Exception('No file uploaded')
  
# read sequences
try:
    input_seqs = protfasta.read_fasta(fn, expect_unique_header=False, return_list=True, invalid_sequence_action='convert' )
except Exception as e:
    print('ERROR: An exception occured when parsing your FASTA file.\n\nSorry about that! Please make sure you FASTA file is an appropriately formatted\nFASTA file, the error message below may help but if not please report this\nerror on the metapredict issue tracker:\n\nhttps://github.com/idptools/metapredict/issues ')  
    raise Exception(e)

print(f'Read in FASTA file and found {len(input_seqs)} sequences')


# if we get here assume we've read things in OK...

# get datetime string for output file - this helps avoid overwriting 
# and tells people when they generated the file!
now = datetime.now()
now_string = now.strftime("%d_%m_%Y_%H_%M_%S")

# build idx
in_dict = {}
idx2name = {}
for i in range(len(input_seqs)):

    idx2name[i] = input_seqs[i][0]
    in_dict[i]  = input_seqs[i][1]
    
    
print('Predicting disorder ...')    

# if we just want disorder scores
if prediction_mode == 'disorder scores':
    out = meta.predict_disorder_batch(in_dict)

# if we want whole IDRs
elif prediction_mode == 'disordered regions':
    out = meta.predict_disorder_batch(in_dict, return_domains=True)

# build disorder_out dictionary 
outstring = f'disorder_scores_{now_string}.csv'

if prediction_mode == 'disorder scores':
    n_res = 0
    with open(outstring,'w') as fh:

        for idx in out:
            disorder = out[idx][1]
            name = idx2name[idx]
            seq = out[idx][0]
            
            n_res = n_res + len(seq)
            
            # update so no commas in name
            name = name.replace(',',';')

            # convert to a comma-separated string
            disorder_string = ", ".join([str(i) for i in disorder])

            # write a line with 4 columns 
            fh.write(f"{idx}, {name}, {seq}, {disorder_string}\n")


elif prediction_mode == 'disordered regions':

    n_res = 0
    idr_index = 0
    with open(outstring,'w') as fh:

        for idx in out:

            name = idx2name[idx].replace(',',';')

            entry = out[idx]:
            for i in range(len(entry.disordered_domains)):
                start = entry.disordered_domain_boundaries[i][0]+1
                end = entry.disordered_domain_boundaries[i][1]+1
                seq = entry.disordered_domains[i]

                fh.write(f"{idr_index}, {name}, {start}, {end}, {seq}\n")

                idr_index = idr_index + 1



end = time.time()
n_res = np.sum([len(x[1]) for x in input_seqs])
n_seqs = len(input_seqs)

r_per_second = np.round((end - start)/n_res,7)
s_per_second = np.round((end - start)/n_seqs,3)

print('\nPerformance statistics:')
print('----------------------------------')
print(f'Execution time was {time.strftime("%H:%M:%S", time.gmtime(end-start))} (hr:min:sec)')
print(f'{r_per_second} seconds per residue')
print(f'{s_per_second} seconds per sequence')


# finally prompt the output file
files.download(outstring)
print('Done!')




Saving ok.fasta to ok (2).fasta
Uploading sequences...
Read in FASTA file and found 226 sequences
Predicting disorder scores...


100%|██████████| 8/8 [00:05<00:00,  1.42it/s]


Performance statistics:
----------------------------------
Execution time was 00:00:11 (hr:min:sec)
0.0001141 seconds per residue
0.05 seconds per sequence





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Done!


# Documentation

## Format of input
The input file must be a correctly formatted FASTA file. This means each sequence is defined by a header (which starts with a `>` character) and then on the NEXT line(s) valid amino acid sequence.

We do not require FASTA headers to be unique, and invalid amino acids will be as best as they can converted to standard amino acids using the following conversion convention::

    B   -> N
    U   -> C
    X   -> G
    Z   -> Q
    " " -> <empty string> (i.e. a whitespace character)
    *   -> <empty string>
    -   -> <empty string>

Under the hood, we're using [protfasta](https://protfasta.readthedocs.io/en/latest/) to parse the FASTA file. If this fails an error message should print and if things really seem to go wrong [please raise an issue on out metapredict issue tracker](https://github.com/idptools/metapredict).

## Format of output file
The output file is a *bona fide* csv (comma separated variable) file.

Each sequence has its own dedicated line, and each line has the following format:

1. **Index** (starting at 0). This is a unique number which will be incremented by one. Note if that any sequences fail you'll see a jump in the index as failed sequences are missed.

2. **Header** this is the FASAT header used

3. **Sequence** the actual sequence used in the prediction. Note we say "actual" because it may have been converted if an invalid amino acid was found on parsing (see above).

4. **Disorder values** column 4 onwards is the per-residue disorder score, such that the number of columns is variable.

### NOTE 
Because we want to guarentee that the output file is a true CSV file any commas in your FASTA headers will be removed and replaced with a semi-colon. You have been warned

The order of output sequence predictions is guarenteed to match the input order, and individually invalid sequences will be skipped rather

### Help
I things go wrong, please don't hesitate to [raise an issue on GitHub](https://github.com/idptools/metapredict)

## Changelog
* 2023-05-27: Updated to use batch mode and 
* 2022-08-30: Added performance statistics (v1.2)
* 2022-08-26: The metapredict batch notebook is live!