<a href="https://colab.research.google.com/github/holehouse-lab/ALBATROSS-colab/blob/main/idrome_constructor/idrome_constructor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>IDRome constructor</h1>

* Version 3.0 (released, November 5th 2024)
* Version 1.0 (initial release, May 27th 2023)
* Version 1.1 (fix for metapredict_api update in SHEPHARD, Jan 31st 2024)

This notebook enables a complete IDR-ome annotation to be generated from an input FASTA file.

Specifically, by uploading a FASTA file this notebook will:

1. Predict all IDRs
2. Calculate sequence properties for each IDR
3. Predict ensemble properties using ALBATROSS
4. Return a CSV file with all this information for easy exploration.


### Input:
The only input file required is a valid FASTA file where each sequence has a unique FASTA header. If the FASTA file was obatined from UniProt then checking the "Uniprot_Input" box will enable the UniProt ID to excised into its own column. Note that FASTA headers have columns replaced by ';' so that a bona fide CSV file can be generated without commas messing up column definitions.

### Output:
Once the notebook is complete, a CSV file called `IDRome_all.csv` will be downloaded.

### Performance:
If GPU credits are available, the human proteome takes ~1 minute. If no GPU credits are available the human proteome takes more like 6-7 minutes.

In [None]:
#@title Setup
#@markdown Run this cell to setup the notebook. This only needs to be done once, and multiple FASTA files can be analyzed using the cell below once the setup has been run.
!pip install git+https://git@github.com/holehouse-lab/shephard.git --quiet;
!pip install git+https://git@github.com/idptools/sparrow.git --quiet;
!pip install git+https://git@github.com/idptools/metapredict.git --quiet;

from google.colab import files
import io
import protfasta

from sparrow import Protein
from shephard.apis import fasta, uniprot
from shephard.apis import metapredict_api
from sparrow.predictors import batch_predict
import numpy as np
import os

In [None]:
#@title Run predictions
#@markdown <h1>Input options</h1>

#@markdown <br>
# define the function that will be called when the form is submitted

#@markdown <h3>UniProt-generated FASTA file</h3>
#@markdown If `uniprot_input` is selected then this analysis assumes
#@markdown the passed fasta file was generated by UniProt and will parse
#@markdown out the UniProt ID into its own column. If not, the whole FASTA header
#@markdown will be used as the unique ID

uniprot_input = True #@param {type:"boolean"}
#@markdown <br>

#@markdown <h3>Ensemble properties</h3>
#@markdown By default all possible ensemble properties are predicted, although if you
#@markdown don't want specific ones to to be predicted this can be adjusted.
radius_of_gyration = True #@param {type:"boolean"}
end_to_end_distance = True #@param {type:"boolean"}
asphericity = True #@param {type:"boolean"}
scaling_exponent = True #@param {type:"boolean"}
prefactor = True #@param {type:"boolean"}
#@markdown ### Select metapredict version
metapredict_version = "v3" #@param ["v1", "v2","v3"]



# upload FASTA file
uploaded_data = files.upload()
uploaded_fasta = list(uploaded_data.keys())[0]

# read in FASTA file
if uniprot_input:
  prot = uniprot.uniprot_fasta_to_proteome(uploaded_fasta, invalid_sequence_action='convert-ignore')
else:
  prot = fasta.fasta_to_proteome(uploaded_fasta, use_header_as_unique_ID=True, invalid_sequence_action='convert-ignore')

# predict IDRs...
print('Predicting IDRs...',end='')
metapredict_api.annotate_proteome_with_disordered_domains(prot, version=metapredict_version, device='cuda')
print('Disorder prediction done')

data = {}
for d in prot.domains:
  name = f"{d.protein.unique_ID}_{d.start}_{d.end}"

  name = name.replace(',',';')
  data[name] = d.sequence



## ------------------------------------------
##
## RUN PREDICTIONS
##

if radius_of_gyration:
  print('Predicting radii of gyration')
  rg = batch_predict.batch_predict(data, network='scaled_rg')

if end_to_end_distance:
  print('Predicting end-to-end distance(s)')
  re = batch_predict.batch_predict(data, network='scaled_re')

if asphericity:
  print('Predicting asphericities')
  asph = batch_predict.batch_predict(data, network='asphericity')

if scaling_exponent:
  print('Predict scaling exponent')
  nu = batch_predict.batch_predict(data, network='scaling_exponent')

if prefactor:
  print('Predict scaling prefactor exponent')
  pref = batch_predict.batch_predict(data, network='prefactor')


## ------------------------------------------
outname = 'IDRome_all.csv'
try:
  os.remove(outname)
except Exception:
  pass

fh = open('IDRome_all.csv','w')

out_string = ''
out_string += "IDR ID, "
out_string += "FASTA header, "

if uniprot_input:
  out_string += "UniProtID, "

out_string += "IDR start, "
out_string += "IDR end, "
out_string += "IDR len, "

if radius_of_gyration:
  out_string += "Rg (A), "

if end_to_end_distance:
  out_string += "Re (A), "

if asphericity:
  out_string += "asphericity, "

if scaling_exponent:
  out_string += "scaling_exponent, "

if prefactor:
  out_string += "prefactor, "

out_string += "FCR, "
out_string += "NCPR, "
out_string += "kappa, "
out_string += "fract_negative, "
out_string += "fract_positive, "
out_string += "fract_aro, "
out_string += "fract_pro, "
out_string += "fract_polar, "
out_string += "fract_ali, "
out_string += "sequence\n"

fh.write(out_string)


for d in prot.domains:
  name = f"{d.protein.unique_ID}_{d.start}_{d.end}"
  name = name.replace(',',';')
  out_string = ''
  if name.find(',') > -1:
    raise Exception

  fasta_header = d.protein.name
  fasta_header = fasta_header.replace(',',';')

  out_string += f"{name}, "
  out_string += f"{fasta_header}, "

  if uniprot_input:
    out_string += f"{d.protein.unique_ID}, "

  out_string += f"{d.start}, "
  out_string += f"{d.end}, "
  out_string += f"{len(d.sequence)}, "

  if radius_of_gyration:
    out_string += f"{rg[name][1]:.2f}, "

  if end_to_end_distance:
    out_string += f"{re[name][1]:.2f}, "

  if asphericity:
    out_string += f"{asph[name][1]:.3f}, "

  if scaling_exponent:
    out_string += f"{nu[name][1]:.3f}, "

  if prefactor:
    out_string += f"{pref[name][1]:.3f}, "

  local_protein = Protein(d.sequence)
  out_string += f"{round(local_protein.FCR,3)}, "
  out_string += f"{round(local_protein.NCPR,3)}, "
  out_string += f"{round(local_protein.kappa,3)}, "
  out_string += f"{round(local_protein.fraction_negative,3)}, "
  out_string += f"{round(local_protein.fraction_positive,3)}, "
  out_string += f"{round(local_protein.fraction_aromatic,3)}, "
  out_string += f"{round(local_protein.fraction_proline,3)}, "
  out_string += f"{round(local_protein.fraction_polar,3)}, "
  out_string += f"{round(local_protein.fraction_aliphatic,3)}, "
  out_string += f"{d.sequence}\n"
  fh.write(out_string)


fh.close()

files.download('IDRome_all.csv')




# Documentation


### FASTA input

The input file must be a correctly formatted FASTA file. This means each sequence is defined by a header (which starts with a > character) and then on the NEXT line(s) valid amino acid sequence.

We also require FASTA headers to be unique.

Invalid amino acids will be as best as they can converted to standard amino acids using the following conversion convention::

* "B"   -> N
* "U"   -> C
* "X"   -> G
* "Z"   -> Q
* " " -> \<empty string> (i.e. a whitespace character)
* "\*"   -> \<empty string>
* "\-"   -> \<empty string>


### Note
Because we want to guarentee that the output file is a true CSV file any commas in your FASTA headers will be removed and replaced with a semi-colon. You have been warned

The order of output sequence predictions is guarenteed to match the input order, and individually invalid sequences will be skipped rather

### Help
If things go wrong, please don't hesitate to [raise an issue on GitHub](https://github.com/holehouse-lab/ALBATROSS-colab)

