## Build pLCD SHEPHARD domains
##### Last updated 2022-09-01
This notebook provides code for building polar-rich low-complexity domains as used for analysis in Figure 2.

Note that to use this code requires the Python package [sparrow](https://github.com/idptools/sparrow). sparrow is in active development and we'd encourage you to avoid integrating it into your standard workflow, however, we have provided it so all analysis in the paper can be unambigiously reproduced with ease.

In [50]:
from shephard.apis import uniprot
from shephard import interfaces
from sparrow import Protein

In [51]:
# name of a FASTA file from uniprot. The example here uses the cleaned human proteome
# - i.e., the human proteome with proteins that lack non-standard amino acids, but this
# could be any FASTA file generated from UniProt (e.g. mouse proteome etc)
filename = '../../shprd_data/human_proteome_validated.fasta'

In [52]:
# read in FASTA file from uniprot
human_proteome = uniprot.uniprot_fasta_to_proteome(filename)

In [53]:
for p in human_proteome:
    s = Protein(p.sequence)
    
    # get pLCDs from the sequencd
    b = s.low_complexity_domains(mode='holt', residue_selector='QSGNTP', max_interruption=5, minimum_length=50, fractional_threshold=0.5)
    
    # if we found 1 or more pLCDs...
    if len(b) >0:
        for d in b:
            p.add_domain(d[1]+1, d[2], 'QSGNTP_LCD')

    
    
    

In [54]:
print(f"Foud {len(human_proteome.get_domains_by_type('QSGNTP_LCD'))} pLCDs in the human proteome")

Foud 5138 pLCDs in the human proteome


In [55]:
interfaces.si_domains.write_domains(human_proteome, 'shprd_domains_human_QSGNTP_LCD.tsv')