## Proteins for exploration
#### Last updated 2022-08-07
This notebook takes the output of the `ana_polar_LCD_stickerspacer_vM.ipynb` notebook and writes the hits and the uniprot IDs out for further analysis an exploration.

Note that to generate HTML pages this notebook requires [sparrow](https://github.com/idptools/sparrow). As mentioned earlier, sparrow is public but we would suggest folks avoid using it for now as its in active development. If you DO want to use sparrow, please reach out to [Alex](https://www.holehouselab.com/team) so he can advise on the best way to avoid issues etc.

In [1]:
import shephard
from shephard.apis import uniprot, fasta
from shephard.interfaces import si_domains, si_tracks



In [2]:
filename ='../shprd_data/human_proteome_validated.fasta'
human_prot = uniprot.uniprot_fasta_to_proteome(filename)

print('Reading in data...')

# load IDRs
si_domains.add_domains_from_file(human_prot, '../generated_data/shprd_domains_polar_LCDs_annotated.tsv')


Reading in data...


### Extract out proteins with LCDs
The code below excises out proteins with specific types of LCDs

In [3]:
UIDs_with_aromatic_polar_lcds = []
UIDs_with_aliphatic_polar_lcds = []
UIDs_with_charged_polar_lcds = []

for protein in human_prot:
    for d in protein.domains:

        if d.attribute('aromatic_polar_lcds',safe=False):
            UIDs_with_aromatic_polar_lcds.append(protein.unique_ID)
            
        if d.attribute('aliphatic_polar_lcds',safe=False):
            UIDs_with_aliphatic_polar_lcds.append(protein.unique_ID)

        if d.attribute('charged_polar_lcds',safe=False):
            UIDs_with_charged_polar_lcds.append(protein.unique_ID)

            
UIDs_with_aromatic_polar_lcds = list(set(UIDs_with_aromatic_polar_lcds))
UIDs_with_aliphatic_polar_lcds = list(set(UIDs_with_aliphatic_polar_lcds))
UIDs_with_charged_polar_lcds = list(set(UIDs_with_charged_polar_lcds))

### Generate output data
The cells below generate text outputs of either the Uniprot IDs in isolation (one per line) or uniprot IDs + protein names. These data can be used by other analysis pipelines.

In [4]:
with open('../generated_data/uniprot_ids_aromatic_polar_lcds.txt', 'w') as fh:
    for uid in UIDs_with_aromatic_polar_lcds:
        fh.write(f"{uid}\n")

with open('../generated_data/uniprot_ids_aliphatic_polar_lcds.txt', 'w') as fh:
    for uid in UIDs_with_aliphatic_polar_lcds:
        fh.write(f"{uid}\n")

with open('../generated_data/uniprot_ids_charged_polar_lcds.txt', 'w') as fh:
    for uid in UIDs_with_charged_polar_lcds:
        fh.write(f"{uid}\n")
        

In [5]:
with open('../generated_data/uniprot_ids_aromatic_polar_lcds_with_name.txt', 'w') as fh:
    for uid in UIDs_with_aromatic_polar_lcds:
        fh.write(f"{uid}, {human_prot.protein(uid).name}, \n")

with open('../generated_data/uniprot_ids_aliphatic_polar_lcds_with_name.txt', 'w') as fh:
    for uid in UIDs_with_aliphatic_polar_lcds:
        fh.write(f"{uid}, {human_prot.protein(uid).name}, \n")

with open('../generated_data/uniprot_ids_charged_polar_lcds_with_name.txt', 'w') as fh:
    for uid in UIDs_with_charged_polar_lcds:
        fh.write(f"{uid}, {human_prot.protein(uid).name}, \n")
        

### Write HTML summary data
The cell below will generate HTML pages showing the polar-rich low complexity domains highlighted using standard amino acid coloring.

In [8]:
WRITE_OUT_HTML = True

if WRITE_OUT_HTML:
    

    from sparrow import Protein
    head = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8">\n<title>{n}</title>\n</head>\n<body>\n\n'
    
    
    for n in ['aromatic_polar_lcds','charged_polar_lcds','aliphatic_polar_lcds']:
        
        fh = open(f'html_out/{n}.html','w')
        
        # write footer
        fh.write(head)
        idx = 0
        for protein in human_prot:
            for d in protein.domains:

                if d.attribute(n,safe=False):
                    
                    fh.write(f'<p>Entry {idx}: <a href="https://www.uniprot.org/uniprotkb/{protein.unique_ID}/">{protein.name}</a>\n</p>')
                    fh.write(f'<p>Domain boundaries: {d.start} - {d.end} \n</p>')

                    a= Protein(d.sequence).show_sequence(return_raw_string=True)
                    fh.write(a+"\n")
                    idx = idx +1
                    
        # write footer
        fh.write('</body>\n</html>\n\n')
        
        fh.close()
                     
