## About
This script should not strictly be need to be run, but will generate the following files in `output_data/`:

* `conserved_subsequences_disorder_thresh_0.3.csv`
* `conserved_subsequences_disorder_thresh_0.4.csv`
* `conserved_subsequences_disorder_thresh_0.5.csv`
* `conserved_subsequences_disorder_thresh_0.6.csv`
* `conserved_subsequences_disorder_thresh_0.7.csv`
* `SHPRD_conserved_subsequences_disorder_thresh_0.3.csv`
* `SHPRD_conserved_subsequences_disorder_thresh_0.4.csv`
* `SHPRD_conserved_subsequences_disorder_thresh_0.5.csv`
* `SHPRD_conserved_subsequences_disorder_thresh_0.6.csv`
* `SHPRD_conserved_subsequences_disorder_thresh_0.7.csv`
* `YSN2UID.pkl`


## Running this notbeook 
This notebook basically cannot be run without the Holehouse Lab (HHL) internal package `yeastevo`. This package just organizes yeast sequence information from version 7 (2012...) of the [YGOB](http://ygob.ucd.ie/). If you want/need the underlying data feel free to reach out to Alex - honestly, this analysis was done first in like 2017 and we'd probably use a different set of sequences if we were doing it today, so we're not hiding the underlying code/package. In fact, it would probably make sense to re-run everything with version 8 (2022) of the YGOB.

In [26]:
try:
    from yeastevo import Pillars
    fungi_matrix = Pillars()
except ModuleNotFoundError:
    print('This script requires the internal HHL package yeastevo to run to generate the underlying data') 


Reading in all proteomes....
... DONE!


In [19]:
import numpy as np
import matplotlib

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
	
# Set such that PDF fonts export in a manner that they
# are editable in illustrator/affinity
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['ps.fonttype'] = 42

# set to define axes linewidths
matplotlib.rcParams['axes.linewidth'] = 0.5

# this defines some prefactors so inline figures look nice
# on a retina macbook. These can be commented out without any
# issue and are solely asthetic.
%matplotlib inline
%config InlineBackend.figure_format='retina'

# UPDATE 2020-12-31 (my preferred font is Avenir...)
font = {'family' : 'avenir',
    	'weight' : 'normal'}

matplotlib.rc('font', **font)
from scipy.signal import savgol_filter
import pickle

from shephard import interfaces
from shephard.apis import fasta



In [5]:
def domain_boundaries(all_positions):
    tmp = list(set(all_positions))
    tmp.sort()
    

    all_boundaries = []
    current = []
    for i in range(0,len(tmp[:-1])):
    
        if len(current) == 0:
            current.append(tmp[i])
        elif tmp[i+1] - tmp[i] > 1:
            current.append(tmp[i])
            all_boundaries.append(current)
            current = []
        
    current.append(tmp[-1])        
    all_boundaries.append(current)
    
    return all_boundaries
    

In [9]:
conservation_thresh=0.65

for disorder_thresh in [0.3,0.4,0.5,0.6,0.7]:
    print(f'On disorder threshold {disorder_thresh}')

    yeast_proteome = fasta.fasta_to_proteome('../figure_1/data/yeast_sequence_dataset.fasta',use_header_as_unique_ID=True)
    interfaces.si_tracks.add_tracks_from_file(yeast_proteome,'../figure_1/data/conservation_scores_SHPRD.tsv', mode='values')
    interfaces.si_tracks.add_tracks_from_file(yeast_proteome,'../figure_1/data/disorder_scores_SHPRD.tsv', mode='values')
    interfaces.si_domains.add_domains_from_file(yeast_proteome,'../figure_1/data/idrs_shephard.tsv')



    frags = []

    for protein in yeast_proteome:
        for domain in protein.domains:
            
            
            s = domain.sequence
            c = domain.get_track_values('conservation')
            if domain.start == 1:
                idx_start = 1
            else:
                idx_start = 0
            
            for i in range(idx_start, len(s)):
                s_val = s[i]
                c_val = c[i]
                protein_position = domain.start + i
                if s_val in ['Y','F','W','L','I','M','V']:
                    if c_val > 0.65:
                        (frag, start, end) = protein.get_sequence_context(protein_position,6, return_indices=True)
                        cons = np.median(protein.get_track_values('conservation',start,end))
                        dis = np.median(protein.get_track_values('disorder',start,end))
                        if dis > disorder_thresh:
                        
                            YSN = protein.name.split('_')[0]
                            YSN_FULL = protein.name
                            UID = fungi_matrix.convert_YNS_to_uniprot(YSN)
                        
                            frags.append([YSN, UID, cons, start, end, frag,YSN_FULL])
                            

    # next for each frag build explicitly up the list of values found in conserved sides
    conserved_sites = {}
    for f in frags:
    
        if f[6] not in conserved_sites:
            conserved_sites[f[6]] = []
    
        v = list(range(f[3], f[4]+1))
        conserved_sites[f[6]].extend(v)
        
    # now for each entry where we have conserved residues on each protein extract out satrt/end positions
    # and add as conserved_idr_region domains
    for ysn in conserved_sites:
        local_coserved_start_end = domain_boundaries(conserved_sites[ysn])
    
        for X in local_coserved_start_end:
            yeast_proteome.protein(ysn).add_domain(X[0],X[1], 'conserved_idr_region',safe=False)
            
            
            
    out_data = []
    for protein in yeast_proteome:
        for domain in protein.domains:
            if domain.domain_type == 'conserved_idr_region':
            
                YSN = protein.name.split('_')[0]                            
                UID = fungi_matrix.convert_YNS_to_uniprot(YSN)
            
                out_data.append([YSN, UID, len(domain), domain.start, domain.end, domain.sequence])
                
                
    with open(f'output_data/conserved_subsequences_disorder_thresh_{disorder_thresh}.csv','w') as fh:
        for l in out_data:
            fh.write(f'{l[0]}, {l[1]}, {l[2]}, {l[3]}, {l[4]}, {l[5]}\n')
            
            
    interfaces.si_domains.write_domains(yeast_proteome, f'output_data/SHPRD_conserved_subsequences_disorder_thresh_{disorder_thresh}.tsv', domain_types=['conserved_idr_region'])
    
    
    

        


On disorder threshold 0.3
On disorder threshold 0.4
On disorder threshold 0.5
On disorder threshold 0.6
On disorder threshold 0.7


In [21]:
YSN2UID = {}
for protein in yeast_proteome:
    YSN = protein.name.split('_')[0]            
    YSN2UID[YSN] = fungi_matrix.convert_YNS_to_uniprot(YSN)
        

In [25]:
with open("output_data/YSN2UID.pkl", "wb") as f:
    pickle.dump(YSN2UID, f, protocol=pickle.HIGHEST_PROTOCOL)