## About
This notebook reads in a CSV version of the protein annotations based on SGD built by the [Fried lab](https://www.friedlab.com/) and writes it out as a SHEPHARD Protein Attributes file. Note that it would be good to do more sanity checking here and x-ref against the protein sequence information provided in the `sequence` column, but for the purposes of this analysis we are simply extracting out a few key parameters to ensure our definitions of (1) Abundance, (2) Number of domains, and (3) disorder are consistent.

In [None]:
class Entry:
    # customizable class that has class variables set based on an input list
    def __init__(self, fields):
        for field in fields:            
            setattr(self, field, None)


In [43]:
with open('sjf_yeast.csv','r',encoding='utf-8-sig') as fh:
    content = fh.readlines()

In [25]:
# get raw field names; takes advantage of the fact that the 1st line in the file
# has the column names. NB: We could OFC do all this with Pandas, but why make
# this easy. 
_fields = content[0].strip().split(',')

# do a bunch of sanity washing for field names 
fields = []
for n in _fields:
    n = n.replace(' ','_')
    n = n.replace('-','_')
    n = n.replace('%','Percentage')
    n = n.replace('(','')
    n = n.replace(')','')
    n = n.replace('Standard_Name_/_Gene','Gene')
    fields.append(n)

# build mapping 
idx2field = {i[0]:i[1] for i in enumerate(fields)}

In [40]:
with open('shprd_fried_yeast.tsv','w') as fh:
    for line in content[1:]:
        
        sline = line.strip().split(',')

        # initialize an entry for this line
        entry = Entry(fields)    
        for idx in range(len(sline)):
            f = idx2field[idx]
            setattr(entry, f, sline[idx])
    
        # 
        uid = entry.UniProt_ID
        if len(uid) !=6:
            continue

        fh.write(f'{uid}\t')
        for f in fields:

            ## NB; could add more per-protein sanity check lines here
            # ............................................................
            
            # sanity check abundance
            if f == 'Median_Abundance':
                try:
                    v = int(getattr(entry,f))
                except ValueError:                    
                    v = -1    

                # write after sanity checking done
                fh.write(f'{f}:{v}\t')

            ## base case write
            else:
                fh.write(f'{f}:{getattr(entry,f)}\t')

            

        fh.write('\n')                 
