## About
This notebook does a few things for preparing for proteome-wide analysis.

1. Generates the file `ysn_only.txt` which can be mapped to uniprotIDs on using gene name as the input (and restricting to S. cerevisiae).
2. Generates the file `ysn2uniprot


In [2]:
import pickle

### Build the YSN list 
This first cell builds a text file with all the S. cerevisiae Yeast Systematic Names (YSNs) for the yeast genes. Note we include this for completeness BUT there is no need to re-run this.

In [3]:
USE_HHL_TOOLS = True

if USE_HHL_TOOLS:

    WRITE_OUT_YSN = True

    from yeastevo import Pillars


    # build and read in the sequence information
    fungi_matrix = Pillars()

    # get the set of IDs and read in alligned sequences
    all_valid_IDs = fungi_matrix.all_aligned_scerevisiae_YSNs()

    if WRITE_OUT_YSN:
        with open('generated_data/ysn_only.txt','w') as fh:
            for k in all_valid_IDs:
                fh.write(f"{k}\n")




Reading in all proteomes....
... DONE!
Reading in 5436 separate FASTA files - this may take 30-40 seconds...
... DONE!


## Build YSN to UniProt mapping
The code here generates a comprehensive mapping of each of the YSN identifiers to:

1. UniProt ID
2. Protein name
3. List of Gene names (many yeast genes have multiple names).

This cell depends on the `ysn_only.txt`file indirectly, in that that file must first be generated and then is passed to the UniProt ID mapping service to generate the `idmapping_2024_08_28.tsv` file, which is what the cell below is parsing.

In [4]:
with open('input_data/idmapping_2024_08_28.tsv', 'r') as fh:
    raw_uniprot_mapping = fh.readlines()

In [12]:
ysn2info = {}
for line in raw_uniprot_mapping[1:]:
    sline = line.strip().split('\t')

    ysn = sline[0]
    uid = sline[1]
    name = sline[2]
    gene_names = sline[3].split()

    if ysn in ysn2info:
        raise('Error - found duplicate')
    ysn2info[ysn] = [uid, name, gene_names]

In [13]:
# generate PKL file for easy computational use
with open(f'generated_data/ysn2uniprot.pkl', 'wb') as file:    
    pickle.dump(ysn2info, file)

# generate text version for easy reading
with open(f'generated_data/ysn2uniprot.tsv', 'w') as file:
    file.write('YSN\t UniProtID\t Name\t Gene Names\n')
    for ysn in ysn2info:
        file.write(f"{ysn}\t{ysn2info[ysn][0]}\t{ysn2info[ysn][1]}\t{", ".join(ysn2info[ysn][2])}\n")