# README
This is a step-by-step walk-through of the assignment of MS found spectra to proteins.

# Preparing Reference & Search Data
The following convert the raw input files into the working reference files. They only depend on the input being present and can be run in any order.
* [Convert Peptide Search CSV to trimmed TSV](src/Preparation/CSV%20to%20TSV.ipynb)

This depends on the xlsx MS search output being converted to csv. It will remove most of the data, in addition to adding Peptide UIDs and a few categores (such as IFN status for GBM and B721 HLA Allele). The output should be a TSV file of the MS identified peptides with the following columns.
    - filename: unique identifier to map peptide match to MS/MS spectra
    - numPSMsObserved: The number of spectra identified which match to the peptide
    - score: The quality of the MS/MS match
    - rank2Score: The quality of the second best MS/MS match
    - deltaForwardReverseScore: The difference between the score and the best decoy score, used to estimate FDR.
    - percent_scored_peak_intensity: The % of the spectra peaks explained by the match
    - backbone_cleavage_score: The number of cleavages detected in the match
    - sequence: The peptide sequence the spectra is assigned to
    - sequenceMulti: All of the possible peptide sequences the spectra could be assigned to
    - retantionTimeMin: The retention time of the spectra
    - Peptide:UID: The generated unique identifier for the peptide match. 
    - sequenceList: A list of all possible sequence matches with modifications removed

* [Merge TPM Files](src/Preparation/Parse%20TPM.ipynb)

This converts the per-sample TPM tables to unified tables (trimming most of the columns in the process). In addition, it calculates mean and standard deviation for Purity and TPM. This takes in all of the rpf TPM script TPM tables to be considered, and generates a merged table with the tpm and purity per sample, and then the mean and standard deviation of the tpm and purity across all samples.

* [Parse Fasta Reference](src/Preparation/Parse%20References.ipynb)

Reads in the protein fasta files generated by Karl for the MS searches and converts them to standardized tables (with protein source annotated and Protein UID added). This should generate a table which has the following columns.
    - Protein:UID: This is the unique identifier assigned to each searched fasta entry
    - header: this is the fasta name for each protein searched
    - category: This indicates the source of the protein, and is used to prioritize assignments
    - sequence: This is the protein sequence

* [Generate UCSC ORF_ID Map](src/Preparation/Generate%20UCSC%20%26%20ORF_ID%20Map.ipynb)

Generates the Pan Sample & Variant references with the appropriate TPM and Purity values per protein. Must be run after Fatsa references are parsed. This generates a map table which has a table of all of the UCSC protein names, and a second column which has the nuORFdb ORF_ID corresponding to that protien (if predicted).

* [Add ORF Types](src/Preparation/Add%20ORF%20Type.ipynb)

Add ORF Type information to the Reference files. Must be run after Fasta References are parsed and the UCSC ORF map is generated. This will append the orfType column to the reference files which are used to prioritize Canonical ORFs over nuORFs.

# Mapping & Assigning Peptides
## Mapping
Using the generated reference files, create maps between all of the found peptide spectra and the matching protein reference.
* [Mapping Peptides to Proteins](src/Mapping/Map%20Peptides.ipynb)

This is followed by merging the map files with their respective references. This gives merged tables which will be modified & then condensed into the final mapped peptide table.
* [Merge Maps](src/Mapping/Merge%20Maps.ipynb)

## Assigning
A linear series of decisions were made to assign found peptides to the fasta reference. All protein matches for each peptide were considered together.
* If any of the matched proteins are Contaminants, the peptide was discarded.
* If any of the proteins are annotated as Canonical, then only Canonical matches were kept.
* The highest TPM value was retained (with NaN values considered as 0)
* Proteins in the default UCSC reference were kept over equivalent nuORF proteins.

This method was applied to all searches which used the Pan Sample reference. For the __Database Comparison__ searches, peptides were partitioned into __Annotated__ or __Unannotated__ based on if they were uniquely found in the generated reference or not.

* [Assign Pan Sample Peptides](src/Mapping/Assign%20Peptides.ipynb)
* [Assign Database Comparison Peptides](src/Mapping/Assign%20Database%20Comparison.ipynb)

This is followed by filtering the MHC-I IP searches to lower the FDR to 1%

* [Filter MHC-I Peptides](src/Mapping/Filter%20FDR.ipynb)
* [Filter DBC Peptides](src/Mapping/DBC%20Filter.ipynb)