# About

[Triqler](https://pubmed.ncbi.nlm.nih.gov/30482846/) is a novel software for protein quantification and differential protein identification. It uses probabilistic graphical models to generate posterior distributions for fold changes between treatment groups, highlighting uncertainty rather than hiding it. Conventional (frequentist) methods use filters and imputations to control error rate and often ignore certain error sources. This project aims to benchmark Triqler against MaxQuant (A commonly used tool for protein quantification). 

For this purpose, a data set with 10 samples containing mixtures of Arabidopsis Thaliana, Caenorhabditis Elegans and Homo Sapiens proteins are used. The concentration levels are known. Theoretically, the results from Triqler should be more representative of the de facto protein quantification, since no filters or imputations methods are used, but previous attempts at showing this fact ([here](https://patruong.github.io/bayesProtQuant/)) have shown that imputation methods could severely impact the results obtained by MaxQuant, making it look either much worse or much better by giving it an unfair advantage or disadvantage. One important aspect of this research is therefore how to make a fair comparison of Triqler and MaxQuant. Sub-tasks to answer relating to this aspect is "How do we make a fair imputation if we need to impute values?" and "How do we visualize the comparison in a meaningful and comprehensible way?".

## Problem
Triqler is a novel software that uses bayesian model for protein quantification. The use of bayesian modeling for protein quantification has not yet been shown better than existing methods, but the fact that Triqler is handling errors in multiple steps in a more theoretical sound way than the most commonly used protein quantification pipelines gives indication that it is better. A benchmark of said Triqler is therefore needed to show its performance. 

(Specifically, it is interesting to benchmarking against DIA proteomics pipeline (such as directDIA by Bionosys) as DIA provides more false positives, false negatives and complex spectra.)

## Preliminary Research Question
The research aims to answer the following questions:
- Is Triqler a better alternative for protein quantifcation than existing methods?

This question will be answered by investigating following sub-questions:
- How do we benchmark Triqler, a bayesian model, against existing methods, such as Spectronaut?
- What performance metrics is relevant to benchmark against?
- How to we show the comparison in a fair manner?


## Limitations 
This project aims to benchmark Triqler against Spectroanut protein quantification. 

## Previous Studies

### Protein quantification using mass spectrometry
Label-free mass spectrometry (MS) is an increasingly important tool for researchers in life sciences ([Zhu et al. 2009](https://www.hindawi.com/journals/bmri/2010/840518/)). In the last decade developments in MS methodologies has made it possible to use it for accurate protein quantification ([Ong et al. 2005](https://www.nature.com/articles/nchembio736)). Various softwares has been developed for processing raw MS data into quantified protein abundances ([Mueller et al. 2008](https://pubmed.ncbi.nlm.nih.gov/18173218/)). Each of which has a unique set of algorithms for different tasks of the MS data processing workflow. A systematic review of the various softwares has been conducted by [Välikangas et al. 2017](https://academic.oup.com/bib/article/19/6/1344/3859191). 


### Benchmarking of protein quantification.
[Mueller et al. 2008](https://pubmed.ncbi.nlm.nih.gov/18173218/)

[Välikangas et al. 2017](https://academic.oup.com/bib/article/19/6/1344/3859191)  

### Missing values and Bayesian missing data. 
Missing values are a frequent problem in life science studies. Incomplete data can cause substantial amount of bias and seriously compromise inferences from studies if they are not hendled appropriately ([Jakobsen et al. 2017](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0442-1)). Besides the common missing value imputation methods (such as min, max, mean and median imputation. So called single imputation which cause bias ([Dziura et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24058309/))) various methods have been developed to tackle this problem, such as using imputation models ([Barnard et al. 1999](https://journals.sagepub.com/doi/10.1177/096228029900800103)) and multiple imputation methods ([Jørgensen et al. 2014](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0111964)). The mechanism causing the data to be missing is another aspect that should be considered when handling missing values if possible ([Jakobsen et al. 2017](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0442-1#ref-CR19)). Bayesian methods have been receiving much attention in the literature for handling missing values in various context, and with or without taking different mechnisms into account ([Ma et al. 2018](https://www.researchgate.net/publication/324510645_Bayesian_methods_for_dealing_with_missing_data_problems#pfe)). Using Bayesian models the missingness can be modelled using observed quantities by using known data to marginalize. Therefore, handling missing values with Bayesian models are very natural and comes without imputing, such as used in [The et al. 2018](https://www.biorxiv.org/content/10.1101/357285v2). However, incorrectly specied missing values also does cause problems for Bayesian models ([Mason J. 2010](https://spiral.imperial.ac.uk/handle/10044/1/5498)) and corretly specifying missing models greatly improves the overall model fit ([Mason et al. 2010](http://eprints.ncrm.ac.uk/1691/1/InsightsSubmitted.pdf)). In [The et al. 2018](https://www.biorxiv.org/content/10.1101/357285v2) a probability distribution is assined over the possible values of the missing value, and this distribution is marginalized over when inferrening protein's quantity. Resulting in a protein quantity which incorporates uncertainty into it, which is manifested by a larger variance than proteins without missing values. 


### Missing values in proteomics.

## Data
The data for this benchmark study is provided by Biognosys. The data set is generated using (?). It consists of ten samples of mixtures of various ratios of proteins from C. elegans, H. sapiens and A. thaliana. There are five replicates for each sample.


| Species     | S01 | S02  | S03   | S04    | S05   | S06    | S07   | S08   | S09   | S10 |
|-------------|-----|------|-------|--------|-------|--------|-------|-------|-------|-----|
| A. thaliana | 0.5 | 0.5  | 0.5   | 0.5    | 0.5   | 0.5    | 0.5   | 0.5   | 0.5   | 0.5 |
| C. elegans  | 0.5 | 0.25 | 0.125 | 0.0625 | 0.031 | 0.0155 | 0.008 | 0.004 | 0.002 | 0   |
| H. sapiens  | 0   | 0.25 | 0.375 | 0.4375 | 0.469 | 0.4845 | 0.492 | 0.496 | 0.498 | 0.5 |

<center><strong>Table 1</strong> . Protein ratios of the ten samples {S01, S02, ..., S10}.</center>

The peak areas of the MS2 intensities for each protein are given, as well as, protein quantities based on Spectronauts protein quantification method. These protein quantities are based on the top three most intense peptides and reproducibility of identification (search scores). Very low intensity MS2-peaks are set to value 1.0. These very low intensity peaks arise from small noise peaks and local normalization effects produced by Spectronaut. In both cases, they are to be considered noise and arises due to the fact that the dynamic range of MS1 and MS2 are not the same. 




## Method

MS2 peak intensities and protein quantities from Spectronauts is provided by Biognosys. The MS2 peaks, search scores and protein identifications are used by Triqler perform another protein quantification. The results from Triqler and Spectronaut will be compared. 

Currently, following analysis are could be considered interesting to perform.

- Number of proteins quantifies for each method including:
  - Number of protein quantified for each sample.
  - Number of protein quantified for each species.
  - Number of protein quantifies for each sample and species.
  - These protein quantifications should be performed without missing value imputation for Spectronaut.
- Performing exploratory data analysis including:
  - Explore the distribution of protein quantities.
    - Box plot.
    - Distribution plot.
    - Scatter plot (think about what we see here?).
- Analyzing missing value effect of protein quantities for spectronaut:
  - Trying different imputation methods.
  - Exploring distributions using different imputations methods.
  - Check coefficient of variation to see the stability of imputed methods.
- Analyzing protein quantity variation to see the stability of each method. This is done by:
  - Visualizing protein quantities for a random subsample of proteins for each method.
  - Performing differential expression analysis between samples for each species.
  - Check coefficient of variation.
- Data tresholding and cleaning.
  - Removing protein for spectronaut which has more than x replicates missing.
  - Removing protein for spectronaut and triqler which as high coefficient of variation. 

Notes for future:
(Also, since missing value imputation can have a 
   
Missing not at random and missing at random...)

Something about missing value type in description in our data?.
