vignettes/Intromi4p.Rmd

---
title: "Multiple imputation for proteomics"
shorttitle: "mi4p: multiple imputation for proteomics"
author: 
- Marie Chion, Université de Strasbourg et CNRS, IRMA, labex IRMIA, LSMBO, IPHC, marie.chion@protonmail.fr
- Christine Carapito, Université de Strasbourg et CNRS, LSMBO, IPHC, ccarapito@unistra.fr
- Frédéric Bertrand, Université de Strasbourg et CNRS, IRMA, labex IRMIA, Université de technologie de Troyes, LIST3N, frederic.bertrand@utt.fr
date: "`r Sys.Date()`"
output:
  prettydoc::html_pretty:
    theme: cayman
    highlight: github
vignette: >
  %\VignetteIndexEntry{Multiple imputation for proteomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<table align="center">
        <tr>
            <td><img src="../man/figures/logo.png" align="center" width="200" style="margin:0 0 0 100px;"/></td>
        </tr>
</table>

```{r setup, include = FALSE}
LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE")
#LOCAL=FALSE
knitr::opts_chunk$set(purl = LOCAL)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=5
)
```

# mi4p: multiple imputation for proteomics

This repository contains the R code and package for the _mi4p_ methodology (Multiple Imputation for Proteomics), proposed by Marie Chion, Christine Carapito and Frédéric Bertrand (2021) in *Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics*, [https://arxiv.org/abs/2108.07086](https://arxiv.org/abs/2108.07086). 

The following material is available on the Github repository of the package [https://github.com/mariechion/mi4p/](https://github.com/mariechion/mi4p/).

1. The `Functions` folder contains all the functions used for the workflow.  

2. The `Simulation-1`, `Simulation-2` and `Simulation-3` folders contain all the R scripts and data used to conduct simulated experiments and evaluate our methodology. 

3. The  `Arabidopsis_UPS` and `Yeast_UPS` folders contain all the R scripts and data used to challenge our methodology on real proteomics datasets. Raw experimental data were deposited with the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD003841 and PXD027800.

<!--You can insert a graphic here, just modify that syntax  ![Example Graphs.](IntroPatterns_files/figure-html/Fresults-1.png) -->


This website and these examples were created by M. Chion, C. Carapito and F. Bertrand.

## Installation

You can install the released version of mi4p from [CRAN](https://CRAN.R-project.org) with:

```r
install.packages("mi4p")
```

You can install the development version of mi4p from [github](https://github.com) with:

```{r, eval = FALSE}
devtools::install_github("mariechion/mi4p")
```

## Examples

### First section

```{r}
library(mi4p)
```

```{r}
set.seed(4619)
datasim <- protdatasim()
str(datasim)
```

```{r}
attr(datasim, "metadata")
```

## AMPUTATION
<!-- Use `eval=LOCAL` to prevent testing too long computation during CRAN checks.-->
```{r, cache=TRUE, eval=LOCAL}
MV1pct.NA.data <- MVgen(dataset = datasim[,-1], prop_NA = 0.01)
MV1pct.NA.data
```

## IMPUTATION
```{r, cache=TRUE, eval=LOCAL}
MV1pct.impMLE <- multi.impute(data = MV1pct.NA.data, conditions = attr(datasim,"metadata")$Condition, method = "MLE", parallel = FALSE)
```

## ESTIMATION
```{r, cache=TRUE, eval=LOCAL}
print(paste(Sys.time(), "Dataset", 1, "out of", 1))
MV1pct.impMLE.VarRubin.Mat <- rubin2.all(data = MV1pct.impMLE, metacond = attr(datasim, "metadata")$Condition) 
```

## PROJECTION
```{r, cache=TRUE, eval=LOCAL}
print(paste("Dataset", 1, "out of",1, Sys.time()))
MV1pct.impMLE.VarRubin.S2 <- as.numeric(lapply(MV1pct.impMLE.VarRubin.Mat, function(aaa){
    DesMat = mi4p::make.design(attr(datasim, "metadata"))
    return(max(diag(aaa)%*%t(DesMat)%*%DesMat))
  }))
```

## MODERATED T-TEST
```{r, cache=TRUE, eval=LOCAL}
MV1pct.impMLE.mi4limma.res <- mi4limma(qData = apply(MV1pct.impMLE,1:2,mean), 
                 sTab = attr(datasim, "metadata"), 
                 VarRubin = sqrt(MV1pct.impMLE.VarRubin.S2))
MV1pct.impMLE.mi4limma.res
(simplify2array(MV1pct.impMLE.mi4limma.res)$P_Value.A_vs_B_pval)[1:10]

(simplify2array(MV1pct.impMLE.mi4limma.res)$P_Value.A_vs_B_pval)[11:200]<=0.05
```

True positive rate
```{r, cache=TRUE, eval=LOCAL}
sum((simplify2array(MV1pct.impMLE.mi4limma.res)$P_Value.A_vs_B_pval)[1:10]<=0.05)/10
```

False positive rate
```{r, cache=TRUE, eval=LOCAL}
sum((simplify2array(MV1pct.impMLE.mi4limma.res)$P_Value.A_vs_B_pval)[11:200]<=0.05)/190
```


```{r, cache=TRUE, eval=LOCAL}
MV1pct.impMLE.dapar.res <-limmaCompleteTest.mod(qData = apply(MV1pct.impMLE,1:2,mean), sTab = attr(datasim, "metadata"))
MV1pct.impMLE.dapar.res
```

Simulate a list of 100 datasets.
```{r}
set.seed(4619)
norm.200.m100.sd1.vs.m200.sd1.list <- lapply(1:100, protdatasim)
metadata <- attr(norm.200.m100.sd1.vs.m200.sd1.list[[1]],"metadata")
```

100 datasets with parallel comuting support. Quite long to run even with parallel computing support.
```{r, eval=FALSE}
library(foreach)
doParallel::registerDoParallel(cores=NULL)
requireNamespace("foreach",quietly = TRUE)
```

## AMPUTATION
```{r, eval=FALSE}
MV1pct.NA.data <- foreach::foreach(iforeach =  norm.200.m100.sd1.vs.m200.sd1.list,
                          .errorhandling = 'stop', .verbose = T) %dopar% 
  MVgen(dataset = iforeach[,-1], prop_NA = 0.01)
```

## IMPUTATION
```{r, eval=FALSE}
MV1pct.impMLE <- foreach::foreach(iforeach =  MV1pct.NA.data,
                         .errorhandling = 'stop', .verbose = F) %dopar% 
  multi.impute(data = iforeach, conditions = metadata$Condition, 
               method = "MLE", parallel  = F)
```

## ESTIMATION
```{r, eval=FALSE}
MV1pct.impMLE.VarRubin.Mat <- lapply(1:length(MV1pct.impMLE), function(index){
  print(paste(Sys.time(), "Dataset", index, "out of", length(MV1pct.impMLE)))
  rubin2.all(data = MV1pct.impMLE[[index]], metacond = metadata$Condition) 
})
```

## PROJECTION
```{r, eval=FALSE}
MV1pct.impMLE.VarRubin.S2 <- lapply(1:length(MV1pct.impMLE.VarRubin.Mat), function(id.dataset){
  print(paste("Dataset", id.dataset, "out of",length(MV1pct.impMLE.VarRubin.Mat), Sys.time()))
  as.numeric(lapply(MV1pct.impMLE.VarRubin.Mat[[id.dataset]], function(aaa){
    DesMat = mi4p::make.design(metadata)
    return(max(diag(aaa)%*%t(DesMat)%*%DesMat))
  }))
})
```

## MODERATED T-TEST
```{r, eval=FALSE}
MV1pct.impMLE.mi4limma.res <- foreach(iforeach =  1:100,  .errorhandling = 'stop', .verbose = T) %dopar%
  mi4limma(qData = apply(MV1pct.impMLE[[iforeach]],1:2,mean), 
                 sTab = metadata, 
                 VarRubin = sqrt(MV1pct.impMLE.VarRubin.S2[[iforeach]]))

MV1pct.impMLE.dapar.res <- foreach(iforeach =  1:100,  .errorhandling = 'stop', .verbose = T) %dopar%
  limmaCompleteTest.mod(qData = apply(MV1pct.impMLE[[iforeach]],1:2,mean), 
                        sTab = metadata)
```