# ML notebooks to process the outputs of [nf-core/metaboigniter](https://github.com/nf-core/metaboigniter) pipeline

---

As explained in the metaboigniter's GitHub repository (https://github.com/nf-core/metaboigniter/blob/master/docs/output.md), the output folder contains three tabular files, one for the __peak table__, one for the __variable metadata__ (including mz, RT, adduct, isotope, and identification information for each mass trace) and __sample metadata__ (original file names for each sample and additional information provided by the phenotype file).

In the following notebooks, we will use the peak table to identify the compounds which most separate the two samples groups (Liver Cancer vs. Case Control).

Then, we will use the two other text files (variable and sample metadata) to identify these meaningful variables.

---
### 1-clean_peakTable.ipynb

This notebook takes the file `peaktablePOSout_pos_metfrag.txt` from metaboigniter results as input.

Initially, the txt file looks like that :

| dataMatrix      | EPIC_Liver_Cancer_NR160809_007_41_LivCan_153_007.mzML | EPIC_Liver_Cancer_NR160809_008_41_LivCan_154_008.mzML | ... |
| :-------------- | :-----------:| :----------: | :---: |
| variable_3      | 19.7617... | 19.7352... | ... |
| variable_5      | 14.5368... | 15.1933... | ... |
| variable_6      | 22.1855... | 20.8314... | ... |
| ...                   | ...              | ...              | ... |

It contains variables in rows and samples in columns.

For further analysis, we prefer to have variables (i.e. compounds) in columns and samples in rows. Moreover, in this notebook, we add two columns :
- SampleID : LivCan-XXX, which is the suffix of the filename
- Groups : Incident or Non-case

At the end of this notebook, our peak table looks like that :

| SampleID               | Groups     | variable_3  | variable_5 | ... |
| :--------------------- | :----------: | :----------: | :----------: | :---: |
| LivCan_153.mzML | Incident    | 19.7617... | 14.5368... | ... |
| LivCan_154.mzML | Non-case | 19.7352... | 15.1933... | ... |
| ...                            | ...              | ...              | ...              | ... |

The output of this notebook is the csv file `peakTable_HILIC_POS.csv`.

_NB : We could also add more metadata information in columns after __Groups__ column_

---

### 2-explore_data.ipynb

This notebook takes the previously created file `peakTable_HILIC_POS.csv` as input.

The objective of this notebook is to explore the peak table cleaned in the previous notebook (__1-clean_RPpos_peakTable__) with a few visualisations on :
- target (here the Sample Group, i.e. Incident vs. Non-case)
- missing values
- outliers
- ...

#### Basic checklist :

__Form analysis__ :
- __target__ : Groups
- __shape (rows & columns)__ : 186 rows (samples) x 558 columns (556 compounds)
- __features types__ :
    - quantitative : 2 (Group, SampleID)
    - qualitative : 556 (compounds)
- __missing values__ :
    - compounds can be in every sample (0% of missing values), in most of them or just in a few
    - the maximum of missing value for a variable is 49.5%, i.e. this variable is absent from 49.5% of the samples
    - (samples seem to be more easily separated with not too much missing values (logic))

__Content analysis__ :
- __target visualisation__ :
    - ratio 1:1 (93 Cancer - 93 Healthy)
- __feature visualisation__ :
    - on the first 10 compounds, most of them follow a normal distribution
    - some of them follow a double normal distribution
    - maybe one distribution for each class (Healthy vs Cancer) --> hypothesis
- __relation features/target__ :
    - on the first 10 compounds, we don't see a clear difference of intensity between the Cancer and Healthy samples --> previous hypothesis rejected on these compounds --> may be true of others
- __relation features/features__ : strong correlations between some of the features --> need to reduce the dimension for further analysis
- __t-test__ : this is a huge approximation but a first t-test allows to have a first view on potential important variables

This t-test is a huge approximation as it considers the feature's intensities independant, but we know that they highly interact. We can still observe that, for a few compounds, there exists a significant difference of intensity between the cancer and healthy samples.

---

### 3-missing_value_imputation.ipynb

This notebook takes the csv file `peakTable_HILIC_POS.csv` (created in the notebook `1-clean_peakTable.ipynb`) as input.

The purpose of this notebook is to use different methods to fill the missing values in our peak table :

- __Univariate__ feature imputation :
    - __zero__ (or __one__ or any other constant value to avoid further analytical problems)
    - __mean__
    - __median__
    - __mode__ (most frequent)
    - __minimum__
    - __half minimum__
- __Multivariate__ feature imputation :
    - __MICE__ (inspired by the R `MICE` package)
- __KNN imputation__

These methods come from the scikitlearn documentation : [cf. doc scikitlearn](https://scikit-learn.org/stable/modules/impute.html#marking-imputed-values)

One type of imputation algorithm is __univariate__, which imputes values in the i-th feature dimension __using only non-missing values in that feature dimension__ (e.g. `impute.SimpleImputer`>). By contrast, __multivariate__ imputation algorithms __use the entire set of available feature dimensions__ to estimate the missing values (e.g. `impute.IterativeImpute`).


The scikitlearn `IterativeImputer` is still experimental, so we will also use directly the R `MICE` package  ([documentation](https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/mice)) in the separate R notebook `3.2-missing_value_imputation_MICE` (in this directory)

Here is a link where the [MICE algorithm is explained](https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html).

The MICE (Multivariate Imputation by Chained Equations) algorithm is a multivariate method to impute missing values. Each missing value is imputed using a separate model with the other variables in the dataset. Iterations should be run until it appears that convergence has been met.


For each of these methods, we can save the imputed peak table as a new csv file.

---

### 4-normalisation_scaling_pipeline.ipynb

This notebook takes an imputed peak table as input, imputed with any of the previous methods, as long as it has no NAs left.

The purpose of this notebook is to use different methods to normalise/scale the data in our peak table.

The function `normPeakTable` takes a peak table and a list of methods to normalise the peak table. The available methods are :
- log10 : base-10 logarithm
- std : standard scaler
- min-max normalisation
    - minmax : across features
    - minmax_rows : across samples
- scale to unit norm (vector length). If $x$ is the vector of length $n$, the normalized vector is $y=x/z$ then $z$ is defined as followed according to the chosen norm :
    - norm_l1 : with l1 norm $\rightarrow z = \| x\|_1 = \sum_{i=1}^n |x_i|$
    - norm_l2 : with l2 norm $\rightarrow z = \| x\|_2 = \sqrt{\sum_{i=1}^n x_i^2}$
    

It is possible to apply different normalisation/scaling methods to the peak table. You should provide as parameter the list of the methods in the order you want it to be applied to the peak table. Example, for _normPeakTable(X_KNN, ['std', 'norm_l2'])_, a standard scaler will first be applied, followed by a norm l2 scaling.

The normalised/scaled peak tables (output of the function `normPeakTable`) can be saved as csv file.

---