# Data cleaning
---

Metabolites are small molecules that are involved in various biochemical processes within living organisms. They are the end products of metabolic pathways or intermediates involved in the conversion of one molecule to another. Metabolites play essential roles in cellular functions, including energy production, signaling, growth, and maintenance of cellular structures. Metabolomic data refers to the comprehensive analysis of metabolites present in a biological system. It involves the identification, quantification, and characterization of metabolites within a given sample or set of samples. Metabolomic studies aim to provide a snapshot of the metabolic state of an organism, tissue, or cell under specific conditions, such as disease, drug treatment, or environmental exposure. Metabolomic data sets are typically complex and high-dimensional, with multiple variables (metabolites) and samples. Analyzing and interpreting such data requires sophisticated statistical and computational approaches. By analyzing metabolomic data, researchers can identify biomarkers associated with specific conditions, uncover metabolic pathways or networks relevant to diseases, study drug metabolism, and gain insights into the biochemical mechanisms underlying physiological or pathological processes. Metabolomic data provides valuable insights into the biochemical composition and metabolic state of biological systems, contributing to our understanding of health, diseases, and the impact of various external factors on organisms.

In [2]:
using DataFrames, Arrow
using LinearAlgebra, Distributions, Statistics, HypothesisTests
using UMAP, Distances, StatsBase, TSne, Optim
using Plots, StatsPlots 

In [4]:
df = DataFrame(Arrow.Table("C:/Users/nicol/Documents/Datasets/ST001828/covariates.arrow"));
btpos = DataFrame(Arrow.Table("C:/Users/nicol/Documents/Datasets/ST001828/btpos.arrow"));
btpos_c = DataFrame(Arrow.Table("C:/Users/nicol/Documents/Datasets/ST001828/btpos_code.arrow"));

In [63]:
tmp = sum(ismissing.(Matrix(btpos[:,2:101])), dims = 2)
tmp1 = DataFrame(metabolite = Array(string.(btpos.Metabolite)),
                 NAs = tmp[:,1]);

In [68]:
# Select the metabolites with <=25% of missing data
valid_met = tmp1[tmp1.NAs .<= 25,"metabolite"];

## Imputation
---
KNN or RandomForest to impute missing data &rightarrow; median?