# evaluomeR - optimal $k$ analysis

In [1]:
library("evaluomeR")

options(scipen=10)

Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which

# Table of contents
* [Dataset](#dataset)
* [Analysis per metric](#single)
    * [Stability](#single_stab)
    * [Quality](#single_qual)
    * [Optimal K value](#single_optimal)
* [Analysis for all the metrics](#all)
    * [Stability](#all_stab)
    * [Quality](#all_qual)
    * [Optimal K value](#all_optimal)

# Dataset <a class="anchor" id="dataset"></a>

In [8]:
seed = 13606
set.seed(seed)

agro_df = read.csv(paste0(getwd(), "/","data/agro.csv"), header=TRUE, stringsAsFactors=FALSE)
head(agro_df)

Description,ANOnto,AROnto,CBOOnto,CBOOnto2,CROnto,DITOnto,INROnto,LCOMOnto,NACOnto,NOCOnto,NOMOnto,POnto,PROnto,RFCOnto,RROnto,TMOnto,TMOnto2,WMCOnto,WMCOnto2
ADO,0.0,3.9503849,0.9991446,0.9991446,0.9957228,3,0.9991446,1.999142,1.0,292.0,2.9632164,0.9957228,0.7478411,3.962361,0.252158895,0.0,0.0,1.999142,1.0
AEO,0.9298246,0.5438596,0.9824561,0.9824561,0.0,5,0.9824561,2.357143,1.0,3.733333,0.9824561,0.5789474,0.5,1.9649123,0.5,0.0,0.0,2.357143,1.0
AFO,0.75,0.0,0.875,0.875,3998.875,3,0.875,1.333333,1.0,3.5,2275.75,0.25,0.9996157,2276.625,0.000384341,0.0,0.0,1.333333,1.0
AGRO,0.9907407,3.1018519,1.0694444,1.0694444,0.3634259,16,1.0694444,7.695971,1.052174,2.287129,1.2037037,1.0555556,0.5295316,2.2731481,0.470468432,0.06264501,2.148148,9.134783,1.186957
AGRORDF,1.2362637,0.0,1.0659341,1.0659341,0.0,6,1.032967,2.467532,1.077465,4.7,0.5879121,0.8571429,0.3627119,1.6538462,0.637288136,0.0718232,2.0,2.676056,1.084507
ANAEETHES,0.0,0.0,0.6666667,0.6666667,1107.666667,2,0.6666667,1.0,1.0,2.0,0.0,0.0,0.0,0.6666667,1.0,0.0,0.0,1.0,1.0


# Analysis per metric <a class="anchor" id="single"></a>
This demonstrates how to conduct an optimal $k$ analysis for each metric or feature in an input dataset. In this instance, we iterate over the range of $k \in [3,6]$. To avoid binary classifications, we exclude $k=2$. The CBI *kmeans* serves as the default clustering method.

In [9]:
k.range=c(3,6)
cbi="kmeans"

### Stability <a class="anchor" id="single_stab"></a>
First off, we have to compute the stabilities for the range of k provided, this is achieved via the `stabilityRange` method. As the output of  `stabilityRange` returns an [ExperimentList](https://rdrr.io/bioc/MultiAssayExperiment/man/ExperimentList.html) object. We can cast the output into a dataframe with the `standardizeStabilityData` method.

In [10]:
stab_range = stabilityRange(data=agro_df, k.range=k.range, 
                            bs=100,
                            cbi=cbi)
stab = standardizeStabilityData(stab_range)


Data loaded.
Number of rows: 78
Number of columns: 20


Processing metric: ANOnto(1)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: AROnto(2)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CBOOnto(3)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CBOOnto2(4)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CROnto(5)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: DITOnto(6)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: INROnto(7)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: LCOMOnto(8)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6


Stabilities for each metric and for each $k$ value can be seen in the `stab` dataframe:

In [11]:
stab

Unnamed: 0,k_3,k_4,k_5,k_6
ANOnto,0.8708931,0.7841959,0.7296689,0.75099
AROnto,0.9538625,0.80226,0.8061481,0.7299783
CBOOnto,0.8458175,0.7740625,0.7687425,0.7984101
CBOOnto2,0.8458175,0.7740625,0.7687425,0.7984101
CROnto,0.6771065,0.621021,0.6142683,0.7003332
DITOnto,0.9577282,0.6616042,0.6735508,0.6662409
INROnto,0.8881516,0.7434687,0.7773876,0.8161579
LCOMOnto,0.7587954,0.6595053,0.8119282,0.742445
NACOnto,0.8386325,0.7071669,0.8002024,0.8274214
NOCOnto,0.8909113,0.9233513,0.8334101,0.8407038


### Quality <a class="anchor" id="single_qual"></a>
For goodness analysis, we will do a similar endevour as in stability analysis. For this, we have the `qualityRange` method that returns an [ExperimentList](https://rdrr.io/bioc/MultiAssayExperiment/man/ExperimentList.html) and the method `standardizeQualityData` to transform it into a dataframe.

In [13]:
qual_range = qualityRange(data=agro_df, k.range=k.range, 
                            cbi=cbi)
qual = standardizeQualityData(qual_range)


Data loaded.
Number of rows: 78
Number of columns: 20


Processing metric: ANOnto(1)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: AROnto(2)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CBOOnto(3)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CBOOnto2(4)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: CROnto(5)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: DITOnto(6)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: INROnto(7)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: LCOMOnto(8)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6


Qualities for each metric and for each $k$ value can be seen in the `qual` dataframe:

In [14]:
qual

Unnamed: 0,k_3,k_4,k_5,k_6
ANOnto,0.7152389,0.6387411,0.6808635,0.6765904
AROnto,0.896646,0.8767875,0.8710151,0.870818
CBOOnto,0.7988265,0.620285,0.6473309,0.6854217
CBOOnto2,0.7988265,0.620285,0.6473309,0.6854217
CROnto,0.9680079,0.9529486,0.9482417,0.8346625
DITOnto,0.6993418,0.6905454,0.6873396,0.6297655
INROnto,0.8115823,0.7898743,0.6626684,0.682638
LCOMOnto,0.6586388,0.6236585,0.593865,0.5267501
NACOnto,0.7939808,0.7795742,0.783855,0.7953857
NOCOnto,0.6785191,0.6616432,0.6106745,0.5431912


### Optimal K value <a class="anchor" id="single_optimal"></a>
In this Section we show how to compute the optimal $k$ value of a dataset **per metric**.

In [20]:
k_opt = getOptimalKValue(stab_range, qual_range, k.range= k.range)
optimal_k = as.numeric(k_opt$Global_optimal_k)

Processing metric: all_metrics

	Maximum stability and quality values matches the same K value: '3'



In the following table, we show the $k$ where the metric was most stable in `Stability_max_k` column, in what $k$ we had the highest goodness (quality) in `Quality_max_k` and what is the decision of `evaluomeR` to compute the overall $k$ value of the metric in `Global_optimal_k`.

In [6]:
k_opt

Metric,Stability_max_k,Stability_max_k_stab,Stability_max_k_qual,Quality_max_k,Quality_max_k_stab,Quality_max_k_qual,Global_optimal_k
ANOnto,3,0.8708931,0.7152389,3,0.8708931,0.7152389,3
AROnto,3,0.9538625,0.896646,3,0.9538625,0.896646,3
CBOOnto,3,0.8458175,0.7988265,3,0.8458175,0.7988265,3
CBOOnto2,3,0.8458175,0.7988265,3,0.8458175,0.7988265,3
CROnto,8,0.7162224,0.8539272,3,0.6771065,0.9680079,3
DITOnto,3,0.9577282,0.6993418,3,0.9577282,0.6993418,3
INROnto,3,0.8881516,0.8115823,3,0.8881516,0.8115823,3
LCOMOnto,5,0.8119282,0.593865,3,0.7587954,0.6586388,3
NACOnto,3,0.8386325,0.7939808,10,0.7901522,0.8304728,10
NOCOnto,4,0.9233513,0.6616432,3,0.8909113,0.6785191,3


# Analysis for all the metrics <a class="anchor" id="all"></a>
This outlines the process of determining an optimal $k$ value for the entire dataset. In this scenario, we calculate the optimal $k$ across the dataset as a whole. Once again we consider a $k$ range such as $k \in [3, 6]$, excluding $k = 2$ to avoid binary classifications. The CBI *kmeans* method is utilized as the default clustering algorithm.

### Stability <a class="anchor" id="all_stab"></a>
Here we set the parameter `all_metrics=TRUE` in order to consider the stability of all the metrics as a whole.

In [17]:
stab_range = stabilityRange(data=agro_df, k.range=k.range,
                            all_metrics=TRUE,
                            bs=100,
                            cbi=cbi)
stab = standardizeStabilityData(stab_range)
stab


Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6


Unnamed: 0,k_3,k_4,k_5,k_6
all_metrics,0.6858984,0.6734864,0.6597495,0.6645691


In the table above, not all individual metrics are displayed; instead, a combined variable named `all_metrics` wchih represents the overall stability of metrics across the dataset.

### Quality <a class="anchor" id="all_qual"></a>
Similarly as in the * [Stability](#all_stab) Section, setting up the parameter `all_metrics=TRUE` is needed to consider all the metrics.

In [18]:
qual_range = qualityRange(data=agro_df, k.range=k.range,
                          all_metrics=TRUE,
                          cbi=cbi)
qual = standardizeQualityData(qual_range)
qual


Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6


Unnamed: 0,k_3,k_4,k_5,k_6
all_metrics,0.9652177,0.9491769,0.9192709,0.8646877


### Optimal K value <a class="anchor" id="all_optimal"></a>
In this Section we show how to compute the optimal $k$ value considering all the metrics from **the whole dataset**.

In [22]:
k_opt = getOptimalKValue(stab_range, qual_range, k.range= k.range)
optimal_k = as.numeric(k_opt$Global_optimal_k)
k_opt

Processing metric: all_metrics

	Maximum stability and quality values matches the same K value: '3'



Metric,Stability_max_k,Stability_max_k_stab,Stability_max_k_qual,Quality_max_k,Quality_max_k_stab,Quality_max_k_qual,Global_optimal_k
all_metrics,3,0.6858984,0.9652177,3,0.6858984,0.9652177,3


In the previous table, according to the column `Global_optimal_k`, `evaluomeR` considered that the optimal $k$ value for the given dataset is $k=3$.