# evaluomeR - RSKC - metric relevancy

In [1]:
library("ISLR") 
library("sparcl")
library("evaluomeR")


options(scipen=10)

Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which

# Table of contents
* [Dataset](#dataset)
* [evaluomeR](#evaluomeR)
    * [Optimal K value](#optimal_k)
    * [Figuring out the L<sub>1</sub> upper boundry](#l1_boundry)
    * [Figuring out the best alpha](#alpha)
* [Metrics relevancy](#metrics_relevancy)
    * [Relevancy table](#relevancy_table)

# Dataset <a class="anchor" id="dataset"></a>
We are going to use the NCI60 dataset, actually a subsample of the first 500 columns for testing purposes.

In [2]:
seed = 13606
set.seed(seed)

nci60 = as.data.frame(NCI60$data)
# Creating a Description column
nci60["labels"] = rownames(nci60)
nci60 = nci60[ , c("labels", names(nci60)[names(nci60) != "labels"])]
nci60["labels"] = NCI60$labs
colnames(nci60)[colnames(nci60) == 'labels'] <- 'Description'
nci60 = nci60[1:500]
head(nci60)

Unnamed: 0,Description,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
V1,CNS,0.3,1.18,0.55,1.14,-0.265,-0.07,0.35,-0.315,-0.45,...,-0.43,-0.035,0.1,-0.285,-0.14,0.01999023,0.37,-0.38,-0.3725,-0.3200195
V2,CNS,0.679961,1.289961,0.169961,0.379961,0.464961,0.579961,0.699961,0.724961,-0.04003899,...,-0.330039,-0.605039,-0.580039,-0.985039,-0.550039,0.4199512,0.129961,-0.09003899,0.03746101,0.0
V3,CNS,0.94,-0.04,-0.17,-0.04,-0.605,0.0,0.09,0.645,0.43,...,0.23,-0.775,-0.85,-0.665,-0.86,0.2399902,-1.19,-0.84,-0.5125,-0.8900195
V4,RENAL,0.28,-0.31,0.68,-0.81,0.625,-1.387779e-17,0.17,0.245,0.02,...,-0.18,0.385,-0.68,-0.115,-0.66,0.1299902,-0.6,-0.52,-0.3225,-0.2600195
V5,BREAST,0.485,-0.465,0.395,0.905,0.2,-0.005,0.085,0.11,0.235,...,-0.195,-0.15,-0.755,-0.72,-0.355,-1.31500977,-0.975,-0.815,-0.6775,-1.3450195
V6,CNS,0.31,-0.03,-0.1,-0.46,-0.205,-0.54,-0.64,-0.585,-0.77,...,-0.67,-0.515,-0.14,-0.215,-0.14,0.3099902,-0.06,-0.57,-0.5425,-0.5500195


# evaluomeR <a class="anchor" id="evaluomeR"></a>
Analysis with *evaluomeR*

## Optimal K value <a class="anchor" id="optimal_k"></a>
Calculating the optimal $k$ value with *kmeans* CBI for the whole dataset (`all_metrics=TRUE`). We consider the $k$ range [3,6] for the analysis, avoiding $k=2$ to prevent binary classifications.

In [3]:
k.range=c(3,6)
cbi = "kmeans"

stab_range = stabilityRange(data=nci60, k.range=k.range, 
                            bs=100, seed=seed,
                            all_metrics=TRUE,
                            cbi=cbi)
stab = standardizeStabilityData(stab_range)

# Qual
qual_range = qualityRange(data=nci60, k.range=k.range, 
                            all_metrics=TRUE, seed=seed,
                            cbi=cbi)
qual = standardizeQualityData(qual_range)

# K opt
k_opt = getOptimalKValue(stab_range, qual_range, k.range= k.range)
optimal_k = as.numeric(k_opt$Global_optimal_k)


Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: all_metrics

	Both Ks do not have a stable classification: '4', '3'

	Using '3' since it provides higher silhouette width



We are going to use $k=3$ as the optimal K value.

## Figuring out the L<sub>1</sub> upper boundry <a class="anchor" id="l1_boundry"></a>
This algorithm for tuning the L<sub>1</sub> parameter and others are presented in 'sparcl' R package.
Considering that for the dataset the global optimal $k$ is $k=3$, we can now compute the permutations to figure out the boundry L<sub>1</sub> with the method 'KMeansSparseCluster.permute' from 'sparcl'.
Our tool, *evaluomeR*, offers a wrapper method `getRSKCL1Boundry` to automatically determine the L<sub>1</sub> bound.

Note: 1 $<$ L<sub>1</sub> $\leq$ $\sqrt{num.variables}$.

In [4]:
L1 = getRSKCL1Boundry(nci60, k=optimal_k, seed=seed)

Computing best L1 boundry with 'sparcl::KMeansSparseCluster.permute'
Best L1 found is: 9.01320962196161, using floor: 9


The best L<sub>1</sub> upper boundry for $k=3$ is $L_{1}=9$.

## Figuring out the best alpha <a class="anchor" id="alpha"></a>
We also offer another method to automatically compute the alpha trimming parameter, namely `getRSKCAlpha`.

In [5]:
alpha = getRSKCAlpha(nci60, k=optimal_k, L1=L1, seed)

Running stability and quality indexes with alpha=0

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3
Running stability and quality indexes with alpha=0.05

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3
Running stability and quality indexes with alpha=0.1

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3

Data loaded.
Number of rows: 64
Number of columns: 500


Processing all metrics, 'merge', in dataframe (499)
	Calculation of k = 3
Running stability and quality indexes with alpha=0

# Metrics relevancy <a class="anchor" id="metrics_relevancy"></a>
We need to compute the optimal K value of the dataset and the L1 boundry in order to compute the table of metrics relevancy.

## Relevancy table <a class="anchor" id="relevancy_table"></a>
Now we know that optimal $k$ value is $k=3$ and that $L_{1}=9$, from our previous analysis. With this we have everything set up to get the relevancy table of the metrics.

**Note**: Remove the first column `Description` method `evaluomeR::getMetricsRelevancy`, that is why we use `nci60_metrics` instead of `nci60` dataframe.

In [7]:
nci60_metrics = nci60
nci60_metrics["Description"] = NULL
nci60_relevancy = getMetricsRelevancy(nci60_metrics, alpha=alpha, k=optimal_k, L1=L1, seed=seed)
relevancy_table = nci60_relevancy$relevancy
head(relevancy_table, 10)

[1] "Alpha set as: 0.05"
[1] "L1 set as: 9"


Unnamed: 0,metric,weight
256,256,0.4507365
257,257,0.3533068
252,252,0.3330739
243,243,0.217398
248,248,0.2101845
196,196,0.2008401
286,286,0.1969767
251,251,0.1754905
267,267,0.1570889
281,281,0.1553637
