# evaluomeR - RSKC - metric relevancy

In [38]:
library("evaluomeR")
library("sparcl")
#library("psych")
#library("scales")
#library("caret")

options(scipen=10)

# Table of contents
* [Dataset](#dataset)
* [evaluomeR CBIs](#cbis)
* [Metrics relevancy](#metrics_relevancy)
    * [Optimal K value](#optimal_k)
    * [Figuring out the L<sub>1</sub> upper boundry](#l1_boundry)
     * [Trimmed cases](#rskc_trimmed)
     * [Relevancy table](#rskc_relevancy_table)
* [References](#references)

# Dataset <a class="anchor" id="dataset"></a>

In [23]:
seed = 13606
set.seed(seed)

agro_df = read.csv(paste0(getwd(), "/","data/agro.csv"), header=TRUE, stringsAsFactors=FALSE)
head(agro_df)

Description,ANOnto,AROnto,CBOOnto,CBOOnto2,CROnto,DITOnto,INROnto,LCOMOnto,NACOnto,NOCOnto,NOMOnto,POnto,PROnto,RFCOnto,RROnto,TMOnto,TMOnto2,WMCOnto,WMCOnto2
ADO,0.0,3.9503849,0.9991446,0.9991446,0.9957228,3,0.9991446,1.999142,1.0,292.0,2.9632164,0.9957228,0.7478411,3.962361,0.252158895,0.0,0.0,1.999142,1.0
AEO,0.9298246,0.5438596,0.9824561,0.9824561,0.0,5,0.9824561,2.357143,1.0,3.733333,0.9824561,0.5789474,0.5,1.9649123,0.5,0.0,0.0,2.357143,1.0
AFO,0.75,0.0,0.875,0.875,3998.875,3,0.875,1.333333,1.0,3.5,2275.75,0.25,0.9996157,2276.625,0.000384341,0.0,0.0,1.333333,1.0
AGRO,0.9907407,3.1018519,1.0694444,1.0694444,0.3634259,16,1.0694444,7.695971,1.052174,2.287129,1.2037037,1.0555556,0.5295316,2.2731481,0.470468432,0.06264501,2.148148,9.134783,1.186957
AGRORDF,1.2362637,0.0,1.0659341,1.0659341,0.0,6,1.032967,2.467532,1.077465,4.7,0.5879121,0.8571429,0.3627119,1.6538462,0.637288136,0.0718232,2.0,2.676056,1.084507
ANAEETHES,0.0,0.0,0.6666667,0.6666667,1107.666667,2,0.6666667,1.0,1.0,2.0,0.0,0.0,0.0,0.6666667,1.0,0.0,0.0,1.0,1.0


# evaluomeR CBIs <a class="anchor" id="cbis"></a>
Available clusterboot interfaces (clustering methods) in evaluomeR

In [21]:
cat(paste(shQuote(evaluomeRSupportedCBI(), type="cmd"), collapse=", "))

"kmeans", "clara", "clara_pam", "hclust", "pamk", "pamk_pam", "rskc"

# Metrics relevancy <a class="anchor" id="metrics_relevancy"></a>
We need to compute the optimal K value of the dataset and the L1 boundry in order to compute the table of metrics relevancy.

## Optimal K value <a class="anchor" id="optimal_k"></a>
Calculating the optimal K value with *kmeans* CBI. Here, we make use of *evaluomeR* to figure out the optimal $k$ value. We consider the $k$ range [3,6] for the analysis, avoiding $k=2$ to prevent from having binary classifications.

In [45]:
k.range=c(3,6)
cbi = "kmeans"

stab_range = stabilityRange(data=agro_df, k.range=k.range, 
                            bs=100, seed=seed,
                            all_metrics=TRUE,
                            cbi=cbi)
stab = standardizeStabilityData(stab_range)

# Qual
qual_range = qualityRange(data=agro_df, k.range=k.range, 
                            all_metrics=TRUE, seed=seed,
                            cbi=cbi)
qual = standardizeQualityData(qual_range)

# K opt
k_opt = getOptimalKValue(stab_range, qual_range, k.range= k.range)
optimal_k = as.numeric(k_opt$Global_optimal_k)


Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6
Processing metric: all_metrics

	Maximum stability and quality values matches the same K value: '3'



We are going to use $k=3$ as the optimal K value.

## Figuring out the L<sub>1</sub> upper boundry <a class="anchor" id="l1_boundry"></a>

This algorithm for tuning the L<sub>1</sub> parameter and others are presented in 'sparcl' R package.
Considering that for the dataset the global optimal $k$ is $k=3$, we can now compute the permutations to figure out the boundry L<sub>1</sub> with the method 'KMeansSparseCluster.permute' from 'sparcl'.

Note: 1 $<$ L<sub>1</sub> $\leq$ $\sqrt{num.variables}$.

In [50]:
data = agro_df[-1] # Removing 'Description' column as it is not numeric
wbounds = seq(2,sqrt(ncol(data)), len=30)
km.perm <- sparcl::KMeansSparseCluster.permute(data, K=optimal_k, wbounds=wbounds, nperms=5, silent=TRUE)
L1 = km.perm$bestw
cat(paste0("Best L1 for k=", optimal_k, " is: ", L1))

Best L1 for k=3 is: 2

The best L<sub>1</sub> upper boundry for $k=3$ is 2.

In [6]:
# HASTA AQUI
rskc_run <- function(stab, qual, alpha) {
  structure(list(stab = stab, mean_stab = mean(as.double(stab)),
                 qual = qual, mean_qual = mean(as.double(qual)),
                 alpha = alpha), class = "rskc_run")
}
run_list = list()

In [7]:
alpha_values = seq(0, 0.25, 0.05)
index = 1
for (alpha in alpha_values) {
    stab = stabilityRange(data=input_df, k.range=k.range, 
                                bs=100, seed=seed,
                                all_metrics=TRUE,
                                cbi="rskc", L1=2, alpha=alpha)
    stab_table = standardizeStabilityData(stab)
    
    qual = qualityRange(data=input_df, k.range=k.range, 
                                seed=seed,
                                all_metrics=TRUE,
                                cbi="rskc", L1=2, alpha=alpha)
    qual_table = standardizeQualityData(qual)
    
    run_list[[index]] <- rskc_run(stab = stab_table, qual = qual_table, alpha=alpha)
    index = index + 1
}


Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2


alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
Step (a), (a2) and (b) are repeated over the maximum number of iterations. The algorithm might not converge.alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
Step (a), (a2) and (b) are repeated over the maximum number of iterations. The algorithm might not converge.alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
alpha is too large for n 
Step (a), (a2) and (b) are repeated over the maximum number of iterations. The alg

	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number of columns: 20


Processing all metrics, 'merge', in dataframe (19)
	Calculation of k = 2
	Calculation of k = 3
	Calculation of k = 4
	Calculation of k = 5
	Calculation of k = 6

Data loaded.
Number of rows: 78
Number 

In [8]:
best_stab = -Inf
best_qual = -Inf
best_alpha_stab = 0
best_alpha_qual = 0
getBestRun <- function(run_list) {
    for (run in run_list)
        if (run$mean_stab > best_stab) {
            best_stab = run$mean_stab
            best_alpha_stab = run$alpha
        }
        if (run$mean_qual > best_qual) {
            best_qual = run$best_qual
            best_alpha_qual = run$alpha
        }

    return(
        list("best_alpha_stab" = best_alpha_stab,
                "best_alpha_qual"=best_alpha_qual)
          )
}

In [9]:
getBestRun(run_list)

In [10]:
run_list[[1]]

$stab
                 k_2      k_3       k_4       k_5       k_6
all_metrics 0.922938 0.763779 0.6835521 0.6574005 0.7228951

$mean_stab
[1] 0.750113

$qual
                  k_2       k_3       k_4       k_5        k_6
all_metrics 0.9765144 0.9652177 0.9491769 0.9192709 0.09758856

$mean_qual
[1] 0.7815537

$alpha
[1] 0

attr(,"class")
[1] "rskc_run"

In [20]:
#RSKC(data, 3, alpha=0, L1 = 2)
# Relevancy of metrics according to the optimal K, L1 and alpha
agroRelevancy = getMetricsRelevancy(input_df[-1], alpha=0, k=3, L1=2, seed=seed)
agroRelevancyMetrics = agroRelevancy$relevancy
agroRelevancyMetrics

[1] "Alpha set as: 0"
[1] "L1 set as: 2"


Unnamed: 0,metric,weight
5,CROnto,0.9999948528396668
11,NOMOnto,0.0022688311929322
14,RFCOnto,0.0022686337687014
10,NOCOnto,9.3242941e-08
18,WMCOnto,6.04502189e-08
6,DITOnto,4.79189945e-08
8,LCOMOnto,1.06510398e-08
2,AROnto,1.0003509e-09
17,TMOnto2,9.451035e-10
12,POnto,3.933273e-10
