# NCI60 use case

In [None]:
library("ISLR") 
library("evaluomeR")
library("dplyr")
library("caret")
library("MLmetrics")
library("plotly")
library("reshape2")

options(scipen=10)

In [None]:
packageVersion("evaluomeR")

# Table of contents
* [Dataset](#dataset)
    * [Removing highly correlated metrics](#correlated)
    * [Top 200](#top)
* [evaluomeR - optimal $k$ analysis](#evaluomer)
    * [Stability plotting](#evaluomeR_stab_plot)
    * [Quality plotting](#evaluomeR_qual_plot)
* [PCA](#pca)
* [Sensitivity](#sensitivity)
* [CER](#cer)

# Dataset <a class="anchor" id="dataset"></a>

In [None]:
nci60 = as.data.frame(NCI60$data)
head(nci60)

There are 14 types of classes within the dataset: **CNS**, **RENAL**, **BREAST**, **NSCLC**, **UNKNOWN**, **OVARIAN**, **MELANOMA**, **PROSTATE**, **LEUKEMIA**, **K562B-repro**, **K562A-repro**, **COLON**, **MCF7A-repro** and **MCF7D-repro**:

In [None]:
as.vector(unlist(unique(NCI60$labs)))

Here, we prepare the NCI60 dataset for the analysis:

- We add a column named `Description` containing the class (category) of each row
- Due to their small class size, we remove the two prostate cell lines and the unknown cell line, "PROSTATE" and "UNKNOWN" entires respectively.

In [None]:
nci60["labels"] = rownames(nci60)
nci60 = nci60[ , c("labels", names(nci60)[names(nci60) != "labels"])]
nci60["labels"] = NCI60$labs
colnames(nci60)[colnames(nci60) == 'labels'] <- 'Description'
nci60 = nci60[!grepl("UNKNOWN", nci60$Description),] # Remove UNKNOWN
nci60 = nci60[!grepl("PROSTATE", nci60$Description),] # Remove PROSTATE

## Removing highly correlated metrics <a class="anchor" id="correlated"></a>
We address the issue of multicorrelation by identifying and removing highly correlated metrics (absolute correlation, 1) from our dataset. First, we exclude the first column from the dataset `nci60`, where the column `Description` is. We then compute the correlation matrix R for data using the `cor` function. To pinpoint the metrics that exhibit perfect correlation (correlation coefficient of 1 or -1), we make use of the `findCorrelation` function from the `caret` package, setting a cutoff of 1. This function returns the names of the variables that are highly correlated, if any.

In [None]:
data = nci60[-1]
R = cor(data)
head(R)

In [None]:
cor_metrics = findCorrelation(R, cutoff = 1, verbose = FALSE, names=TRUE)
length(cor_metrics)

Finally, we use `length(cor_metrics)` to determine the number of these highly correlated metrics. As this number is 0, we assess there are no highly correlated metrics.

## Top 200 <a class="anchor" id="top"></a>

We now filter the metrics for including only the top 200 with the greatest variance, as these metrics have the most significant impact on clustering.

In [None]:
variance = sort(sapply(nci60[-1], var), decreasing = TRUE)  # Sorted gene variance
nci60_var = as.data.frame(variance)
nci60_var["Description"] = rownames(nci60_var)

In [None]:
top_number = 200
top_rows = nci60_var[c(1:top_number), ]
head(top_rows)

In [None]:
row_list = as.list(top_rows["Description"])
top_row_list = unlist(setdiff(row_list, names(nci60)))
top_nci60 = nci60[, top_row_list]
top_nci60["Description"] = nci60[rownames(top_nci60) %in% rownames(nci60), "Description"]
top_nci60 = top_nci60[ , c("Description", names(top_nci60)[names(top_nci60) != "Description"])] 

The dataframe `top_nci60` contains 200 genes (metrics) which provides the most variance.

In [None]:
head(top_nci60)

# evaluomeR - optimal $k$ analysis <a class="anchor" id="evaluomer"></a>
In this Section, evaluomeR executes an optimal $k$ analysis. First, stabilities and qualities are calculated, considering all the metrics in the dataset. The $k$ range is $k \in [3,10]$ and the clustering method is `kmeans`.

In [None]:
seed = 13606
k.range=c(3,10)
cbi = "kmeans"

Stability calculation with $k \in [3,10]$ and `kmeans`:

In [None]:
stab_range = stabilityRange(data=top_nci60, k.range=k.range, 
                            bs=100, seed=seed,
                            all_metrics=TRUE,
                            cbi=cbi)
stab = standardizeStabilityData(stab_range)

## Stability plotting <a class="anchor" id="evaluomeR_stab_plot"></a>

Stability plot

In [None]:
rownames(stab) = c("stab_kmeans")
stab$Metric = rownames(stab)
stab$Method = "kmeans"
stab_melt = melt(stab, id.vars = c("Metric", "Method"))

In [None]:
# Color
plot_ly(stab_melt, x=~variable, y=~value, colors = "Set1", color= ~Method,
              type = 'scatter', mode = 'dot') %>%
  layout(
    title = paste0('Stability k in [', k.range[1], ",", k.range[2],']'),
    xaxis = list(title = 'k', range=c(0, 7)),
    yaxis = list(title = 'Stability', range=c(0,1)),
    shapes = list(
        list(type = "rect", fillcolor = "green", line = list(color = "green"), opacity = 0.1,
            y0 = 0.85, y1 = 1, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "blue", line = list(color = "blue"), opacity = 0.1,
            y0 = 0.75, y1 = 0.85, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.1,
            y0 = 0.6, y1 = 0.75, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "red", line = list(color = "red"), opacity = 0.1,
            y0 = 0, y1 = 0.6, x0 = 0, x1 = 7, layer="below")
    )
  )

In [None]:
# Grayscale
grayscale_colors <- c("black")

plot_ly(stab_melt, x = ~variable, y = ~value, colors = grayscale_colors, color = ~Method,
              type = 'scatter', mode = 'dot') %>%
  layout(
    title = paste0('Stability k in [', k.range[1], ",", k.range[2],']'),
    xaxis = list(title = 'k', range = c(0, 7)),
    yaxis = list(title = 'Stability', range = c(0, 1)),
    shapes = list(
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.05,
            y0 = 0.85, y1 = 1, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.15,
            y0 = 0.75, y1 = 0.85, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.30,
            y0 = 0.6, y1 = 0.75, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.40,
            y0 = 0, y1 = 0.6, x0 = 0, x1 = 7, layer = "below")
    )
  )

Quality calculation with $k \in [3,10]$ and `kmeans`.

In [None]:
qual_range = qualityRange(data=top_nci60, k.range=k.range, 
                            seed=seed,
                            all_metrics=TRUE,
                            cbi=cbi)
qual = standardizeQualityData(qual_range)

## Quality plotting <a class="anchor" id="evaluomeR_qual_plot"></a>

Quality plot

In [None]:
rownames(qual) = c("qual_kmeans")
qual$Metric = rownames(qual)
qual$Method = "qual"
qual_melt = melt(qual, id.vars = c("Metric", "Method"))

In [None]:
# Color
plot_ly(qual_melt, x=~variable, y=~value, colors = "Set1", color= ~Method,
              type = 'scatter', mode = 'dot') %>%
  layout(
    title = paste0('Stability k in [', k.range[1], ",", k.range[2],']'),
    xaxis = list(title = 'k', range=c(0, 7)),
    yaxis = list(title = 'Stability', range=c(0,1)),
    shapes = list(
        list(type = "rect", fillcolor = "green", line = list(color = "green"), opacity = 0.1,
            y0 = 0.85, y1 = 1, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "blue", line = list(color = "blue"), opacity = 0.1,
            y0 = 0.75, y1 = 0.85, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.1,
            y0 = 0.6, y1 = 0.75, x0 = 0, x1 = 7, layer="below"),
        list(type = "rect", fillcolor = "red", line = list(color = "red"), opacity = 0.1,
            y0 = 0, y1 = 0.6, x0 = 0, x1 = 7, layer="below")
    )
  )

In [None]:
# Grayscale
grayscale_colors <- c("black")

plot_ly(qual_melt, x = ~variable, y = ~value, colors = grayscale_colors, color = ~Method,
              type = 'scatter', mode = 'dot') %>%
  layout(
    title = paste0('Stability k in [', k.range[1], ",", k.range[2],']'),
    xaxis = list(title = 'k', range = c(0, 7)),
    yaxis = list(title = 'Stability', range = c(0, 1)),
    shapes = list(
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.05,
            y0 = 0.85, y1 = 1, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.15,
            y0 = 0.75, y1 = 0.85, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.30,
            y0 = 0.6, y1 = 0.75, x0 = 0, x1 = 7, layer = "below"),
        list(type = "rect", fillcolor = "gray", line = list(color = "gray"), opacity = 0.40,
            y0 = 0, y1 = 0.6, x0 = 0, x1 = 7, layer = "below")
    )
  )

Determining the optimal $k$ given the stabilities and qualities in `stab_range` and `qual_range` objects:

In [None]:
k_opt = getOptimalKValue(stab_range, qual_range, k.range= k.range)
optimal_k = k_opt$Global_optimal_k
optimal_k_str = paste0("k_", optimal_k)
print(paste0("Optimal k: ", optimal_k))

In [None]:
print(paste0("Stabilities and qualities per k with '", cbi, "' as clustering method"))
stab
qual
print(paste0("Stabily in k=", optimal_k,": ", stab[optimal_k_str]))
print(paste0("Quality in k=", optimal_k,": ", qual[optimal_k_str]))

# Clusters

In [None]:
# Internal method used to group individuals per cluster
individuals_per_cluster = function(qualityResult) {
  qual_df = as.data.frame(assay(qualityResult))


  cluster_pos_str = as.character(unlist(qual_df["Cluster_position"]))
  cluster_labels_str = as.character(unlist(qual_df["Cluster_labels"]))

  cluster_pos = as.list(strsplit(cluster_pos_str, ",")[[1]])
  cluster_labels = as.list(strsplit(cluster_labels_str, ",")[[1]])

  individuals_in_cluster = as.data.frame(cbind(cluster_labels, cluster_pos))
  colnames(individuals_in_cluster) = c("Individual", "InCluster")

  return(individuals_in_cluster)
}

In [None]:
cluster_individuals = individuals_per_cluster(assay(qual_range[optimal_k_str]))
for (cluster_i in 1:optimal_k) {
    ind_in_cluster = paste(unlist(cluster_individuals[cluster_individuals$InCluster == cluster_i, ]["Individual"]), collapse = ",")
    print(paste("Cluster", cluster_i, ":", ind_in_cluster))
    print("---")
}

# PCA <a class="anchor" id="pca"></a>
We employ Principal Component Analysis (PCA) as a dimensionality reduction technique to facilitate the visualization of clusters within our dataset. PCA allow us to transform the original high-dimensional data into a lower-dimensional space, while preserving as much of the variability as possible.

In [None]:
top_nci60["inCluster"] = as.numeric(cluster_individuals$InCluster)
pca_matrix = top_nci60 %>% select(-Description, -inCluster)
pca_result <- prcomp(pca_matrix, scale. = TRUE)
pca_df <- data.frame(pca_result$x)
pca_df$Cluster <- as.factor(top_nci60$inCluster)
pca_df$Individual <- top_nci60$Description
head(pca_df)

In [None]:
# Plot of the clusters, color
ggplot2::ggplot(pca_df, ggplot2::aes(x = PC1, y = PC2, color = Cluster, label = Individual)) +
  ggplot2::geom_point(size = 3) +
  ggplot2::geom_text(vjust = 1, hjust = 1) +
  ggplot2::labs(title = "PCA of Features",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  ggplot2::theme_minimal()

In [None]:
# Plot of the clusters, grayscale
ggplot(pca_df, aes(x = PC1, y = PC2, shape = Cluster, color = Cluster, label = Individual)) +
  geom_point(size = 3) + # Point size
  geom_text(vjust = 1, hjust = 1, color = "black") + # Black color for text
  labs(title = "PCA of Features",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_shape_manual(values = c(16, 17, 15, 18, 19, 20, 21, 22, 23, 24)) + # Different point styles
  scale_color_manual(values = c("gray20", "gray35", "gray50", "gray65", "gray80", 
                                "black", "gray30", "gray45", "gray60", "gray75")) + # Different grayscale colors
  theme(legend.position = "right")

# Sensitivity <a class="anchor" id="sensitivity"></a>
In this Section we evaluate the sensitivity of our clustering using the `MLmetrics::Sensitivity` method. Sensitivity, or the true positive rate, measures the ability to correctly identify positive instances within the data. By focusing on sensitivity, we aim to ensure that our model effectively captures the relevant clusters, minimizing the number of false negatives. 

In [None]:
top_nci60["Class"] = top_nci60["Description"]
head(top_nci60)[, c("Description", "Class")]

In [None]:
# KMEANS
# k=8
level_mapping <- c("NSCLC" = 1, "CNS" = 2, "BREAST" = 3,
                 "MCF7A-repro" = 3, "MCF7D-repro" = 3, "RENAL" = 4, 
                 "LEUKEMIA" = 5, "K562B-repro" = 5, "K562A-repro" = 5, 
                 "MELANOMA" = 6,  "COLON" = 7, "OVARIAN" = 8
                  )
map_strings_to_numbers <- function(strings) {
    return(as.numeric(level_mapping[strings]))
}
# Map categories with cluster number
top_nci60["Class_n"] = lapply(top_nci60["Class"], map_strings_to_numbers)
# Table of prediction vs actual classification
head(top_nci60)[, c("Description", "Class", "inCluster", "Class_n")]

In [None]:
# Getting a vector of prediction vs actual classification
actual = as.factor(as.vector(unlist(top_nci60["Class_n"])))
predicted <- factor(as.vector(unlist(top_nci60["inCluster"])))

print("actual")
actual
print("predicted")
predicted

In [None]:
sens <- MLmetrics::Sensitivity(y_pred = predicted, y_true = actual)
sens = format(round(sens*100, 2), nsmall = 2)
print(paste0("Sensitivity: ", sens, "%"))

# CER <a class="anchor" id="cer"></a>
To assess the overall accuracy of our clustering, we compute the Classification Error Rate (CER) and compare it with the gold standard classification. CER represents the proportion of misclassified instances, thus providing a clear measure of the clustering performance in assigning individuals to the correct clusters.

In [None]:
cer <- CER(predicted, actual)
cer = format(round(cer*100, 2), nsmall = 2)
print(paste0("CER: ", cer, "%"))