# From distances to clouds

The R half of the workflow takes care of the distance matrices and generates the rest of the files that we need for the visualization itself:

- The coordinates to plot the tokens and context words
- The distances between models
- The coordinates to plot the models
- Selection of medoids
- HDBSCAN clustering

**note:**

I tested everything on the original code as an R script, but only some things with the code as an R package. If anything doesn't work, please let me know (with an [issue](https://github.com/montesmariana/semcloud/issues/new/choose) or by mail).

In [1]:
#devtools::install_github("montesmariana/semcloud")

Downloading GitHub repo montesmariana/semcloud@HEAD




[32m✔[39m  [90mchecking for file ‘/tmp/RtmpQgR9MS/remotes8a973752abac/montesmariana-semcloud-a965e48/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘semcloud’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘semcloud_0.0.0.9000.tar.gz’[39m[36m[39m
   


Installing package into ‘/home/mariana/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)



In [2]:
library(tidyverse)
library(semcloud)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.4     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [3]:
input_dir <- "../output/tokens" # where the data is stored
cw_dir <- "../output/cws"
output_dir <- "../github/" # where the data will go

In [4]:
lemmas <- setdiff(dir(input_dir), c('destroy', 'eatdrink'))

## Token coordinates

In order to compute the token coordinates, we first need to decide which solutions we are going to choose,
that is, whether we are going to run nMDS and, in the case we run t-SNE, which perplexities we are interested in.
We might even want to run UMAP (not available yet in this code).

While in the end I mostly looked at t-SNE with perplexity of 30, I will show the instructions when having more options.

In [None]:
# This list works for a loop in the function below and should then be stored as a json file
# in the github directory of each lemma, to tell the visualization what is being used
solutions_old <- list("mds" = ".mds")
for (perp in c(10, 20, 30, 50)) {
    solutions_old[[paste0("tsne", perp)]] = paste0(".tsne.", perp)
}
solutions_old

In [5]:
# For show at least, we will only use t-SNE 30, but the same should be done anyways
solutions <- list("tsne30" = ".tsne.30")

In [9]:
# If we have a many lemmas, we could run this on a loop:
suffix <- ".ttmx.dist.pac"
for (lemma in lemmas) {
    models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
    files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)
    write(rjson::toJSON(solutions), file.path(output_dir, lemma, paste0(lemma, ".solutions.json")))    
    getClouds(file.path(input_dir, lemma), file.path(output_dir, lemma),
          files_list, lemma, solutions)
}



## Context words coordinates
For the context words, the workflow is exactly the same as for the tokens. The difference is that the files are saved as `.csv` (because for some reason R cannot read them when they are `.wwmx...` and so it uses the `focdistsFromCsv()` function.

In [5]:
# If we have a many lemmas, we could run this on a loop:
suffix <- ".wwmx.dist.csv"
for (lemma in lemmas) {
    models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
    files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)
    getClouds(file.path(cw_dir, lemma), file.path(output_dir, lemma),
          files_list, lemma, solutions, type = 'focdists')
}

ERROR: Error in getClouds(file.path(cw_dir, lemma), file.path(output_dir, lemma), : object 'solutions' not found


## Model distances and coordinates

The function belows loads the `[lemma].models.tsv` file in the `output_dir` in order to modify it by appending the coordinates from an nMDS on the distances between the models. By default, it will compute "euclidean" distances on the transformed matrices, but the function can be changed with the `fun` argument, and the transformation can be turned off with the `transformed` argument. It returns some data for a register (which I tend to combine across lemmas and store as `euclidean_register.tsv` to tell the index of the visualization which lemmas to offer :)

Under the hood, it also stores the distance matrix as `[lemma].models.dist.tsv`. If the file already exists, it loads it instead of recomputing the distances.

In [6]:
# running on multiple lemmas
reg <- map_dfr(lemmas, ~compLemma(.x, file.path(input_dir, .x), file.path(output_dir, .x)))
write_tsv(reg, file.path(output_dir, "euclidean_register.tsv"))

Run 0 stress 0.1421042 
Run 1 stress 0.1625272 
Run 2 stress 0.1717681 
Run 3 stress 0.1651881 
Run 4 stress 0.1706972 
Run 5 stress 0.1545158 
Run 6 stress 0.1908481 
Run 7 stress 0.183644 
Run 8 stress 0.2226687 
Run 9 stress 0.1853079 
Run 10 stress 0.1681442 
Run 11 stress 0.1900465 
Run 12 stress 0.1904087 
Run 13 stress 0.1841634 
Run 14 stress 0.1715881 
Run 15 stress 0.1653993 
Run 16 stress 0.1421042 
... Procrustes: rmse 4.371464e-05  max resid 0.000259483 
... Similar to previous best
Run 17 stress 0.1421041 
... New best solution
... Procrustes: rmse 2.828685e-05  max resid 0.0001829588 
... Similar to previous best
Run 18 stress 0.1651976 
Run 19 stress 0.1981996 
Run 20 stress 0.1668781 
*** Solution reached
[1] 0.1421041
[1] "Models saved"
[1] "Matrix created."
[1] "Distance matrix saved in ../github//diskwalificeren/diskwalificeren.models.dist.tsv"
Run 0 stress 0.1895059 
Run 1 stress 0.1939067 
Run 2 stress 0.1927536 
Run 3 stress 0.2220217 
Run 4 stress 0.2172796 
Run

Run 16 stress 0.1633774 
... Procrustes: rmse 0.000141874  max resid 0.001755332 
... Similar to previous best
Run 17 stress 0.1634017 
... Procrustes: rmse 0.0007653752  max resid 0.006169051 
... Similar to previous best
Run 18 stress 0.163377 
... Procrustes: rmse 2.259426e-05  max resid 0.0002013542 
... Similar to previous best
Run 19 stress 0.163377 
... Procrustes: rmse 3.058869e-05  max resid 0.0003123157 
... Similar to previous best
Run 20 stress 0.1633774 
... Procrustes: rmse 0.0002293944  max resid 0.002877957 
... Similar to previous best
*** Solution reached
[1] 0.163377
[1] "Models saved"
[1] "Matrix created."
[1] "Distance matrix saved in ../github//geldig/geldig.models.dist.tsv"
Run 0 stress 0.2062365 
Run 1 stress 0.2063523 
... Procrustes: rmse 0.002739647  max resid 0.02456444 
Run 2 stress 0.2063544 
... Procrustes: rmse 0.002931733  max resid 0.02465933 
Run 3 stress 0.2062365 
... New best solution
... Procrustes: rmse 1.034653e-05  max resid 8.233001e-05 
... S

Run 6 stress 0.1458266 
... Procrustes: rmse 3.506894e-05  max resid 0.0004520254 
... Similar to previous best
Run 7 stress 0.1463179 
... Procrustes: rmse 0.00937287  max resid 0.0353971 
Run 8 stress 0.1458188 
... New best solution
... Procrustes: rmse 0.003758978  max resid 0.03437295 
Run 9 stress 0.1463179 
... Procrustes: rmse 0.007416808  max resid 0.02785145 
Run 10 stress 0.1458265 
... Procrustes: rmse 0.0037554  max resid 0.03441902 
Run 11 stress 0.1458267 
... Procrustes: rmse 0.003757862  max resid 0.03447493 
Run 12 stress 0.1458265 
... Procrustes: rmse 0.003756658  max resid 0.03439812 
Run 13 stress 0.1463179 
... Procrustes: rmse 0.007426234  max resid 0.02791093 
Run 14 stress 0.1463182 
... Procrustes: rmse 0.007397021  max resid 0.02766116 
Run 15 stress 0.1463179 
... Procrustes: rmse 0.007413448  max resid 0.0278253 
Run 16 stress 0.1458579 
... Procrustes: rmse 0.002595381  max resid 0.02460757 
Run 17 stress 0.1458267 
... Procrustes: rmse 0.00375725  max re

Run 8 stress 0.1863499 
... Procrustes: rmse 9.870135e-06  max resid 9.198296e-05 
... Similar to previous best
Run 9 stress 0.1863499 
... Procrustes: rmse 2.431374e-05  max resid 0.0001895848 
... Similar to previous best
Run 10 stress 0.1863499 
... New best solution
... Procrustes: rmse 1.287405e-05  max resid 0.0001296544 
... Similar to previous best
Run 11 stress 0.1863499 
... Procrustes: rmse 1.019301e-05  max resid 0.0001241892 
... Similar to previous best
Run 12 stress 0.2212811 
Run 13 stress 0.1863499 
... Procrustes: rmse 1.441343e-05  max resid 0.0001224691 
... Similar to previous best
Run 14 stress 0.215229 
Run 15 stress 0.1863499 
... Procrustes: rmse 6.389795e-06  max resid 6.394442e-05 
... Similar to previous best
Run 16 stress 0.1863499 
... Procrustes: rmse 8.571259e-06  max resid 6.224822e-05 
... Similar to previous best
Run 17 stress 0.1863499 
... Procrustes: rmse 9.25615e-06  max resid 8.864693e-05 
... Similar to previous best
Run 18 stress 0.2142247 
Run

Run 19 stress 0.2199309 
Run 20 stress 0.2281942 
*** Solution reached
[1] 0.210714
[1] "Models saved"
[1] "Matrix created."
[1] "Distance matrix saved in ../github//herinneren/herinneren.models.dist.tsv"
Run 0 stress 0.1510061 
Run 1 stress 0.1731898 
Run 2 stress 0.1897496 
Run 3 stress 0.1510039 
... New best solution
... Procrustes: rmse 0.0007516002  max resid 0.008049745 
... Similar to previous best
Run 4 stress 0.202302 
Run 5 stress 0.1741918 
Run 6 stress 0.1980887 
Run 7 stress 0.1603246 
Run 8 stress 0.171401 
Run 9 stress 0.1778926 
Run 10 stress 0.1918518 
Run 11 stress 0.1851106 
Run 12 stress 0.1755396 
Run 13 stress 0.1683834 
Run 14 stress 0.1682556 
Run 15 stress 0.1673588 
Run 16 stress 0.1510058 
... Procrustes: rmse 0.000325901  max resid 0.002434183 
... Similar to previous best
Run 17 stress 0.1775494 
Run 18 stress 0.1683378 
Run 19 stress 0.1603262 
Run 20 stress 0.1734309 
*** Solution reached
[1] 0.1510039
[1] "Models saved"
[1] "Matrix created."
[1] "Distan

Run 12 stress 0.199696 
Run 13 stress 0.1974675 
Run 14 stress 0.2112857 
Run 15 stress 0.1901139 
Run 16 stress 0.185264 
Run 17 stress 0.1923456 
Run 18 stress 0.2011876 
Run 19 stress 0.2000742 
Run 20 stress 0.1743904 
*** Solution reached
[1] 0.1620196
[1] "Models saved"
[1] "Matrix created."
[1] "Distance matrix saved in ../github//hoopvol/hoopvol.models.dist.tsv"
Run 0 stress 0.204805 
Run 1 stress 0.204805 
... Procrustes: rmse 2.250562e-05  max resid 0.0001236497 
... Similar to previous best
Run 2 stress 0.2048059 
... Procrustes: rmse 0.0003040809  max resid 0.003809428 
... Similar to previous best
Run 3 stress 0.204806 
... Procrustes: rmse 0.0004197261  max resid 0.004053389 
... Similar to previous best
Run 4 stress 0.204806 
... Procrustes: rmse 0.0004216173  max resid 0.004061694 
... Similar to previous best
Run 5 stress 0.204805 
... Procrustes: rmse 6.500503e-06  max resid 5.891592e-05 
... Similar to previous best
Run 6 stress 0.204806 
... Procrustes: rmse 0.00042

Run 10 stress 0.1593054 
Run 11 stress 0.1488642 
... New best solution
... Procrustes: rmse 1.873452e-05  max resid 0.0002037157 
... Similar to previous best
Run 12 stress 0.1593895 
Run 13 stress 0.1488642 
... Procrustes: rmse 8.27955e-06  max resid 7.537974e-05 
... Similar to previous best
Run 14 stress 0.1488642 
... Procrustes: rmse 9.561981e-06  max resid 0.0001024116 
... Similar to previous best
Run 15 stress 0.1488642 
... Procrustes: rmse 1.183756e-05  max resid 0.0001597431 
... Similar to previous best
Run 16 stress 0.1488642 
... Procrustes: rmse 1.628528e-05  max resid 0.0001778797 
... Similar to previous best
Run 17 stress 0.1488642 
... Procrustes: rmse 1.469861e-05  max resid 0.000160702 
... Similar to previous best
Run 18 stress 0.1488642 
... Procrustes: rmse 3.6176e-06  max resid 2.534269e-05 
... Similar to previous best
Run 19 stress 0.1488642 
... Procrustes: rmse 1.232663e-05  max resid 8.848092e-05 
... Similar to previous best
Run 20 stress 0.1488642 
...

## Medoids

The medoids are simply calculated with `cluster::pam` and some basic information is stored in a `[lemma].medoids.tsv` file. The only important column for the visualization is `medoids`.

In [7]:
for (lemma in lemmas) {
    distmtx <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".models.dist.tsv")),
        col_types = cols()) %>% 
    matricizeCloud() %>% as.dist
    pam_data <- cluster::pam(distmtx, k = 8)
    medoid_data <- pam_data$clusinfo %>% as_tibble() %>% mutate(medoids = pam_data$medoids, medoid_i = seq(8))
    write_tsv(medoid_data, file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")))
    models_file <- file.path(output_dir, lemma, paste0(lemma, ".models.tsv"))
    read_tsv(models_file, show_col_types = FALSE, lazy = FALSE) %>% 
        mutate(
            pam_cluster = pam_data$clustering[`_model`], # add pam-cluster number
            medoid = pam_data$medoids[pam_cluster] # add name of medoid
        ) %>% 
        write_tsv(models_file)
}

## HDBSCAN

I've mostly computed HDBSCAN among the medoids, but it could certainly be computed for all models. HDBSCAN information, from clustering to membership probabilities or eps, *could* in principle be included for NephoVis, but I haven't done it because the result varies per model, meaning that each token will have about 200 columns for each of them (or 8 if it's only with the medoids, which it's still a lot), and that is hard to incorporate into the tool.

Instead, I work with an RDS file with a list of models per lemma, and each model object includes:

- coordinates: the coordinates from t-SNE with perplexity 30, next to other variables in the "variables" dataframe like, in my case, "senses", as well as the tailored list of context words. We add the token-wise HDBSCAN info here
- cws: distribution of first-order context words across HDBSCAN clusters and their t-SNE coordinates if available
- (optionally) the normal HDBSCAN plot

I've typically stored this data somewhere else, not really on the visualization GitHub, but it could totally go there. It is more useful for the ShinyApp though.

**Note**
The ShinyApp currently assumes that this file has a "cw_coords" elements with the coordinates for the context words; it needs to be updated.

In [43]:
map(setNames(lemmas, lemmas), function(lemma){
    models <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")), show_col_types = FALSE)$medoids
    map(setNames(models, models),
           summarizeHDBSCAN, lemma = lemma,
           input_dir = file.path(input_dir, lemma),
           output_dir = file.path(output_dir, lemma))
}) %>% 
write_rds(file.path(output_dir, "mariana_hdbscan_new.rds"))

### What to do with HDBSCAN

Here I will add the code to classify the clouds in types, but later.

## Final steps

For the visualization tool, we need to add a file that lists all the files in the directory of a lemma, to help it manage the available data.

In [44]:
library(stringr)
library(readr)
library(purrr)
library(rjson)
cleanFname <- function(str){
    sections <- str_split(str, "\\.")[[1]]
    paste(sections[-c(1, length(sections))], collapse = "")
}

In [50]:
for (lemma in lemmas){
    files_list <- setdiff(dir(file.path(output_dir, lemma)), "paths.json")
    names(files_list) <- map_chr(files_list, cleanFname)
    write(toJSON(files_list), file.path(output_dir, lemma, "paths.json"))
}