# From distances to clouds

The R half of the workflow takes care of the distance matrices and generates the rest of the files that we need for the visualization itself:

- The coordinates to plot the tokens and context words
- The distances between models
- The coordinates to plot the models
- Selection of medoids
- HDBSCAN clustering

**note:**

I tested everything on the original code as an R script, but only some things with the code as an R package. If anything doesn't work, please let me know (with an [issue](https://github.com/montesmariana/semcloud/issues/new/choose) or by mail).

In [1]:
devtools::install_github("montesmariana/semcloud")

Downloading GitHub repo montesmariana/semcloud@HEAD




[32m✔[39m  [90mchecking for file ‘/tmp/Rtmpfmf2I0/remotesb76855e1c276/montesmariana-semcloud-40a6188/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘semcloud’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘semcloud_0.0.0.9000.tar.gz’[39m[36m[39m
   


Installing package into ‘/home/mariana/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)



In [4]:
library(tidyverse)
library(semcloud)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.4     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
input_dir <- "../output/tokens" # where the data is stored
cw_dir <- "../output/cws"
output_dir <- "../github/" # where the data will go

## Token coordinates

In order to compute the token coordinates, we first need to decide which solutions we are going to choose,
that is, whether we are going to run nMDS and, in the case we run t-SNE, which perplexities we are interested in.
We might even want to run UMAP (not available yet in this code).

While in the end I mostly looked at t-SNE with perplexity of 30, I will show the instructions when having more options.

In [6]:
# This list works for a loop in the function below and should then be stored as a json file
# in the github directory of each lemma, to tell the visualization what is being used
solutions_old <- list("mds" = ".mds")
for (perp in c(10, 20, 30, 50)) {
    solutions_old[[paste0("tsne", perp)]] = paste0(".tsne.", perp)
}
solutions_old

In [7]:
# For show at least, we will only use t-SNE 30, but the same should be done anyways
solutions <- list("tsne30" = ".tsne.30")

In [8]:
lemma <- "destroy" # your lemma
suffix <- ".ttmx.dist.pac"
models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)
write(rjson::toJSON(solutions), file.path(output_dir, lemma, paste0(lemma, ".solutions.json")))
file.exists(file.path(input_dir, lemma, files_list[1]))

The `semcloud::getClouds()` function groups the full "workflow":
1. It sets up one empty dataframe per item in solution, which will be stored in a `[lemma].[solution].tsv` file.

2. For each file in `files_list`:

    2.1 It extracts the model name

    2.2 It loads the file with `semcloud::tokensFromPac()`
    
    2.3 If `logrank = TRUE` (the default), it applies the transformation
    
    2.4 It applies the corresponding algorithm and extracts the coordinates
    
    2.5 It appends the coordinates as columns preceded by the name of the model to the corresponding dataframe
   
In addition, if "mds" is one of the algorithms, it will *return* a list with the stress values.

In [7]:
getClouds(file.path(input_dir, lemma), file.path(output_dir, lemma),
          files_list, lemma, solutions)



In [8]:
read_tsv(file.path(output_dir, lemma, paste0(lemma, ".tsne.30.tsv")),
        col_types = cols())

_id,destroy.nobound5-5lex.PPMIweight.LENGTHFOC.SOCPOSnav.x,destroy.nobound5-5lex.PPMIweight.LENGTHFOC.SOCPOSnav.y,destroy.nobound5-5lex.PPMIweight.LENGTHFOC.SOCPOSall.x,destroy.nobound5-5lex.PPMIweight.LENGTHFOC.SOCPOSall.y,destroy.nobound5-5lex.PPMIweight.LENGTH5000.SOCPOSnav.x,destroy.nobound5-5lex.PPMIweight.LENGTH5000.SOCPOSnav.y,destroy.nobound5-5lex.PPMIweight.LENGTH5000.SOCPOSall.x,destroy.nobound5-5lex.PPMIweight.LENGTH5000.SOCPOSall.y,destroy.nobound5-5lex.PPMIselection.LENGTHFOC.SOCPOSnav.x,⋯,destroy.LEMMAPATH2.PPMIselection.LENGTH5000.SOCPOSall.x,destroy.LEMMAPATH2.PPMIselection.LENGTH5000.SOCPOSall.y,destroy.LEMMAPATH2.PPMIno.LENGTHFOC.SOCPOSnav.x,destroy.LEMMAPATH2.PPMIno.LENGTHFOC.SOCPOSnav.y,destroy.LEMMAPATH2.PPMIno.LENGTHFOC.SOCPOSall.x,destroy.LEMMAPATH2.PPMIno.LENGTHFOC.SOCPOSall.y,destroy.LEMMAPATH2.PPMIno.LENGTH5000.SOCPOSnav.x,destroy.LEMMAPATH2.PPMIno.LENGTH5000.SOCPOSnav.y,destroy.LEMMAPATH2.PPMIno.LENGTH5000.SOCPOSall.x,destroy.LEMMAPATH2.PPMIno.LENGTH5000.SOCPOSall.y
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
verniel/verb/volkskrant_19990517_122/16,4.5103586,-13.563040,15.523726,-5.57195978,-2.1339912,8.45883375,-7.6833499,5.813211,-5.9074627,⋯,19.7423844,13.6412159,-4.465949,15.0480808,-8.460450,-18.7470916,-11.9683826,8.60219248,15.344079,-14.2524173
verniel/verb/het_nieuwsblad_20030830_01_369/154,19.3327374,-31.172588,13.140263,-25.33798017,20.9769407,31.89025556,10.0587641,37.621520,21.6792727,⋯,16.9336546,9.6183536,-9.304723,14.3698826,-15.072294,-18.1496642,-15.3910805,7.90511293,11.678149,-11.5082566
verniel/verb/parool_20040309_16/329,6.4711766,10.039919,1.285111,0.07739147,2.3022332,14.24235263,3.6930121,11.971478,1.1009828,⋯,-2.8861642,19.4406992,-23.369125,3.3884856,-21.960284,-14.6298730,-11.9056482,-19.02173945,2.574085,-28.4551199
verniel/verb/het_laatste_nieuws_20000306_01_201/52,10.3221772,-26.462641,12.980446,-15.97108131,5.6491243,19.77761667,9.5433983,24.707643,15.1519018,⋯,12.2578615,18.1967712,-16.753299,22.4663665,-18.454857,-24.8827052,-19.1415944,2.17279014,8.096723,-17.7693043
verniel/verb/het_laatste_nieuws_20041112_01_230/104,5.8366891,11.836600,-2.426130,0.70845915,3.0882856,-4.20312327,8.4971297,-6.026609,1.4728773,⋯,-18.8165576,18.0633348,-11.947114,-0.5966733,-4.051231,-0.4587108,-0.9243321,-34.16728462,-21.654089,0.4275975
verniel/verb/het_laatste_nieuws_20001219_01_282/59,20.0071655,-29.411883,14.631492,-30.99748811,19.1906157,24.41962093,-6.2361716,21.303664,16.4316494,⋯,17.3541587,18.3462981,-23.546286,26.4834946,-20.925695,-24.8573772,-24.6485177,5.23275936,3.934130,-17.0497189
verniel/verb/parool_20010412_58/69,3.4781955,17.484237,-7.816589,0.90975546,1.8247112,-11.95862438,10.8026153,-14.610629,-2.9555374,⋯,-20.9505372,19.1523026,-16.936964,-14.2134990,-6.297340,-16.0238694,11.9147595,-30.06665159,-10.959444,-20.6475566
verniel/verb/het_nieuwsblad_20031105_01_340/198,11.8003622,-37.140226,6.336671,-24.03425472,31.2143513,26.73777482,14.5511306,40.960605,23.9330209,⋯,16.0902389,15.3451155,-22.534068,28.6565701,-20.380007,-26.9694092,-18.8347031,0.19495360,7.597424,-21.2066294
verniel/verb/parool_20001229_14/17,19.8900194,-32.695587,15.296789,-26.00720552,23.4260454,34.98021575,6.8536828,40.043036,20.5052605,⋯,13.3594413,17.6768266,-15.933811,24.3122004,-17.873249,-26.3309408,-16.0100635,-0.01796889,11.710458,-22.1443794
verniel/verb/het_laatste_nieuws_20040320_01_388/70,-4.7441414,-34.932504,32.936057,-18.12818766,10.1041535,25.99584947,-2.2243332,29.753445,9.6420902,⋯,5.6806095,17.8796340,-28.965622,5.2460546,-27.409439,-15.6423908,-10.9341254,-1.68406376,12.157457,-15.4045734


In [None]:
# If we have a many lemmas, we could run this on a loop:
# suffix <- ".ttmx.dist.pac"
# for (lemma in lemmas) {
#     models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
#     files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)
#     write(rjson::toJSON(solutions), file.path(output_dir, lemma, paste0(lemma, ".solutions.json")))    
#     getClouds(file.path(input_dir, lemma), file.path(output_dir, lemma),
#           files_list, lemma, solutions)
# }

## Context words coordinates
For the context words, the workflow is exactly the same as for the tokens. The difference is that the files are saved as `.csv` (because for some reason R cannot read them when they are `.wwmx...` and so it uses the `focdistsFromCsv()` function.

In [27]:
suffix <- ".wwmx.dist.csv"
files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)

getClouds(file.path(cw_dir, lemma), file.path(output_dir, lemma),
          files_list, lemma, solutions, type = "focdists")



In [None]:
# If we have a many lemmas, we could run this on a loop:
# suffix <- ".wwmx.dist.csv"
# for (lemma in lemmas) {
#     models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
#     files_list <- paste0(read_tsv(models_file, col_types = cols())$`_model`, suffix)
#     getClouds(file.path(cw_dir, lemma), file.path(output_dir, lemma),
#           files_list, lemma, solutions)
# }

## Model distances and coordinates

The function belows loads the `[lemma].models.tsv` file in the `output_dir` in order to modify it by appending the coordinates from an nMDS on the distances between the models. By default, it will compute "euclidean" distances on the transformed matrices, but the function can be changed with the `fun` argument, and the transformation can be turned off with the `transformed` argument. It returns some data for a register (which I tend to combine across lemmas and store as `euclidean_register.tsv` to tell the index of the visualization which lemmas to offer :)

Under the hood, it also stores the distance matrix as `[lemma].models.dist.tsv`. If the file already exists, it loads it instead of recomputing the distances.

In [9]:
reg <- compLemma(lemma, file.path(input_dir, lemma), file.path(output_dir, lemma))

Run 0 stress 0.2096644 
Run 1 stress 0.2128179 
Run 2 stress 0.2180829 
Run 3 stress 0.2324361 
Run 4 stress 0.216838 
Run 5 stress 0.215678 
Run 6 stress 0.2217743 
Run 7 stress 0.2360372 
Run 8 stress 0.2172706 
Run 9 stress 0.2155969 
Run 10 stress 0.2472629 
Run 11 stress 0.2128561 
Run 12 stress 0.2133979 
Run 13 stress 0.2168053 
Run 14 stress 0.2204092 
Run 15 stress 0.2418459 
Run 16 stress 0.2404527 
Run 17 stress 0.2138515 
Run 18 stress 0.2234314 
Run 19 stress 0.2141747 
Run 20 stress 0.2223146 
Run 21 stress 0.2563006 
Run 22 stress 0.2226901 
Run 23 stress 0.2174322 
Run 24 stress 0.2182067 
Run 25 stress 0.2377647 
Run 26 stress 0.2162604 
Run 27 stress 0.2151332 
Run 28 stress 0.2541734 
Run 29 stress 0.2427642 
Run 30 stress 0.2163842 
*** No convergence -- monoMDS stopping criteria:
    16: stress ratio > sratmax
    14: scale factor of the gradient < sfgrmin
[1] 0.2096644


ERROR: Error in stats::setNames(., .data, c("_model", "model.x", "model.y")): unused argument (c("_model", "model.x", "model.y"))


In [36]:
reg # other information could be added eventually, like range of number of tokens, or part-of-speech

type,models,stress,date
<chr>,<int>,<dbl>,<date>
destroy,204,0.2097,2021-08-26


In [None]:
# running on multiple lemmas
# reg <- map_dfr(lemmas, ~compLemma(.x, file.path(input_dir, .x), file.path(output_dir, .x)))
# write_tsv(reg, file.path(output_dir, "euclidean_register.tsv"))

## Medoids

The medoids are simply calculated with `cluster::pam` and some basic information is stored in a `[lemma].medoids.tsv` file. The only important column for the visualization is `medoids`.

In [10]:
distmtx <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".models.dist.tsv")),
        col_types = cols()) %>% 
matricizeCloud() %>% as.dist
pam_data <- cluster::pam(distmtx, k = 8)

In [11]:
medoid_data <- pam_data$clusinfo %>% as_tibble() %>% mutate(medoids = pam_data$medoids, medoid_i = seq(8))
write_tsv(medoid_data, file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")))
medoid_data

size,max_diss,av_diss,diameter,separation,medoids,medoid_i
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>
33,3.626162,2.430053,4.136572,1.0001767,destroy.LEMMAPATHweight.PPMIselection.LENGTHFOC.SOCPOSall,1
29,3.726072,2.550195,4.15972,1.0001767,destroy.LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSall,2
20,3.430102,2.620707,4.230553,1.8067634,destroy.bound5-5all.PPMIno.LENGTHFOC.SOCPOSall,3
21,3.68101,2.335795,4.175897,0.7574642,destroy.bound10-10lex.PPMIselection.LENGTHFOC.SOCPOSall,4
25,4.000107,2.528835,4.495695,0.7574642,destroy.nobound10-10all.PPMIweight.LENGTHFOC.SOCPOSall,5
17,3.320142,2.493338,4.062809,1.1197924,destroy.nobound3-3all.PPMIselection.LENGTH5000.SOCPOSnav,6
31,2.979151,2.220707,4.093892,1.1197924,destroy.bound3-3lex.PPMIselection.LENGTHFOC.SOCPOSall,7
28,3.292718,2.301664,4.081487,1.8669176,destroy.bound5-5lex.PPMIselection.LENGTHFOC.SOCPOSall,8


In [12]:
# we can also add clustering data to the models summary
models_file <- file.path(output_dir, lemma, paste0(lemma, ".models.tsv"))
read_tsv(models_file, col_types = cols()) %>% 
    mutate(
        pam_cluster = pam_data$clustering[`_model`], # add pam-cluster number
        medoid = pam_data$medoids[pam_cluster] # add name of medoid
    ) %>% 
    write_tsv(models_file)

In [None]:
# On a loop
# for (lemma in lemmas) {
#     distmtx <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".models.dist.tsv")),
#         col_types = cols()) %>% 
#     matricizeCloud() %>% as.dist
#     pam_data <- cluster::pam(distmtx, k = 8)
#     medoid_data <- pam_data$clusinfo %>% as_tibble() %>% mutate(medoids = pam_data$medoids, medoid_i = seq(8))
#     write_tsv(medoid_data, file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")))
#     models_file <- file.path(output_dir, lemma, paste0(lemma, ".models.tsv"))
#     read_tsv(models_file, col_types = cols()) %>% 
#         mutate(
#             pam_cluster = pam_data$clustering[`_model`], # add pam-cluster number
#             medoid = pam_data$medoids[pam_cluster] # add name of medoid
#         ) %>% 
#         write_tsv(models_file)
# }

## HDBSCAN

I've mostly computed HDBSCAN among the medoids, but it could certainly be computed for all models. HDBSCAN information, from clustering to membership probabilities or eps, *could* in principle be included for NephoVis, but I haven't done it because the result varies per model, meaning that each token will have about 200 columns for each of them (or 8 if it's only with the medoids, which it's still a lot), and that is hard to incorporate into the tool.

Instead, I work with an RDS file with a list of models per lemma, and each model object includes:

- coordinates: the coordinates from t-SNE with perplexity 30, next to other variables in the "variables" dataframe like, in my case, "senses", as well as the tailored list of context words. We add the token-wise HDBSCAN info here
- cws: distribution of first-order context words across HDBSCAN clusters and their t-SNE coordinates if available
- (optionally) the normal HDBSCAN plot

In [9]:
# You could run it on all the models or just the medoids
# models <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".models.tsv")), col_types = cols())$`_model` # all models
models <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")), col_types = cols())$medoids # only medoids
models

In [10]:
res <- map(setNames(models, models),
           summarizeHDBSCAN, lemma = lemma,
           input_dir = file.path(input_dir, lemma),
           output_dir = file.path(output_dir, lemma))
write_rds(list("destroy" = res), file.path(output_dir, "hdbscan.rds"))

In [12]:
names(res)

In [13]:
res[[1]]$coords

_id,model.x,model.y,cws,clusters,membprob,eps
<chr>,<dbl>,<dbl>,<list>,<fct>,<dbl>,<dbl>
verniel/verb/volkskrant_19990517_122/16,2.8988544,-22.462862,"een/det , Picasso/name , ben/verb , Stedelijk/name , het/det , schilderij/noun , ernstig/adj , zondag_middag/noun, van/prep , Museum/name , mes/noun",0,0.000000000,3.887423
verniel/verb/het_nieuwsblad_20030830_01_369/154,-1.6853307,-19.225090,"de/det , afsluiting/noun, waarbij/pp , en/vg , van/prep , poort/noun",0,0.000000000,3.186496
verniel/verb/parool_20040309_16/329,-16.6766356,-12.206250,"toen/comp , de/det , word/verb , stort_in/verb, vorige/adj , deels/adv , woensdag/noun",0,0.000000000,2.897787
verniel/verb/het_laatste_nieuws_20000306_01_201/52,-5.6305850,-29.800029,"een/det , de/det , 58-jarige/adj , elektriciteit_voorziening/noun, vrijdag_morgen/noun , heb/verb , bulldozer/noun , paar/noun , en/vg , van/prep , woning/noun , auto/noun",4,0.000000000,2.826265
verniel/verb/het_laatste_nieuws_20041112_01_230/104,-9.9646824,-3.892588,"Brugge/name, de/det , word/verb , waardoor/pp, stad/noun",5,0.003507536,3.135761
verniel/verb/het_laatste_nieuws_20001219_01_282/59,-7.3809673,-40.659280,"ben/verb , betrap/verb , kerst_verlichting/noun",0,0.000000000,3.544540
verniel/verb/parool_20010412_58/69,3.2812238,-21.938489,"Stedelijk/name , het/det , schilderij/noun, word/verb , of/vg , 1998/noun , van/prep , Museum/name",0,0.000000000,3.884736
verniel/verb/het_nieuwsblad_20031105_01_340/198,-6.3526612,-38.054513,"de/det , camera/noun , Werchter/name, onbekend/adj",3,0.007459628,2.767327
verniel/verb/parool_20001229_14/17,-1.6465688,-36.451938,"een/det , geparkeerd/adj, de/det , Mercedes/name , zwaar/adj , heb/verb",3,0.094472841,2.524723
verniel/verb/het_laatste_nieuws_20040320_01_388/70,-19.9197007,-18.306144,"een/det , bij_gebouw/noun , brand/noun , de/det , schrijnwerkerij/noun, zwaar/adj , ontsta/verb , van/prep",4,0.215657534,2.216760


In [14]:
res[[2]]$cws %>% filter(cluster == 1)

cw,TP,recall,precision,Fscore,cluster,model.x,model.y
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>
ben/verb,8,1.0,0.1509434,0.2622951,1,-32.982993,18.18119
practisch/adj,1,0.125,1.0,0.2222222,1,-11.542693,14.97918
daar/adv,1,0.125,0.5,0.2,1,12.159312,-20.77912
stad/noun,1,0.125,0.3333333,0.1818182,1,6.602981,-23.25507


I've typically stored this data somewhere else, not really on the visualization GitHub, but it could totally go there. It is more useful for the ShinyApp though.

In [None]:
# On a loop across lemmas
# map(setNames(lemmas, lemmas), function(lemma){
#     map(setNames(models, models),
#            summarizeHDBSCAN, lemma = lemma,
#            input_dir = file.path(input_dir, lemma),
#            output_dir = file.path(output_dir, lemma))
# }) %>% 
# write_rds(file.path(output_dir, "hdbscan.rds"))

### What to do with HDBSCAN

Here I will add the code to classify the clouds in types, but later.

## Final steps

For the visualization tool, we need to add a file that lists all the files in the directory of a lemma, to help it manage the available data.

In [18]:
library(stringr)
library(readr)
library(purrr)
library(rjson)
cleanFname <- function(str){
    sections <- str_split(str, "\\.")[[1]]
    paste(sections[-c(1, length(sections))], collapse = "")
}

In [2]:
output_dir <- "../github/"
lemma <- "destroy"

In [16]:
files_list <- dir(file.path(output_dir, lemma))
names(files_list) <- map_chr(files_list, cleanFname)
files_list

In [20]:
write(toJSON(files_list), file.path(output_dir, lemma, "paths.json"))
fromJSON(file = file.path(output_dir, lemma, "paths.json"))

In [None]:
# On a loop across lemmas
# for (lemma in lemma){
#     files_list <- dir(file.path(output_dir, lemma))
#     names(files_list) <- map_chr(files_list, cleanFname)
#     write(toJSON(files_list), file.path(output_dir, lemma, "paths.json"))
# }