# Advanced Dimension Reduction with UMAP and t-SNE - AcqVA Aurora workshop



In [None]:
knitr::include_graphics("https://slcladal.github.io/images/acqvalab.png")



## Introduction

In this AcqVA Aurora workshop, we will focus on advanced dimension reduction methods and creating interactive notebooks. Both of these issues are very applicable - the first when it comes to finding and displaying structure in complex numerical data and the second in rendering your work more transparent and reproducible which can be used, for example, to provide reviewers with a better understanding of your analyses.

The first part of the workshop will focus on dimension reduction using UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding).

### What to do before the workshop{-}

To get the most out of this workshop, you will need to have some (basic) R skills and (basic) knowledge of how to work with R, RStudio, R Projects, and R Notebooks. If you have no or little experience with this or if you need to refresh your skills, please carefully read (or optimally go through) these tutorials:

* [Getting started with R](https://slcladal.github.io/intror.html)

* [Handling tables in R](https://slcladal.github.io/table.html)

Before attending the workshop, you need to install the following packages in RStudio:

* `ggplot2`  (for general data visualization)

* `here` (for easy pathing)

* `tidyverse` (for data processing)

* `vcd` (for mosaic plots)

* `likert` (for visualizing Likert data)

* `knitr` (for knitting R Notebooks)

* `markdown` (for rendering Rmd files)

* `rmarkdown`  (for R Markdown formatting)

* `installr` (for updating R)

* `gridExtra` for multiple plots in one window



You can update R and install these packages by clicking on `packages` on the top of the lower right pane in RStudio or you can execute the following code in the lower left, `Console` pane of RStudio. 


In [None]:
# update R
#install.packages("installr")
#library(installr)
#updateR()
# install required packages
install.packages(c("tidyverse", "here", "vcd", "likert", "lme4", 
                   "sjPlot", "lme4"), 
                 dependencies = T)


**It is really important that you have knowledge of R and RStudio and that you have installed the packages before the workshop so that we do not have to deal with technical issues too much.**

You can follow this workshop in different ways - you can sit back and watch it like a lecture or take a more active role - that said, the intention for this workshop is clearly to be practical so that I show something and then you do it on you computer and we have exercises where you can try out what you have just learned. Choose which option suits you best and then go with it. 

### Session preparation{-}

Here is what we have planned to cover in this workshop:

Tuesday, January 16, 10-12am

* Introduction

* Session preparation

* Basics of UMAP and t-SNE 

* Case study

Tuesday, January 25, 1-3pm

* Why interactive notebooks are useful

* Limitations

* Creating an interactive notebook

* Wrap-up



### Getting started{-}

If you choose option 2, you need to set up our R session and prepare our R project at the very beginning of the workshop. 

For everything to work, please do the following:

* Create a folder for this workshop somewhere on your computer, e.g. called *AcqVA_UMAP_WS*

* In that folder, create two subfolders called *data* and *images*

* Open RStudio, go to File > New Project > Existing Directory (Browse to project folder) > Create (and hit Enter)

This will then create an R Project in the project folder.


## UMAP and t-SNE

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that is commonly used for visualizing high-dimensional data in lower-dimensional spaces. It is particularly popular in machine learning and data analysis for tasks such as clustering, visualization, and feature engineering.

In linguistics it can be used to find and visualize groups of similar words, observations or participants.

Here's a brief overview of UMAP and a comparison with PCA (Principal Component Analysis):

#### Key aspects of UMAP{-}

1. UMAP is a **nonlinear** dimensionality reduction technique, meaning it can capture complex relationships in the data that linear methods such as PCA may struggle with. Linear dimensionality reduction methods assume that the relationships between the variables are linear. In other words, the data can be represented as a linear combination of its features. When using a linear method, the data is projected onto a lower-dimensional subspace via a linear transformation of the original space. UMAP and t-SNE do not assume that the relationship between variables is linear (in fact, they assume that the relationship is nonlinear), meaning that the projection onto a lower dimensional space is not linear. However, both methods focuses on preserving both local and global structure.

2. UMAP adheres to the **Preservation of Neighborhoods** and aims to maintain the local structure of the data points. Points that are close in the high-dimensional space should remain close in the lower-dimensional representation.

3. UMAP is not restricted to reducing data to a fixed number of dimensions and it is thus assuming **Variable Dimensionality**. This means that UMAP (and t-SNE) allows for variable dimensionality, offering flexibility in the choice of the target dimensionality.

4. UMAP is more **robust** to noise and outliers compared to some other, linear, dimensionality reduction techniques, making it suitable for a wide range of data types. Also, while PCA works well when the majority of the variability is explained with the first few components, UMAP and t-SNE can handle data better where the variability is more spread out over more dimensions or components (or their equivalents).

5. Unlike some other nonlinear techniques, UMAP is designed to **preserve structure**  in the data not only locally but also globally, providing a more comprehensive representation that aims to retain the relationship between data points. In contrast, other multidimensionalty reduction methods aim to capture variability or variance while compromising relative similarity.

#### Comparison between UMAP and PCA{-}

UMAP is **more flexible** in terms of capturing nonlinear relationships and allowing for variable dimensionality.

UMAP is **designed to better preserve** both local and global structures in the data, making it more suitable for certain types of high-dimensional data, especially when the relationships are intricate and nonlinear.

UMAP can be **computationally expensive** for large datasets, while PCA is generally more computationally efficient.

UMAP and PCA serve different purposes and are suitable for different scenarios. UMAP is a powerful tool for visualizing high-dimensional data, especially when nonlinear relationships need to be preserved, while PCA is a computationally efficient technique for capturing the principal components and explaining variance in a linear manner.

#### How does UMAP work?{-}

The basic idea of UMAP is that a higher-dimensional graph is mapped onto a lower dimensional graph. During this projection, the relationship between individual points (local) as well as between  groups of points (global) is retained as well as possible (observations and clusters of observations that are close together in high-dimensional space should also be close together in low-dimensional space). 

UMAP starts by creating a distance matrix where the distance between each data point is calculated to all other data points.

Then, a similarity score is calculated for each point. The similarity score is calculated as e<sup>-(raw distance - distance to nearest neighbor) / $\sigma$</sup> The similarity score depends on the number of neighbors (k) that are specified (this number includes the point itself!). The similarity score first uses the distance and then draws a curve or area around the points so that the total of the points position matches a similarity score that is the log<sub>2</sub>(k) - this means that the curve or area differs for each point. For example, if we have a cluster of points A, B, and C and we set k = 3 and the distance between A and B is 0.5 and the distance between A and C is 2.4 and the log<sub>2</sub>(3) is 1.6. Then, the curve will be drawn in a way that the curve is 1 high above point B and 0.6 above point C. The curve or area is not symmetric as otherwise some points would not have any neighbors in their allocated area. For example, the value of B relative to C can be (and probably is) different from the score of C relative to B. The fact that UMAP adapts or sets the curve or area for each data points allows us to guarantee that all points have exactly k neighbors in their area of influence and that all sums of similarities are log<sub>2</sub>(k) (thus rendering the distribution uniform in a mathematical sense). Other points do not matter and have a similarity score of 0 - this is important as it reduced the number of values for each data point from all data points (in the distance matrix) to k similarity scores for each point. 

In a next step, the similarity scores are scaled so that they are symmetric again (similar to averaging the similarity scores) when similarity scores differ. For example, the value of B relative to C could be 0.6 while the score of C relative to B could be 1.0. This would then result in a similarity score of 0.8 for both B relative to C and C relative to B. 

Now, UMAP generates an initial low-dimensional graph using spectral embedding. This graph will, however, not capture the global structure well. To adapt the initial graph, UMAP iteratively selects random pairs of points with a probability of being chosen based on their similarity score (higher similarity pairs are more likely to be chosen). Then, UMAP moves one point closer to the other - the choice which point is moved is random. Once it is determined which point is moved, UMAP selects a point from another cluster and moves the point in question further away from the foreign cluster point and closer to the same cluster point. 

The question now is how much the point should be moved. For this, UMAP calculates similarity scores for the same-cluster points and the point to be moved and the foreign cluster point. This similarity score is based on a fixed distribution (the t-distribution). The aim of moving the point is to maximize the low-dimensional value for similar, same cluster points and to minimize the value of foreign cluster, dissimilar points.

This is done iteratively until we arrive at a final low-dimensional graph. The structure of the final graph is determined to a large extent by the number of nearest neighbors (k) one chooses: a lower k will lead to smaller clusters and preservation of local structure while higher k will lead to larger clusters and preservation of global structure. 

#### What are differences between UMAP and t-SNE?{-}

UMAP and t-SNE are very similar but have some differences. For instance, UMAP is substantively faster than t-SNE and UMAP always starts with the same low-dimensional graph (as it uses spectral embedding to generate the initial low-dimensional graph) while t-SNE generates different initial low-dimensional graphs for each run which leads to very different final low-dimensional graphs for each run! UMAP is deemed to have a better balance between retaining local and global structure than t-SNE (see the projection of a mammoth on the pair-code blogpost on this issue [here](https://pair-code.github.io/understanding-umap/)). Another difference is that t-SNE moves **each** point a little bit during each iteration while UMAP only moves one point (or a small number of points) per iteration.

### Case study{-}

We start with a very simple example and then move on to a more realistic example. The initial easy study simply aims at showing you the relevant functions.

install packages


In [None]:
install.packages("here")   # for defining paths
install.packages("readxl") # for loading data
install.packages("dplyr")  # for data processing
install.packages("stringr")# for data processing
install.packages("umap")   # for UMAP
install.packages("tsne")   # for t-SNE
install.packages("ggplot2")# for data visualization
install.packages("devtools")# for interactive notebooks


activate packages



In [None]:
library(here)
library(readxl)
library(dplyr)
library(stringr)
library(umap)
library(tsne)
library(ggplot2)


#### Example 1{-}

load example data

> This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.


In [None]:
data("iris")
# inspect 
head(iris); str(iris)


process data



In [None]:
# extract
features <- iris %>%
  dplyr::select(-Species)
# extract labels
label <- iris %>%
  dplyr::select(Species)
# inspect
str(features); str(label)


#### Implement UMAP{-}

implement umap


In [None]:
umap.res <- umap(features, # the data set containing the features 
                 n_neighbors = 15, # number of nearest neighbors
                 n_components = 2, # number of target dimensions 
                 random_state = 15 # seed for random number generation used during umap
                 ) 
# inspect
str(umap.res)


extract information



In [None]:
visdat <- umap.res[["layout"]] %>%
  as.data.frame() %>%
  dplyr::mutate(label)
# inspect
str(visdat)


visualize umap results



In [None]:
visdat %>%
  ggplot(aes(x = V1, y = V2, color = Species)) +
  geom_point() +
  theme_bw()


#### Implement t-SNE{-}



In [None]:
tsne.res <- tsne(features, # the data set containing the features 
                 initial_dims = 3, # number of dimensions to use in reduction method
                 k =2, # number of target dimensions (dimensions of the resulting embedding
                 perplexity = 15 # number of nearest neighbors
                 ) %>%
  as.data.frame() %>%
  dplyr::mutate(label)
# inspect
str(tsne.res)


visualize t-SNE results



In [None]:
tsne.res %>%
  ggplot(aes(x = V1, y = V2, color = Species)) +
  geom_point() +
  theme_bw()


#### Example 2{-}

load data


In [None]:
bdat <- readxl::read_xlsx(here::here("data", "SPR_trimmed_Martin.xlsx"))
# inspect
head(bdat); str(bdat)


extract a selection



In [None]:
# extract features
bfeatures <- bdat[, c(1, 17, 21:50)] %>%
  # remove columns
  dplyr::select(-AoA_ENG,-SRP_ENG,-proportion_daily_ENG_use, -average_proportion_use_across_contexts_NOR) %>%
  # remove duplicates
  dplyr::distinct() %>%
  # remove column with NAs
  dplyr::filter(complete.cases(.)) %>%
  # clean Language group
  dplyr::mutate(Language_group = stringr::str_remove_all(Language_group, " ")) %>%
  dplyr::mutate(Language_group = stringr::str_remove_all(Language_group, "\\&")) 
# extract labels
blabel <- paste0(bfeatures$Language_group, "_",
                 bfeatures$Participant_Private_ID) 
# remove label column from features
bfeatures <- bfeatures %>%
  dplyr::select(-Participant_Private_ID, -Language_group) %>%
  # convert characters to numbers
  dplyr::mutate_if(is.character, as.numeric)
# inspect
head(bfeatures); str(bfeatures)


implement umap



In [None]:
umap.bres <- umap(bfeatures, 
                  n_neighbors = 10, 
                  n_components = 2, 
                  random_state = 15
                  ) 
umap.bres <- umap.bres[["layout"]] %>%
  as.data.frame() %>%
  dplyr::mutate(Participant = blabel,
                LanguageGroup = stringr::str_remove_all(blabel, "_.*"))
# inspect
str(umap.bres)


In [None]:
umap.bres %>%
  ggplot(aes(x = V1, y = V2, color = LanguageGroup)) +
  geom_point() +
  theme_bw()


## Outro



In [None]:
sessionInfo()

