generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 67
/
umap_viz.Rmd
429 lines (241 loc) · 15.1 KB
/
umap_viz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
# Efficient dimension reduction with UMAP
Soham Joshi and Mehrab Singh Gill
```{r, include=FALSE}
# this prevents package loading message from appearing in the rendered version of your problem set
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
```{r, echo=FALSE}
#install.packages('umap')
library(stringr)
library(boot)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(plotly)
library(forcats)
library(umap)
```
## Brief Introduction to UMAP
The dimensionality techniques that exist right now do not tend to work very efficiently when there are a lot of dimensions involved. In such cases these techniques tend to be very slow and inefficient. This is where UMAP (Uniform Manifold Approximation and Projection) comes into the picture as it works very well in cases where there are a lot of dimensions involved.
UMAP is a dimensionality reduction technique which uses Topological Data Analysis and Mapping to project higher dimensional data to lower dimensions. UMAP can be used for dimensionality reduction, unsupervised clustering and metric learning. UMAP tends to work much faster and efficiently compared to the other techniques. This can be very effective as it takes into consideration the local and global minima and maxima while projecting data into two dimensions. It does not find the principal components.
Unlike PCA (Principal Component Analysis), which is a linear approach, UMAP uses k nearest neighbours on an n-dimensional manifold to find out those points which are closest to a certain point based on the topology of the data. UMAP basically checks the relations in the higher dimensions and then helps to plot the same in lower dimensions.
UMAP can be used in R through the "umap" package which is an implementation of the python package in R. While working on this project, we gained hands on experience with the "umap" package and realized how efficient, fast and easy it can be to visualize data with higher dimensions with the help of UMAP. We aim to explore many more datasets on which we could define and demonstrate this technique as well as try to make a better and faster package for visualization of data in R.
As part of our project, we will now be demonstrating the use of the "umap" package, input tuning and its properties by working on two datasets, namely MNIST and IRIS.
## Experiments on the MNIST dataset
Let's start with a high dimensionality dataset which can be projected to 2 dimensions and then discerned with clusters. The MNIST hand written digits dataset is a collection of images of handwritten numbers, flattened and stored as a 784 dimension vectors.
```{r}
df2<- read.csv("https://datahub.io/machine-learning/mnist_784/r/mnist_784.csv")
```
Since, the MNIST dataset has the digits from 0-9 with 70,000 rows, this processing is computationally very expensive for R. Hence, we will consider only the first 4 digits, i.e {0,1,2,3} and all entries corresponding to them. We thus get a 784 dimensions x 28,911 rows data.
We also separate the labels and the actual input data to umap.
```{r}
df2 <- df2[ !(df2$class >= 4),]
mnist_labels = df2[, "class"]
mnist_data = df2[1:784]
```
The function call to umap is very straightforward in the default case.
```{r}
mnist.umap = umap(mnist_data)
mnist.umap
```
Here we get an object of the following summary:
1. It is an umap embedding of 28911 items, which refer to our data points and 2 dimensions. Thus, the umap algorithm has reduced our data from 784 to 2 dimensional data.
2. Layout: This is a numeric vector which gives us the values of the 2 dimensions that we have created. These values can be considered as X and Y coordinates and form the projection that UMAP has created.
3. Data: It is the input data we fed into the algorithm.
4. KNN: This is also an object which consists of two components, indexes and distances, which are stored from the smooth knn part of the UMAP algorithm.
5. config: These are the input parameters which are considered for UMAP. In this particular case we have not tweaked any, but we will be doing it shortly.
If you now plot the embedding created by UMAP, we get the following result:
```{r}
df <- data.frame(mnist.umap$layout[,1], mnist.umap$layout[,2], mnist_labels)
colnames(df) <- c("X","Y", "Label")
ggplot(df, aes(x =X, y= Y))+ geom_point()
```
We can clearly see that there are 4 distinct clusters formed in the embedded data. We can visually assess the accuracy of this clustering by coloring the points according to their true labels:
```{r}
df <- data.frame(mnist.umap$layout[,1], mnist.umap$layout[,2], mnist_labels)
colnames(df) <- c("X","Y", "Label")
ggplot(df, aes(x =X, y= Y, color= Label))+ geom_point()
```
With the exception of a few outliers, we see that UMAP accurately separates the clusters corresponding to the digits {0,1,2,3}.
For our base case, the default values of UMAP worked out well, but in some scenarios it may not be the case. Parameter tuning is thus very efficient for UMAP. We can change the input parameters to umap using the config object.
We can check the default values of UMAP as follows:
```{r}
custom.config = umap.defaults
custom.config
```
## Parameters and Additional Insights
We will be focusing on the 3 most important parameters in UMAP:
1. n_neighbors: This controls the number of neighbors in KNN. Since the number of neighbors changes our idea of what points are similar to any given point on the manifold, this value can have a visible effect in the output.
2. min_dist: This value ranges from 0.0 to 1.0. It refers to the packing of data in the final projection. If data is packed too tightly, then it is harder to split clusters, while very loose packing might lead to loss of shape and cluster formations. Thus, it is important we choose a good value for min_dist.
3. n_components: While n_components does not affect the projection in UMAP, it changes the target dimensionality for our reduction. By default n_components =2, i.e UMAP will always reduce data to 2 dimensions. But in many cases data can also be viewed better in 3 dimensions. This parameter can be utilized for such cases.
To demonstrate the effect of the first two parameters let us consider arbitrary values:
### min_dist = 0.99 and n_neighbors = 4
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.99 # change min_dist and n_neighbors
custom.config$n_neighbors = 4
mnist.umap.config = umap(mnist_data, config=custom.config)
```
The resultant projection is shown below:
```{r}
dfc <- data.frame(mnist.umap.config$layout[,1], mnist.umap.config$layout[,2], mnist_labels)
colnames(dfc) <- c("X","Y", "Label")
ggplot(dfc, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
As we can see, there are no separate clusters formed. This is evident, since we considered a lower number of neighbors and packed it loosely i.e we looked more from a local structure perspective and did not bind the points together.
If now, we change the n_neighbors to a higher number, say n_neighbors = 10
### min_dist = 0.99 and n_neighbors = 10
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.99 # change min_dist and n_neighbors
custom.config$n_neighbors = 10
mnist.umap.config2 = umap(mnist_data, config=custom.config)
dfc <- data.frame(mnist.umap.config2$layout[,1], mnist.umap.config2$layout[,2], mnist_labels)
colnames(dfc) <- c("X","Y", "Label")
ggplot(dfc, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
This has a good effect, as we include more neighbors, more clusters are formed. However, the boundaries are very close to each other since our min_dist value is still high. UMAP still does a good job of separating the clusters:
```{r}
dfc <- data.frame(mnist.umap.config2$layout[,1], mnist.umap.config2$layout[,2], mnist_labels)
colnames(dfc) <- c("X","Y", "Label")
ggplot(dfc, aes(x =X, y= Y, color =Label))+ geom_point(alpha =0.5)
```
Instead of varying n_neighbors, if we change the min_dist to a lower value, say min_dist = 0.5; the packing becomes loose, but the clusters are not discernable. Thus, a lower n_neighbors value is not a good idea if we have a large dataset.
### min_dist = 0.5 and n_neighbors = 4
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.5 # change min_dist and n_neighbors
custom.config$n_neighbors = 4
mnist.umap.config3 = umap(mnist_data, config=custom.config)
```
```{r}
dfc <- data.frame(mnist.umap.config3$layout[,1], mnist.umap.config3$layout[,2], mnist_labels)
colnames(dfc) <- c("X","Y", "Label")
ggplot(dfc, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
If we add color to Labels and check the output, we can notice that the groups are actually separate, but there is no clear boundary.
```{r}
dfc <- data.frame(mnist.umap.config3$layout[,1], mnist.umap.config3$layout[,2], mnist_labels)
colnames(dfc) <- c("X","Y", "Label")
ggplot(dfc, aes(x =X, y= Y, color =Label))+ geom_point(alpha =0.5)
```
## Iris Dataset
```{r}
iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]
```
Function call to umap in the default case
```{r}
iris.umap = umap(iris.data)
iris.umap
```
It is an umap embedding of 150 items, which refer to our data points and 2 dimensions. Thus, the umap algorithm has reduced our data from 4 to 2 dimensional data.
If you now plot the embedding created by UMAP, we get the following result:
```{r}
df_iris <- data.frame(iris.umap$layout[,1], iris.umap$layout[,2], iris.labels)
colnames(df_iris) <- c("X","Y", "Label")
ggplot(df_iris, aes(x =X, y= Y))+ geom_point()
```
We see that there are 2 distinct clusters formed in the embedded data. We will now visually assess the accuracy of this clustering by coloring the points according to their true labels:
```{r}
df_iris_1 <- data.frame(iris.umap$layout[,1], iris.umap$layout[,2], iris.labels)
colnames(df_iris_1) <- c("X","Y", "Label")
ggplot(df_iris_1, aes(x =X, y= Y, color= Label))+ geom_point()
```
On clustering by coloring the points, we infact realize that there are 3 clusters and not 2.
For our base case, the default values of UMAP didn't work out well in this case. We thus go for parameter tuning.
As discussed before, we can check the default values of UMAP as follows:
```{r}
custom.config = umap.defaults
custom.config
```
## Analysis of Iris using UMAP
To demonstrate the effect of n_neighbors and min_dist, let us consider arbitrary values:
### min_dist = 0.10 and n_neighbors = 20
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.10 # change min_dist and n_neighbors
custom.config$n_neighbors = 20
iris.umap.config = umap(iris.data, config=custom.config)
```
The resultant projection is shown below:
```{r}
dfc_iris <- data.frame(iris.umap.config$layout[,1], iris.umap.config$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
If we add color to the same,
```{r}
dfc_iris <- data.frame(iris.umap.config$layout[,1], iris.umap.config$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y, color = Label))+ geom_point(alpha =0.5)
```
With an increase in the number of neighbors, we observe a close packing of similar clusters. More neighbors implies merging of clusters ( observe versicolor and verginica). Some clusters are still not well connected. For low number of neighbours, data is over-clustered.
Say, now we increase the min_dist parameter to 0.5
### min_dist = 0.5 and n_neighbors = 20
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.5 # change min_dist
custom.config$n_neighbors = 20
iris.umap.config2 = umap(iris.data, config=custom.config)
dfc_iris <- data.frame(iris.umap.config2$layout[,1], iris.umap.config2$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
If we add color,
```{r}
dfc_iris <- data.frame(iris.umap.config2$layout[,1], iris.umap.config2$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y, color =Label))+ geom_point(alpha =0.5)
```
Finer separation of data has been achieved now (observe virginica and versicolor). By reducing the density of packing, the amount of points which can be wrongly classified can be reduced. However, the downside is that the weak inter cluster connection can be misrepresented as separate clusters which infact might even loosen up the third tightly classified cluster ( setosa).
### High min_dist and low n_neighbors
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.9 # change min_dist and n_neighbors
custom.config$n_neighbors = 3
iris.umap.config3 = umap(iris.data, config=custom.config)
dfc_iris <- data.frame(iris.umap.config3$layout[,1], iris.umap.config3$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
If we add color,
```{r}
dfc_iris <- data.frame(iris.umap.config3$layout[,1], iris.umap.config3$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y, color =Label))+ geom_point(alpha =0.5)
```
Two different labels start merging. There are a few outliers with respect to setosa label here as well. We observe an overlap of clusters in this case.
### Low min_dist and high n_neighbors
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$min_dist = 0.15 # change min_dist and n_neighbors
custom.config$n_neighbors = 35
iris.umap.config4 = umap(iris.data, config=custom.config)
dfc_iris <- data.frame(iris.umap.config4$layout[,1], iris.umap.config4$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y))+ geom_point(alpha =0.5)
```
If we add color,
```{r}
dfc_iris <- data.frame(iris.umap.config4$layout[,1], iris.umap.config4$layout[,2], iris.labels)
colnames(dfc_iris) <- c("X","Y", "Label")
ggplot(dfc_iris, aes(x =X, y= Y, color =Label))+ geom_point(alpha =0.5)
```
Here, too we observe an overlap of clusters
### 3D Visualisation using UMAP
We can always choose the number of dimensions we need for the projection. While we usually plot data on 2 dimensions to visualize it better on a plot, we can always create 3 dimensional visualization by projecting it on 3 components. This can be done by changing the n_components parameter to 3 and proceeding to plot it using plotly (for 3-D plot).
```{r}
custom.config = umap.defaults # Set of configurations
custom.config$n_components = 3 # change min_dist and n_neighbors
iris.umap.config5 = umap(iris.data, config=custom.config)
df <- data.frame(iris.umap$layout[,1], iris.umap$layout[,2])
colnames(df) <- c("X","Y")
dfc <- data.frame(iris.umap.config5$layout[,1], iris.umap.config5$layout[,2], iris.umap.config5$layout[,3], iris.labels)
colnames(dfc) <- c("X","Y", "Z", "Label" )
plot_ly(dfc, x=dfc$X, y=dfc$Y, z=dfc$Z, color = ~Label, type="scatter3d", mode="markers")
```
## References:
1) Uniform Manifold Approximation and Projection in R: https://umap-learn.readthedocs.io/en/latest/
2) Uniform Manifold Approximation and Projection for Dimension Reduction: https://umap-learn.readthedocs.io/en/latest/
3) MNIST dataset: https://datahub.io/machine-learning/mnist_784#resource-mnist_784