-
Notifications
You must be signed in to change notification settings - Fork 16
/
week12-3.Rmd
118 lines (95 loc) · 5.27 KB
/
week12-3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Interface Design and Topic Models
layout: post
output:
md_document:
preserve_yaml: true
---
_The Stanford Dissertation Browser case study_
[Code](https://github.com/krisrs1128/stat679_code/blob/main/notes/week12-3.Rmd), [Recording](https://mediaspace.wisc.edu/media/Week%2012%20-%203%3A%20Interface%20Design%20and%20Topic%20Models/1_jocz55hl)
```{r, echo = FALSE}
library(knitr)
opts_knit$set(base_dir = "/", base.url = "/")
opts_chunk$set(
warning = FALSE,
message = FALSE,
fig.path = "stat679_notes/assets/week12-3/"
)
```
1. For most of the course, we have focused on visualizing the data directly. For
the last few weeks, though, we have visualized the results of an initial
modeling step, like fitting PCA or topic models. These kinds of models can help
in creating abstractions, and reasoning effectively through abstraction is often
a source of discovery.
1. The ideal is to apply modeling to lift concrete datasets into more meaningful
abstractions, which can then be visualized to support high-level reasoning.
Reality, however, is rarely as simple is that. In practice, models can go awry,
visualizations can be misleading, and results can be misunderstood. Indeed, this
complexity is one of the reasons data science training takes years.
1. The authors of this week’s reading suggest a mnemonic: “interpret but
verify.” This captures several principles,
* Interpret: Models and visualizations should be designed in such a way that users can draw inferences that are relevant to their understanding of a domain.
* Verify: It is the designer’s responsibility that these inferences be accurate.
#### Stanford Dissertation Browser
1. The authors ground their discussion by considering a specific design project:
The Stanford Dissertation Browser. For this project, university leadership
wanted to discover ways to promote effective, interdisciplinary research. Topic
models helped transform the the raw dissertation text data into higher-levels of
abstraction. Research themes were reflected in topics, and the interdisciplinary
reach of certain PhD theses was reflected in their mixed-memberships.
1. In several initial implementations of their interface, it was easy to draw
incorrect inferences. For example, in some dimensionality-reduction outputs, it
seemed that petroleum engineering had become closer to biology over time, but
this turned out to be an artifact of the reduction and was not visible when
using the original distances. Another example — some words were used in
neurobiology (voltage, current, …) led to theses in this department to appear
very close to electrical engineering, when they really had more in common with
those in other biology departments.
<p align="center">
```{r, echo = FALSE, out.width = 400}
include_graphics("stat679_notes/assets/week12-3/petroleum_case.png")
```
</p>
1. The authors carefully tracked and manually verified inferences made by a
group of test users. Based on this qualitative evaluation, they were able to
refine their modeling and visualization approach. For example, they used a
supervised variant of topics models, and they replaced their original
dimensionality reduction scatterplot with a visualization of topic-derived
inter-department similarities.
<p align="center">
```{r, echo = FALSE, out.width = 400}
include_graphics("stat679_notes/assets/week12-3/radial_view.png")
```
</p>
#### Overall Strategy
1. To implement the “interpret but verify” idea, the authors recommend,
* Align the analysis, visualization, and models along appropriate “units of
analysis.”
* Verify that the modeling results in abstractions / concepts that are
relevant to the analysis, and modify accordingly.
* Progressively disclose data to support reasoning at multiple levels of
abstraction.
1. The units of analysis in the dissertation browser were departments and
theses. In general, these are the “entities, relationships, and concepts” of
interest, and they can often linked to existing metadata.
1. Verification can be guided both by quantitative metrics (e.g., test error)
and qualitative evaluation (e.g., user studies). For revising models, it can be
possible to refit parameters, add labels, modify model structure, or simply
override the model with user-provided values.
1. Progressive disclosure allows the user to go up and down the [ladder of
abstraction](http://worrydream.com/LadderOfAbstraction/). It makes it possible to navigate large-scale data while supporting
verification through specific examples.
1. In the dissertation browser, this is implemented through semantic zooming
(from departmental to dissertation views) and linked highlighting (revealing the
dissertation abstract on mouseover).
<p align="center">
```{r, echo = FALSE, out.width = 800}
include_graphics("stat679_notes/assets/week12-3/dissertation_browser_overview.png")
```
</p>
1. The question of how to effectively weave together model building with visual
design is still one that is actively explored in research today. If you enjoyed
reading about this project, you may enjoy other papers on visualization for
topic models or machine learning [[1](https://visxai.io/),
[2](https://distill.pub/),
[3](https://www.sciencedirect.com/science/article/pii/S2468502X17300086)] more generally.