-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.idyll
396 lines (287 loc) · 34.8 KB
/
index.idyll
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
[meta title:"PAC Learning; Or: Why We Should (and Shouldn't) Trust Machine Learning" description:"An illustration of PAC Learning and why ML models can fail in deployment." /]
[Header
fullWidth:true
title:"PAC Learning"
subtitle:"Or: Why We Should (and Shouldn't) Trust Machine Learning"
author: "Dylan Cashman, Assistant Professor of Computer Science, Brandeis University"
authorurl:"https://cs.brandeis.edu/~dylan"
date:"October 18, 2023"
background:"#222222"
color:"#ffffff"
/]
// author:"Dylan Cashman"
// authorurl:"https://www.eecs.tufts.edu/~dcashm01/"
[var name:"currgamestate" value:"NA" /]
[var name:"currgamemsg" value:"Welcome to the Four Germans Game!" /]
[Fixed]
[PacGameContainer
parentcurrgamemsg: currgamemsg
gamestate: currgamestate
/]
[/Fixed]
## Under Review
This paper is under review on the experimental track of the [Journal of Visualization and Interaction](https://www.journalovi.org/).
## Abstract
In this interactive article, we present an interactive game that represents the types of tasks solved by machine learning algorithms. We use this game to motivate the definition of Probably Approximately Correct (PAC) learning, illustrating a proof of PAC learnability for Empirical Risk Minimization (ERM). Then, we show three types of vulnerabilities of ERM that often occur in applied machine learning - domain mismatch, dependencies in data, and an incorrect model class. We conclude by arguing for the need for visualization to identify these issues in a modeling dataset.
## Introduction
[var name:"beginning" value:0 /]
[GameChanger
longtext: "Welcome to the Four Germans Game!"
shorttext: "Try it out!"
gamestate: "beginning"
progressvar: beginning
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `beginning++` scrolloffset:300
showhelp: true
/]
Let's play a game. Your task is to define the concept of "men of medium height and weight", and every second or so, I will give you examples. A green dot means that example is part of the concept, and a red dot means it is not part of the concept. Try the game out a few times, hitting the TEST! button when you're satisfied with your guess, and then using the [InlineRefresh /] button to start a new game.
I first heard of this game from Tufts University professor emeritus Anselm Blumer, who along with his coauthors Ehrenfeucht, Haussler, and Warmuth used this setting to demonstrate the concept of learnability [link text: "in their article in 1989" url: "https://dl.acm.org/citation.cfm?id=76371" /]. That paper became known as the "Four Germans Paper", and in honor of their work, I call this game the *Four Germans Game*. This game is a useful abstraction that can help illustrate the task that machine learning algorithms have to solve. Machine learning models see some training data generated by an underlying phenomenon (i.e. the hidden box), and based on that limited information must fit a model that approximates the underlying phenomonen. A successful algorithm will pick a concept with low error on its training data, and will hopefully have similarly low error when it is tested (here error defined as the total area that is only inside one box, i.e. false positives (in our proposed concept, not in the ground truth concept), or false negatives (in ground truth concept, not in our proposed concept)). A lot can go wrong, however, even in this simple game:
- What if the training data is not representative of the actual phenomenon?
- What if there is not enough training data?
- What if the training data is noisy?
- What if the wrong model is used (i.e. you guess a rectangular region, but the actual ground truth is a circle)?
This game can be used to illustrate many concepts of one of the underlying theories of machine learning: [link text: "probably approximately correct (PAC) learning" url: "https://en.wikipedia.org/wiki/Probably_approximately_correct_learning#:~:text=In%20computational%20learning%20theory%2C%20probably,in%201984%20by%20Leslie%20Valiant." /], proposed by Leslie Valiant in 1984. PAC learning is a framework to help understand why we are able to trust machine learning algorithms. It also provides some insight into why machine learning models can make nonsensical and erroneous precictions. In this article, we'll illustrate the basic assumptions of most machine learning algorithms, and demonstrate why fully-automated approaches will never be able to be error-free if these assumptions are validated.
### Why this is important
Machine Learning algorithms are becoming ubiquitous tools in the age of big data to aid in decision making and data analysis.
They are actively being deployed in scenarios where their impact directly effects humans,including
[link text:"self-driving cars" url:"https://arxiv.org/abs/1901.04407" /],
[link text: "patient triaging software at hospitals" url:"https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205836" /], and even [link text: "crime prediction software" url: "https://link.springer.com/chapter/10.1007/978-3-030-14680-1_40" /], a la [link text: "Minority Report" url: "https://www.imdb.com/title/tt0181689/" /]. Automation from deep learning models such as [link text: "facial recognition technology" url: "https://www.gao.gov/products/gao-22-106100" /], [link text: "image and scene classification" url: "https://www.sciencedirect.com/science/article/pii/S0386111219301566" /], and [link text: "text generation through large language models" url: "https://openai.com/blog/chatgpt" /] are rapidly eclipsing levels of human competency and changing the way we live our lives.
When they are successful, these models can result in new industries and capabilities that were science fiction only years ago. But when they fail, they can have drastic consequences for the humans effected by them. Consider a scenario where making a mistake can have drastic consequences: for example, a model might be used to prescribe a medication that has drastic side effects if applied to the wrong patient. Would you feel comfortable using a machine learning model to decide on that medication?
When we deploy machine learning algorithms out in the wild, how do we *know* that they will work at their intended goal? How do we know that the probability of their error is small *enough*?
The answer, of course, is that we don't know for sure the models will work as intended.
We train models on an available dataset, and then hope that the model's performance on a portion of that dataset is representative
on however that model will be used out in the wild. This can fail spectacularly, though: [link text: "A recent project called called Gender Shades by Joy Buolamwini et al." url: "http://gendershades.org/" /] at the Massachusetts Institute of Technology discovered that commercial gender classification systems have less than one percent error on lighter-skinned males, and up to 35% error on darker-skinned females, presumably because they were trained on datasets that were biased towards light-skinned males. Because the training sets were imbalanced and not representative of the general population, the tool performed much worse in practice than we can imagine its creators assumed it would perform.
It is important for the general public to understand that machine learning algorithms make mistakes. And it is *crucial* for decision makers to understand *why* machine learning models can perform worse when deployed in the real world, so that they can make informed decisions about deploying these models in high-risk environments. In hindsight, it might seem obvious that the training data of facial recognition algorithms could lead to a bias in a deployed system. PAC learning, however, will provide us foresight to anticipate these types of risks before building a model. And this article will illustrate what that means.
## Four Germans Game: Find My Rectangle
[var name:"no_free_lunch" value:0 /]
[GameChanger
longtext: "Now, try to guess with no labeled points"
shorttext: "No Free Lunch"
gamestate: "no_free_lunch"
progressvar: no_free_lunch
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `no_free_lunch++` scrolloffset:300
/]
Let's return to the game, and consider some machine learning concepts. First, try to play the game again (in the smaller version to the right here), but without getting any training data.
...
Not very fun, or interesting, right?
This is an illustration of there being [link text:"No Free Lunch theorems for optimization" url:"https://ieeexplore.ieee.org/document/585893" /]. In the absence of any information, we can do no better than random chance, and thus we can make no statements about how "correct" we are. You might have 10% error, you might have 90% error, you might fit the rectangle perfectly and have 0 error.
[SectionDivider /]
[var name:"ten_samples" value:0 /]
[GameChanger
longtext: "How would you guess with 10 labeled points?"
shorttext: "A Little Lunch, as a Treat"
gamestate: "ten_samples"
progressvar: ten_samples
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `ten_samples++` scrolloffset:300
/]
However, suppose we are given some information. Here, we are given 10 points, with 5 inside the target rectangle and 5 outside. If you guess a rectangle, can you make any guarantees about how close you are to the ground truth?
Consider the various strategies that you might use to minimize your error. Do you tightly wrap the green examples, or do you leave some space around them to allow for data you haven't seen yet? Which strategy generally works better? Which strategy works better in the unlucky case, where the sampled data doesn't provide much information due to bad luck?
Up to this point, you have been playing the machine learning algorithm. You have all the magnificent faculties of the human mind, letting you change and adapt your strategy as you encounter new examples. But machine learning algorithms tend to be much simpler, and typically more stuck-in-their ways, because they have to be well-defined.
[SectionDivider /]
We present three simple machine learning algorithms to solve this problem.
[var name:"tightest_fit" value:0 /]
[GameChanger
longtext: "This algorithm picks the tightest rectangle."
shorttext: "Tightest Fit"
gamestate: "tightest_fit"
progressvar: tightest_fit
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `tightest_fit++` scrolloffset:300
/]
1. **Tightest Fit**: In this algorithm, the tightest possible rectangle is chosen that still contains all of the positive examples seen so far. This algorithm is the smallest possible region that is still consistent with the samples seen. It is related to the principle of [link text: "Risk Minimization" url: "https://papers.nips.cc/paper/506-principles-of-risk-minimization-for-learning-theory" /], which is a strategy applicable to many machine learning problems. This will under-estimate the rectangular region.
[SectionDivider /]
[var name:"loosest_fit" value:0 /]
[GameChanger
longtext: "This algorithm picks the loosest rectangle"
shorttext: "Loosest Fit"
gamestate: "loosest_fit"
progressvar: loosest_fit
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `loosest_fit++` scrolloffset:300
/]
2. **Loosest Fit**: In contrast to the Tightest Fit algorithm, this algorithm chooses the loosest possible rectangle. This rectangle will be internal to all negative examples, and will contain all positive examples. This will over-estimate the rectangular region.
[SectionDivider /]
[var name:"max_margin" value:0 /]
[GameChanger
longtext: "This algorithm maximizes the margin between in and out"
shorttext: "Maximum Margin"
gamestate: "max_margin"
progressvar: max_margin
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `max_margin++` scrolloffset:300
/]
3. **Maximum Margin**: This is a midpoint between the two previous algorithms. It chooses a rectangle that is as far as possible from both the positive and negative examples, while containing all examples.
How do we determine if any of these algorithms are good? Or reliable? What are the properties and assumptions that are necessary to make these judgements?
[SectionDivider /]
## PAC-learning
When we use classical statistical methods, such as logistic regression, to analyze data, we can *trust* the model works because formalisms such as the p value and confidence intervals give us bounds on the possible errors of the model. The rectangle game proposed by Blumer et. al. that is featured in this post is designed to demonstrate how such formalisms can be defined for machine learning techniques.
Computational Learning Theorists attempt to find theoretically sound definitions of concepts found in machine learning,
such as training data, validation accuracy, and modelling. These definitions are then used to prove bounds on various
metrics like error and runtime. The paradigm of **probably approximately correct learning**, or **PAC learning**, has the
explicit goal of determining under what conditions a machine learning algorithm will most likely perform about as well as it does
on a training set, when deployed in the wild.
We provide the following treatment based on chapter 1 from [link text: "Kearns and Vazirani" url: "https://mitpress.mit.edu/books/introduction-computational-learning-theory" /], and we direct the reader to that text for a more thorough discussion. Loosely, a concept is PAC-learnable if there exists a machine learning algorithm such that, after viewing a sufficient amount of labeled samples, we can prove a bound on the error (let's say that error is smaller than some epsilon [Equation]\epsilon[/Equation]), assuming we aren't too unlucky with which samples we receive (so, with probability at least [Equation](1 - \delta)[/Equation]). This type of analysis mirrors analysis done using p values for logistic regression or mathematical definitions of limits, and allows us to bound error on machine learning models. We define it formally below, then prove that our game is PAC-learnable. The assumptions in the definition and its proof lead us to a better understanding of potential vulnerabilities of machine learning algorithms, which will be demonstrated below.
[var name:"pause_definitions" value:0 /]
[GameChanger
longtext: "Pause For Definitions"
shorttext: "Pause The Game"
gamestate: "pause_definitions"
progressvar: pause_definitions
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `pause_definitions++` scrolloffset:300
/]
[FullWidth]
[hr /]
[div className: 'callout']
[div className: 'callout-definition']
## Definition
([link text: "Adapted from Wikipedia" url: "https://en.wikipedia.org/wiki/Probably_approximately_correct_learning" /])
Let [Equation]X[/Equation] be a data space. This could be as simple as our 2-D space, or it could be the space of all images or the space of all snippets of text. A *concept* is a subset [Equation]c \in X[/Equation] of the data space - in our case a rectangular region. A *concept class* [Equation]C[/Equation] is the collection of all concepts over our data space - in our case, all possible rectangular regions in the 2-D space.
Let [Equation]EX(c, D)[/Equation] be a procedure that draws an example, [Equation]x[/Equation], using a probability distribution [Equation]D[/Equation] and gives the correct label [Equation]c(x)[/Equation]. In our case, [Equation]EX(c, D)[/Equation] is the process that happens when you hit the [InlinePlay/] button. It draws a random location in the 2-D space using [Equation]D[/Equation](the uniform distribution), then provides a label for that location as green (if it is part of the rectangular region we want to guess), or red (if it is outside that region).
An algorithm [Equation]A[/Equation] is a **PAC learning algorithm** for [Equation]C[/Equation] (equivalently, the concept space [Equation]C[/Equation] is **PAC learnable** if the following is true: For any [Equation]\epsilon[/Equation] and [Equation]\delta[/Equation] that we choose between 0 and 1 (i.e. some small fraction of a number), there exists a finite sample size [Equation]p[/Equation] such that if [Equation]EX(c, D)[/Equation] draws and labels [Equation]p[/Equation] examples, the algorithm [Equation]A[/Equation] will output a hypothesis concept, [Equation]h \in C[/Equation], that will have less than [Equation]\epsilon[/Equation] error, with probability [Equation]1-\delta[/Equation].
[/div]
In other words, a type of concept (which could be rectangular regions on a plane, or the set of all pictures of cats and dogs within a group of images - the thing we are trying to see in the data space) is considered learnable if a learning algorithm exists that can usually reach a level of guaranteed performance if it sees enough training data. In the definition, choosing [Equation]\epsilon[/Equation] chooses how "good" an algorithm we want. If [Equation]\epsilon = 0.05[/Equation], then we are saying a "good" algorithm will have less than 5% error, however we want to define error. The [Equation]\delta[/Equation] instead compensates for bad luck. If [Equation]\delta = 0.1[/Equation], then we are saying that we will be able to produce a "good" algorithm at least 90% of the time. The other 10% of the time, we might be unable to produce a "good" enough algorithm because of getting unlucky with a sample size. In general, the number of samples increases as the amount of error ([Equation]\epsilon[/Equation]) and bad luck ([Equation]\delta[/Equation]) we tolerate goes down. These two parameters in the definition explain the redundancy in the term: probably ([Equation]\delta[/Equation]) approximately ([Equation]\epsilon[/Equation]) correct.
It's useful to consider when a concept is *not* PAC learnable. It would seem pretty universal that the more samples a learning algorithm sees, the more confident it would get about its predictions, and the better those predictions would be. But in a lot of problems, you hit a plateau. If you try to fit a very simple model, like a logistic regression classifier, to a very complicated concept, like classifying photos of dogs, you probably can't guarantee any arbitrary level of performance no matter how many samples you see.
### Why is this important?
The most popular learning paradigm used in machine learning use cases is [link text: "*Empirical Risk Minimization*" url: "https://en.wikipedia.org/wiki/Empirical_risk_minimization" /], which says that when we want to minimize error (and thus risk) when putting a machine learning model out in the world, we should first tune it to minimize error on the *empirical* data that we have, i.e. the training data. By minimizing the *empirical risk*, we are hopefully minimizing the *actual risk* the model will try to avoid when it is used out in the world.
This only makes sense, though, if there is some theoretical basis for why seeing labeled trainign data would guarantee performance on unseen data. If a concept class wasn't PAC learnable, then no matter how good our performance on a training set was, we wouldn't be able to make any guarantees on an algorithm's performance out in the world.
Understanding how to prove PAC learnability can help us see what stipulations about a machine learning use case need to hold in order to trust empirical risk minimization. The definition of PAC learnability can also provide hints into how applied machine learning can go wrong.
In this article, we will use our game to demonstrate first how a learning algorithm can be proven to be a PAC learning algorithm. Then, we will break down different ways where PAC learning can break down due to issues with the data or the complexity of the concept that we are trying to model.
[/div]
[/FullWidth]
[SectionDivider extraspace: true /]
[var name:"pac_learning_1" value:0 /]
[GameChanger
longtext: "This algorithm picks the tightest rectangle."
shorttext: "Tightest Fit"
gamestate: "pac_learning_1"
progressvar: pac_learning_1
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `pac_learning_1++` scrolloffset:300
/]
We can use the PAC learning paradigm to analyze our algorithms. In particular, we can look at the tightest fit algorithm, and try to bound its error, where error is defined as the probability that our algorithm's chosen region will get the label incorrect. We would like to prove that the error is less than [Equation]\epsilon[/Equation], with probability at least [Equation]1-\delta[/Equation]. We do this by showing that, under the uniform distribution that gives us our labeled points, we have a pretty good chance of having low error.
[SectionDivider /]
[var name:"pac_learning_2" value:0 /]
[GameChanger
longtext: "The tightest-fit rectangle is always contained in the target rectangle."
shorttext: "Tightest Fit Errors"
gamestate: "pac_learning_2"
progressvar: pac_learning_2
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `pac_learning_2++` scrolloffset:300
/]
First, we note that the tightest-fit rectangle is always contained in the target rectangle. We can express the total error region as four strips: the strips above, below, to the left, and to the right of the tightest-fit rectangle. This is the only region where our algorithm will label incorrectly. To proceed, we determine the probability that the size of each region is [Equation]\leq \frac{\epsilon}{4}[/Equation], so that the total error is less than [Equation]\epsilon[/Equation]. This will give us a relationship between [Equation]\epsilon[/Equation] and [Equation]\delta[/Equation] in terms of the nubmer of samples drawn, [Equation]M[/Equation].
[var name:"pac_learning_3" value:0 /]
[GameChanger
longtext: "Consider just the error on the top strip, T'."
shorttext: "Error from Top Strip"
gamestate: "pac_learning_3"
progressvar: pac_learning_3
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `pac_learning_3++` scrolloffset:300
/]
Consider the top strip, which we'll call T'. Then, consider the strip T that starts from the top of the proposal region, and sweeps down in height until it has area [Equation]\frac{\epsilon}{4}[/Equation]. **If T' is contained in T, then it has area less than [Equation]\frac{\epsilon}{4}[/Equation], and we are done**. By our algorithm, we know that our T' can only be greater in area than T if we haven't seen any samples in T, since if we had, then our proposal region would contain that point. The probability that M independent draws from the distribution (i.e. new points that we see) all miss the region T (and thus that the area is less than [Equation]\frac{\epsilon}{4}[/Equation]) is exactly [Equation](1 - \frac{\epsilon}{4})^M[/Equation].
[var name:"pac_learning_4" value:0 /]
[GameChanger
longtext: "Total error from four strips is less than epsilon."
shorttext: "Error from All Strips"
gamestate: "pac_learning_4"
progressvar: pac_learning_4
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `pac_learning_4++` scrolloffset:300
/]
The same analysis holds for the other three regions, so by the union bound, the error is at most [Equation]4(1 - \frac{\epsilon}{4})^M[/Equation]. Then, we choose M to satisfy [Equation]4(1 - \frac{\epsilon}{4})^M \leq \delta[/Equation], i.e. the probability that the area is **not** bounded by [Equation]\epsilon[/Equation] is smaller than [Equation]\delta[/Equation], so the probability that it **is** bounded by [Equation]\epsilon[/Equation] is greater than [Equation]1 - \delta[/Equation].
The final step of the proof is to determine how many samples we'd need - solving for M. Using the inequality [Equation](1 - x) \leq e^{-x}[/Equation], we can simplify the bound to [Equation]4e^{\epsilon M / 4} \leq \delta[/Equation], and solve for [Equation]M \geq (4/\epsilon)ln(4/\delta)[/Equation]. This is a very common strategy for establishing bounds in these types of proofs. The key takeaway is that M is not infinity. Since M is a real, finite number, we've demonstrated that **the problem is PAC-learnable**, i.e. we have bounded the error in terms of the number of samples seen.
[SectionDivider /]
## Assuming the Worst
As the old saying goes: Don't Assume. It makes an ass out of you and me. And our proof involved many assumptions that may not be so realistic, if we imagine proving something similar for a more applied case.
While the proof contains some complicated logic, and our conclusion can be hard to interpret, it was actually a significant result. For this very simple problem (much simpler than the types of problems machine learning is being used to solve in the real world), we were able to show that the error on unseen examples would be bounded.
But not so fast - there's a major issue that we didn't mention. The proof we went through, as well as the PAC-learning paradigm in general, relies on several assumptions.
### Independent and Identically Distributed Data
[var name:"iid" value:0 /]
[GameChanger
longtext: "What if the data is not independent and identically distributed?"
shorttext: "Not IID data"
gamestate: "iid"
progressvar: iid
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `iid++` scrolloffset:300
/]
That the sampled data in independent and identically distributed (typically referred to as i.i.d). In real world data, this may not be the case; if the algorithm is being trained on streamed data, it typically cannot be considered independent draws. This assumption was used in our probability bounds.
In this version of the game, the i.i.d. assumption is violated by placing some linear drift on the data. As we get new points, they move linearly in a certain direction.
[SectionDivider /]
### Training-Testing Mismatch
[var name:"train_test_mismatch" value:0 /]
[GameChanger
longtext: "Here, the training data comes from a different distribution than the testing."
shorttext: "Train Test Mismatch"
gamestate: "train_test_mismatch"
progressvar: train_test_mismatch
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `train_test_mismatch++` scrolloffset:300
/]
We assume (or at least count on) that the data that we test our algorithm on is from the same distribution that we train our algorithm on. This assumption is frequently incorrect in practice, and can result in unexpected behavior by the machine learning algorithm. Consider a computer vision algorithm for a self-driving car that only trained in the Pacific Northwestern United States, and then was deployed in a desert climate; the behavior of the computer vision algorithm in training might have no relationship with its behavior in testing. This would correspond to the testing samples being labeled in a different region than during training.
In this version of the game, the training distribution and the testing distribution are independent of one another. This means that the true labels (green points) that you get in training may not be representative of where the true labels would appear in the testing phase. It makes it impossible to guarantee any performance: it's back to the No Free Lunch situation described earlier.
[SectionDivider /]
### Incorrect Model Class
[var name:"incorrect_model_class" value:0 /]
[GameChanger
longtext: "What if our Model Class is Wrong?"
shorttext: "Ellipses, not Rectangles"
gamestate: "incorrect_model_class"
progressvar: incorrect_model_class
parentcurrgamestate: currgamestate
parentcurrgamemsg: currgamemsg
onEnterView: `incorrect_model_class++` scrolloffset:300
/]
Our proof assumed that we already knew that the type of region we were looking for was a rectangle. But in practice, we rarely know what kind of machine learning model will match the phenomenon being modeled in the data. Suppose, instead of a rectangular region, the region we were looking for turned out to be an ellipse, or a parallellogram, or even an arbitrary, amorphous disconnected region. A geometric proof would then be much harder, or even impossible.
In this version of the game, the actual region we are trying to guess is an ellipse, but we can only choose a rectangle. Play it a few times and notice that no matter how good your guess is, you will always have some nonzero lower bound on your testing error.
[SectionDivider /]
There are other considerations for failure scenarios for machine learning algorithms. Varshney presented [link text:"an interesting treatment" url: "https://www.liebertpub.com/doi/full/10.1089/big.2016.0051"/] in 2017. In practice, there is active research into proving PAC learnability for more complex learning spaces than in this article ([link text: "PAC Learnability of Node Functions in Networked Dynamical Systems by Adiga et al. in ICML from 2019" url: "https://proceedings.mlr.press/v97/adiga19a" /], or [link text: "A Theory of PAC Learnability under Transformation Invariances by Shao et al in NeurIPS from 2022" url: "https://proceedings.neurips.cc/paper_files/paper/2022/hash/5a829e299ebc1c1615ddb09e98fb6ce8-Abstract-Conference.html" /]). However, it is likely that PAC learnability wouldn't hold in real world cases because we typically can't *prove* (or *know*) that the assumptions are going to be held. But the paradigm of PAC learnability illustrates what types of problems we have to look out for in applied machine learning.
## An Argument for Visualization in Applied Machine Learning
The game played in this blog post was hopefully a relevant illustration of some of the topics introduced. If it's helpful, it may be because it is a visual explanation. In general, visualizing the right properties of the data and the algorithm can help indicate to a data scientist whether any of the necessary assumptions are broken. The data scientist can then address them by cleaning or preprocessing the data, or choosing a different machine learning algorithm.
In this blog post, we looked at an incredibly simple machine learning problem, and the algorithms we considered were easily explainable in a sentence or two. Even for this simple problem, it was difficult to prove any limitations on error. Machine learning is typically used to model much more complex problem domains, with much more complex data and intricate, high-dimensional processes. In these cases, it is very unlikely that error bounds are proveable, and even if they are, it is unlikely that the assumptions are upheld.
Instead, a human in the loop, armed with appropriate visualizations and analytical tools, can act as a safeguard against the most endemic cases. This additional line of defense is more and more necessary as machine learning models are deployed in scenarios that directly affect humans. More discussion on this topiccan be found in chapters 1 and 7 of my thesis, [link text: "Bridging the Human-Machine Gap in Applied Machine Learning with Visual Analytics" url: "https://www.cs.brandeis.edu/~dylan/public/docs/dylan_cashman_thesis_human_machine_gap.pdf" /], which offers additional perspective on the role of visual analytics systems in empirical risk minimization.
[SectionDivider /]
## Conclusion
In this interactive article, we used the Four Germans Game to illustrate the concept of probably approximately correct (PAC) learning. We then demonstrated how several assumptions are used in learning theory to build theoretical bounds on performance for machine learning models in the empirical risk minimization learning paradigm. We presented some examples, using the game, of how these assumptions might be broken.
## Special Thanks
Thank you to Anselm Blumer and the reviewers from VisXAI.
## References
- Blumer, Anselm, et al. "Learnability and the Vapnik-Chervonenkis dimension." Journal of the ACM (JACM) 36.4 (1989): 929-965.
- Kearns, Michael J., and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
- Feng, Alice, et al. "The myth of the impartial machine." The Parametric Press (2019).
- Vapnik, Vladimir. "Principles of risk minimization for learning theory." Advances in neural information processing systems 4 (1991).
- Varshney, Kush R., and Homa Alemzadeh. "On the safety of machine learning: Cyber-physical systems, decision sciences, and data products." Big data 5.3 (2017): 246-255.
- Valiant, Leslie. "Probably approximately correct: nature's algorithms for learning and prospering in a complex world." (2013).
- Adiga, Abhijin, et al. "PAC learnability of node functions in networked dynamical systems." International Conference on Machine Learning. PMLR, 2019.
- Shao, Han, Omar Montasser, and Avrim Blum. "A theory of pac learnability under transformation invariances." Advances in Neural Information Processing Systems 35 (2022): 13989-14001.
- Badue, Claudine, et al. "Self-driving cars: A survey." Expert Systems with Applications 165 (2021): 113816.
- Kwon, Joon-myoung, et al. "Validation of deep-learning-based triage and acuity score using a large national dataset." PloS one 13.10 (2018): e0205836.
- Wu, Shaobing, et al. "Crime prediction using data mining and machine learning." The 8th International Conference on Computer Engineering and Networks (CENet2018). Springer International Publishing, 2020.
- Fujiyoshi, Hironobu, Tsubasa Hirakawa, and Takayoshi Yamashita. "Deep learning-based image recognition for autonomous driving." IATSS research 43.4 (2019): 244-252.
- Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on fairness, accountability and transparency. PMLR, 2018.
- Wolpert, David H., and William G. Macready. "No free lunch theorems for optimization." IEEE transactions on evolutionary computation 1.1 (1997): 67-82.
## Research Material Statements
No datasets are used in this interactive article. The data visualized within the game is either randomly generated or hardcoded in. You can view the source code for this article on its [link text: "Github repository" url: "https://github.com/dylancashman/PAC-learning-game" /]. Please see the software license below.
## About The Author
Dylan Cashman is an assistant professor in the Michtom School of Computer Science at Brandeis University in Waltham, MA. He previously worked as a Senior Expert in Data Science and Advanced Visual Analytics within the Data Science and Artificial Intelligence division of Novartis Pharmaceuticals in Cambridge, MA. Dylan holds a PhD in Computer Science from Tufts University and an Sc. B in Mathematics from Brown University. His research interests include the development and evaluation of visual affordances that improve usability of artificial intelligence models and data science processes. His research has won best paper awards at the Eurovis conference, the Symposium on Visualization for Data Science (VDS), and the Workshop on Visual Analytics for Deep Learning at IEEEVIS.
## License
This work is licensed under a [link text: "Creative Commons Attribution-ShareAlike 4.0 International License" url: "https://creativecommons.org/licenses/by-sa/4.0/" /].
## Conflict of Interest
The author declares that there are no competing interests.