-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
/
association-rule.html
498 lines (457 loc) · 21.4 KB
/
association-rule.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
layout: layout.njk
permalink: "{{ page.filePathStem }}.html"
---
{% include "toc.njk" %}
<div class="col-md-9 col-md-pull-3">
<h1 id="association-rule-top" class="title">Association Rule Mining</h1>
<p>Association rule learning is a popular and well researched method for
discovering interesting relations between variables in large databases.
Let I = {i<sub>1</sub>, i<sub>2</sub>,..., i<sub>n</sub>} be a set of n
binary attributes called items. Let D = {t<sub>1</sub>, t<sub>2</sub>,..., t<sub>m</sub>}
be a set of transactions called the database. Each transaction in D has a
unique transaction ID and contains a subset of the items in I.
An association rule is defined as an implication of the form X ⇒ Y
where X, Y ⊆ I and X ∩ Y = Ø. The item sets X and Y are called
antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS)
of the rule, respectively. The support supp(X) of an item set X is defined as
the proportion of transactions in the database which contain the item set.
Note that the support of an association rule X ⇒ Y is supp(X ∪ Y).
The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y) / supp(X).
Confidence can be interpreted as an estimate of the probability P(Y | X),
the probability of finding the RHS of the rule in transactions under the
condition that these transactions also contain the LHS.</p>
<p>For example, the rule {onions, potatoes} ⇒ {burger} found in the sales
data of a supermarket would indicate that if a customer buys onions and
potatoes together, he or she is likely to also buy burger. Such information
can be used as the basis for decisions about marketing activities such as
promotional pricing or product placements.</p>
<p>Association rules are usually required to satisfy a user-specified minimum
support and a user-specified minimum confidence at the same time. Association
rule generation is usually split up into two separate steps:</p>
<ul>
<li>First, minimum support is applied to find all frequent item sets
in a database (i.e. frequent item set mining).
<li> Second, these frequent item sets and the minimum confidence constraint
are used to form rules.
</ul>
<h2 id="fim">Frequent Itemset Mining</h2>
<p>Finding all frequent item sets in a database is difficult since it involves
searching all possible item sets (item combinations). The set of possible
item sets is the power set over I (the set of items) and has size 2<sup>n</sup> - 1
(excluding the empty set which is not a valid item set). Although the size
of the power set grows exponentially in the number of items n in I, efficient
search is possible using the downward-closure property of support
(also called anti-monotonicity) which guarantees that for a frequent item set
also all its subsets are frequent and thus for an infrequent item set, all
its supersets must be infrequent.</p>
<p>In practice, we may only consider the frequent item set that has the maximum
number of items bypassing all the sub item sets. An item set is maximal
frequent if none of its immediate supersets is frequent.</p>
<p>For a maximal frequent item set, even though we know that all the sub item
sets are frequent, we don't know the actual support of those sub item sets,
which are very important to find the association rules within the item sets.
If the final goal is association rule mining, we would like to discover
closed frequent item sets. An item set is closed if none of its immediate
supersets has the same support as the item set.</p>
<p>Some well known algorithms of frequent item set mining are Apriori,
Eclat and FP-Growth. Apriori is the best-known algorithm to mine association
rules. It uses a breadth-first search strategy to counting the support of
item sets and uses a candidate generation function which exploits the downward
closure property of support. Eclat is a depth-first search algorithm using
set intersection.</p>
<p>FP-growth (frequent pattern growth) algorithm employs an extended prefix-tree (FP-tree) structure to
store the database in a compressed form. The FP-growth algorithm is
currently one of the fastest approaches to discover frequent item sets.
FP-growth adopts a divide-and-conquer approach to decompose both the mining
tasks and the databases. It uses a pattern fragment growth method to avoid
the costly process of candidate generation and testing used by Apriori.</p>
<p>The basic idea of the FP-growth algorithm can be described as a
recursive elimination scheme: in a preprocessing step delete
all items from the transactions that are not frequent individually,
i.e., do not appear in a user-specified minimum
number of transactions. Then select all transactions that
contain the least frequent item (least frequent among those
that are frequent) and delete this item from them. Recurse
to process the obtained reduced (also known as projected)
database, remembering that the item sets found in the recursion
share the deleted item as a prefix. On return, remove
the processed item from the database of all transactions
and start over, i.e., process the second frequent item etc. In
these processing steps the prefix tree, which is enhanced by
links between the branches, is exploited to quickly find the
transactions containing a given item and also to remove this
item from the transactions after it has been processed.</p>
<p>When the input itemsets are already in memory, the below methods can be used.
The parameter <code>itemsets</code> is the item set database,
where each row is a item set, which may have different length.
The item identifiers have to be in [0, n), where n is the number of items.
Item set should NOT contain duplicated items. Note that it is reordered after the call.
The parameter <code>minSupport</code> is the required minimum support of item sets in terms
of frequency. The return can be the list of frequent item sets. Often the output
is too big to be stored in memory. In that case, the user may provide
a <code>PrintStream</code> or file path to save the output. The return number is
the number of discovered frequent item sets.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_1" data-toggle="tab">Scala</a></li>
<li><a href="#java_1" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_1" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def fpgrowth(minSupport: Int, itemsets: Array[Array[Int]]): Stream[ItemSet]
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class FPTree {
public static FPTree of(int minSupport, int[][] itemsets);
public static FPTree of(double minSupport, int[][] itemsets);
}
public class FPGrowth {
public static Stream<ItemSet> apply(FPTree tree);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun fpgrowth(minSupport: Int, itemsets: Array<IntArray>): Stream<ItemSet>
</code></pre>
</div>
</div>
</div>
<p>In practice, even the raw input is often too big to fit into the memory. To conquer this
challenge, the below methods scan the input file twice.
We first scan the database to obtains the frequency of
single items. Then we scan the data again to construct the FP-Tree, which
is a compressed form of data.
In this way, we don't need load the whole database into the main memory.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_2" data-toggle="tab">Scala</a></li>
<li><a href="#java_2" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_2" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def fptree(minSupport: Int, supplier: Supplier[Stream[Array[Int]]]): FPTree
def fpgrowth(tree: FPTree): Stream[ItemSet]
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class FPTree {
public static FPTree of(int minSupport, Supplier<Stream<int[]>> supplier);
public static FPTree of(double minSupport, Supplier<Stream<int[]>> supplier);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun fptree(minSupport: Int, supplier: Supplier<Stream<IntArray>>): FPTree
fun fpgrowth(tree: FPTree): Stream<ItemSet>
</code></pre>
</div>
</div>
</div>
<p>In the below example, we apply FP-Growth to a toy data.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_3" data-toggle="tab">Scala</a></li>
<li><a href="#java_3" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_3" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> val itemsets = Array(
Array(1, 3),
Array(2),
Array(4),
Array(2, 3, 4),
Array(2, 3),
Array(2, 3),
Array(1, 2, 3, 4),
Array(1, 3),
Array(1, 2, 3),
Array(1, 2, 3)
)
smile> fpgrowth(3, itemsets).forEach { itemset => println(itemset) }
4 (3)
1 (5)
2 1 (3)
3 2 1 (3)
3 1 (5)
2 (7)
3 2 (6)
3 (8)
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
jshell> import smile.association.*
jshell> int[][] itemsets = {
...> {1, 3},
...> {2},
...> {4},
...> {2, 3, 4},
...> {2, 3},
...> {2, 3},
...> {1, 2, 3, 4},
...> {1, 3},
...> {1, 2, 3},
...> {1, 2, 3}
...> }
itemsets ==> int[10][] { int[2] { 1, 3 }, int[1] { 2 }, int[1] ... 3 }, int[3] { 1, 2, 3 } }
jshell> var tree = FPTree.of(3, itemsets)
tree ==> smile.association.FPTree@2a7b6f69
jshell> FPGrowth.apply(tree).forEach(itemset -> System.out.println(itemset))
4 (3)
1 (5)
2 1 (3)
3 2 1 (3)
3 1 (5)
2 (7)
3 2 (6)
3 (8)
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> val itemsets = arrayOf(
intArrayOf(1, 3),
intArrayOf(2),
intArrayOf(4),
intArrayOf(2, 3, 4),
intArrayOf(2, 3),
intArrayOf(2, 3),
intArrayOf(1, 2, 3, 4),
intArrayOf(1, 3),
intArrayOf(1, 2, 3),
intArrayOf(1, 2, 3)
)
>>> fpgrowth(3, itemsets).forEach { itemset -> println(itemset) }
4 (3)
1 (5)
2 1 (3)
3 2 1 (3)
3 1 (5)
2 (7)
3 2 (6)
3 (8)
</code></pre>
</div>
</div>
</div>
<p>Each row is a frequent item set, ending with the frequency in parenthesis.
The sample data directory <code>data/transaction</code> contains a couple of
large benchmark datasets.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_4" data-toggle="tab">Scala</a></li>
<li><a href="#java_4" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_4" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> val tree = fptree(1000, () => {
smile.util.Paths.getTestDataLines("transaction/kosarak.dat").map { line =>
line.split(" ").map(_.toInt).toArray
}
})
smile> fpgrowth(tree).limit(10).forEach { itemset => println(itemset) }
5634 (1000)
3805 (1001)
3376 (1001)
2279 (1001)
6333 (1002)
243 (1002)
808 (1003)
3875 (1004)
2265 (1004)
996 (1004)
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
jshell> var tree = FPTree.of(1000, () -> {
...> try {
...> return smile.util.Paths.getTestDataLines("transaction/kosarak.dat").
...> map(line -> Arrays.stream(line.split(" ")).
...> mapToInt(Integer::valueOf).
...> toArray()
...> );
...> } catch (IOException ex) {
...> ex.printStackTrace();
...> }
...>
...> return Stream.empty();
...> })
tree ==> smile.association.FPTree@3dddbe65
jshell> FPGrowth.apply(tree).limit(10).forEach(itemset -> System.out.println(itemset))
5634 (1000)
3805 (1001)
3376 (1001)
2279 (1001)
6333 (1002)
243 (1002)
808 (1003)
3875 (1004)
2265 (1004)
996 (1004)
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> class Parser : Supplier<Stream<IntArray>> {
override fun get(): Stream<IntArray> {
return smile.util.Paths.getTestDataLines("transaction/kosarak.dat").map { line ->
line.split(" ").map({ w -> w.toInt() }).toIntArray()
}
}
}
>>> val tree = fptree(1000, Parser())
>>> fpgrowth(tree).limit(10).forEach { itemset -> println(itemset) }
5634 (1000)
3805 (1001)
3376 (1001)
2279 (1001)
6333 (1002)
243 (1002)
808 (1003)
3875 (1004)
2265 (1004)
996 (1004)
</code></pre>
</div>
</div>
</div>
<p>The data <code>kosarak.data</code> has 990,002 transactions.
With minimum support 1000, our implementation generates 711,424
frequent item sets in about 30 seconds on a modern laptop.</p>
<h2 id="arm">Association Rules</h2>
<p>After mining frequent itemsets, it is straightforward to
form association rules that meet minimum confidence constraint.
Our implementation generates association rules in a storage efficient way
by employing the total support tree that is a kind of compressed set enumeration tree.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_5" data-toggle="tab">Scala</a></li>
<li><a href="#java_5" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_5" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def arm(minSupport: Int, confidence: Double, itemsets: Array[Array[Int]]): Stream[AssociationRule]
def arm(confidence: Double, tree: FPTree): Stream[AssociationRule]
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class ARM {
public static Stream<AssociationRule> apply(double confidence, FPTree tree);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
def arm(minSupport: Int, confidence: Double, itemsets: Array<IntArray>): Stream<AssociationRule>
def arm(confidence: Double, tree: FPTree): Stream<AssociationRule>
</code></pre>
</div>
</div>
</div>
<p>The API is similar to <code>fpgrowth</code> except the additional parameter <code>confidence</code>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#scala_6" data-toggle="tab">Scala</a></li>
<li><a href="#java_6" data-toggle="tab">Java</a></li>
<li><a href="#kotlin_6" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="scala_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> arm(0.6, tree).limit(10).forEach { rule => println(rule) }
(11) => (6) support = 32.73% confidence = 89.00% lift = 1.47 leverage = 0.1039
(3, 11) => (6) support = 14.51% confidence = 89.09% lift = 1.47 leverage = 0.0462
(1) => (6) support = 13.34% confidence = 66.89% lift = 1.10 leverage = 0.0123
(3, 1) => (6) support = 5.84% confidence = 68.28% lift = 1.12 leverage = 0.0064
(6, 1) => (11) support = 8.70% confidence = 65.17% lift = 1.77 leverage = 0.0379
(11, 1) => (6) support = 8.70% confidence = 93.70% lift = 1.54 leverage = 0.0306
(6, 3, 1) => (11) support = 3.81% confidence = 65.30% lift = 1.78 leverage = 0.0167
(3, 11, 1) => (6) support = 3.81% confidence = 93.73% lift = 1.54 leverage = 0.0134
(218) => (6) support = 7.85% confidence = 87.67% lift = 1.44 leverage = 0.0241
(3, 218) => (6) support = 3.43% confidence = 87.86% lift = 1.45 leverage = 0.0106
</code></pre>
</div>
</div>
<div class="tab-pane" id="java_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
jshell> ARM.apply(0.6, tree).limit(10).forEach(rule -> System.out.println(rule))
(11) => (6) support = 32.73% confidence = 89.00% lift = 1.47 leverage = 0.1039
(3, 11) => (6) support = 14.51% confidence = 89.09% lift = 1.47 leverage = 0.0462
(1) => (6) support = 13.34% confidence = 66.89% lift = 1.10 leverage = 0.0123
(3, 1) => (6) support = 5.84% confidence = 68.28% lift = 1.12 leverage = 0.0064
(6, 1) => (11) support = 8.70% confidence = 65.17% lift = 1.77 leverage = 0.0379
(11, 1) => (6) support = 8.70% confidence = 93.70% lift = 1.54 leverage = 0.0306
(6, 3, 1) => (11) support = 3.81% confidence = 65.30% lift = 1.78 leverage = 0.0167
(3, 11, 1) => (6) support = 3.81% confidence = 93.73% lift = 1.54 leverage = 0.0134
(218) => (6) support = 7.85% confidence = 87.67% lift = 1.44 leverage = 0.0241
(3, 218) => (6) support = 3.43% confidence = 87.86% lift = 1.45 leverage = 0.0106
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> arm(0.6, tree).limit(10).forEach { rule -> println(rule) }
(11) => (6) support = 32.73% confidence = 89.00% lift = 1.47 leverage = 0.1039
(3, 11) => (6) support = 14.51% confidence = 89.09% lift = 1.47 leverage = 0.0462
(1) => (6) support = 13.34% confidence = 66.89% lift = 1.10 leverage = 0.0123
(3, 1) => (6) support = 5.84% confidence = 68.28% lift = 1.12 leverage = 0.0064
(6, 1) => (11) support = 8.70% confidence = 65.17% lift = 1.77 leverage = 0.0379
(11, 1) => (6) support = 8.70% confidence = 93.70% lift = 1.54 leverage = 0.0306
(6, 3, 1) => (11) support = 3.81% confidence = 65.30% lift = 1.78 leverage = 0.0167
(3, 11, 1) => (6) support = 3.81% confidence = 93.73% lift = 1.54 leverage = 0.0134
(218) => (6) support = 7.85% confidence = 87.67% lift = 1.44 leverage = 0.0241
(3, 218) => (6) support = 3.43% confidence = 87.86% lift = 1.45 leverage = 0.0106
</code></pre>
</div>
</div>
</div>
<div id="btnv">
<span class="btn-arrow-left">← </span>
<a class="btn-prev-text" href="vector-quantization.html" title="Previous Section: Vector Quantization"><span>Vector Quantization</span></a>
<a class="btn-next-text" href="mds.html" title="Next Section: Multi-Dimensional Scaling"><span>Multi-Dimensional Scaling</span></a>
<span class="btn-arrow-right"> →</span>
</div>
</div>
<script type="text/javascript">
$('#toc').toc({exclude: 'h1, h5, h6', context: '', autoId: true, numerate: false});
</script>