-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
/
emst.txt
148 lines (117 loc) · 5.09 KB
/
emst.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*!
@file emst.txt
@author Bill March
@brief Tutorial for the Euclidean Minimum Spanning Tree algorithm.
@page emst_tutorial EMST Tutorial
@section intro_emsttut Introduction
The Euclidean Minimum Spanning Tree problem is widely used in machine learning
and data mining applications. Given a set \f$S\f$ of points in \f$\mathbf{R}^d\f$,
our task is to compute lowest weight spanning tree in the complete graph on \f$S\f$
with edge weights given by the Euclidean distance between points.
Among other applications, the EMST can be used to compute hierarchical clusterings
of data. A <em>single-linkage clustering</em> can be obtained from the EMST by deleting
all edges longer than a given cluster length. This technique is also referred to as a <em>Friends-of-Friends</em> clustering in the astronomy literature.
mlpack includes an implementation of <b>Dual-Tree Boruvka</b> which uses
\f$kd\f$-trees by default; this is the empirically and theoretically fastest
EMST algorithm. In addition, the implementation supports the use of different
trees via templates. For more details, see the following paper:
@code
@inproceedings{march2010fast,
title={Fast {E}uclidean minimum spanning tree: algorithm, analysis, and
applications},
author={March, William B. and Ram, Parikshit and Gray, Alexander G.},
booktitle={Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD '10)},
pages={603--612},
year={2010},
organization={ACM}
}
@endcode
\b mlpack provides:
- a \ref cli_emsttut "simple command-line executable" to compute the EMST of a given data set
- a \ref dtb_emsttut "simple C++ interface" to compute the EMST
@section toc_emsttut Table of Contents
A list of all the sections this tutorial contains.
- \ref intro_emsttut
- \ref toc_emsttut
- \ref cli_emsttut
- \ref dtb_emsttut
- \ref further_doc_emsttut
@section cli_emsttut Command-Line 'EMST'
The \c mlpack_emst executable in \b mlpack will compute the EMST of a given set
of points and store the resulting edge list to a file.
The output file contains an edge list representation of the MST in an
\f$n-1 \times 3 \f$ matrix, where the first and second columns are labels of
points and the third column is the edge weight. The edges are sorted in order
of increasing weight.
Below are several examples of simple usage (and the resultant output). The
\c -v option is used so that verbose output is given. Further documentation on
each individual option can be found by typing
@code
$ mlpack_emst --help
@endcode
@code
$ mlpack_emst --input_file=dataset.csv --output_file=edge_list.csv -v
[INFO ] Reading in data.
[INFO ] Loading 'dataset.csv' as CSV data.
[INFO ] Data read, building tree.
[INFO ] Tree built, running algorithm.
[INFO ] 4 edges found so far.
[INFO ] 5 edges found so far.
[INFO ] Total spanning tree length: 1002.45
[INFO ] Saving CSV data to 'edge_list.csv'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_file: dataset.csv
[INFO ] leaf_size: 1
[INFO ] naive: false
[INFO ] output_file: edge_list.csv
[INFO ] verbose: true
[INFO ]
[INFO ] Program timers:
[INFO ] emst/mst_computation: 0.000179s
[INFO ] emst/tree_building: 0.000061s
[INFO ] total_time: 0.052641s
@endcode
The code performs at most \f$\log N\f$ iterations for \f$N\f$ data points. It will print an update on the number of MST edges found after each iteration.
Convenient program timers are given for different parts of the calculation at
the bottom of the output, as well as the parameters the simulation was run with.
@code
$ cat dataset.csv
0, 0
1, 1
3, 3
0.5, 0
1000, 0
1001, 0
$ cat edge_list.csv
0.0000000000e+00,3.0000000000e+00,5.0000000000e-01
4.0000000000e+00,5.0000000000e+00,1.0000000000e+00
1.0000000000e+00,3.0000000000e+00,1.1180339887e+00
1.0000000000e+00,2.0000000000e+00,2.8284271247e+00
2.0000000000e+00,4.0000000000e+00,9.9700451353e+02
@endcode
The input points are labeled 0-5. The output tells us that the MST connects
point 0 to point 3, point 4 to point 5, point 1 to point 3, point 1 to point 2,
and point 2 to point 4, with the corresponding edge weights given in the third
column. The total length of the MST is also given in the verbose output.
Note that it is also possible to compute the EMST using a naive (\f$O(N^2)\f$)
algorithm for timing and comparison purposes, using the \c --naive option.
@section dtb_emsttut The 'DualTreeBoruvka' class
The 'DualTreeBoruvka' class contains our implementation of the Dual-Tree Boruvka
algorithm.
The class has two constructors: the first takes the data set, constructs the
tree (where the type of tree constructed is the TreeType template parameter),
and computes the MST. The second takes data set and an already constructed
tree.
The class provides one method that performs the MST computation:
@code
void ComputeMST(const arma::mat& results);
@endcode
This method stores the computed MST in the matrix results in the format given above.
@section further_doc_emsttut Further documentation
For further documentation on the DualTreeBoruvka class, consult the
\ref mlpack::emst::DualTreeBoruvka "complete API documentation".
*/