-
Notifications
You must be signed in to change notification settings - Fork 62
/
dataloadbal.cc
304 lines (252 loc) · 11.5 KB
/
dataloadbal.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
/*
This file is part of MADNESS.
Copyright (C) 2007,2010 Oak Ridge National Laboratory
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
For more information please contact:
Robert J. Harrison
Oak Ridge National Laboratory
One Bethel Valley Road
P.O. Box 2008, MS-6367
email: harrisonrj@ornl.gov
tel: 865-241-3937
fax: 865-572-0680
$Id$
*/
/*!
\file examples/dataloadbal.cc
\brief Illustrates how to use static data/load balancing of functions
\defgroup loadbaleg Data and load balancing
\ingroup examples
The source is <a href=https://github.com/m-a-d-n-e-s-s/madness/blob/master/src/examples/dataloadbal.cc>here</a>.
This is one of the more computationally demanding examples - either run it on
50 or more nodes of jaguar or reduce the number of functions employed
(value of \c NFUNC in the source).
\par Points of interest
- Using functors to incorporate state into functions to be compressed
- Using special points in a functor to provide hints to adaptive refinement algorithm
- Operating on multiple functions in a vector
- Heuristic analysis of function trees to generate a new mapping of data to processor
- Installation of the new process map with automatic redistribution of data
- Timing to estimate benefit of load balancing
\par Background
Poor distribution of work (load imbalance) is the largest reason for
inefficient parallel execution within MADNESS. Poor data distribution
(data imbalance) contributes to load imbalance and also leads to
out-of-memory problems due to one or more processes having too much
data. Thus, we are interested in uniform distribution of both
work and data.
Many operations in MADNESS are entirely data driven (i.e.,
computation occurs in the processor that "owns" the data) since
there is insufficient work to justify moving data between processes
(e.g., computing the inner product between functions). However, a
few expensive operations can have work shipped to other processors.
There are presently three load balancing mechanisms within MADNESS
- static and driven by the distribution of data,
- dynamic via random assignment of work, and
- dynamic via work stealing (currently only in prototype).
Until the work stealing becomes production quality we must exploit
the first two forms. The random work assignment is controlled by
options in the FunctionDefaults class.
- FunctionDefaults::set_apply_randomize(bool) controls the use of
randomization in applying integral (convolution) operators. It is
typically beneficial when computing to medium/high precision.
- FunctionDefaults::set_project_randomize(bool) constrols the use
of randomization in projecting from an analytic form (i.e., C++)
into the discontinuous spectral element basis. It is typically
beneficial unless there is already a good static data distribution.
Since these options are straightforward to enable, this example
focusses on static data redistribution.
The process map (an instance of WorldDCPmapInterface) controls
mapping of data to processors and it is actually quite easy to write
your own (e.g., see WorldDCDefaultPmap or LevelPmap) that ensure
uniform data distribution. However, you also seek to incorporate
estimates of the computational cost into the distribution. The
class LBDeuxPmap (deux since it is the second such class) in
mra/lbdeux.h does this by examining the functions you request and
using provided weights to estimate the computational cost.
Communication costs are proportional to the number of broken links
in the tree. Since some operations work in the scaling function
basis, some in the wavelet basis, and some in non-standard form,
there is an element of empiricism in getting best performance from
most algorithms.
\par Implementation
The example times how long it takes to perform the following operations
- constructing 4000 Gaussian functions with random exponent and origin,
- truncating them,
- differentiating them, and
- applying the Coulomb Green's function to them.
The process map (data distribution) is then modified using the LBDeux
heuristic and the operations repeated.
\par Results
\verbatim
/tmp/work/harrison % aprun -n 50 -d 12 -cc none ./dataloadbal
Runtime initialized with 10 threads in the pool and affinity 1 0 2
Before load balancing
project 2.57 truncate 9.18 differentiate 11.80 convolve 12.57 balance 0.00
project 2.74 truncate 9.08 differentiate 11.33 convolve 13.07 balance 0.00
project 2.80 truncate 8.84 differentiate 10.76 convolve 12.66 balance 2.19
After load balancing
project 2.96 truncate 2.56 differentiate 3.28 convolve 9.81 balance 0.00
project 3.05 truncate 2.61 differentiate 3.46 convolve 9.24 balance 0.00
project 3.01 truncate 2.60 differentiate 3.25 convolve 9.30 balance 0.00
\endverbatim
*/
#include <madness/mra/mra.h>
#include <madness/mra/operator.h>
#include <madness/mra/vmra.h>
#include <madness/mra/lbdeux.h>
#include <madness/constants.h>
using namespace madness;
static const int NFUNC = 4;
// A class that behaves like a function to compute a Gaussian of given origin and exponent
class Gaussian : public FunctionFunctorInterface<double,3> {
public:
const coord_3d center;
const double exponent;
const double coefficient;
std::vector<coord_3d> specialpt;
Gaussian(const coord_3d& center, double exponent, double coefficient)
: center(center), exponent(exponent), coefficient(coefficient), specialpt(1)
{
specialpt[0][0] = center[0];
specialpt[0][1] = center[1];
specialpt[0][2] = center[2];
}
// MADNESS will call this interface
double operator()(const coord_3d& x) const {
double sum = 0.0;
for (int i=0; i<3; i++) {
double xx = center[i]-x[i];
sum += xx*xx;
};
return coefficient*exp(-exponent*sum);
}
// By default, adaptive projection into the spectral element basis
// starts uniformly distributed at the initial level. However, if
// a function is "spiky" it may be necessary to project at a finer
// level but doing this uniformly is expensive. This method
// enables us to tell MADNESS about points/areas needing deep
// refinement (the default is no special points).
std::vector<coord_3d> special_points() const {
return specialpt;
}
};
// Makes a new square-normalized Gaussian functor with random origin and exponent
real_functor_3d random_gaussian() {
const double expntmin=1e-1;
const double expntmax=1e4;
const real_tensor& cell = FunctionDefaults<3>::get_cell();
coord_3d origin;
for (int i=0; i<3; i++) {
origin[i] = RandomValue<double>()*(cell(i,1)-cell(i,0)) + cell(i,0);
}
double lo = log(expntmin);
double hi = log(expntmax);
double expnt = exp(RandomValue<double>()*(hi-lo) + lo);
double coeff = pow(2.0*expnt/constants::pi,0.75);
return real_functor_3d(new Gaussian(origin,expnt,coeff));
}
// This structure is used to estimate the cost of computing on a block of coefficients
// The constructor saves an estimate of the relative cost of computing on
// leaf (scaling function) or interior (wavelet) coefficients. Unless you
// are doing nearly exclusively just one type of operation the speed is
// not that sensitive to the choice, so the default is usually OK.
//
// The operator() method is invoked for each block of coefficients
// (function node) to estimate the cost. Since pretty much everything
// involves work at the top of the tree we give levels 0 and 1 a 100x
// multiplier to avoid them being a bottleneck.
struct LBCost {
double leaf_value;
double parent_value;
LBCost(double leaf_value=1.0, double parent_value=1.0)
: leaf_value(leaf_value)
, parent_value(parent_value)
{}
double operator()(const Key<3>& key, const FunctionNode<double,3>& node) const {
if (key.level() <= 1) {
return 100.0*(leaf_value+parent_value);
}
else if (node.is_leaf()) {
return leaf_value;
}
else {
return parent_value;
}
}
};
void test(World& world, bool doloadbal=false) {
double start;
vector_real_function_3d f(NFUNC);
default_random_generator.setstate(99); // Ensure all processes have the same state
// By default a sychronization (fence) is performed after projecting a new function
// but if you are projecting many functions this is inefficient. Here we turn
// off the fence when projecting, and do it manually just once.
start = wall_time();
for (int i=0; i<NFUNC; i++)
f[i] = real_factory_3d(world).functor(random_gaussian()).nofence();
world.gop.fence();
double projection = wall_time() - start;
start = wall_time();
truncate(world, f);
double truncation = wall_time() - start;
start = wall_time();
Derivative<double,3> Dx(world,0);
apply(world, Dx, f); // Computes vector of derivatives and discards result
double differentiation = wall_time() - start;
// Make the Coulomb operator
real_convolution_3d op = CoulombOperator(world, 1e-2, 1e-6);
start = wall_time();
apply(world, op, f); // Applies Coulomb GF and discards result
double convolution = wall_time() - start;
start = wall_time();
if (doloadbal) {
LoadBalanceDeux<3> lb(world);
for (int i=0; i<NFUNC; i++)
lb.add_tree(f[i], LBCost(2.0,1.0));
// Calling redistribute() installs the new process map and
// redistributes all functions using the old process map.
// This is almost always what you want. However, since we
// know that we are about to throw away our functions (f) we
// could simply call set_pmap() that installs the new map but
// does not redistribute.
FunctionDefaults<3>::redistribute(world, lb.load_balance(2.0,false));
}
double loadbal = wall_time() - start;
if (world.rank() == 0) printf("project %.2f truncate %.2f differentiate %.2f convolve %.2f balance %.2f\n",
projection, truncation, differentiation, convolution, loadbal);
}
int main(int argc, char** argv) {
// Initialize the parallel programming environment
initialize(argc,argv);
World world(SafeMPI::COMM_WORLD);
// Load info for MADNESS numerical routines
startup(world,argc,argv);
// Set/override defaults
FunctionDefaults<3>::set_cubic_cell(-20,20);
FunctionDefaults<3>::set_apply_randomize(false);
FunctionDefaults<3>::set_project_randomize(false);
FunctionDefaults<3>::set_truncate_on_project(true);
// First three without data redistribution
if (world.rank() == 0) print("Before load balancing");
test(world);
test(world);
test(world, true);
// At end of last test data was redistributed, repeat again three times
if (world.rank() == 0) print("After load balancing");
test(world);
test(world);
test(world);
finalize();
return 0;
}