-
Notifications
You must be signed in to change notification settings - Fork 5
/
README
310 lines (298 loc) · 19.6 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
This is the regression test suite for PLFS.
Description:
The regression suite will pull in needed sources into its environment and
run a series of tests on that configuration. It is designed to be self-
reliant; it should not call on anything outside its directory structure to
accomplish anything when running tests. This way the regression
environment is easily considered by looking at what is currently in the
regression directory structure and log files.
A regression run has four phases:
- build phase: all dependencies are retrieved and built, if neccessary.
- configuration and directory check: check required configuration files
of all packages and that the plfs-related directories can be created.
- test submittal: tests are submitted. If no tests are submitted (either by
configuration or by errors), this phase becomes the final phase for a
regression run. If specified, a summary email will be sent.
- test checking: if at least one test is run, this last stage is entered.
Each test that was run will have it's results checked and logged. If
specified, a summary email will be sent.
Information on writing tests for the regression suite is located in:
tests/README.
Dependencies:
- What ever compiler that is desired should be setup either through login
scripts or in the regression suite's config file (see below).
- The following packages/programs are required:
- Python >= 2.5
- PLFS: can be either in the form of source or an already built version.
PLFS source is used in at least one test so is required even when
providing an already built version of PLFS. When given source, the
regression suite will build PLFS.
- MPI (for running parallel tests): can be either in the form of a source
tarball or an already patched for PLFS MPI installation. If the tarball is
given, MPI will be built.
NOTE: if MPI is to be built, make sure the correct version of autotools
(m4, autoconf, automake, libtool) are already in your PATH and
LD_LIBRARY_PATH. Otherwise, the building of MPI may fail.
NOTE: at this time, only Open MPI tarballs are supported.
- fs_test (for running specific IO patterns and many tests use it): can be
in the form of source or an already compiled version. It is available in
the fs-test project (https://github.com/fs-test). If source is given,
fs_test will be compiled against the regression suite's PLFS and MPI.
- experiment_management (for creating test commands tailored to the system
being run on): available in the fs-test project
(https://github.com/fs-test). This package does not require complilation.
- job scheduler (for submitting and tracking parallel tests): the regression
suite relies on the command 'checkjob' to determine if submitted jobs have
completed or not. Additionally, the following line is expected to be in
checkjob's output:
Status <status>
The regression suite will look for the following values for <status>:
Completed or Removed. If the job scheduler on the system does not provide
checkjob with that format, the regression suite will be unable to determine
when jobs complete and will error out. It is known that the MOAB job
scheduler provides this functionality.
- A working ~/.plfsrc file consistent with the version of plfs that is to be
tested. It is not necessary that all of the directories for the PLFS mount
points and backends exist. The regression suite will ensure that these
directories exist and will attempt to create them if they are missing. The
backends should be located on a file system space that will be shared
between all nodes (compute nodes as well as the node that will be running
run_plfs_regression.sh). Make sure that the user has write permissions for
all paths. Using mount_points and backends that the user does not have
permissions to create but does have write permissions on is permissible only
if all of the required directories are already created. In this way, a
system configuration of PLFS can be tested. Tests within the regression suite
will use these mount points, experiment_management input (see below) and
run-time information to construct file targets that will be used during the
test. Without the applicable experiment_management setting, targets will be
placed directly into the mount point.
Note about the check for directories and plfs rc files:
The checks for plfs related directores and rc files happen after the build
phase. If build phase was successfull but any of the checks fail, it is not
necessary to rerun the build phase. Instead, use run_plfs_regression.sh's
'--nobuild' option to skip the build phase. The checks will still be done
before the test submittal phase.
Fuse tests that run on compute nodes will make sure the mount_point
specified in the plfsrc is present. This way it is possible to use a
location such as /var/tmp, but users will not have to log in to each compute
node individually to make sure the mount point exists. This checking for the
mount_point's existance is actually done on any fuse test regardless of
where the test runs if the test was written correctly.
- In regards to experiment_management, an rc file must be present and define
experiment_management's runcommand, ppn, and outdir as individual tests will
not pass these on the command line when invoking experiment_management. In
addition, msub should be defined to include "-j oe" as many of the tests
expect to look for only one output file.
Note: outdir should be a relative path for now. It is expected that this
directory will be placed in the test's directory, and absolute paths don't
guarantee this.
Note about the check for experiment_management rc files:
Like the check for plfs rc files, the check for experiment_management rc
files is done after the build stage. If the build phase was successfull,
but the check on experiment_management rc files fails, it is not necessary
to rerun the build phase; use '--nobuild' to skip the build stage. The test of
the rc files will be done before the test submittal stage.
- A config file in the Regression directory. A sample is provided as
config.sample. Copy this file as config (in the same directory) and edit
config accordingly. All information related to specifying where to find
dependencies will be entered in that file. It is heavily commented in an
attempt to explain all the different options available.
- MY_MPI_HOST environment variable set as in the experiment_management
framework. This variable needs to be set in such a way that it is defined
on compute nodes. Set it in log-in scripts or using the env_customize.sh
script (covered in the Quick Start section).
- Define MPI_CC environment variable as the command to compile parallel
programs. For example, set it to mpicc (generic mpi implementation) or cc
(cray implementation).
- If email capabilities are needed/wanted, /bin/mail needs to be available.
- If Open MPI is to be built by the regression suite, a platform file must
be used when compiling Open MPI to make it aware of the locations of the
PLFS library. A sample platform file is provided as
openmpi_platform_file.sample. It contains the minimum settings necessary
to make Open MPI aware of the PLFS library. It can be copied to
openmpi_platform_file and modified, but the REPLACE_PLFS_* characters must
remain so that the regression suite can accurately generate a usable
platform file. The regression suite will first check for the existance of
openmpi_platform_file. If it is not present, the regression suite will
check for openmpi_platform_file.sample. This way, if a custom platform file
is not needed, users can do nothing and the sample will be used.
- If putting test target files directly into the PLFS mount point is not
desired, specify rs_mnt_append_path in experiment_management's rc file.
Assign it a path relative to the mount point. Tests will append this path to
the path of the mount point in constructing paths for test file targets:
/<mount_point>/<rs_mnt_append_path>/<test file target>
The regression suite will make sure the directory specified by
rs_mnt_append_path exists on all backends which will ensure that it exists
as /<mount_point>/<rs_mnt_append_path>.
Usage:
./run_plfs_regression.sh [OPTIONS ...]
Running run_plfs_regression.sh without options will start a regression run
with a configuration consistent with what is set up in the config file and
default values. Please see config.sample for descptions of all of the options.
Use run_plfs_regression.sh's '-h|--help' switch to list the options that can be
used and their effect on the parameters in the config file. Parameters set on
the command line over-ride those set in the config file.
Quick Start/Sanity Check:
1) Start run_plfs_regression.sh with the --notests command line switch after
setting up the config file. Deal with any issues in building. Logs are
located in the logs directory.
2) Run a simple test: tests/write_read_no_error or tests/adio_write_read.
Both of these tests are simple write and read tests.
Change directory into either of the above and run ./reg_test.py. A single
job should be submitted. When the job has finished, run ./check_results.py.
Deal with any issues in getting the test to run; this may require doing
step 1 again.
If there are environment issues, consider creating
tests/utils/env_customize.sh to deal with them. This file will be sourced
by tests/ustils/rs_env_init.sh, which is the main script that makes sure
the environment is set up to use the regression suite's executables and
libraries. /tests/utils/env_customize.sh will be sourced by
run_plfs_regression.sh and by tests running on compute nodes, so it may be
necessary to put conditionals in env_customize.sh to make sure the proper
hosts are all getting the right environment. Env_customize.sh must be
written in bash or sh.
As an example, a library is found by default on a machine's frontend, but
not on the compute nodes. Env_customize.sh could modify LD_LIBRARY_PATH
based on compute nodes names such that the library will be found when it
comes time to run binaries on the compute nodes.
3) If the above worked, it should be ok to run the whole regression suite
(e.g. run_plfs_regression.sh --nobuild --testtypes=2,3)
Files and directories in the regression suite:
- There should only ever be two entry points into the regression suite:
run_plfs_regression.sh and check_tests.py. As such, there are only two exit
points for the regression suite: from those same files. However, it should be
noted that the regression scripts have been built with the idea of modularity
in mind. That is, it should be possible to run each script separately if need
be.
- config.sample: sample config file for configuration. A similar file should be
in the Regression directory named config.
- run_plfs_regression.sh: The main regression script. Will build the
needed source code or link to already existing binaries and libraries. If
there are any issues with these steps, run_plfs_regression.sh will exit the
regression run and send an email notification to addresses specified in the
config file. If all needed packages are present, this script will call
submit_tests.py to submit tests and then pass control to check_tests.py.
- submit_tests.py: python script that will submit tests that are located in the
tests directory. It will use a control file to know which tests to submit.
Submit_tests.py file is where we specify how many different types of tests can
be run. To add more types of tests, edit this file in the main and parse_args
methods.
- check_tests.py: python script that waits for tests to finish and then report
on results. It is designed to be run on it's own. That is,
run_plfs_regression.sh passes control to check_tests.py and then exits.
Check_tests.py expects to finish up the regression on its own. It can also
be started at any time to re-check tests. It can be instructed to wait for
jobs to complete or immediately check the results of tests. At this time,
check_tests.py requires a file that lists what tests were submitted. See
below for more information.
- restart_check_tests.sh: Script that can be run through cron to find out if
check_tests.py should be restarted. Since check_tests.py is suppose to wait
for tests to complete, but there is no process to report to
(run_plfs_regression.sh has already exited), it is possible that the
check_tests.py process dies before the whole regression run is complete.
This script is to make sure check_tests.py finishes at some point and sends
a report.
- plfs_build.sh: build plfs and install it into the installation directory (given
in config file).
- openmpi_build.sh: build openmpi and install it. This script does not check
for the correct version of autotools. Open MPI's autogen.sh checks for this.
- fs_test_build.sh: builds fs_test source.
- src_cp.sh: copy source from one location into anther. This script can be used
to copy any set of code into the regression environment (plfs, fs_test, etc).
- src: Directory that will contain all source needed by the regression suite.
All code will be copied into this directory in it's own directory. The
directory names will be hardcoded into run_plfs_regression.sh so that it can
always expect to find code in certain places.
- inst: Directory that will be used to install packages to. Each package will
install to its own directory within inst. This is where tests should expect
to find all binaries and libraries that the regression suite provides.
NOTE: if already existing executables and libraries are given to the
regression suite, then this directory or its sub-directories will contain
symbolic links to the locations given in the configuration. This is to
ensure that tests always know where to find dependencies.
- logs: Directory that will be used to hold log files.
- tests: Directory that will contain individual test directories. Each
directory is considered a test. Lists of tests to run are kept in text
files in the tests directory.
- A lock file will be created by run_plfs_regression.sh and can be removed by
run_plfs_regression.sh and check_results.py. It will also house job ids for
submitted jobs that can be checked by check_results.py. The actual file name
is set in the config file.
NOTE: submit_tests.py will also create this file if it doesn't exist, but
this functionality is there only to allow submit_tests.py to be run on its
own outside of the other regression scripts. A message is printed out if
submit_tests.py creates the file so that the behaviour can be tracked.
- Check_tests.py requires the use of a file that contains a python dictionary
detailing what tests were submitted by submit_tests.py. The name of this
file can be specified in the config file. This dictionary will have
information on tests that were successfully submitted as well as tests that
were not successfully submitted. All of this information is necessary in
order for check_tests.py to create a suitable summary.
Notes
- The regression suite will rely on the MPI implementation's wrapper scripts
(such as mpicc, mpif90, etc.). This way the regression suite doesn't have to
worry about MPI library names and getting the linking right. As at least one
MPI implementation doesn't make sure its own MPI libraries are properly
linked into the executable (no use of rpath), the regression suite will
set LD_LIBRARY_PATH to include a path to the MPI libraries.
- For fs_test compilation, the regression suite will over-write MPI_LD and
MPI_INC to have the proper flags to compile fs_test against the regression
suite's plfs and mpi. This change is not permanent to the user's
environment, just the environment set up by the regression suite. These
environment variables should be available if individual tests need to
compile supporting programs.
- It is possible for the regression suite to get itself into a state where it
cannot run. This is usually due to not being able to work with the needed
jobs id file and the python dictionary file of what jobs to check. These
are specified in the config file as id_file and dict_file, respectively.
When these files can not be found when needed or are not usable, the
regression suite will create a DO_NOT_RUN_REGRESSION. The presence of this
file will keep future instances of the regression suite from running. At the
time that this file is created, and error should be printed out as to what
went wrong with what file. If email is enabled, an email with the same
information should be sent. It is expected that user intervention is
required to get the regression suite back into working order.
Issues/To do:
- Check_tests.py for now requires a file that contains a python dictionary of
the tests that submit_tests.py submitted. This was done so that information
that submit_tests.py finds in submitting tests can be passed to
check_tests.py (basically submitted vs. unable to submit at this time).
However, this means that it is not quite simple to just rerun
check_tests.py after a regression run is completed since the dictionary file
is removed. Is this important? Maybe could have submit_tests.py leave
that additional information in a file inside the test's directory, but the
test does nothing with it (it would be a regression suite file, not a test
file). This way, check_tests.py could go through the same control file that
submit_tests.py went through to figure out what tests to check, and look
for that file and act accordingly. However, what if the control file
changes? This is probably the original reason why the dictionary file method
is used.
Maybe add an option to check_tests.py to not remove the dictionary file? Or
better yet, to not remove it at all from check_tests.py and have
submit_tests.py rewrite it each regression instance. Will this bring about
confusion about what regression instance spawned the file?
- This regression suite doesn't run on the Cray XE6 machines (possibly any
Cray). There are two reasons for this:
1) The regression suite relies on a ssh approach when mounting the needed
PLFS mount points; ssh to the Cray compute nodes is not possible at this
time. All interaction with the compute nodes has to happen through aprun.
2) Given that we have to use aprun to interact with the compute nodes, it
would be nice if we could mount PLFS using aprun. However, when aprun
exits, all user-generated processes are killed which includes the PLFS
daemon. This brings about the situation where fuse thinks PLFS is still
running but it really isn't. In order to make the mount point usable
again, it has to be unmounted through aprun. Basically, anything that has
to be done to a PLFS mount point that the regression suite mounts has to
be done within a single aprun command.
Given the above reasons, there are no plans to implement the regression
suite on the Cray machines. In order to do this, tests would have to be
completely rewritten to remove all parallel run commands (aprun, mpirun,
etc.) from the scripts that get submitted and that script would be run
via a single parallel run command. Given that a lot of the generated
scripts are in a shell language and shells have no concept of MPI, it
would be impossible to implement some tests in this fashion unless a
different language is used. For example, write_read_error relies on a single
rank running a program that will change a single byte in a PLFS file. If the
generated script were run on all nodes, every rank would try to modify the
file since the generated script is in bash. The only way to solve this would
be to change the language of the generated script to one that is MPI-aware.