New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Lustre ADIO driver patch for Progressive File Layout feature #3290
Conversation
|
Don't merge this yet! I am just creating the pull request to host conversation/feedback as we refine it. |
b7cfa67
to
830b24e
Compare
|
Emoly Liu contributed some backwards compatibility routines. now the lustre driver builds with lustre as old as 1.8 (maybe older, but none of us tested that). |
450b30e
to
4a07a15
Compare
|
Trying to remember all the bugs we've patched in the old code over the last decade.
|
|
Some romio tests in Failing tests are i_noncontig, noncontig, noncontig_coll, noncontig_coll2. I suspect one bug is triggering these four failing tests. Most of the tests fail reading back data they did not expect, but one test fails differently: |
|
I wonder if this is related to fcntl locking? |
|
Probably not: lustre in my VM is mounted with flock support: |
|
This change does not introduce the error after all. Something broke between 54d944 and HEAD. git-bisecting now. |
|
@liy106 It looks like the offending commit was way back a few years ago. I pushed a fix to this branch, but 'noncontig_coll' is still failing. Confirmed that the test does not fail with HEAD. |
|
The test as written does three open/write/read/close cycles, for different combinations of "noncontiguous in memory" and "noncontiguous in file" (there are plenty of other tests that exercise the contiguous-in-memory contigous-in-file case). f you strip down noncontig_coll to just one open/write/read/close cycle, there are no errors. Wonder what stale state is getting carried over between close and subsequent open? |
To make Lustre ADIO driver work with PFL feature correctly, this patch
makes the following
changes:
- Use llapi_layout_* interfaces to get/set the striping information
-- use llapi_layout_file_create() instead of ioctl() to set striping
information when creating a file;
-- use llapi_layout_xxx_set/get() instead of ioctl() to set/get
striping information;
-- since O_LOV_DELAY_CREATE is set with O_CREATE by default in
llapi_layout_file_open(), the related code to open a file in
ADIOI_LUSTRE_open() is changed.
- Set composite layout by hint:
-- add hint "romio_lustre_comp_layout" to specify composite layout
in the following 3 formats:
--- YAML template file, e.g. /a/b/layout.yaml
--- option string, similar to "lfs setstripe" command, e.g. "-E
4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K"
--- lustre source file, e.g. /mnt/lustre/compfile, that means
creating file with the same layout to this lustre file.
--- If "romio_lustre_comp_layout" is enabled, hints
"striping_factor", "striping_unit" and
"romio_lustre_start_iodevice" will be ignored.
-- add file ad_lustre_layout.c to parse the composite layout
options;
-- add files ad_lustre_cyaml.h and ad_lustre_cyaml.h to parse YAML
template layout file.
- Improve the following I/O redistribution algorithm with PFL feature:
-- for stripe-contiguous pattern:
--- use the LCM(lowest common multiple) of different component
stripe count to calculate the number of available cb nodes
--- use the last component stripe size as the common stripe
size, because different MPI procs will write different
components and it's hard to predict which component will
have most impact on performance. (This can be improved.)
-- for file-contiguous: use the last component stripe size as the
common stripe size, same as above. - Fix some issues:
-- set fn->hints->cb_nodes in ADIOI_LUSTRE_WriteStridedColl(),
otherwise the final avail_cb_nodes is always 1;
-- since there is no mapping/initialization for ranklist[], just
use rank directly, otherwise it will get a wrong rank number;
-- add LDEBUG() to print debug information;
-- remove striping information setting in ADIOI_LUSTRE_SetInfo()
since these values can be set/gotten easily by
llapi_layout_xxx_set/get().
-- add "AC_SEARCH_LIBS(llapi_layout_comp_use, lustreapi,
[AC_DEFINE(HAVE_LUSTRE_COMP_LAYOUT_SUPPORT, 1,...)], ...)" to
romio/configure.ac to check if Lustre version installed supports
PFL feature.
-- add "AC_CHECK_HEADERS(yaml.h, ...)" to romio/configure.ac to
check if libyaml and libyaml-devel are installed for YAML
template file support.
-- add "AM_CONDITIONAL([LUSTRE_YAML]...)" to romio/configure.ac and
add "if LUSTRE_YAML" to ad_lustre/Makefile.am to tell if YAML
libraries are needed to build.
Implement layout routines in such a way that lustre-1.8 can still work.
'error' should only have a non-zero value if the Lustre I/O routines actually encounter an error. If no i/o was carried out (e.g. because 'count' was zero), a '-1' error is not sensible.
…he fly ROMIO data structures are not well encapsulated. Way back in file open we parsed the list of I/O aggregators and set up a "ranklist" array that was 'cb_nodes' big. If we adjust cb_nodes on the fly after that, we're going to overrun the array when we later go to figure out which aggregator we should talk to.
let user figure out the detected layout information via info object
|
If I set no hints, this test fails, but if I set the hint |
| @@ -120,6 +120,7 @@ void ADIOI_LUSTRE_WriteStridedColl(ADIO_File fd, const void *buf, int count, | |||
| ADIO_Offset *lustre_offsets0, *lustre_offsets, *count_sizes = NULL; | |||
|
|
|||
| MPI_Comm_size(fd->comm, &nprocs); | |||
| fd->hints->cb_nodes = nprocs; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liy106 here's the problem! Welcome to ROMIO where nothing is adequately documented. There is a fd->hints->ranklist[] array allocated to be cb_nodes big. By increasing cb_nodes here, you will overrrun that array elsewhere.
| @@ -0,0 +1,160 @@ | |||
| /* | |||
| * LGPL HEADER START | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check with your Cray friends, but I am pretty sure our vendor partners will flip out if we try to include anything LGPL.
efa7f92
to
80409f2
Compare
uncertain why we were not caching this value...
|
ok! with a few changes all of the romio tests pass if you don't set any hints. I tried to exercise the progressive layout path but was not certain how to enable that feature. |
|
Inactive PR. Closing it for now, till someone finds the time to work on it. |
To make Lustre ADIO driver work with PFL feature correctly, this
patch does the following changes:
-- use llapi_layout_file_create() instead of ioctl() to set
striping information when creating a file;
-- use llapi_layout_xxx_set/get() instead of ioctl() to set/get
striping information;
-- since O_LOV_DELAY_CREATE is set with O_CREATE by default
in llapi_layout_file_open(), the related code to open a file
in ADIOI_LUSTRE_open() is changed.
-- add hint "romio_lustre_comp_layout" to specify composite
layout in the following 3 formats:
--- YAML template file, e.g. /a/b/layout.yaml
--- option string, similar to "lfs setstripe" command, e.g.
"-E 4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K"
--- lustre source file, e.g. /mnt/lustre/compfile, that means
creating file with the same layout to this lustre file.
--- If "romio_lustre_comp_layout" is enabled, hints
"striping_factor", "striping_unit" and
"romio_lustre_start_iodevice" will be ignored.
-- add file ad_lustre_layout.c to parse the composite layout
options;
-- add files ad_lustre_cyaml.h and ad_lustre_cyaml.h to parse YAML
template layout file.
feature:
-- for stripe-contiguous pattern:
--- use the LCM(lowest common multiple) of different component
stripe count to calculate the number of available cb nodes
--- use the last component stripe size as the common stripe
size, because different MPI procs will write different
components and it's hard to predict which component will
have most impact on performance. (This can be improved.)
-- for file-contiguous: use the last component stripe size as the
common stripe size, same as above.
-- set fn->hints->cb_nodes in ADIOI_LUSTRE_WriteStridedColl(),
otherwise the final avail_cb_nodes is always 1;
-- since there is no a mapping/initialization for ranklist[],
just use #ran directly, otherwise it will get a wrong rank
number;
-- add LDEBUG() to print debug information;
-- remove striping information setting in ADIOI_LUSTRE_SetInfo()
since these values can be set/gotten easily by
llapi_layout_xxx_set/get().
-- add "AC_SEARCH_LIBS(llapi_layout_comp_use, lustreapi,
[AC_DEFINE(HAVE_LUSTRE_COMP_LAYOUT_SUPPORT, 1,...)], ...)" to
romio/configure.ac to check if Lustre version installed
supports PFL feature.
-- add "AC_CHECK_HEADERS(yaml.h, ...)" to romio/configure.ac to
check if libyaml and libyaml-devel are installed for YAML
template file support.