Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Lustre ADIO driver patch for Progressive File Layout feature #3290

Closed
wants to merge 7 commits into from

Conversation

roblatham00
Copy link
Contributor

To make Lustre ADIO driver work with PFL feature correctly, this
patch does the following changes:

  • Use llapi_layout_* interfaces to get/set the striping information
    -- use llapi_layout_file_create() instead of ioctl() to set
    striping information when creating a file;
    -- use llapi_layout_xxx_set/get() instead of ioctl() to set/get
    striping information;
    -- since O_LOV_DELAY_CREATE is set with O_CREATE by default
    in llapi_layout_file_open(), the related code to open a file
    in ADIOI_LUSTRE_open() is changed.
  • Set composite layout by hint:
    -- add hint "romio_lustre_comp_layout" to specify composite
    layout in the following 3 formats:
    --- YAML template file, e.g. /a/b/layout.yaml
    --- option string, similar to "lfs setstripe" command, e.g.
    "-E 4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K"
    --- lustre source file, e.g. /mnt/lustre/compfile, that means
    creating file with the same layout to this lustre file.
    --- If "romio_lustre_comp_layout" is enabled, hints
    "striping_factor", "striping_unit" and
    "romio_lustre_start_iodevice" will be ignored.
    -- add file ad_lustre_layout.c to parse the composite layout
    options;
    -- add files ad_lustre_cyaml.h and ad_lustre_cyaml.h to parse YAML
    template layout file.
  • Improve the following I/O redistribution algorithm with PFL
    feature:
    -- for stripe-contiguous pattern:
    --- use the LCM(lowest common multiple) of different component
    stripe count to calculate the number of available cb nodes
    --- use the last component stripe size as the common stripe
    size, because different MPI procs will write different
    components and it's hard to predict which component will
    have most impact on performance. (This can be improved.)
    -- for file-contiguous: use the last component stripe size as the
    common stripe size, same as above.
  • Fix some issues:
    -- set fn->hints->cb_nodes in ADIOI_LUSTRE_WriteStridedColl(),
    otherwise the final avail_cb_nodes is always 1;
    -- since there is no a mapping/initialization for ranklist[],
    just use #ran directly, otherwise it will get a wrong rank
    number;
    -- add LDEBUG() to print debug information;
    -- remove striping information setting in ADIOI_LUSTRE_SetInfo()
    since these values can be set/gotten easily by
    llapi_layout_xxx_set/get().
    -- add "AC_SEARCH_LIBS(llapi_layout_comp_use, lustreapi,
    [AC_DEFINE(HAVE_LUSTRE_COMP_LAYOUT_SUPPORT, 1,...)], ...)" to
    romio/configure.ac to check if Lustre version installed
    supports PFL feature.
    -- add "AC_CHECK_HEADERS(yaml.h, ...)" to romio/configure.ac to
    check if libyaml and libyaml-devel are installed for YAML
    template file support.

@roblatham00
Copy link
Contributor Author

Don't merge this yet! I am just creating the pull request to host conversation/feedback as we refine it.

@roblatham00
Copy link
Contributor Author

Emoly Liu contributed some backwards compatibility routines. now the lustre driver builds with lustre as old as 1.8 (maybe older, but none of us tested that).

@roblatham00
Copy link
Contributor Author

Trying to remember all the bugs we've patched in the old code over the last decade.

  • I tested MPI_MODE_EXCL -- that's still working, both with and without layout hints. To be expected as the hint processing was moved into the open path, and this patch doesn't change that.
  • In 2010 a bogus value for the stripe count could confuse LL_IOC_LOV_GETSTRIPE; the solution then was to use calloc. I've changed your use of malloc. While there I also change it to use ADIOI_Calloc, which will add a bit of memory logging in our debug builds.
  • fixed up a few uninitialized/unused variable warnings

@roblatham00
Copy link
Contributor Author

Some romio tests in src/mpi/romio/tests are failing on my virtual machine (Centos-7 tracking lustre-master; setting up a simple two OST volume from llmount.sh from lustre-tests.

Failing tests are i_noncontig, noncontig, noncontig_coll, noncontig_coll2. I suspect one bug is triggering these four failing tests.

Most of the tests fail reading back data they did not expect, but one test fails differently:

**** Testing noncontig_coll.c ****
Error in ADIOI_Calc_aggregator(): rank_index(12) >= fd->hints->cb_nodes (2) fd_size=20000 off=13816909816972
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

@roblatham00
Copy link
Contributor Author

I wonder if this is related to fcntl locking?

@roblatham00
Copy link
Contributor Author

Probably not: lustre in my VM is mounted with flock support:
10.0.2.15@tcp:/lustre on /mnt/lustre type lustre (rw,seclabel,flock,user_xattr,lazystatfs)

@roblatham00
Copy link
Contributor Author

This change does not introduce the error after all. Something broke between 54d944 and HEAD. git-bisecting now.

@roblatham00
Copy link
Contributor Author

@liy106 It looks like the offending commit was way back a few years ago. I pushed a fix to this branch, but 'noncontig_coll' is still failing. Confirmed that the test does not fail with HEAD.

@roblatham00
Copy link
Contributor Author

The test as written does three open/write/read/close cycles, for different combinations of "noncontiguous in memory" and "noncontiguous in file" (there are plenty of other tests that exercise the contiguous-in-memory contigous-in-file case). f you strip down noncontig_coll to just one open/write/read/close cycle, there are no errors. Wonder what stale state is getting carried over between close and subsequent open?

Emoly Liu and others added 6 commits October 11, 2018 00:03
To make Lustre ADIO driver work with PFL feature correctly, this patch
makes the following
changes:
- Use llapi_layout_* interfaces to get/set the striping information
    -- use llapi_layout_file_create() instead of ioctl() to set striping
       information when creating a file;
    -- use llapi_layout_xxx_set/get() instead of ioctl() to set/get
       striping information;
    -- since O_LOV_DELAY_CREATE is set with O_CREATE by default in
       llapi_layout_file_open(), the related code to open a file in
       ADIOI_LUSTRE_open() is changed.
- Set composite layout by hint:
    -- add hint "romio_lustre_comp_layout" to specify composite layout
       in the following 3 formats:
        --- YAML template file, e.g. /a/b/layout.yaml
        --- option string, similar to "lfs setstripe" command, e.g. "-E
            4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K"
        --- lustre source file, e.g.  /mnt/lustre/compfile, that means
            creating file with the same layout to this lustre file.
        --- If "romio_lustre_comp_layout" is enabled, hints
            "striping_factor", "striping_unit" and
            "romio_lustre_start_iodevice" will be ignored.
    -- add file ad_lustre_layout.c to parse the composite layout
       options;
    -- add files ad_lustre_cyaml.h and ad_lustre_cyaml.h to parse YAML
       template layout file.
- Improve the following I/O redistribution algorithm with PFL feature:
    -- for stripe-contiguous pattern:
        --- use the LCM(lowest common multiple) of different component
            stripe count to calculate the number of available cb nodes
        --- use the last component stripe size as the common stripe
            size, because different MPI procs will write different
            components and it's hard to predict which component will
            have most impact on performance. (This can be improved.)
    -- for file-contiguous: use the last component stripe size as the
       common stripe size, same as above. - Fix some issues:
    -- set fn->hints->cb_nodes in ADIOI_LUSTRE_WriteStridedColl(),
       otherwise the final avail_cb_nodes is always 1;
    -- since there is no  mapping/initialization for ranklist[], just
       use rank directly, otherwise it will get a wrong rank number;
    -- add LDEBUG() to print debug information;
    -- remove striping information setting in ADIOI_LUSTRE_SetInfo()
       since these values can be set/gotten easily by
       llapi_layout_xxx_set/get().
    -- add "AC_SEARCH_LIBS(llapi_layout_comp_use, lustreapi,
       [AC_DEFINE(HAVE_LUSTRE_COMP_LAYOUT_SUPPORT, 1,...)], ...)" to
       romio/configure.ac to check if Lustre version installed supports
       PFL feature.
    -- add "AC_CHECK_HEADERS(yaml.h, ...)" to romio/configure.ac to
       check if libyaml and libyaml-devel are installed for YAML
       template file support.
    -- add "AM_CONDITIONAL([LUSTRE_YAML]...)" to romio/configure.ac and
       add "if LUSTRE_YAML" to ad_lustre/Makefile.am to tell if YAML
       libraries are needed to build.
Implement layout routines in such a way that lustre-1.8 can still work.
'error' should only have a non-zero value if the Lustre I/O routines
actually encounter an error.  If no i/o was carried out (e.g. because
'count' was zero), a '-1' error is not sensible.
…he fly

ROMIO data structures are not well encapsulated.  Way back in file open
we parsed the list of I/O aggregators and set up a "ranklist" array that
was 'cb_nodes' big.  If we adjust cb_nodes on the fly after that, we're
going to overrun the array when we later go to figure out which
aggregator we should talk to.
let user figure out the detected layout information via info object
@roblatham00
Copy link
Contributor Author

If I set no hints, this test fails, but if I set the hint cb_config_list to *:*, the test passes. That hint says "make every node an I/O aggregator" and also increases the size of the fd->hints->ranklist array. I am trying to figure out what Lustre's read is doing differently from UFS (unix) reads. They both use a lot of the same code in the Read case....

@@ -120,6 +120,7 @@ void ADIOI_LUSTRE_WriteStridedColl(ADIO_File fd, const void *buf, int count,
ADIO_Offset *lustre_offsets0, *lustre_offsets, *count_sizes = NULL;

MPI_Comm_size(fd->comm, &nprocs);
fd->hints->cb_nodes = nprocs;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liy106 here's the problem! Welcome to ROMIO where nothing is adequately documented. There is a fd->hints->ranklist[] array allocated to be cb_nodes big. By increasing cb_nodes here, you will overrrun that array elsewhere.

@@ -0,0 +1,160 @@
/*
* LGPL HEADER START
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with your Cray friends, but I am pretty sure our vendor partners will flip out if we try to include anything LGPL.

uncertain why we were not caching this value...
@roblatham00
Copy link
Contributor Author

ok! with a few changes all of the romio tests pass if you don't set any hints. I tried to exercise the progressive layout path but was not certain how to enable that feature.

@pavanbalaji pavanbalaji added this to Done in MPI I/O Jan 15, 2019
@pavanbalaji pavanbalaji moved this from Done to To do in MPI I/O Jan 15, 2019
@pavanbalaji
Copy link
Contributor

Inactive PR. Closing it for now, till someone finds the time to work on it.

@pavanbalaji pavanbalaji closed this Jun 3, 2020
@roblatham00 roblatham00 self-assigned this Jul 18, 2022
@roblatham00 roblatham00 added romio anything having to do with our MPI-IO implementation still-interested Closed but will revisit or reopen labels Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge romio anything having to do with our MPI-IO implementation still-interested Closed but will revisit or reopen WIP
Projects
No open projects
MPI I/O
  
To do
Development

Successfully merging this pull request may close these issues.

None yet

3 participants