Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

romio: make contig_access_count MPI_Count instead of int #6928

Merged
merged 12 commits into from
May 15, 2024

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Feb 27, 2024

Pull Request Description

It is potentially possible to have more than INT_MAX number of noncontig segments, thus contig_access_count need be typed as MPI_Count to prevent overflow.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou
Copy link
Contributor Author

hzhou commented Feb 27, 2024

test:mpich/ch3/tcp
test:mpich/ch4/ofi

@wkliao
Copy link
Contributor

wkliao commented Mar 1, 2024

Can ROMIO now handle a single request of size > INT_MAX made from a process rank?

@hzhou
Copy link
Contributor Author

hzhou commented Mar 1, 2024

Can ROMIO now handle a single request of size > INT_MAX made from a process rank?

It is still in testing, but it should -- at least that's the goal :)

@roblatham00
Copy link
Contributor

@wkliao ROMIO has been able to handle large sizes for a few years now. Large counts are proving a bit harder. Requires lots of changes all over ROMIO, but I think I'm close. I'll push my changes to this branch.

Do you have a test case we should try out?

@wkliao
Copy link
Contributor

wkliao commented Mar 1, 2024

Most of my tests are through PnetCDF. Because of its use of type conversion and nonblocking APIs, PnetCDF calls MPI-IO APIs with MPI_BYTE datatype for buffers.

Both MPI-IO and PnetCDF need to be modernized to add large count APIs.

@roblatham00
Copy link
Contributor

I just pushed a bunch of changes, but for review only. i'm trying to figure out how ADIOI_Heap_merge gets a bogus memory address with small requests, and large counts timeout after 30 minutes without actually doing any I/O

@wkliao
Copy link
Contributor

wkliao commented Mar 2, 2024

Below is the test program with the assertion message when running 2 processes.

Assertion failed in file ../../../../mpich/src/mpi/romio/adio/common/ad_write_coll.c at line 898: (curr_to_proc[p] + len) == (unsigned) ((ADIO_Offset) curr_to_proc[p] + len)
% cat large_dtype.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <mpi.h>

#define ERROR(fname) \
    if (err != MPI_SUCCESS) { \
        int errorStringLen; \
        char errorString[MPI_MAX_ERROR_STRING]; \
        MPI_Error_string(err, errorString, &errorStringLen); \
        printf("Error at line %d when calling %s: %s\n",__LINE__,fname,errorString); \
    }

#define LEN 2048
#define NVARS 1100

/*----< main() >------------------------------------------------------------*/
int main(int argc, char **argv)
{
    char *filename;
    size_t i, buf_len;
    int err, rank, verbose=1, nprocs, psize[2], gsize[2], count[2], start[2];
    char *buf;
    MPI_File     fh;
    MPI_Datatype subType, filetype, buftype;
    MPI_Status   status;
    MPI_Offset fsize;
    int array_of_blocklengths[NVARS];
    MPI_Aint array_of_displacements[NVARS];
    MPI_Datatype array_of_types[NVARS];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    if (argc != 2) {
        if (!rank) printf("Usage: %s filename\n",argv[0]);
        MPI_Finalize();
        exit(1);
    }
    filename = argv[1];

    /* Creates a division of processors in a cartesian grid */
    psize[0] = psize[1] = 0;
    err = MPI_Dims_create(nprocs, 2, psize);
    ERROR("MPI_Dims_create");

    /* each 2D variable is of size gsizes[0] x gsizes[1] bytes */
    gsize[0] = LEN * psize[0];
    gsize[1] = LEN * psize[1];

    /* set subarray offset and length */
    start[0] = LEN * (rank / psize[1]);
    start[1] = LEN * (rank % psize[1]);
    count[0] = LEN - 1;   /* -1 to create holes */
    count[1] = LEN - 1;

    fsize = (MPI_Offset)gsize[0] * gsize[1] * NVARS - (LEN+1);
    if (verbose) {
        buf_len = (size_t)NVARS * (LEN-1) * (LEN-1);
        if (rank == 0) {
            printf("nprocs=%d NVARS=%d LEN=%d\n", nprocs, NVARS, LEN);
            printf("Expecting file size=%lld bytes (%.1f MB, %.1f GB)\n",
                   fsize, (float)fsize/1048576,(float)fsize/1073741824);
            printf("Each global variable is of size %d bytes (%.1f MB)\n",
                   gsize[0]*gsize[1],(float)gsize[0]*gsize[1]/1048576);
            printf("Each process writes %zd bytes (%.1f MB, %.1f GB)\n",
                   buf_len,(float)buf_len/1048576,(float)buf_len/1073741824);
        }
        printf("rank %3d: gsize=%4d %4d start=%4d %4d count=%4d %4d\n", rank,
               gsize[0],gsize[1],start[0],start[1],count[0],count[1]);
    }


    /* create 2D subarray datatype for fileview */
    err = MPI_Type_create_subarray(2, gsize, count, start, MPI_ORDER_C, MPI_BYTE, &subType);
    ERROR("MPI_Type_create_subarray");
    err = MPI_Type_commit(&subType);
    ERROR("MPI_Type_commit");

    /* create a filetype by concatenating NVARS subType */
    for (i=0; i<NVARS; i++) {
        array_of_blocklengths[i] = 1;
        array_of_displacements[i] = gsize[0]*gsize[1]*i;
        array_of_types[i] = subType;
    }
    err = MPI_Type_create_struct(NVARS, array_of_blocklengths,
                                        array_of_displacements,
                                        array_of_types,
                                        &filetype);
    ERROR("MPI_Type_create_struct");
    err = MPI_Type_commit(&filetype);
    ERROR("MPI_Type_commit");
    err = MPI_Type_free(&subType);
    ERROR("MPI_Type_free");

    /* Create local buffer datatype: each 2D variable is of size LEN x LEN */
    gsize[0] = LEN;
    gsize[1] = LEN;
    start[0] = 0;
    start[1] = 0;
    count[0] = LEN-1;  /* -1 to create holes */
    count[1] = LEN-1;

    err = MPI_Type_create_subarray(2, gsize, count, start, MPI_ORDER_C, MPI_BYTE, &subType);
    ERROR("MPI_Type_create_subarray");
    err = MPI_Type_commit(&subType);
    ERROR("MPI_Type_commit");

    /* concatenate NVARS subType into a buftype */
    for (i=0; i<NVARS; i++) {
        array_of_blocklengths[i] = 1;
        array_of_displacements[i] = LEN*LEN*i;
        array_of_types[i] = subType;
    }

    /* create a buftype by concatenating NVARS subTypes */
    err = MPI_Type_create_struct(NVARS, array_of_blocklengths,
                                        array_of_displacements,
                                        array_of_types,
                                        &buftype);
    ERROR("MPI_Type_create_struct");
    err = MPI_Type_commit(&buftype);
    ERROR("MPI_Type_commit");
    err = MPI_Type_free(&subType);
    ERROR("MPI_Type_free");

    /* allocate a local buffer */
    buf_len = (size_t)NVARS * LEN * LEN;
    buf = (char*) malloc(buf_len);
    for (i=0; i<buf_len; i++) buf[i] = (char)rank;

    /* open to create a file */
    err = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
    ERROR("MPI_File_open");

    /* set the file view */
    err = MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);
    ERROR("MPI_File_set_view");

    /* MPI collective write */
    err = MPI_File_write_all(fh, buf, 1, buftype, &status);
    ERROR("MPI_File_write_all");

    MPI_File_close(&fh);
    free(buf);

    err = MPI_Type_free(&filetype);
    ERROR("MPI_Type_free");
    err = MPI_Type_free(&buftype);
    ERROR("MPI_Type_free");

    MPI_Finalize();
    return 0;
}

@roblatham00
Copy link
Contributor

Wei-keng I always love to see new testscases: can I add this to test/mpi/io ?

@wkliao
Copy link
Contributor

wkliao commented Mar 8, 2024

Wei-keng I always love to see new testscases: can I add this to test/mpi/io ?

Sure.

@wkliao
Copy link
Contributor

wkliao commented Mar 8, 2024

FYI. To test reads, I added a call to MPI_File_read_all right after MPI_File_write_all in the test program (and file open mode changed to MPI_MODE_RDWR)

    /* MPI collective read */
    err = MPI_File_read_all(fh, buf, 1, buftype, &status);
    ERROR("MPI_File_read_all");

Then, I got this error when running just one MPI process.

Assertion failed in file ../../../../mpich/src/mpi/romio/adio/common/ad_read_str.c
at line 379: ((ADIO_Offset) num + size) == (unsigned) (num + size)

@raffenet
Copy link
Contributor

Forgive me for parachuting in, but I'd like to get this PR merged and backported to 4.2.x for a release by the end of the month. I added a commit to hopefully address the issue currently blocking @wkliao. Could you take a look?

@raffenet
Copy link
Contributor

test:mpich/ch4/most

@wkliao
Copy link
Contributor

wkliao commented Mar 16, 2024

Hi, @raffenet

I tested your patch on a ufs using the above test program.
It passed when running one MPI process, but failed with two.
The same error message appeared

Assertion failed in file ../../../../mpich/src/mpi/romio/adio/common/ad_write_coll.c
at line 898: (curr_to_proc[p] + len) == (unsigned) ((ADIO_Offset) curr_to_proc[p] + len)

I can see your patch fixed file ad_read_str.c and ad_write_str.c, but the error
came from ad_write_coll.c.

In addition, there are other ADIO drivers that also implement ReadStrided/WriteStrided.
The same for the collective subroutines, WriteStridedColl/ReadStridedColl
They also need to be fixed.

@wkliao
Copy link
Contributor

wkliao commented Mar 17, 2024

Please also update file /mpich/test/mpi/io/Makefile.am to add the new test program large_count.c, so it can run during CI.

@raffenet
Copy link
Contributor

Thanks for testing. I also enabled -Wshorten-64-to-32 on my laptop and found a bunch more issues that need fixing. Will keep plugging at it when I have time.

@roblatham00
Copy link
Contributor

roblatham00 commented Mar 18, 2024 via email

@roblatham00
Copy link
Contributor

I can make ROMIO happy about this large count but not MPICH itself:

#0  0x00002b871f16a387 in raise () from /lib64/libc.so.6
#1  0x00002b871f16ba78 in abort () from /lib64/libc.so.6
#2  0x00002b870936444b in MPID_Abort.cold () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#3  0x00002b870a40f598 in MPIR_Handle_fatal_error () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#4  0x00002b870a40ffe7 in MPIR_Err_return_comm () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#5  0x00002b8709a648ea in internal_Waitall () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#6  0x00002b871959ada6 in ADIOI_W_Exchange_data () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#7  0x00002b8719598076 in ADIOI_Exch_and_write () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#8  0x00002b871959530c in ADIOI_GEN_WriteStridedColl () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#9  0x00002b87194d18e4 in MPIOI_File_write_all () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#10 0x00002b87194d12a7 in PMPI_File_write_all () from /home/robl/soft/mpich-master/lib/libmpi.so.0
#11 0x00000000004014d8 in main ()

(guess I need to rebuild with debug info)

@roblatham00
Copy link
Contributor

(gdb) where
#0  0x00002b871f16a387 in raise () from /lib64/libc.so.6
#1  0x00002b871f16ba78 in abort () from /lib64/libc.so.6
#2  0x00002b870936444b in MPID_Abort (comm=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, comm@entry=0x0, mpi_errno=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, mpi_errno@entry=0,
    exit_code=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, exit_code@entry=1010488083, error_msg=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
    error_msg@entry=0x7fff21c09330 "Fatal error in internal_Waitall: Invalid MPI_Request, error stack:\ninternal_Waitall(125): MPI_Waitall(count=4, array_of_requests=0x2051230, array_of_statuses=0x1) failed\ninternal_Waitall(49).: The sup"...)
    at ../src/mpid/ch4/src/ch4_globals.c:120
#3  0x00002b870a40f598 in MPIR_Handle_fatal_error (comm_ptr=comm_ptr@entry=0x0, fcname=fcname@entry=0x2b8719602480 <__func__.0> "internal_Waitall", errcode=1010488083) at ../src/mpi/errhan/errutil.c:747
#4  0x00002b870a40ffe7 in MPIR_Err_return_comm (comm_ptr=comm_ptr@entry=0x0, fcname=fcname@entry=0x2b8719602480 <__func__.0> "internal_Waitall", errcode=<optimized out>) at ../src/mpi/errhan/errutil.c:278
#5  0x00002b8709a648ea in internal_Waitall (count=<optimized out>, array_of_requests=<optimized out>, array_of_statuses=<optimized out>) at ../src/binding/c/request/waitall.c:129
#6  0x00002b871959ada6 in ADIOI_W_Exchange_data (fd=0x2042890, buf=0x2b872588d010, write_buf=0x2b883888e0b0 "", flat_buf=0x2050e40, offset_list=0x2b883988f0b0, len_list=0x2b883a9bce58, send_size=0x20517a8, recv_size=0x20517b0, off=4294967296, size=16777216, count=0x2051798,
    start_pos=0x20517d8, partial_recv=0x20517a0, sent_to_proc=0x20517b8, nprocs=2, myrank=0, buftype_is_contig=0, contig_access_count=2251700, min_st_offset=0, fd_size=4613733376, fd_start=0x2050ae0, fd_end=0x2050af0, others_req=0x20516a0, send_buf_idx=0x20517c0,
    curr_to_proc=0x20517c8, done_to_proc=0x20517d0, hole=0x7fff21c0aaf4, iter=256, buftype_extent=4613734400, buf_idx=0x2042ee0, error_code=0x7fff21c0adec) at ../../../../src/mpi/romio/adio/common/ad_write_coll.c:750
#7  0x00002b8719598076 in ADIOI_Exch_and_write (fd=0x2042890, buf=0x2b872588d010, datatype=-872415232, nprocs=2, myrank=0, others_req=0x20516a0, offset_list=0x2b883988f0b0, len_list=0x2b883a9bce58, contig_access_count=2251700, min_st_offset=0, fd_size=4613733376,
    fd_start=0x2050ae0, fd_end=0x2050af0, buf_idx=0x2042ee0, error_code=0x7fff21c0adec) at ../../../../src/mpi/romio/adio/common/ad_write_coll.c:461
#8  0x00002b871959530c in ADIOI_GEN_WriteStridedColl (fd=0x2042890, buf=0x2b872588d010, count=1, datatype=-872415232, file_ptr_type=101, offset=0, status=0x7fff21c0af50, error_code=0x7fff21c0adec) at ../../../../src/mpi/romio/adio/common/ad_write_coll.c:186
#9  0x00002b87194d18e4 in MPIOI_File_write_all (fh=0x2042890, offset=0, file_ptr_type=101, buf=0x2b872588d010, count=1, datatype=-872415232, myname=0x2b871e16a450 <myname> "MPI_FILE_WRITE_ALL", status=0x7fff21c0af50) at ../../../../src/mpi/romio/mpi-io/write_all.c:166
#10 0x00002b87194d12a7 in PMPI_File_write_all (fh=0x2042890, buf=0x2b872588d010, count=1, datatype=-872415232, status=0x7fff21c0af50) at ../../../../src/mpi/romio/mpi-io/write_all.c:69
#11 0x00000000004014d8 in main (argc=<optimized out>, argv=<optimized out>) at /home/robl/src/mpich/test/mpi/io/large_count.c:143```

@roblatham00
Copy link
Contributor

still in progress but pushed some more changes to the branch

@roblatham00
Copy link
Contributor

with a debug, no optimizaiton build:

Abort(1007866643) on node 0: Fatal error in internal_Waitall: Invalid MPI_Request, error stack:
internal_Waitall(125): MPI_Waitall(count=4, array_of_requests=0x27673d0, array_of_statuses=0x1) failed
internal_Waitall(49).: The supplied request in array element 2 was invalid (kind=0)

@roblatham00
Copy link
Contributor

hm. -fsanitize=undefined -fsanitize=address doesn't uncover anything

@roblatham00
Copy link
Contributor

pushed another round of commits that seems to make the big-count IOR workload happy. Still investigating Wei-keng's test case

@roblatham00
Copy link
Contributor

going back to "invalid request"... I've added return-type checking to every isend and irecv in that path and still get The supplied request in array element 2 was invalid (kind=15) .

other than checking return value of MPI_Isend and MPI_Irecv, how can I confirm the returned request is valid? I'd like to assert/error/dump-core then, not at waitall time.

I could change to MPI_Send and MPI_recv but I am pretty sure that will deadlock.

@raffenet
Copy link
Contributor

test:mpich/warnings

@roblatham00 roblatham00 force-pushed the 2402_romio_count branch 2 times, most recently from da803b6 to 0b59b09 Compare May 13, 2024 20:51
@hzhou
Copy link
Contributor Author

hzhou commented May 13, 2024

test:mpich/ch3/most
test:mpich/ch4/most

@@ -30,6 +30,8 @@ noinst_PROGRAMS = \
external32_derived_dtype \
tst_fileview \
zero_count \
large_count \
large_dtype \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big (in terms of file space and memory it needs) are these two tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the memory footprint, but the output file size is 18 GB when running on 2 MPI processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that both tests are not added to the testlist so they won't be run during CI testing. I guess that is fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/bin/time says large_count needs ~26 MiB (26276 KiB) of memory (max RSS) -- that seems low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. That is how the chunking algorithm is supposed to work, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, it allocates a local buffer of size ~4GB, so ~26MB does seem low

@hzhou
Copy link
Contributor Author

hzhou commented May 14, 2024

The rd_end failure is a regression from this PR:

not ok 2487 - ./io/rd_end 2
  ---
  Directory: ./io
  File: rd_end
  Num-procs: 2
  Timeout: 180
  Date: "Mon May 13 17:33:52 2024"
  ...
## Test output (expected 'No Errors'):
## 0: count was 0; expected 5
##  Found 1 errors

ERROR("MPI_Type_free");

/* allocate a local buffer */
buf_len = (size_t) NVARS *LEN * LEN;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, 1100 * 2048 * 2048 -> ~ 4GB?

@wkliao
Copy link
Contributor

wkliao commented May 14, 2024

User buffer malloc total size per process is 8 GB in large_dtype.c.
Googling keyword memory footprint suggested the following commands.

/usr/bin/time -v command

and

/usr/bin/time -l command

The former is for bash and latter mac.

The value shown in "maximum resident set size" is the max footprint.

@wkliao
Copy link
Contributor

wkliao commented May 14, 2024

/usr/bin/time -v mpiexec -n 2 large_dtype -f dummy

	Elapsed (wall clock) time (h:mm:ss or m:ss): 35:33.03
	Maximum resident set size (kbytes): 9285480

@roblatham00
Copy link
Contributor

test:mpich/ch3/most
test:mpich/ch4/most

roblatham00 and others added 12 commits May 14, 2024 15:08
It is potentially possible to have more than INT_MAX number of noncontig
segments, thus contig_access_count need be typed as MPI_Count in order
to prevent overflow.
While the collective buffering routine will chunk up large requests,
there is still a preliminary "what is everyone doing" step.  Those
exchanges might require large counts.
Derived from a pnetcdf-generated workload
These routines stored length values in int and unsigned variable types
in some places. Update them to use MPI_Aint, which matches the eventual
API call.
Huge patch set touching almost all of romio, but should be much fewer
places where we store potentially large values in an int.  Passes
'-fsanitize=undefined' and also reduces '-Wshorten-64-to-32' warnings.
@roblatham00
Copy link
Contributor

test:mpich/ch3/most
test:mpich/ch4/most

@roblatham00
Copy link
Contributor

roblatham00 commented May 15, 2024

the failing tests are mostly OS/X failing to find the right autoconf version. However, I don't know how to resolve https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3351/jenkins_configure=debug,label=ubuntu22.04_review/testReport/junit/(root)/io/02482_____io_i_noncontig_coll2_4__/ -- that's a segfault in i_noncontig2 but I cannot reproduce that on Imrpov's UCX stack (clang-16 with sanitizers) and I cannot rule out a bug in UCX

@hzhou
Copy link
Contributor Author

hzhou commented May 15, 2024

the failing tests are mostly OS/X failing to find the right autoconf version. However, I don't know how to resolve https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3351/jenkins_configure=debug,label=ubuntu22.04_review/testReport/junit/(root)/io/02482_____io_i_noncontig_coll2_4__/ -- that's a segfault in i_noncontig2 but I cannot reproduce that on Imrpov's UCX stack (clang-16 with sanitizers) and I cannot rule out a bug in UCX

Let's ignore the osx tests for now. Most of the ch4 failures are expected. OFI failures are due to known performance issues in the sockets provider. The io/i_noncontig2 failures seems to be related to ucx shared memory issue. When the server is overloaded, ucx and some ofi providers are known not to release sharedmemory promptly and resulting in out-of-space errors.

I think this PR is good to merge now.

@hzhou hzhou merged commit 5861d90 into pmodels:main May 15, 2024
4 of 8 checks passed
@hzhou hzhou deleted the 2402_romio_count branch May 15, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants