Skip to content

Conversation

yanagibashi
Copy link
Contributor

Before this change, the reference counters opal_util_initialized
and opal_initialized were incremented at the beginning of the
opal_init_util and the opal_init functions respectively.
In other words, they were incremented before fully initialized.

This causes the following program to abort by SIGFPE if
--enable-timing is enabled on configure.

// need -lm option on link

int main(int argc, char *argv[])
{
    // raise SIGFPE on division-by-zero
    feenableexcept(FE_DIVBYZERO);
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}

The logic of the SIGFPE is:

  1. MPI_Init calls opal_init through ompi_rte_init.
  2. opal_init changes the value of opal_initialized to 1.
  3. opal_init calls opal_init_util.
  4. opal_init_util calls opal_timing_ts_func through
    OPAL_TIMING_ENV_INIT, and opal_timing_ts_func returns
    get_ts_cycle instead of get_ts_gettimeofday because
    opal_initialized to 1.
    (This is the problem)
  5. opal_init_util calls get_ts_cycle through
    OPAL_TIMING_ENV_INIT.
  6. get_ts_cycle executes
    opal_timer_base_get_cycles()) / opal_timer_base_get_freq()
    and it raises SIGFPE (division-by-zero) because the OPAL TIMER
    framework is not initialized yet and opal_timer_base_get_freq
    returns 0.

This commit changes the increment timing of opal_util_initialized
and opal_initialized to the end of opal_init_util and the
opal_init functions respectively.

Signed-off-by: Tsubasa Yanagibashi fj2505dt@aa.jp.fujitsu.com

Before this change, the reference counters `opal_util_initialized`
and `opal_initialized` were incremented at the beginning of the
`opal_init_util` and the `opal_init` functions respectively.
In other words, they were incremented before fully initialized.

This causes the following program to abort by SIGFPE if
`--enable-timing` is enabled on `configure`.

```c
// need -lm option on link

int main(int argc, char *argv[])
{
    // raise SIGFPE on division-by-zero
    feenableexcept(FE_DIVBYZERO);
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}
```

The logic of the SIGFPE is:

1. `MPI_Init` calls `opal_init` through `ompi_rte_init`.
2. `opal_init` changes the value of `opal_initialized` to 1.
3. `opal_init` calls `opal_init_util`.
4. `opal_init_util` calls `opal_timing_ts_func` through
   `OPAL_TIMING_ENV_INIT`, and `opal_timing_ts_func` returns
   `get_ts_cycle` instead of `get_ts_gettimeofday` because
   `opal_initialized` to 1.
   (This is the problem)
5. `opal_init_util` calls `get_ts_cycle` through
   `OPAL_TIMING_ENV_INIT`.
6. `get_ts_cycle` executes
   `opal_timer_base_get_cycles()) / opal_timer_base_get_freq()`
   and it raises SIGFPE (division-by-zero) because the OPAL TIMER
   framework is not initialized yet and `opal_timer_base_get_freq`
   returns 0.

This commit changes the increment timing of `opal_util_initialized`
and `opal_initialized` to the end of `opal_init_util` and the
`opal_init` functions respectively.

Signed-off-by: Tsubasa Yanagibashi <fj2505dt@aa.jp.fujitsu.com>
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@kawashima-fj
Copy link
Member

ok to test

Copy link
Member

@kawashima-fj kawashima-fj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This fix is not thread-safe. However, the original code was not thread-safe, too. Someone may improve it in the future.

orte_init in v4.0 (and possibly v3.x?) has similar code. ORTE is gone in master and orte_init in v4.0 does not seem to have a real problem. So we don't need to fix orte_init.

@kawashima-fj kawashima-fj merged commit a9b1629 into open-mpi:master Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants