-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up nhc_hw_gather_data function for KNL #47
Conversation
Reduces number of access to /proc/cpuinfo in the nhc_hw_gather_data function. For processors like Intel Phi KNL with 256+ threads, the function was taking over 40 seconds to return. Running on a normal file, the function should only take about 4 seconds. Pre-filtering with grep shortens the function runtime further to 1.5s on a 1.4GHz (64 core, 256 thread) KNL.
This one is great!
|
Hi |
This branch/changeset extensively refactors `lbnl_hw.nhc` as a possible solution/improvement for the `/proc` file I/O problem (e.g., - Split `nhc_hw_gather_data()` into distinct functions per section, allowing more fine-grained control over which `procfs` and/or `sysfs` files actually need to be read and parsed in the first place; - Convert the path+filename for each file being parsed into a configuration variable that users can customize as needed; - Alter the way NHC is pulling data from each file, by using a single `read` invocation, to avoid the [`lseek()`/rebuild problems described](#43 (comment)) by @mattmix in #43. - Add support for using a cached copy of the original file (instead of reading directly from `/proc` or `/sys`) to avoid the resulting poor performance; and - Allow for the aforementioned configuration variables to specify a process substitution expression (see `bash(1)` for details) rather than a path+filename. This, in turn, permits the user to, for example, `grep` out unused lines to minimize parsing time. (For a specific example and rationale, see [this comment](#30 (comment)) from @NateCrawford with relevant performance comparisons.) - Improve the unit tests for this module; now that the hardware checks are capable of reading from a user-defined location, the unit tests feed auto-generated data into the checks via the `/dev/stdin` "file" (really a shell variable). Not only does this verify the new "user-specified custom data source" code, it also expands the "code coverage" of the unit tests. The old tests just directly assigned the test data to the right NHC-internal variables; the new tests cover the parsing code as well, not just the checks themselves. These changes are intended to fix #30, #39, #43, #47, and #118 as well as some older LANL-internal issues with [Trinity](https://www.lanl.gov/projects/trinity/specifications.php) (our Haswell/KNL-based, nineteen-thousand-node HPE/Cray XC40). And with respect to Trinity, I would be remiss were I to fail to express my sincere thanks to @grahamvh, my colleague at @lanl and one of the main sysadmins for that system, who helped me immensely in brainstorming, devising potential solutions, testing, and providing critical feedback en route toward finally getting this problem licked! Feedback on this approach is much appreciated!
Fixed via #121 instead; closing. NHC goes to great lengths to avoid spawning subshells or other processes wherever and whenever possible. The culprit turned out to be the way the kernel and Bash handle |
Reduces number of access to /proc/cpuinfo in the nhc_hw_gather_data function. For processors like Intel Phi KNL with 256+ threads, the function was taking over 40 seconds to return. Running on a normal file, the function should only take about 4 seconds. Pre-filtering with grep shortens the function runtime further to 1.5s on a 1.4GHz (64 core, 256 thread) KNL.