Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-17313 Fix potential smartjoin assert if failing over to local #9771

Merged
merged 1 commit into from Mar 29, 2017

Conversation

@jakesmith
Copy link
Member

commented Mar 29, 2017

When failing over to local smartjoin, if memory pressure causes a
OOM callback twice after successfully gathering the whole RHS
without spilling, then an assert was hit whilst reading the
gathered rhs rows.

Signed-off-by: Jake Smith jake.smith@lexisnexisrisk.com

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browers
    • The component(s) render as expected

Testing:

To expose the bug, I artificially provoked a double OOM callback at appropriate point.
For reference that was by adding the 2 "TEST*" lines below

                    if (marker.init(rhsRows, rightRowManager)) // May fail if insufficient memory available
                    {
                        // NB: If marker.init() returned false, it will have called the MM callbacks and have setup hasFailedOverToLocal() already
                        success = rhs.resize(rhsRows, SPILL_PRIORITY_LOW); // NB: Could OOM, handled by exception handler
                        clearAllNonLocalRows("TEST1", true);
                        clearAllNonLocalRows("TEST2", true);
                    }
@hpcc-jirabot

This comment has been minimized.

Copy link

commented Mar 29, 2017

@jakesmith jakesmith force-pushed the jakesmith:hpcc-17313 branch from 65f5fdc to 1aa5c65 Mar 29, 2017
When failing over to local smartjoin, if memory pressure causes a
OOM callback twice after successfully gathering the whole RHS
without spilling, then an assert was hit whilst reading the
gathered rhs rows.

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
@jakesmith jakesmith force-pushed the jakesmith:hpcc-17313 branch from 1aa5c65 to 05d53f5 Mar 29, 2017
@@ -554,6 +554,7 @@ inline StringBuffer& operator << (StringBuffer& s, const TValue& value)

extern jlib_decl void decodeCppEscapeSequence(StringBuffer & out, const char * in, bool errorIfInvalid);
extern jlib_decl bool strToBool(const char * text);
inline const char *boolToStr(bool b) { return b ? "true" : "false"; }

This comment has been minimized.

Copy link
@jakesmith

jakesmith Mar 29, 2017

Author Member

@ghalliday - auxiliary util. method addition, could put in separate JIRA/PR if you want..

@jakesmith

This comment has been minimized.

Copy link
Member Author

commented Mar 29, 2017

@ghalliday - please review

@HPCCSmoketest

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

Automated Smoketest:
Sha: 05d53f5
Build: success
ECL Watch: Rebuilding Site

errors warnings build time
0 65 92.315 seconds

Install hpccsystems-platform-community_6.3.0-trunk0.el7.x86_64.rpm
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout
unittest 93 93 0 0 0
wutoolTest(Dali) 19 19 0 0 0
wutoolTest(Cassandra) 19 19 0 0 0

Regression test result:

phase total pass fail
setup (hthor) 9 9 0
setup (thor) 9 9 0
setup (roxie) 9 9 0
test (hthor) 720 719 1
test (thor) 615 615 0
test (roxie) 745 745 0

HPCC Stop: OK
HPCC Uninstall: OK

@AttilaVamos

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2017

It failed on hthor because the persist_refresh.ecl doesn't parallel query safe (HPCC-17208).

@ghalliday ghalliday merged commit f1f2b22 into hpcc-systems:candidate-6.4.0 Mar 29, 2017
@richardkchapman

This comment has been minimized.

Copy link
Member

commented Apr 4, 2017

Cherry-picked to 6.2.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.