Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-19906 Fix race during finalization of spilling container. #11320

Merged
merged 2 commits into from Jun 27, 2018

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Jun 15, 2018

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Testing:

Manually reproduced race-condition type events using gdb to halt threads.

@hpcc-jirabot
Copy link

@jakesmith
Copy link
Member Author

@ghalliday - the only functional change + the fix is in the last commit, the other commits are there to improve the tracing and to tidy up some virtuals + add overrides.

Please review and let me know when to sqaush.

@ghalliday
Copy link
Member

@jakesmith the changes look sensible. Any idea why the thor loop tests are failing?

@jakesmith
Copy link
Member Author

No, I will investigate.

@jakesmith
Copy link
Member Author

@ghalliday - fixed the cause of the loop failures, the locking scope wasn't long enough in some cases (and there used to be a crit block in getStream).

Please review

totalRows += spillableRows.numCommitted();
if (iCompare)
{
// Option(rcflag_noAllInMemSort) - avoid sorting allMemRows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not connected with this PR, but I don't think this option is ever used (I was trying to work out why it would be used...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can see it's not. And although not directly related, I am going to remove associated code now.

@@ -165,6 +165,7 @@ class CSpillable : public CSimpleInterfaceOf<roxiemem::IBufferedRowCallback>
unsigned spillPriority = SPILL_PRIORITY_DISABLE;
IThorRowInterfaces *rowIf = nullptr;
roxiemem::IRowManager *rowManager = nullptr;
StringAttr tracingPrefix;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more efficient if this didn't involve a heap allocation. I suspect it is a tiny proportion of the time spent in these classes, but I am slightly concerned about the overhead for small numbers of rows in child queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it's a concern, but it's more that if these objects in general are recreated on the heap repeatedly for small sets in child queries, then it is wasteful/inefficient.
I've tried in the past to avoid that, but there is likely plenty of scope to look for further optimizations to minimize reallocation of reusable utility objects like this.
i.e. I think the approach should be, if possible, create a e.g. CSpillableStream object once and reuse it on next iteration of the activity in a child query, in which case the overhead of this StringAttr allocation for example becomes a 1 time overhead and trivial.

@ghalliday
Copy link
Member

@jakesmith I am not sure I would spot any big problems, but the logic looks sensible. A couple of minor comments, but could merge as-is.

@jakesmith
Copy link
Member Author

@ghalliday - added commit to remove RowCollectorOptionFlags / rcflag_noAllInMemSort

Let me know and I'll squash the commits.

@jakesmith
Copy link
Member Author

@AttilaVamos - there are no changes to hthor or any base classes/code that hthor uses in this PR, so the failures would appear to be unrelated.

Is there another merged PR implicated do you know?

@jakesmith
Copy link
Member Author

@ghalliday - anything recently merged that may be causing these mismatches?

@AttilaVamos
Copy link
Contributor

I don't know any.

@AttilaVamos
Copy link
Contributor

AttilaVamos commented Jun 20, 2018

I run your PR on my Ubuntu based smoketest. Moreover and got and build latest 6.4.22 and execute regression test on it to check it is healthy.

@jakesmith
Copy link
Member Author

@ghalliday - smoketest issues were my fault, last commit caused it, now rectified.

Please review/let me know when to squash.

@ghalliday
Copy link
Member

@jakesmith I think it looks ok - please squash

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
@jakesmith
Copy link
Member Author

@ghalliday - squashed.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest: ✅
OS: centos 7.4.1708 (Linux 3.10.0-327.28.3.el7.x86_64)
Sha: ec6694b
Build: success
Install hpccsystems-platform-community_6.4.22-closedown0.el7.x86_64.rpm
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout
unittest 97 97 0 0 0
wutoolTest(Dali) 19 19 0 0 0
wutoolTest(Cassandra) 19 19 0 0 0

Regression test result:

phase total pass fail
setup (hthor) 11 11 0
setup (thor) 11 11 0
setup (roxie) 11 11 0
test (hthor) 736 736 0
test (thor) 631 631 0
test (roxie) 765 765 0

HPCC Stop: OK
HPCC Uninstall: OK
Time stats:

Prep time Build time Package time Install time Start time Test time Stop time Summary
39 sec (00:00:39) 147 sec (00:02:27) 40 sec (00:00:40) 6 sec (00:00:06) 52 sec (00:00:52) 1116 sec (00:18:36) 39 sec (00:00:39) 1439 sec (00:23:59)

@ghalliday ghalliday merged commit dd1b45a into hpcc-systems:candidate-6.4.22 Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants