Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-20493 Allow connect and read timeout to be configured in ESP #12392

Merged
merged 1 commit into from
Apr 18, 2019

Conversation

shamser
Copy link
Contributor

@shamser shamser commented Apr 4, 2019

Signed-off-by: Shamser Ahmed shamser.ahmed@lexisnexis.co.uk

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Testing:

@hpcc-jirabot
Copy link

@shamser
Copy link
Contributor Author

shamser commented Apr 4, 2019

TODO: still working on providing the test cases for this.

@shamser shamser force-pushed the issue20493 branch 3 times, most recently from 3674095 to 49b24b5 Compare April 4, 2019 15:34
@shamser
Copy link
Contributor Author

shamser commented Apr 8, 2019

@afishbeck Please can you review these changes relating being able to set the timeouts for ESP services. By the way, I've tested this manually but (1) created a multi-node thor (2) spraying to this thor (3) shutting down the remote thor node's dafilesrv (4) executing WUInfo, DFUQuery and DFUInfo ESP queries. Compare how long it takes to timeout with and without the new settings. (I'm inclined not produce test cases because it seems to be quite a lot of work for not too much benefit)

@@ -325,6 +326,10 @@ CEspConfig::CEspConfig(IProperties* inputs, IPropertyTree* envpt, IPropertyTree*
if (m_cfg->getProp("@daliServers", daliservers))
initDali(daliservers.str()); //won't init if detached

const unsigned dafilesrvConnectTimeout = m_cfg->getPropInt("@dafilesrvConnectTimeout", 2000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were the defaults decided on? Will a change in default behavior catch the users off guard? Looks like old behavior was the equivalent of 0's here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the default timeout period last Monday. @jakesmith suggested that having shorter timeout period would mean be sensible (resolves issues such as https://track.hpccsystems.com/browse/HPCC-19002) and if it were to cause problems, the timeout period could be extended with the configuration option.

I agree that it would cause problems if say a 30 second timeout is necessary and eventually returns a results. But does any remote file access require this length of time? May be 2 seconds timeout is too short?

@jakesmith ?

Copy link
Member

@jakesmith jakesmith Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was that the global default (not just in esp) should be much shorter than it is now.
( Currently the defaults are 100 seconds and 3 retries. )
And it should be globally configurable (e.g. via environment.conf), so that if needed / if problematic, it can be raised.

I think I had suggested changing it to 10 seconds (3 retries may be still ok).

Once that is done, whilst having a separate configuration in Esp is okay, it becomes less relevant.

There is also the further improvement of implementing a better mechanism via a timed blacklist, so clients don't keep retrying the same endponit and repeatedly timing out, which cumulatively can be significant even if short. (There is an implementation like this in thorsoapcall, which should be looked at to see whether it can be reused).
@shamser - was a JIRA opened for the blacklist implementation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I have just created the blackout jira: https://track.hpccsystems.com/browse/HPCC-21927. @afishbeck Do you think it should be set to '0' or is 10 seconds ok?

Copy link
Member

@jakesmith jakesmith Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just created the blackout jira

@shamser - blacklist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it should be set to '0' or is 10 seconds ok?

@shamser - are you asking re. this esp custom timeout, or the global timeout I was talking about above?
Has a JIRA been opened to change the global timeout?

If the global timeout is set as suggested (10 secs) - then I'd default the esp timeout to unset , will default to global timeout - but they (ops) can override it if they want to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I'm talking about the ESP custom and default timeouts. This one is designed to address a number of issues reported relating to ESP services (making remote file requests) taking too long to timeout. As far as other aspects related to timeouts, I feel that i may not be immediately pressing as no has complained (yet) and we can merge this one until we have agreed the more comprehensive approach for timings. By the way, there was a bug in the previous commit, which I have now fixed and pushed. (I've also set the default timeout period to 10 seconds).

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser one question.


<xs:attribute name="dafilesrvConnectTimeout" type="xs:nonNegativeInteger"
hpcc:displayName="Remote file connection timeout in seconds"
hpcc:presentValue="2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 10 now ?

hpcc:tooltip="Remote file access connection timeout in seconds"/>
<xs:attribute name="dafilesrvReadTimeout" type="xs:nonNegativeInteger"
hpcc:displayName="Remote file read timeout in seconds"
hpcc:presentValue="2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 10 now ?

@jakesmith
Copy link
Member

@shamser - 1 comment. Please go ahead and squash after changing.

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexis.co.uk>
@shamser
Copy link
Contributor Author

shamser commented Apr 17, 2019

@jakesmith Squashed. Please can you do final review.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest: ✅
OS: centos 7.4.1708 (Linux 3.10.0-327.28.3.el7.x86_64)
Sha: ea8c114
Build: success
Build: success
Install HPCC Platform
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout elaps
unittest 111 111 0 0 0 35 sec
wutoolTest(Dali) 19 19 0 0 0 2 sec
wutoolTest(Cassandra) 19 19 0 0 0 7 sec

Regression test result:

phase total pass fail elaps
setup (hthor) 11 11 0 23 sec (00:00:23)
setup (thor) 11 11 0 44 sec (00:00:44)
setup (roxie) 11 11 0 27 sec (00:00:27)
test (hthor) 827 827 0 196 sec (00:03:16)
test (thor) 752 752 0 608 sec (00:10:08)
test (roxie) 903 903 0 223 sec (00:03:43)

HPCC Stop: OK
Time stats:

Prep time Build time Package time Install time Start time Test time Stop time Summary
29 sec (00:00:29) 209 sec (00:03:29) 0 sec (00:00:00) 3 sec (00:00:03) 19 sec (00:00:19) 1319 sec (00:21:59) 21 sec (00:00:21) 1600 sec (00:26:40)

@jakesmith
Copy link
Member

Looks good.

@richardkchapman - ready to merge.

@richardkchapman
Copy link
Member

@JamesDeFabia @shamser Is there a documentation impact?

@shamser
Copy link
Contributor Author

shamser commented Apr 18, 2019

@JamesDeFabia @shamser Is there a documentation impact?

@JamesDeFabia @richardkchapman I have created a documentation Jira: https://track.hpccsystems.com/browse/HPCC-21974. This change affects current behaviour. How do we get this documented in the Redbook?

@richardkchapman
Copy link
Member

I have added a tag

@richardkchapman richardkchapman merged commit e656939 into hpcc-systems:master Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants