Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-14535 Fetch all available ESPs and verify ESP is active #9642

Conversation

rpastrana
Copy link
Member

@rpastrana rpastrana commented Feb 28, 2017

  • Fetches all available wsfs instances
  • Verifies given wsfs is accessible

Signed-off-by: Rodrigo Pastrana rodrigo.pastrana@lexisnexis.com

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browers
    • The component(s) render as expected

Testing:

@hpcc-jirabot
Copy link

@rpastrana
Copy link
Member Author

@wangkx please review

Copy link
Member

@wangkx wangkx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments.

{
StringBuffer key;
CriticalBlock b(espURLcrit);
espUrlsLastHeartBeat.setValue(key.set(url).str(), aliveTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that url will not be null. Why not use url directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to persist url.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this answer. The copy into key does not make any sense to me unless you are trying to translate nullptr to "" (but it's an inefficient way to do that...)

server.addServiceUrl(espurl);

Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest();
Owned<IClientEchoDateTimeResponse> result = server.EchoDateTime(req);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FileSpray is not running, will EchoDateTime return null or throw exception? How to handle it?

@rpastrana
Copy link
Member Author

@wangkx please review.

Owned<IClientEchoDateTimeResponse> result = server.EchoDateTime(req);
try
{
Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpastrana Wrong indent.

@rpastrana rpastrana force-pushed the HPCC-14535-fileservicesplugin branch from b927709 to b906db3 Compare March 2, 2017 20:30
@rpastrana
Copy link
Member Author

@wangkx please review again

@wangkx
Copy link
Member

wangkx commented Mar 2, 2017

@rpastrana looks fine now.

@rpastrana
Copy link
Member Author

@richardkchapman please merge

const char * key = ((const char *)iter.query().getKey());
time_t lastAlive = *(espUrlsLastHeartBeat.getValue(key));

if (now.getSimple() - lastAlive < 60 * 60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You check once an hour? That seems a little infrequent...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling intervals are always difficult, the right answer is usually to make it configurable, however based on your overall recommendation, this issue becomes moot.


try
{
Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this take to time out if the server is NOT available? And is that a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice it's in the low seconds, but the fact we can't control the connect|read, etc timeouts on the generated ESP clients is an issue I'm looking into. I want to expose connection level configuration options.

@richardkchapman
Copy link
Member

@rpastrana A couple of questions, but also a more general point. I don't really understand why there is a need to be tracking availability / timestamps etc. The bug we are trying to fix is that it doesn't fail over when the server it previously connected to goes away... seems like all we need to do is (a) when first connecting, loop through all listed servers until we find one that responds and (b) repeat that process any time the one we talked to before doesn't respond.

What am I missing?

@rpastrana
Copy link
Member Author

@richardkchapman well, the concept was to avoid pinging the ESP service as much as possible while maintaining a "reasonably" spaced heartbeat test.

@rpastrana rpastrana force-pushed the HPCC-14535-fileservicesplugin branch 2 times, most recently from 0072524 to 7230017 Compare March 3, 2017 21:36
@rpastrana
Copy link
Member Author

@richardkchapman please review.

{
PROGLOG("Could not contact WsFS: '%s'",espurl);
}
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say return false

if (result->getResult())
return true;
}
catch(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leak any IException that was thrown...

const char * currentUrl = availableWsFS.item(index);
if (contactWsFS(currentUrl, wu))
return currentUrl;
//else remove it from the list?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should NOT remove from the list - if a user has no rights on a given ESP it does not mean that other users don't (and this list is static so shared by all users)


const char * wsfsurl = getNextAliveWsFSURL(wu);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a shame to start from beginning of list each time rather than last successful connection (though saving last successful connection per user might be a bit of a pain).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman Would it have to be per user? If they can't connect because of permissions I would think they couldn't connect to any instance of ws_fs.. since I think ws_fs access control is only checked at the feature level?

Therefore I think the list would be the same for all users based on the environment, and if the service isn't accessible because of permissions then the permissions should fail for that user on all instances (unless they aren't all secure... which would be inconsistent and bad).

@richardkchapman
Copy link
Member

@rpastrana Some comments

@rpastrana rpastrana force-pushed the HPCC-14535-fileservicesplugin branch from 7230017 to 02a9f50 Compare March 6, 2017 17:31
@rpastrana
Copy link
Member Author

@richardkchapman changes based on latest review squashed in.

@rpastrana
Copy link
Member Author

@richardkchapman new Jira based on client request timeout: https://track.hpccsystems.com/browse/HPCC-17178

@richardkchapman
Copy link
Member

@afishbeck @wangkx please re-review

{
StringBuffer user, password, token;
wu->getSecurityToken(StringBufferAdaptor(token));
extractToken(token.str(), wu->queryWuid(), StringBufferAdaptor(user), StringBufferAdaptor(password));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman we have really got to find a better way than storing user credentials in an easily extracted manner within workunits. Especially if we ever start using credentials that are valid for services outside of HPCC functionality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I commented on that issue. Hopefully we can get rid of these entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It seems the ECL Plugin logic shouldn't have to deal with credentials at all, either the environment executing the logic (eclide, etc) should provide the connection logic as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have proposed using Public/Private key files and OpenSSL to manage password encryption. See https://track.hpccsystems.com/browse/HPCC-17172

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

Owned<IPropertyTreeIterator> iter3 = iter2->query().getElements("AuthenticateFeature");
ForEach(*iter3) {
ForEach(*iter3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we are looking at authenticated features. The EspBinding with service="EclWatch" represents a collection of services including ws_fs. Looking for both that and for a standalone EspBinding with service="ws_fs" should be enough. We don't really care how authentication is configured here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy code, looks to be based on the assumption that looking for eclwatch alone wasn't guaranteed to always contain ws_fs???
Looking for service="ws_fs" will always yield false until the day somebody decides to split wsfs from eclwatch...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EclWatch will always contain ws_fs unless we make a major (very unlikely) change in the future. All services can be run on their own, although as you imply it may not be likely for this one.

The way authorization features ended up being done in the environment.xml is a bit weird, but the authorization isn't the best indicator of the service type, and even worse, another service may decide it needs to check ws_fs permissions even though it isn't the ws_fs service. But again the setup config mechanism is weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on all these points, and your suggestion is the obvious cleanest implementation. But no matter which way we go, we're making assumptions (WsFS will always be part of ecwatch, wsfs auth always associated with wsfs). It does't seem worthwhile to change the legacy implementation, but I'll go ahead a change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above suggestion forces a few more string compares, but saves an entire set of iterations over Auth entries, definitely worthwhile change.

iter3->query().getProp("@service",wsfsurl.clear())&&
(strcmp(wsfsurl.str(),"ws_fs")==0)){
if (iter2->query().getProp("@protocol",wsfsurl.clear())) {
wsfsurl.append("://");
StringBuffer espname;
if (iter1->query().getProp("@name",espname)) {
StringBuffer espinst;
if (iter1->query().getProp("Instance[1]/@computer",espinst)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're handling the case of multiple ESP processes where the first is down. But what about the case of a single ESP process with multiple instances where the first INSTANCE is down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point


const char * wsfsurl = getNextAliveWsFSURL(wu);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardkchapman Would it have to be per user? If they can't connect because of permissions I would think they couldn't connect to any instance of ws_fs.. since I think ws_fs access control is only checked at the feature level?

Therefore I think the list would be the same for all users based on the environment, and if the service isn't accessible because of permissions then the permissions should fail for that user on all instances (unless they aren't all secure... which would be inconsistent and bad).

@richardkchapman
Copy link
Member

@rpastrana I think this is back with you at present

- Catches exceptions in WSFS method invocation
- Creates list of all configured WsFS services
- Attempts to contact WsFS before providing that target URL
- Cleans up error messages

Signed-off-by: Rodrigo Pastrana <rodrigo.pastrana@lexisnexis.com>
- Fetches all available wsfs instances
- Verifies given wsfs is accessible

Signed-off-by: Rodrigo Pastrana <rodrigo.pastrana@lexisnexis.com>
@rpastrana rpastrana force-pushed the HPCC-14535-fileservicesplugin branch from f61d62b to 561edf9 Compare March 16, 2017 21:09

if (env.get())
{
StringBuffer wsFSUrl;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted from the inner loops to avoid re-initialization of the SB objects.

@rpastrana
Copy link
Member Author

@afishbeck please review.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest: ✅
Sha: 561edf9
Build: success
ECL Watch: Rebuilding Site

errors warnings build time
0 65 94.267 seconds

Install hpccsystems-platform-community_6.3.0-trunk0.el7.x86_64.rpm
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout
unittest 93 93 0 0 0
wutoolTest(Dali) 19 19 0 0 0
wutoolTest(Cassandra) 19 19 0 0 0

HPCC Stop: OK
HPCC Uninstall: OK

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpastrana looks better.

Still wish we could cache the list.

Owned<IPropertyTreeIterator> iter3 = iter2->query().getElements("AuthenticateFeature");
ForEach(*iter3)
espBindingIter->query().getProp("@service",wsFSUrl.clear());
if (wsFSUrl.length() && (stricmp(wsFSUrl.str(),"EclWatch")==0 || stricmp(wsFSUrl.str(),"ws_fs")==0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but I find strieq() easier to read.

Also this is why occasionally a more generic string name is good... if its being used for multiple purposes. This name is no longer self describing.

@rpastrana
Copy link
Member Author

@richardkchapman please merge.

@richardkchapman richardkchapman merged commit a1b2370 into hpcc-systems:candidate-6.4.0 Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants