New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-14535 Fetch all available ESPs and verify ESP is active #9642
HPCC-14535 Fetch all available ESPs and verify ESP is active #9642
Conversation
https://track.hpccsystems.com/browse/HPCC-14535 |
@wangkx please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments.
{ | ||
StringBuffer key; | ||
CriticalBlock b(espURLcrit); | ||
espUrlsLastHeartBeat.setValue(key.set(url).str(), aliveTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that url will not be null. Why not use url directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to persist url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this answer. The copy into key does not make any sense to me unless you are trying to translate nullptr to "" (but it's an inefficient way to do that...)
server.addServiceUrl(espurl); | ||
|
||
Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest(); | ||
Owned<IClientEchoDateTimeResponse> result = server.EchoDateTime(req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FileSpray is not running, will EchoDateTime return null or throw exception? How to handle it?
@wangkx please review. |
Owned<IClientEchoDateTimeResponse> result = server.EchoDateTime(req); | ||
try | ||
{ | ||
Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rpastrana Wrong indent.
b927709
to
b906db3
Compare
@wangkx please review again |
@rpastrana looks fine now. |
@richardkchapman please merge |
const char * key = ((const char *)iter.query().getKey()); | ||
time_t lastAlive = *(espUrlsLastHeartBeat.getValue(key)); | ||
|
||
if (now.getSimple() - lastAlive < 60 * 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You check once an hour? That seems a little infrequent...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polling intervals are always difficult, the right answer is usually to make it configurable, however based on your overall recommendation, this issue becomes moot.
|
||
try | ||
{ | ||
Owned<IClientEchoDateTime> req = server.createEchoDateTimeRequest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this take to time out if the server is NOT available? And is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice it's in the low seconds, but the fact we can't control the connect|read, etc timeouts on the generated ESP clients is an issue I'm looking into. I want to expose connection level configuration options.
@rpastrana A couple of questions, but also a more general point. I don't really understand why there is a need to be tracking availability / timestamps etc. The bug we are trying to fix is that it doesn't fail over when the server it previously connected to goes away... seems like all we need to do is (a) when first connecting, loop through all listed servers until we find one that responds and (b) repeat that process any time the one we talked to before doesn't respond. What am I missing? |
@richardkchapman well, the concept was to avoid pinging the ESP service as much as possible while maintaining a "reasonably" spaced heartbeat test. |
0072524
to
7230017
Compare
@richardkchapman please review. |
{ | ||
PROGLOG("Could not contact WsFS: '%s'",espurl); | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should say return false
if (result->getResult()) | ||
return true; | ||
} | ||
catch(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leak any IException that was thrown...
const char * currentUrl = availableWsFS.item(index); | ||
if (contactWsFS(currentUrl, wu)) | ||
return currentUrl; | ||
//else remove it from the list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should NOT remove from the list - if a user has no rights on a given ESP it does not mean that other users don't (and this list is static so shared by all users)
|
||
const char * wsfsurl = getNextAliveWsFSURL(wu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a shame to start from beginning of list each time rather than last successful connection (though saving last successful connection per user might be a bit of a pain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman Would it have to be per user? If they can't connect because of permissions I would think they couldn't connect to any instance of ws_fs.. since I think ws_fs access control is only checked at the feature level?
Therefore I think the list would be the same for all users based on the environment, and if the service isn't accessible because of permissions then the permissions should fail for that user on all instances (unless they aren't all secure... which would be inconsistent and bad).
@rpastrana Some comments |
7230017
to
02a9f50
Compare
@richardkchapman changes based on latest review squashed in. |
@richardkchapman new Jira based on client request timeout: https://track.hpccsystems.com/browse/HPCC-17178 |
@afishbeck @wangkx please re-review |
{ | ||
StringBuffer user, password, token; | ||
wu->getSecurityToken(StringBufferAdaptor(token)); | ||
extractToken(token.str(), wu->queryWuid(), StringBufferAdaptor(user), StringBufferAdaptor(password)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman we have really got to find a better way than storing user credentials in an easily extracted manner within workunits. Especially if we ever start using credentials that are valid for services outside of HPCC functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @RussWhitehead is looking into that matter https://track.hpccsystems.com/browse/HPCC-17172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I commented on that issue. Hopefully we can get rid of these entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It seems the ECL Plugin logic shouldn't have to deal with credentials at all, either the environment executing the logic (eclide, etc) should provide the connection logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have proposed using Public/Private key files and OpenSSL to manage password encryption. See https://track.hpccsystems.com/browse/HPCC-17172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
Owned<IPropertyTreeIterator> iter3 = iter2->query().getElements("AuthenticateFeature"); | ||
ForEach(*iter3) { | ||
ForEach(*iter3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we are looking at authenticated features. The EspBinding with service="EclWatch" represents a collection of services including ws_fs. Looking for both that and for a standalone EspBinding with service="ws_fs" should be enough. We don't really care how authentication is configured here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legacy code, looks to be based on the assumption that looking for eclwatch alone wasn't guaranteed to always contain ws_fs???
Looking for service="ws_fs" will always yield false until the day somebody decides to split wsfs from eclwatch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EclWatch will always contain ws_fs unless we make a major (very unlikely) change in the future. All services can be run on their own, although as you imply it may not be likely for this one.
The way authorization features ended up being done in the environment.xml is a bit weird, but the authorization isn't the best indicator of the service type, and even worse, another service may decide it needs to check ws_fs permissions even though it isn't the ws_fs service. But again the setup config mechanism is weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on all these points, and your suggestion is the obvious cleanest implementation. But no matter which way we go, we're making assumptions (WsFS will always be part of ecwatch, wsfs auth always associated with wsfs). It does't seem worthwhile to change the legacy implementation, but I'll go ahead a change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above suggestion forces a few more string compares, but saves an entire set of iterations over Auth entries, definitely worthwhile change.
iter3->query().getProp("@service",wsfsurl.clear())&& | ||
(strcmp(wsfsurl.str(),"ws_fs")==0)){ | ||
if (iter2->query().getProp("@protocol",wsfsurl.clear())) { | ||
wsfsurl.append("://"); | ||
StringBuffer espname; | ||
if (iter1->query().getProp("@name",espname)) { | ||
StringBuffer espinst; | ||
if (iter1->query().getProp("Instance[1]/@computer",espinst)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're handling the case of multiple ESP processes where the first is down. But what about the case of a single ESP process with multiple instances where the first INSTANCE is down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
|
||
const char * wsfsurl = getNextAliveWsFSURL(wu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardkchapman Would it have to be per user? If they can't connect because of permissions I would think they couldn't connect to any instance of ws_fs.. since I think ws_fs access control is only checked at the feature level?
Therefore I think the list would be the same for all users based on the environment, and if the service isn't accessible because of permissions then the permissions should fail for that user on all instances (unless they aren't all secure... which would be inconsistent and bad).
@rpastrana I think this is back with you at present |
- Catches exceptions in WSFS method invocation - Creates list of all configured WsFS services - Attempts to contact WsFS before providing that target URL - Cleans up error messages Signed-off-by: Rodrigo Pastrana <rodrigo.pastrana@lexisnexis.com>
- Fetches all available wsfs instances - Verifies given wsfs is accessible Signed-off-by: Rodrigo Pastrana <rodrigo.pastrana@lexisnexis.com>
f61d62b
to
561edf9
Compare
|
||
if (env.get()) | ||
{ | ||
StringBuffer wsFSUrl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracted from the inner loops to avoid re-initialization of the SB objects.
@afishbeck please review. |
Automated Smoketest: ✅
Install hpccsystems-platform-community_6.3.0-trunk0.el7.x86_64.rpm Unit tests result:
HPCC Stop: OK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rpastrana looks better.
Still wish we could cache the list.
Owned<IPropertyTreeIterator> iter3 = iter2->query().getElements("AuthenticateFeature"); | ||
ForEach(*iter3) | ||
espBindingIter->query().getProp("@service",wsFSUrl.clear()); | ||
if (wsFSUrl.length() && (stricmp(wsFSUrl.str(),"EclWatch")==0 || stricmp(wsFSUrl.str(),"ws_fs")==0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but I find strieq() easier to read.
Also this is why occasionally a more generic string name is good... if its being used for multiple purposes. This name is no longer self describing.
@richardkchapman please merge. |
Signed-off-by: Rodrigo Pastrana rodrigo.pastrana@lexisnexis.com
Type of change:
Checklist:
Testing: