New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-14396 REFRESH(boolean) option on persist #7981
Conversation
https://track.hpccsystems.com/browse/HPCC-14396 |
@jamienoss Please can you code review? |
Automated Smoketest Build: success |
@@ -2502,6 +2502,8 @@ bool EclAgent::checkPersistUptoDate(IRuntimeWorkflowItem & item, const char * lo | |||
errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure what happens in this case or how this even occurs but am wondering how this is included in the semantics of REFRESH(false)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if the CRC is missing the underlying file can still be used. CRC missing caused the file to be rebuilt because it was depending on the CRC to work out if it has changed - when Refresh=false, then it's not necessary to execute that check. @dabayliss can you comment please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case the conditionals should swap position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamienoss do you mean swap if (!item.queryPersistRefresh()) with if (savedEclCRC != eclCRC) conditional? If I do that and CRC is missing, then the file will be re-built - I'm not sure that is what is being asked for. @jakesmith please can you comment on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!isResult(lfn, (unsigned)-2))
errText.appendf("Building PERSIST('%s'): It hasn't been calculated before", logicalName);
else if (!isResult(crcName, (unsigned)-2))
errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName);
else if (isFile && !fileExists(logicalName))
errText.appendf("Rebuilding PERSIST('%s'): Persistent file does not exist", logicalName);
else if (!item.queryPersistRefresh())
return true;
else
So for REFRESH(false) you should still need to check that the file actually exists i.e. else if (isFile && !fileExists(logicalName))
should still precede else if (!item.queryPersistRefresh())
(as it does) but I don't think you want to rebuild if the crc isn't present and the file is found by logicalname and actually exists on disk. As you pointed out, the crc is wanted so as to check that it is the same as the one requested, if not rebuild i.e. refresh which seems to me exactly what REFRESH(false) is asking it not to do, no? So a direct swap, as I previously suggested, isn't complete - I think you want to move
else if (!isResult(crcName, (unsigned)-2))
errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName);
to after the refresh conditional, i.e. to the last elseif before the else. I think I'm saying that the order looked incorrect in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamienoss I got it now. I didn't look at the code carefully enough. Thanks. @jakesmith can you comment on jamie's question - any problems in using the file if CRC is not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR but the above constances (-2) should use the IWUResult constants defined in workunit.hpp, i.e. ResultSequencePersist in place of -2.
(The Roxie implementation of checkPersistUptoDate does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the CRC would be missing, but I agree with @jamienoss, if REFRESH(false) is used, then must still check if file is present, but should ignore missing CRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: any change here should also be made in Roxie's implementation of checkPersistUptoDate() in ccdcontext.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @jakesmith comment on IWUResults constants. This will be done in this jira ticket: https://track.hpccsystems.com/browse/HPCC-14601
@shamser back to you. |
@shamser - added comments around Jamie's point about missing CRC. I've also posted some question to the JIRA as have some more fundamental questions regarding proceeding after mismatch of code crc.. |
@jamienoss Please can you do a final review. |
Automated Smoketest Build: success
HPCC Stop: OK |
@jakesmith David responded that a nice to have would be: 1) A WARNING if the format of a persist behind my current persist,false has changed might be nice (eg warn me if Universe has a new format with my sandbox refresh as false) Should we implement this suggest nice to have? |
I think we should do something And/or bury the option to continue past format matches into a #option somewhere. If the platforms always generated a clean exception on format mismatch on read as David says in 2), then it is not obvious to me why you would not check the format at persist build time and throw an error there OR rebuild. Was hoping @ghalliday would comment in the JIRA.. |
@@ -1741,6 +1742,14 @@ persistOpt | |||
: fewMany | |||
| expireAttr | |||
| clusterAttr | |||
| REFRESH '(' TOK_TRUE ')' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be REFRESH '(' expression ')'
and then code to ensure it is boolean and constant in the production. Otherwise
doRefresh := true;
...REFRESH(doRefresh);
will not be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, I should have spotted that!
A couple of minor comments on the pull request. However I share Jake's unease with the feature. I think we should discuss when @richardkchapman gets back. A few comments:
If I think of the keyword as CACHED instead of PERSIST then it seems more reasonable. |
ds := DATASET(DestFile, rec, CSV(HEADING(1), SEPARATOR(','), QUOTE([]))); | ||
|
||
#IF (persistRefresh=true) | ||
CountriesDS := ds:PERSIST('~REGRESS::PersistRefresh', SINGLE, REFRESH(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REFRESH(persistRefresh) should be supported with no #IF condition
@shamser Back to you to address Gavin's comments - also needs rebasing for a clean merge. I have discussed the concerns about the design/desirability of the feature with Gavin and I think we are both happy with leaving it as it is - so if you can address the PR issues I will merge. |
Automated Smoketest Build: success
HPCC Stop: OK |
@jakesmith @jamienoss would you review the latest commit before I rebase? |
Automated Smoketest Build: success
HPCC Stop: OK |
@@ -383,6 +386,7 @@ class CCloneWorkflowItem : public CInterface, implements IRuntimeWorkflowItem | |||
persistWfid = other->queryPersistWfid(); | |||
scheduledWfid = other->queryScheduledWfid(); | |||
persistCopies = other->queryPersistCopies(); | |||
persistRefresh = other->queryPersistRefresh(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new, but none of the members of CCloneWorkflowItem are initialized in constructor, which would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll initialize persistRefresh in constructor and the others in separate ticket later.
@shamser - minor comments, but I think it would be better to simplify test by using inline datasets, rather than introducing new text files which you 1st spray. |
import Std; | ||
import Std.File AS FileServices; | ||
|
||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Imports removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant the alone "//" but perhaps this is also a reply to Attila's comment.
DestFile := '~regress::europe.txt'; | ||
ESPportIP := 'http://127.0.0.1:8010/FileSpray'; | ||
|
||
spray := FileServices.SprayVariable(OriginalTextFilesIp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use inline ds as Jake suggested.
@shamser think it would be good to have a few syntax tests in ecl/regress for testing the new syntax. The tests in this dir aren't run on any engine, just compiled. |
InputFile := OriginalTextFilesOsPath + '/download/europe1.txt'; | ||
#ELSE | ||
InputFile := OriginalTextFilesOsPath + '/download/europe2.txt'; | ||
#END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghalliday & @jakesmith would it be a good idea to convert all the #ifs to if to make sure that all the workflow and eclagent code works as it should do - I'm not really familiar enough with how persists work for this to be a code gen vs runtime issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or maybe we need two test files with both? I feel unsure, considering how many times I hear that persists cause issues with graph flow, eval, & exec etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamienoss the regression programs are not just compiled. They are executed too and the results a compared against the expected results... not sure what you mean by saying they are "just compiled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume that this response is in relation to - "think it would be good to have a few syntax tests in ecl/regress for testing the new syntax. The tests in this dir aren't run on any engine, just compiled."
rather than where you/github has placed it.
The ecl examples/tests in HPCC-Platform/testing/regress/ecl are both yes, but the ones in HPCC-Platform/ecl/regress are not, they are only compiled (at least by intention). I meant that perhaps tests should be added here also testing the new syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re.
would it be a good idea to convert all the #ifs to if to make sure that all the workflow and eclagent code works as it should do - I'm not really familiar enough with how persists work for this to be a code gen vs runtime issue?
Should use standard if instead of #IF where possible, not necessary afaict in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is necessary. We only need tests in one of the two places - and this is testing the general version of the syntax. What would adding another test cover that this one doesn't? The only example i can think of is error cases - e.g. a non constant argument.
@richardkchapman Ready to merge. |
Automated Smoketest Build: success
HPCC Stop: OK |
|
||
persistRefresh := #IFDEFINED(root.persistRefresh, true); | ||
|
||
#IF (persistRefresh=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be a standard 'if' vs #if
New REFRESH option for Persist. Usage: 1) PERSIST('test',REFRESH(TRUE)) - same as current persist behaviour 2) PERSIST('test',REFRESH(FALSE)) - only rebuild if file missing (ignore any file date or content changes) Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexis.co.uk>
@richardkchapman - looks ok to merge to me. |
Automated Smoketest Build: success
HPCC Stop: OK |
HPCC-14396 REFRESH(boolean) option on persist Reviewed-By: Jake Smith <jake.smith@lexisnexis.com> Reviewed-By: Jamie Noss <james.noss@lexisnexis.com> Reviewed-By: Gavin Halliday <gavin.halliday@lexisnexis.com> Reviewed-By: Richard Chapman <rchapman@hpccsystems.com>
New REFRESH option for Persist.
Usage:
PERSIST('test',REFRESH(TRUE))
- same as current persist behaviour
PERSIST('test',REFRESH(FALSE))
- only rebuild if file missing (ignore any file date or content changes)