Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-14396 REFRESH(boolean) option on persist #7981

Merged
merged 1 commit into from Dec 16, 2015

Conversation

shamser
Copy link
Contributor

@shamser shamser commented Nov 26, 2015

New REFRESH option for Persist.

Usage:

  1. PERSIST('test',REFRESH(TRUE))
    - same as current persist behaviour

  2. PERSIST('test',REFRESH(FALSE))
    - only rebuild if file missing (ignore any file date or content changes)

@hpcc-jirabot
Copy link

@shamser
Copy link
Contributor Author

shamser commented Nov 26, 2015

@jamienoss Please can you code review?

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK
HPCC Stop: OK
HPCC Uninstall: OK

@@ -2502,6 +2502,8 @@ bool EclAgent::checkPersistUptoDate(IRuntimeWorkflowItem & item, const char * lo
errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure what happens in this case or how this even occurs but am wondering how this is included in the semantics of REFRESH(false)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if the CRC is missing the underlying file can still be used. CRC missing caused the file to be rebuilt because it was depending on the CRC to work out if it has changed - when Refresh=false, then it's not necessary to execute that check. @dabayliss can you comment please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case the conditionals should swap position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamienoss do you mean swap if (!item.queryPersistRefresh()) with if (savedEclCRC != eclCRC) conditional? If I do that and CRC is missing, then the file will be re-built - I'm not sure that is what is being asked for. @jakesmith please can you comment on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 if (!isResult(lfn, (unsigned)-2))
         errText.appendf("Building PERSIST('%s'): It hasn't been calculated before", logicalName);
 else if (!isResult(crcName, (unsigned)-2))
         errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName);
 else if (isFile && !fileExists(logicalName))
         errText.appendf("Rebuilding PERSIST('%s'): Persistent file does not exist", logicalName);
 else if (!item.queryPersistRefresh())
        return true;
 else

So for REFRESH(false) you should still need to check that the file actually exists i.e. else if (isFile && !fileExists(logicalName)) should still precede else if (!item.queryPersistRefresh()) (as it does) but I don't think you want to rebuild if the crc isn't present and the file is found by logicalname and actually exists on disk. As you pointed out, the crc is wanted so as to check that it is the same as the one requested, if not rebuild i.e. refresh which seems to me exactly what REFRESH(false) is asking it not to do, no? So a direct swap, as I previously suggested, isn't complete - I think you want to move

 else if (!isResult(crcName, (unsigned)-2))
         errText.appendf("Rebuilding PERSIST('%s'): Saved CRC isn't present", logicalName);

to after the refresh conditional, i.e. to the last elseif before the else. I think I'm saying that the order looked incorrect in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamienoss I got it now. I didn't look at the code carefully enough. Thanks. @jakesmith can you comment on jamie's question - any problems in using the file if CRC is not present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR but the above constances (-2) should use the IWUResult constants defined in workunit.hpp, i.e. ResultSequencePersist in place of -2.
(The Roxie implementation of checkPersistUptoDate does)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the CRC would be missing, but I agree with @jamienoss, if REFRESH(false) is used, then must still check if file is present, but should ignore missing CRC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: any change here should also be made in Roxie's implementation of checkPersistUptoDate() in ccdcontext.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding @jakesmith comment on IWUResults constants. This will be done in this jira ticket: https://track.hpccsystems.com/browse/HPCC-14601

@jamienoss
Copy link
Collaborator

@shamser back to you.
Looks good, though I think you'll need to add a test to ecl/regress that tests the syntax/compiler.

@jakesmith
Copy link
Member

@shamser - added comments around Jamie's point about missing CRC.

I've also posted some question to the JIRA as have some more fundamental questions regarding proceeding after mismatch of code crc..

@shamser
Copy link
Contributor Author

shamser commented Dec 2, 2015

@jamienoss Please can you do a final review.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK

setup hthor thor roxie
setup Pass Pass Pass
setup_fetch Pass Pass Pass
setupdict Pass Pass Pass
setupsearchindex Pass Pass Pass
setupsq Pass Pass Pass
setupwordindex Pass Pass Pass
setupxml Pass Pass Pass
test hthor thor roxie
persist_refresh (persistRefresh=false, deleteFiles=true) Pass Pass Pass
persist_refresh (persistRefresh=true, deleteFiles=false) Pass Pass Pass

HPCC Stop: OK
HPCC Uninstall: OK

@shamser
Copy link
Contributor Author

shamser commented Dec 2, 2015

@jakesmith David responded that a nice to have would be: 1) A WARNING if the format of a persist behind my current persist,false has changed might be nice (eg warn me if Universe has a new format with my sandbox refresh as false)
2) A 'clean' error if I try to USE a persist whose format has changed (eg error if Universe has a new format and my sandbox refresh is true)

Should we implement this suggest nice to have?

@jakesmith
Copy link
Member

I think we should do something And/or bury the option to continue past format matches into a #option somewhere.
Otherwise I suspect it's going to lead to headaches down the line from people abusing REFRESH(false).

If the platforms always generated a clean exception on format mismatch on read as David says in 2), then it is not obvious to me why you would not check the format at persist build time and throw an error there OR rebuild.

Was hoping @ghalliday would comment in the JIRA..

@@ -1741,6 +1742,14 @@ persistOpt
: fewMany
| expireAttr
| clusterAttr
| REFRESH '(' TOK_TRUE ')'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be REFRESH '(' expression ')'
and then code to ensure it is boolean and constant in the production. Otherwise

doRefresh := true;
...REFRESH(doRefresh);

will not be supported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I should have spotted that!

@ghalliday
Copy link
Member

A couple of minor comments on the pull request. However I share Jake's unease with the feature. I think we should discuss when @richardkchapman gets back. A few comments:

  • If we implement this should there be two parameters to allow you to control whether it is rebuilt (i) if data changes (ii) if code changes (if SINGLE option also supplied)?
  • Would OUTPUT(..., MISSING) provide the similar functionality? I.e., only output the file if it does not exist.
  • I suspect David wants it to be used in conjunction with ,SINGLE.
  • We should have a test that checks that an error is reported if the format changes

If I think of the keyword as CACHED instead of PERSIST then it seems more reasonable.

ds := DATASET(DestFile, rec, CSV(HEADING(1), SEPARATOR(','), QUOTE([])));

#IF (persistRefresh=true)
CountriesDS := ds:PERSIST('~REGRESS::PersistRefresh', SINGLE, REFRESH(true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REFRESH(persistRefresh) should be supported with no #IF condition

@richardkchapman
Copy link
Member

@shamser Back to you to address Gavin's comments - also needs rebasing for a clean merge.

I have discussed the concerns about the design/desirability of the feature with Gavin and I think we are both happy with leaving it as it is - so if you can address the PR issues I will merge.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK

setup hthor thor roxie
setup Pass Pass Pass
setup_fetch Pass Pass Pass
setupdict Pass Pass Pass
setupsearchindex Pass Pass Pass
setupsq Pass Pass Pass
setupwordindex Pass Pass Pass
setupxml Pass Pass Pass
test hthor thor roxie
persist_refresh (persistRefresh=false) Pass Pass Pass
persist_refresh (persistRefresh=true) Pass Pass Pass

HPCC Stop: OK
HPCC Uninstall: OK

@shamser
Copy link
Contributor Author

shamser commented Dec 15, 2015

@jakesmith @jamienoss would you review the latest commit before I rebase?

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK

setup hthor thor roxie
setup Pass Pass Pass
setup_fetch Pass Pass Pass
setupdict Pass Pass Pass
setupsearchindex Pass Pass Pass
setupsq Pass Pass Pass
setupwordindex Pass Pass Pass
setupxml Pass Pass Pass
test hthor thor roxie
persist_refresh (persistRefresh=false) Pass Pass Pass
persist_refresh (persistRefresh=true) Pass Pass Pass

HPCC Stop: OK
HPCC Uninstall: OK

@@ -383,6 +386,7 @@ class CCloneWorkflowItem : public CInterface, implements IRuntimeWorkflowItem
persistWfid = other->queryPersistWfid();
scheduledWfid = other->queryScheduledWfid();
persistCopies = other->queryPersistCopies();
persistRefresh = other->queryPersistRefresh();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new, but none of the members of CCloneWorkflowItem are initialized in constructor, which would be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll initialize persistRefresh in constructor and the others in separate ticket later.

@jakesmith
Copy link
Member

@shamser - minor comments, but I think it would be better to simplify test by using inline datasets, rather than introducing new text files which you 1st spray.

import Std;
import Std.File AS FileServices;

//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Imports removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant the alone "//" but perhaps this is also a reply to Attila's comment.

DestFile := '~regress::europe.txt';
ESPportIP := 'http://127.0.0.1:8010/FileSpray';

spray := FileServices.SprayVariable(OriginalTextFilesIp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use inline ds as Jake suggested.

@jamienoss
Copy link
Collaborator

@shamser think it would be good to have a few syntax tests in ecl/regress for testing the new syntax. The tests in this dir aren't run on any engine, just compiled.

InputFile := OriginalTextFilesOsPath + '/download/europe1.txt';
#ELSE
InputFile := OriginalTextFilesOsPath + '/download/europe2.txt';
#END
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday & @jakesmith would it be a good idea to convert all the #ifs to if to make sure that all the workflow and eclagent code works as it should do - I'm not really familiar enough with how persists work for this to be a code gen vs runtime issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or maybe we need two test files with both? I feel unsure, considering how many times I hear that persists cause issues with graph flow, eval, & exec etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamienoss the regression programs are not just compiled. They are executed too and the results a compared against the expected results... not sure what you mean by saying they are "just compiled".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume that this response is in relation to - "think it would be good to have a few syntax tests in ecl/regress for testing the new syntax. The tests in this dir aren't run on any engine, just compiled."
rather than where you/github has placed it.

The ecl examples/tests in HPCC-Platform/testing/regress/ecl are both yes, but the ones in HPCC-Platform/ecl/regress are not, they are only compiled (at least by intention). I meant that perhaps tests should be added here also testing the new syntax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re.

would it be a good idea to convert all the #ifs to if to make sure that all the workflow and eclagent code works as it should do - I'm not really familiar enough with how persists work for this to be a code gen vs runtime issue?

Should use standard if instead of #IF where possible, not necessary afaict in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is necessary. We only need tests in one of the two places - and this is testing the general version of the syntax. What would adding another test cover that this one doesn't? The only example i can think of is error cases - e.g. a non constant argument.

@shamser
Copy link
Contributor Author

shamser commented Dec 16, 2015

@richardkchapman Ready to merge.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK

setup hthor thor roxie
setup Pass Pass Pass
setup_fetch Pass Pass Pass
setupdict Pass Pass Pass
setupsearchindex Pass Pass Pass
setupsq Pass Pass Pass
setupwordindex Pass Pass Pass
setupxml Pass Pass Pass
test hthor thor roxie
persist_refresh (persistRefresh=false) Pass Pass Pass
persist_refresh (persistRefresh=true) Pass Pass Pass

HPCC Stop: OK
HPCC Uninstall: OK


persistRefresh := #IFDEFINED(root.persistRefresh, true);

#IF (persistRefresh=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a standard 'if' vs #if

New REFRESH option for Persist.

Usage:
  1) PERSIST('test',REFRESH(TRUE))
	- same as current persist behaviour

  2) PERSIST('test',REFRESH(FALSE))
	- only rebuild if file missing (ignore any file date or content changes)

Signed-off-by: Shamser Ahmed <shamser.ahmed@lexisnexis.co.uk>
@jakesmith
Copy link
Member

@richardkchapman - looks ok to merge to me.

@HPCCSmoketest
Copy link
Contributor

Automated Smoketest Build: success
Install hpccsystems-platform-community-with-plugins_6.1.0-trunk0.el6.x86_64.rpm
HPCC Start: OK

setup hthor thor roxie
setup Pass Pass Pass
setup_fetch Pass Pass Pass
setupdict Pass Pass Pass
setupsearchindex Pass Pass Pass
setupsq Pass Pass Pass
setupwordindex Pass Pass Pass
setupxml Pass Pass Pass
test hthor thor roxie
persist_refresh (persistRefresh=false) Pass Pass Pass
persist_refresh (persistRefresh=true) Pass Pass Pass

HPCC Stop: OK
HPCC Uninstall: OK

richardkchapman added a commit that referenced this pull request Dec 16, 2015
HPCC-14396 REFRESH(boolean) option on persist

Reviewed-By: Jake Smith <jake.smith@lexisnexis.com>
Reviewed-By: Jamie Noss <james.noss@lexisnexis.com>
Reviewed-By: Gavin Halliday <gavin.halliday@lexisnexis.com>
Reviewed-By: Richard Chapman <rchapman@hpccsystems.com>
@richardkchapman richardkchapman merged commit dde96a3 into hpcc-systems:master Dec 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants