Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-27204 Support planes with multiple devices (striping) #15782

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Feb 16, 2022

Signed-off-by: Jake Smith jake.smith@lexisnexisrisk.com

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@jakesmith
Copy link
Member Author

@ghalliday - please review, leaving a draft for the moment, because I need to do some more testing.

@github-actions
Copy link

@jakesmith jakesmith force-pushed the HPCC-27204-numDevices-support branch 5 times, most recently from c2705da to ff6eabc Compare February 17, 2022 12:34
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith a few comments from an initial scan. Main comment is that a logical file/file descriptor should have a baseDevice field - generally set to a hash of the logical filename - which is added to the part number before modulus. (Then the hash code is localised to creating the file, rather than the readers.)

unsigned numStripedDevices = queryPartDiskMapping(cn).numStripedDevices;
unsigned stripeNum = 0;
if (numStripedDevices>1)
stripeNum = (i%numStripedDevices)+1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more: also hash of logical filename/base hash stored in the file descriptor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately left this functionality off in this draft so far (but there is a comment somewhere), so prove the rest worked.

mspec.flags &= ~CPDMSF_striped;
#else
// Bare-metal can have multiple devices per plane (e.g. data + mirror), but it doesn't stripe across them
mspec.numStripedDevices = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possibly a flaw. May want to revisit bare metal at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, it would break bare metal to consider what it thinks of multidevice planes now as striped, but we could have a switch where new plane types were flagged as striped.
As you say, for future.

@@ -326,7 +328,8 @@ extern da_decl StringBuffer &makePhysicalPartName(
unsigned replicateLevel, // uses replication directory
DFD_OS os, // os must be specified if no dir specified
const char *diroverride, // override default directory
bool dirPerPart); // generate a subdirectory per part
bool dirPerPart, // generate a subdirectory per part
unsigned stripeNum); // strip number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: indent of comment and "strip"->"stripe"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix in next commit

@jakesmith jakesmith force-pushed the HPCC-27204-numDevices-support branch 2 times, most recently from 189ab8d to 6cc22dc Compare February 22, 2022 10:40
@jakesmith jakesmith marked this pull request as ready for review February 22, 2022 12:30
@AttilaVamos
Copy link
Contributor

@jakesmith The error with the failed spray_dir_test.ecl is:

<Exception><Code>1410</Code><Source>Roxie</Source><Message>Could not resolve filename regress::roxie::W20220222-125949-4::spray_test</Message></Exception>

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith looks good. a few relatively minor comments/questions

@@ -546,6 +546,7 @@ class IndexWriteSlaveActivity : public ProcessSlaveActivity, public ILookAheadSt
getPartFilename(*tlkDesc, l, path, true);
if (0 == l)
{
ensureDirectoryForFile(path.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efficiency: I will open a separate PR, but in most cases it would be better to try and copy a file and only try and create the directory if that fails. Not an issue for local file systems, but more of an issue for remote.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, revisit this and other similar places in that new JIRA then I think.

makePhysicalPartName(logicalName.get(), 0, 0, dir, 0, DFD_OSdefault, prefix, false, false);

StringBuffer fullPath;
makePhysicalPartName(logicalName.get(), 1, 1, fullPath, 0, DFD_OSdefault, prefix, false, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is passing 0 for stripe correct?

@@ -2399,7 +2400,10 @@ void ClusterWriteHandler::getPhysicalName(StringBuffer & name, const char * clus
{
Owned<IStoragePlane> plane = getDataStoragePlane(cluster, false);
const char * prefix = plane ? plane->queryPrefix() : nullptr;
makePhysicalPartName(logicalName.get(), 1, 1, name, 0, DFD_OSdefault, prefix, false);
unsigned stripeNum = 0;
if (plane)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if plane && numDevices > 1? or calcStripeNumber should return 0 if numPlanes = 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll add a check to calcStrripeNumber and simplify the calls

lfnHash = attr->getPropInt("@lfnHash");
else if (tracename.length())
{
lfnHash = hashc((const unsigned char *)tracename.str(), tracename.length(), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: More encapsulated if the function to calculate the hash was in a separately named function - so the actual hash function is isolated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

@@ -2336,7 +2428,7 @@ IFileDescriptor *createFileDescriptor(const char *lname,IGroup *grp,IPropertyTre
width = grp->ordinality();
StringBuffer s;
for (unsigned i=0;i<width;i++) {
makePhysicalPartName(lname, i+1, width, s.clear(), 0, os, nullptr, false);
makePhysicalPartName(lname, i+1, width, s.clear(), 0, os, nullptr, false, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0 for the stripe correct? Would be worth having a comment why. (added later?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is, but this version of createFileDescriptor is only called in one context in dfurun.cpp to do with keydiff.
However, a more sensible/normalized version should be called which would create an IFileDescriptor based on plane/striping.
I'll change and remove this defunct version.

#endif
StringBuffer descPath;
makePhysicalPartName(lfn.get(), 0, 0, descPath, 0, DFD_OSdefault, dir, false, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last parameter should be a stripeNumber rather than a boolean. Is it worth having a special function for this since similar code occurs elsewhere (I assume it gets the base directory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll introduce a helper for clarify, but call common code.

extern da_decl void addStripeDirectory(StringBuffer &out, const char *directory, const char *planeName, unsigned partNum, unsigned lfnHash, unsigned numStripes);
inline unsigned calcStripeNumber(unsigned partNum, unsigned lfnHash, unsigned numStripes)
{
return ((partNum+lfnHash)%numStripes)+1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: return 0 if numStripes <= 1? (to simplify calling code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will change to that.

Copy link
Member Author

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday - please review latest commit

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last changes look good and cleaned up the code a bit.

@jakesmith
Copy link
Member Author

@richardkchapman - want me to squash?

@richardkchapman
Copy link
Member

@jakesmith Yes please

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
@jakesmith jakesmith force-pushed the HPCC-27204-numDevices-support branch from a7f944f to 743e752 Compare March 1, 2022 13:11
@jakesmith
Copy link
Member Author

@richardkchapman - squashed

@richardkchapman richardkchapman merged commit a01d88b into hpcc-systems:candidate-8.6.x Mar 1, 2022
@HPCCSmoketest
Copy link
Contributor

Automated Smoketest: ✅
OS: centos 7.6.1810 (Linux 3.10.0-957.1.3.el7.x86_64)
Host: ip-10-20-0-50.ca-central-1.compute.internal
GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Git: 2.9.5, CMake: 3.22.1, cUrl: 7.67.0, node.js: v16.13.1, npm: 8.1.2
Sha: 743e752
Containerized:False
Build: success
Milestone:Install hpccsystems-platform-community_8.6.7-closedown0.el7.x86_64.rpm
HPCC Start: OK

Unit tests result:

Test total passed failed errors timeout elaps
unittest 222 222 0 0 0 69 sec
wutoolTest(Dali) 19 19 0 0 0 1 sec

Regression test result:

phase total pass fail elaps
setup (hthor) 9 9 0 22 sec (00:00:22)
setup (thor) 9 9 0 39 sec (00:00:39)
setup (roxie) 20 20 0 19 sec (00:00:19)
test (hthor) 984 984 0 895 sec (00:14:55)
test (thor) 894 894 0 1076 sec (00:17:56)
test (roxie) 1054 1054 0 903 sec (00:15:03)

HPCC Stop: OK
HPCC Uninstall: OK
Time stats:

Prep time Build time Package time Install time Start time Test time Stop time Summary
18 sec (00:00:18) 450 sec (00:07:30) 111 sec (00:01:51) 21 sec (00:00:21) 16 sec (00:00:16) 1339 sec (00:22:19) 18 sec (00:00:18) 1973 sec (00:32:53)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants