Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp #3616

Merged
merged 12 commits into from Feb 8, 2023

Conversation

meethngala
Copy link
Contributor

@meethngala meethngala commented Dec 14, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

We want to support extended ACLs within for any type of file based distcp that is initiated. The previous of Hadoop i.e. 2.4 in Gobblin has this limitation since there wasn't APIs in the Hadoop FileSystem that we could leverage. Thus, as part of this PR, I built on top of our Hadoop upgrade version to 2.10 and provided the support for extended ACLs.

This PR now allows us to preserve ACLs as well during a distcp along with its previous other attributes which includes and not limited to: file permissions with sticky bit, owner, group, mod times, etc. We preserve these ACLs for the all the directories/paths that were created as part of the distcp operation on the destination. Having said that, the user can limit the ancestor directory up to which they want to preserve these attributes. All of the attribute preservation for the FileSystem is configurable including the current support for ACLs. Below are key changes made as part of this PR:

  • Here are some details about my PR, including screenshots (if applicable):
  • Added new preserve attributes for supporting extended ACLs
  • Extended OwnerAndPermissions to provide ACL support
  • Updated CopyableFile to include ACLs if the preserve attribute for ACL is set and also maintain the same for ancestors
  • Updated FileAwareInputStreamDataWriter to set ACLs
  • Added unit tests for preserving ACLs at both File Path level and all of the ancestors that were created as part of the copy

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • build successfully
  • Added unit tests to confirm the support for extended ACLs

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2022

Codecov Report

Merging #3616 (d8b378d) into master (69f3f9c) will increase coverage by 0.00%.
The diff coverage is 68.42%.

@@            Coverage Diff            @@
##             master    #3616   +/-   ##
=========================================
  Coverage     46.58%   46.58%           
- Complexity    10672    10675    +3     
=========================================
  Files          2133     2133           
  Lines         83557    83570   +13     
  Branches       9290     9294    +4     
=========================================
+ Hits          38928    38935    +7     
- Misses        41068    41069    +1     
- Partials       3561     3566    +5     
Impacted Files Coverage Δ
...lin/data/management/copy/ManifestBasedDataset.java 51.78% <0.00%> (ø)
...che/gobblin/data/management/copy/CopyableFile.java 73.33% <42.85%> (-2.01%) ⬇️
...bblin/data/management/copy/OwnerAndPermission.java 30.43% <100.00%> (+10.43%) ⬆️
...bblin/data/management/copy/PreserveAttributes.java 97.36% <100.00%> (+0.07%) ⬆️
...nt/copy/writer/FileAwareInputStreamDataWriter.java 73.80% <100.00%> (+0.63%) ⬆️
...lin/elasticsearch/writer/FutureCallbackHolder.java 61.42% <0.00%> (-1.43%) ⬇️
...a/management/copy/publisher/CopyDataPublisher.java 72.25% <0.00%> (-1.30%) ⬇️
.../org/apache/gobblin/cluster/GobblinTaskRunner.java 62.22% <0.00%> (-0.31%) ⬇️
...anagement/copy/replication/ConfigBasedDataset.java 68.87% <0.00%> (ø)
...he/gobblin/source/PartitionAwareFileRetriever.java 55.55% <0.00%> (+7.40%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@meethngala meethngala force-pushed the support-acls-for-file-based-distcp branch from a176b0e to 48f6ff2 Compare December 15, 2022 05:08
@meethngala meethngala force-pushed the support-acls-for-file-based-distcp branch from 48f6ff2 to a7121d1 Compare January 10, 2023 17:19
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking very close!

@meethngala meethngala force-pushed the support-acls-for-file-based-distcp branch from f92867a to 1274aab Compare February 3, 2023 02:23
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...a couple last Qs

Path outputDir = writer.outputDir;
String[] splitExpectedOutputPath = expectedOutputPath.toString().split("output");
Path dstOutputPath = new Path(outputDir.toString().concat(splitExpectedOutputPath[1])).getParent();
Path stgFilePath = new Path("file:".concat(stagingDir.toString().concat("/file")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think I've ever actually used the String.concat method before... what about the canonical +? is there some advantage, since:

new Path("file:" + stagingDir + "/file");

seems easier to read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.concat handles NPEs if any compared to + ... I guess that's the only difference.

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just needs javadoc plus realignment so the expected, not actual/observed values drive the verification

@meethngala meethngala force-pushed the support-acls-for-file-based-distcp branch from 230367c to 5c86e22 Compare February 7, 2023 16:29
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on the test cases, meeth--they now verify the crux of the impl.

there still seems room to share impl between the two new test cases. ideally, the only source difference between the two would be the initialization of some different param values that lead to the changed response by the system. (plus maybe different or additional assertions).

that said, looks ok for now... we can always return to refactor, esp. if we later extend this suite. this looks ready to get in so we can start benefiting from it!

@meethngala meethngala force-pushed the support-acls-for-file-based-distcp branch from 5c86e22 to d8b378d Compare February 8, 2023 04:01
@Will-Lo Will-Lo merged commit fd8f6e3 into apache:master Feb 8, 2023
phet added a commit to phet/gobblin that referenced this pull request Feb 13, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
phet added a commit to phet/gobblin that referenced this pull request Mar 24, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants