[vpj][server][da-vinci][controller] Allow RMD retrieval and parsing in TTL filter at Mapper stage#49
Merged
adamxchen merged 23 commits intolinkedin:mainfrom Nov 7, 2022
Merged
Conversation
This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter. RMD retrieval and parsing: When repush TTL function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store on a temp folder on HDFS. In the Mapper stage, when ttl filter is inited, the schems will be fetched to further parse the RMD records. Based on the TTL config, the record will be either ignored or filtered. Refactoring: 1. Move some Controller/SSL related methods from VPJ main class to an Utils class. 2. Move some Rmd related methods from da-vinci package to the common page as a new Utils class.
gaojieliu
requested changes
Oct 20, 2022
gaojieliu
requested changes
Oct 25, 2022
# Conflicts: # clients/venice-push-job/src/main/java/com/linkedin/venice/hadoop/VenicePushJob.java
this commit address some comments in the PR. 1) Keep the ttl policy abstraction but remove unused policies for now 2) Let ttl feature uses store-level rewind time and have a flag for customers to turn on explicitly 3) Support chunking but mapper will filter non-chunked records only.
With this change, when ttl config is present, both mapper and reducer will create ttl filter. The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper.
# Conflicts: # clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/TestValidateSchemaAndBuildDictMapperOutputReader.java
gaojieliu
requested changes
Nov 5, 2022
Contributor
gaojieliu
left a comment
There was a problem hiding this comment.
Looks good overall, and I left some minor comments
Contributor
Author
|
@gaojieliu Thanks for the review! Comments are addressed. Please take a look when you get time! |
ZacAttack
pushed a commit
to ZacAttack/venice
that referenced
this pull request
Nov 16, 2022
…n TTL filter at Mapper stage (linkedin#49) * Allow RMD retrieval and parsing in TTL filter at Mapper stage This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter. RMD retrieval and parsing: TTL feature uses store-level rewind time as ttl time and have a flag for customers to turn on explicitly. When the function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store on a temp folder on HDFS. The Mapper will create ttl filter and reducer may create the ttl filter too if the chunking is enabled. The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper. This filter will use beforementioned temp folder to cache rmd schema and parse rmd records to get the timestmap information for ttl logic. Refactoring: 1. Move some Controller/SSL related methods from VPJ main class to an Utils class. 2. Move some Rmd related methods from da-vinci package to the common page as a new Utils class.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter.
RMD retrieval and parsing:
[vpj][server][da-vinci]
When repush TTL function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store and persist them on a temp folder on HDFS. The folder will be unique for each run and gets cleaned up when the push job is finished.
When ttl config is present, both mapper and reducer will create ttl filter. When ttl filter is inited, the schemas will be fetched to further parse the RMD records. Based on the TTL config, the record will be either ignored or filtered.
[controller]
The existing API doesn't have the rmd value schema ID info contained, which is essential to differentiate the RMD schemas so also added it.
Refactoring:
Revision updated on Nov 2, 2022
this commit address some comments in the PR. (the diff)
The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper.
How was this PR tested?
internal CI.
JDK 11 - 6664/6665/6666
Added bunch of new unit tests, and integ tests. (See
TestActiveActiveIngestion - testKIFRepushActiveActiveStorefor repush with ttl)Does this PR introduce any user-facing changes?
See #10
Notes
In terms of deployment order, the controller has to be deployed first prior to this VPJ change