Skip to content

[vpj][server][da-vinci][controller] Allow RMD retrieval and parsing in TTL filter at Mapper stage#49

Merged
adamxchen merged 23 commits intolinkedin:mainfrom
adamxchen:rmdretrieval
Nov 7, 2022
Merged

[vpj][server][da-vinci][controller] Allow RMD retrieval and parsing in TTL filter at Mapper stage#49
adamxchen merged 23 commits intolinkedin:mainfrom
adamxchen:rmdretrieval

Conversation

@adamxchen
Copy link
Copy Markdown
Contributor

@adamxchen adamxchen commented Oct 13, 2022

Description

This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter.

RMD retrieval and parsing:

[vpj][server][da-vinci]

When repush TTL function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store and persist them on a temp folder on HDFS. The folder will be unique for each run and gets cleaned up when the push job is finished.
When ttl config is present, both mapper and reducer will create ttl filter. When ttl filter is inited, the schemas will be fetched to further parse the RMD records. Based on the TTL config, the record will be either ignored or filtered.

[controller]

The existing API doesn't have the rmd value schema ID info contained, which is essential to differentiate the RMD schemas so also added it.

Refactoring:

  1. Move some Controller/SSL related methods from VPJ main class to an Utils class.
  2. Move some Rmd related methods from da-vinci package to the common page as a new Utils class.

Revision updated on Nov 2, 2022

this commit address some comments in the PR. (the diff)

  1. Keep the ttl policy abstraction but remove unused policies for now
  2. Let ttl feature uses store-level rewind time and have a flag for customers to turn on explicitly
  3. Support chunking but mapper will filter non-chunked records only.
  4. When ttl config is present, both mapper and reducer will create ttl filter.
    The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper.

How was this PR tested?

internal CI.
JDK 11 - 6664/6665/6666

Added bunch of new unit tests, and integ tests. (See TestActiveActiveIngestion - testKIFRepushActiveActiveStore for repush with ttl)

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

See #10

Notes

In terms of deployment order, the controller has to be deployed first prior to this VPJ change

This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter.

RMD retrieval and parsing:
When repush TTL function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store on a temp folder on HDFS.
In the Mapper stage, when ttl filter is inited, the schems will be fetched to further parse the RMD records. Based on the TTL config, the record will be either ignored or filtered.

Refactoring:
1. Move some Controller/SSL related methods from VPJ main class to an Utils class.
2. Move some Rmd related methods from da-vinci package to the common page as a new Utils class.
Comment thread clients/venice-push-job/src/main/java/com/linkedin/venice/hadoop/FilterChain.java Outdated
# Conflicts:
#	clients/venice-push-job/src/main/java/com/linkedin/venice/hadoop/VenicePushJob.java
this commit address some comments in the PR.
1) Keep the ttl policy abstraction but remove unused policies for now
2) Let ttl feature uses store-level rewind time and have a flag for customers to turn on explicitly
3) Support chunking but mapper will filter non-chunked records only.
With this change, when ttl config is present, both mapper and reducer will create ttl filter.
The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper.
# Conflicts:
#	clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/TestValidateSchemaAndBuildDictMapperOutputReader.java
Copy link
Copy Markdown
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, and I left some minor comments

@adamxchen
Copy link
Copy Markdown
Contributor Author

@gaojieliu Thanks for the review! Comments are addressed. Please take a look when you get time!

Copy link
Copy Markdown
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@adamxchen adamxchen merged commit e0e8542 into linkedin:main Nov 7, 2022
ZacAttack pushed a commit to ZacAttack/venice that referenced this pull request Nov 16, 2022
…n TTL filter at Mapper stage (linkedin#49)

* Allow RMD retrieval and parsing in TTL filter at Mapper stage

This commit introduced some refactoring changes and the implementation of for RMD retrieval and parsing in TTL filter.

RMD retrieval and parsing:
TTL feature uses store-level rewind time as ttl time and have a flag for customers to turn on explicitly.
When the function is enabled, the VPJ will make a call onto controller to retrieve all RMD schemas belonging to this store on a temp folder on HDFS. The Mapper will create ttl filter and reducer may create the ttl filter too if the chunking is enabled.
The difference is mapper will only handle non-chunked records whereas reducer can handle both, though in fact reducer will only chunked records if non-chunked records have been processed by mapper.
This filter will use beforementioned temp folder to cache rmd schema and parse rmd records to get the timestmap information for ttl logic.

Refactoring:
1. Move some Controller/SSL related methods from VPJ main class to an Utils class.
2. Move some Rmd related methods from da-vinci package to the common page as a new Utils class.
@adamxchen adamxchen deleted the rmdretrieval branch December 14, 2022 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants