Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sub-chunking for offloads #192

Open
nj1973 opened this issue May 31, 2024 · 0 comments
Open

Implement sub-chunking for offloads #192

nj1973 opened this issue May 31, 2024 · 0 comments

Comments

@nj1973
Copy link
Collaborator

nj1973 commented May 31, 2024

Sometimes we encounter very large non-partitioned tables or very large single partitions, e.g. > 10TB.

At the moment the segment (i.e. top level table or partition) is the smallest level of chunking we support. If we cannot offload that smallest unit of work in a single pass (e.g. ORA-1555) then our only option is to keep increasing parallelism in the hope of completing before we run out of time.

It has been suggested that we should attempt to break the single segment, be it a table or partition, down into multiple transport jobs.

Example 1:

  • Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size
  • The table is also sub-partitioned
  • Offload could loop through the subpartitions adding them to the staging area one at a time, each split by ROWID ranges for parallelism
  • Only after ALL subpartitions have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire

Example 2:

  • Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size, there are NO subpartitions
  • Offload detects that P2015 > MAX_OFFLOAD_CHUNK_SIZE
  • Offload detects that the partition key TXN_DATE (or perhaps some other columns) is not a single value but has a range of values
  • Offload identifies n points between the min/max TXN_DATE
  • Offload loops through the ranges adding data to the staging area for each, each transport would be split natively by Spark on the range of values
  • Only after ALL ranges have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant