Implement sub-chunking for offloads #192

nj1973 · 2024-05-31T11:54:08Z

Sometimes we encounter very large non-partitioned tables or very large single partitions, e.g. > 10TB.

At the moment the segment (i.e. top level table or partition) is the smallest level of chunking we support. If we cannot offload that smallest unit of work in a single pass (e.g. ORA-1555) then our only option is to keep increasing parallelism in the hope of completing before we run out of time.

It has been suggested that we should attempt to break the single segment, be it a table or partition, down into multiple transport jobs.

Example 1:

Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size
The table is also sub-partitioned
Offload could loop through the subpartitions adding them to the staging area one at a time, each split by ROWID ranges for parallelism
Only after ALL subpartitions have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire

Example 2:

Partition P2015 is in a table partitioned by date column TXN_DATE and it 20TB in size, there are NO subpartitions
Offload detects that P2015 > MAX_OFFLOAD_CHUNK_SIZE
Offload detects that the partition key TXN_DATE (or perhaps some other columns) is not a single value but has a range of values
Offload identifies n points between the min/max TXN_DATE
Offload loops through the ranges adding data to the staging area for each, each transport would be split natively by Spark on the range of values
Only after ALL ranges have been appended to the bucket does the Offload continue, that way we still get the atomicity we desire

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sub-chunking for offloads #192

Implement sub-chunking for offloads #192

nj1973 commented May 31, 2024

Implement sub-chunking for offloads #192

Implement sub-chunking for offloads #192

Comments

nj1973 commented May 31, 2024