Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for page-level filter pushdown (indexes) #102

Closed
jorgecarleitao opened this issue Mar 17, 2022 · 0 comments · Fixed by #114
Closed

Added support for page-level filter pushdown (indexes) #102

jorgecarleitao opened this issue Mar 17, 2022 · 0 comments · Fixed by #114
Assignees
Labels
feature A new feature

Comments

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Mar 17, 2022

With #100 merged, we can now work on using indexes to support page-level filter pushdown via indexes.

This issue describe my initial ideas about the topic:

Design using column indexes

Say we have two columns in a row group, c1 and c2, with the following page structure:

c1: [----][----][--][--] (p11, p12, p13, p14)
c2: [----][--][----][--] (p21, p22, p23, p24)

and that we have a filter over c1 that selects it as follows:

selected:        ----      --
c1      : [----][----][--][--] (p11, p12, p13, p14)
c2      : [----][--][----][--] (p21, p22, p23, p24)

the goal is to iterate over c1 and c2 so that we select rows from them accordingly:

            I_1      I_2
           ----      --
c1: [----][----][--][--] (p12, p14)
c2: [----][--][----][--] (p22, p23, p24)

For this, 3 steps come to mind:

  1. compute the row intervals I_j = (start, len)
  2. for each column, compute the set of pages that fill all intervals
  3. pass pages (and their ranges) to consumers to deserialize to their favorite in-memory format

For c1 these are full pages; for c2 these are slices of pages (slices of p22,p23,p24).

Implementation

Because ColumnIndex has encoded data, we need a trait object Index to describe the
different physical layouts, just like we do for statistics, so that consumers do not need to worry about de-serializing the values prior to using them:

pub trait Index: Send + Sync + std::fmt::Debug {
    fn as_any(&self) -> &dyn Any;

    fn physical_type(&self) -> &PhysicalType;
}

/// The index of a page, containing the min and max values of the page.
pub struct PageIndex<T> {
    /// The minimum value in the page. It is None when all values are null
    pub min: Option<T>,
    /// The maximum value in the page. It is None when all values are null
    pub max: Option<T>,
    /// The number of null values in the page
    pub null_count: Option<i64>,
}

/// An index of a column of [`NativeType`] physical representation
pub struct NativeIndex<T: NativeType> {
    pub indexes: Vec<PageIndex<T>>,
    pub boundary_order: BoundaryOrder,
}

impl<T: NativeType> Index for NativeIndex<T> {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn physical_type(&self) -> &PhysicalType {
        &T::TYPE
    }
}

/// An index of a column of bytes physical type
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct ByteIndex {
    pub indexes: Vec<PageIndex<Vec<u8>>>,
    pub boundary_order: BoundaryOrder,
}

...

For step 1, we need to compute the set of intervals that selects the relevant rows from a set of pages:

pub fn compute_rows<'a, T>(
    index: &'a [PageIndex<T>],
    locations: &[PageLocation],
    num_rows: u64,
    selector: &dyn Fn(&'a PageIndex<T>) -> bool,
) -> Result<Vec<Interval>, ParquetError>;

where:

  • PageLocation contains the location of the page in the file (in bytes)
  • Interval contain the start and length (in number of rows) of the elements
    to retrieve from said page. The invariant is start + length <= page.num_rows().

(multiple-column selectors are computed by the overlapping of each column's interval).

For step 2, we need a function to select pages from any column chunk:

pub fn select_pages(
    intervals: &[Interval],
    locations: &[PageLocation],
    num_rows: u64,
) -> Result<Vec<FilteredPage>, ParquetError>;

where FilteredPage is something like

/// An enum describing a page that was either selected in a filter pushdown or skipped
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum FilteredPage {
    Select {
        /// Location of the page in the file in bytes
        start: u64,
        /// the length of the page in bytes
        length: usize,
        /// Location of rows to select in the page
        rows_offset: usize,
        rows_length: usize,
    },
    Skip {
        /// Location of the page in the file
        start: u64,
        /// the length of the page in bytes
        length: usize,
        /// number of rows that are skip by skipping this page
        num_rows: usize,
    },
}

With this struct, we have everything we need to implement a reader that skips pages that are not needed. When pages are selected, the rows_offset and rows_length are passed along so that consumers know how which part of the page they must decode.

Consumers (of pages) that apply this filter must be able to decode pages with an offset (Rust's .skip) and length (Rust's .take).

@jorgecarleitao jorgecarleitao added the feature A new feature label Mar 17, 2022
@jorgecarleitao jorgecarleitao self-assigned this Mar 17, 2022
@jorgecarleitao jorgecarleitao changed the title Add support for page-level filter pushdown (indexes) Added support for page-level filter pushdown (indexes) Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant