Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metainfo of the inverted index may overflow, causing the index to become unavailable #3991

Open
zhongzc opened this issue May 20, 2024 · 3 comments
Assignees
Labels
C-bug Category Bugs

Comments

@zhongzc
Copy link
Contributor

zhongzc commented May 20, 2024

What type of bug is this?

Unexpected error

What subsystems are affected?

Storage Engine

Problem Report

Below is the definition of InvertedIndexMeta:
https://github.com/GreptimeTeam/greptime-proto/blob/a11db14b8502f55ca5348917fd18e6fcf140f55e/proto/greptime/v1/index/inverted_index.proto#L41-L68

As can be seen, both relative_fst_offset and fst_size are defined as uint32.

    // The byte offset of the Finite State Transducer (FST) data relative to the `base_offset`.
    uint32 relative_fst_offset = 4;

    // The size in bytes of the FST data.
    uint32 fst_size = 5;

According to the layout of the inverted index:

null_bitmap bitmap₀ bitmap₁ bitmap₂ ... bitmapₙ fst

This implies that if the total size of the bitmaps exceeds 4.3GB, the relative_fst_offset will overflow, causing the FST to be located at an incorrect position when read.

Let's examine the possibility that the total bitmap size exceeds 4.3GB:

  1. Suppose there are 100k distinct values in the column, each occupying an average of 43KB of bitmap size.
  2. Since the bitmap is implemented with the simplest bitvec, 43KB actually represents a maximum of 344k segments, which is 352 million rows.

This is achievable, so this issue urgently needs to be addressed.

Some possible solutions are:

  1. Optimize bitmap implementation, compress the bitmap, or adopt a roaring bitmap.
  2. Introduce additional levels of indexing, SST -> Row Group -> Segment, to control the number of indexing targets.

Note: Simply adjusting relative_fst_offset to uint64 is insufficient because the Value that the FST maps to represents the position and size of the bitmap, which also uses u32. Therefore, the positioning of the bitmap will also be affected by the total size of the bitmap.

@zhongzc zhongzc added the C-bug Category Bugs label May 20, 2024
@zhongzc zhongzc changed the title Meta of the inverted index overflows, causing the index to become unavailable Metainfo of the inverted index may overflow, causing the index to become unavailable May 20, 2024
@waynexia
Copy link
Member

it proves we made a data intensive application

2 looks good to me. 1 also, but I'm not sure about the complexity.

@evenyag
Copy link
Contributor

evenyag commented May 20, 2024

In another aspect, maybe we should restrict the total row number of an SST file. We don't want to generate a very large parquet file and @v0y4g3r is working on this.

We may also skip building the inverted index for a column if the number of distinct is too large but I'm not sure what the appropriate threshold is.

Introduce additional levels of indexing, SST -> Row Group -> Segment, to control the number of indexing targets.

Maybe we can build an inverted index per row group since the number of rows inside a row group won't be too large. We need to enlarge the row group size (rows in a row group) to reduce the number of row groups per file.

@killme2008
Copy link
Contributor

@zhongzc Do we have a plan to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
None yet
Development

No branches or pull requests

4 participants