Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: binlog primary key turn off dict encoding #34358

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

shaoting-huang
Copy link
Contributor

@shaoting-huang shaoting-huang commented Jul 2, 2024

issue: #34357

Go Parquet uses dictionary encoding by default, and it will fall back to plain encoding if the dictionary size exceeds the dictionary size page limit. Users can specify custom fallback encoding by using parquet.WithEncoding(ENCODING_METHOD) in writer properties. However, Go Parquet fallbacks to plain encoding rather than custom encoding method users provide. Therefore, this patch only turns off dictionary encoding for the primary key.

With a 5 million auto ID primary key benchmark, the parquet file size improves from 13.93 MB to 8.36 MB when dictionary encoding is turned off, reducing primary key storage space by 40%.

@sre-ci-robot sre-ci-robot added the size/XL Denotes a PR that changes 500-999 lines. label Jul 2, 2024
Copy link
Contributor

mergify bot commented Jul 2, 2024

@shaoting-huang

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

  1. Title Format: The PR title must begin with one of these prefixes:
  • feat: for introducing a new feature.
  • fix: for bug fixes.
  • enhance: for improvements to existing functionality.
  • test: for add tests to existing functionality.
  • doc: for modifying documentation.
  • auto: for the pull request from bot.
  1. Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly 

Please review and update your PR to comply with these guidelines.

Copy link
Contributor

mergify bot commented Jul 2, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Jul 3, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@shaoting-huang shaoting-huang changed the title binlog writer fallback encoding enhance: binlog writer fallback encoding Jul 3, 2024
@mergify mergify bot added kind/enhancement Issues or changes related to enhancement and removed do-not-merge/invalid-pr-format labels Jul 3, 2024
Copy link
Contributor

mergify bot commented Jul 3, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 3, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@shaoting-huang shaoting-huang force-pushed the parquet_encoding branch 4 times, most recently from 3409c87 to 0ac6853 Compare July 4, 2024 08:30
Copy link
Contributor

mergify bot commented Jul 4, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@shaoting-huang shaoting-huang force-pushed the parquet_encoding branch 3 times, most recently from f0ffaf8 to 8537cab Compare July 8, 2024 11:49
@mergify mergify bot added the ci-passed label Jul 8, 2024
Copy link

codecov bot commented Jul 8, 2024

Codecov Report

Attention: Patch coverage is 97.80220% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.43%. Comparing base (3306bc2) to head (c58fd88).
Report is 6 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #34358      +/-   ##
==========================================
+ Coverage   80.76%   84.43%   +3.67%     
==========================================
  Files        1161      891     -270     
  Lines      141015   116725   -24290     
==========================================
- Hits       113892    98559   -15333     
+ Misses      22750    13801    -8949     
+ Partials     4373     4365       -8     
Files Coverage Δ
internal/storage/binlog_writer.go 85.62% <100.00%> (+0.62%) ⬆️
internal/storage/event_writer.go 84.15% <100.00%> (+0.42%) ⬆️
internal/storage/payload_writer.go 92.85% <100.00%> (+0.04%) ⬆️
internal/storage/serde.go 94.35% <100.00%> (+0.18%) ⬆️
internal/storage/serde_events.go 77.01% <100.00%> (ø)
internal/storage/utils.go 87.86% <100.00%> (+0.08%) ⬆️
internal/storage/data_codec.go 80.46% <80.00%> (+0.36%) ⬆️

... and 295 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Jul 9, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 10, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@mergify mergify bot added the ci-passed label Jul 10, 2024
@shaoting-huang
Copy link
Contributor Author

/run-cpu-e2e

@shaoting-huang shaoting-huang changed the title enhance: binlog writer fallback encoding enhance: binlog primary key encoding Jul 12, 2024
Copy link
Contributor

mergify bot commented Jul 12, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jul 12, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@shaoting-huang shaoting-huang force-pushed the parquet_encoding branch 3 times, most recently from ea914b5 to 3d52fe7 Compare July 15, 2024 09:01
@shaoting-huang shaoting-huang changed the title enhance: binlog primary key encoding enhance: binlog primary key turn off dict encoding Jul 15, 2024
Copy link
Contributor

mergify bot commented Jul 15, 2024

@shaoting-huang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

internal/storage/payload_writer_test.go Show resolved Hide resolved
internal/storage/serde.go Outdated Show resolved Hide resolved
internal/storage/serde.go Outdated Show resolved Hide resolved
@tedxu
Copy link
Collaborator

tedxu commented Jul 16, 2024

/lgtm

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>

add ut

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
@tedxu
Copy link
Collaborator

tedxu commented Jul 17, 2024

/lgtm

@czs007
Copy link
Contributor

czs007 commented Jul 17, 2024

/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: czs007, shaoting-huang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit 88b373b into milvus-io:master Jul 17, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/XL Denotes a PR that changes 500-999 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants