Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling split by varchar #1414

Merged
merged 1 commit into from
Feb 22, 2024
Merged

Disabling split by varchar #1414

merged 1 commit into from
Feb 22, 2024

Conversation

davidducos
Copy link
Member

I decided that we are going to disable the chunks by varchars.
It was not implemented correctly.
It was a good exercise, but it is not mature enough.

We need to develop a better understanding of the use cases where it will be useful. For instance, when binary is used or when UTF8MB4, and what is the best approach to split the chunks.

@davidducos davidducos added this to the Release 0.16.1-1 milestone Feb 22, 2024
@davidducos davidducos closed this Feb 22, 2024
@davidducos davidducos reopened this Feb 22, 2024
@davidducos davidducos merged commit 530c689 into master Feb 22, 2024
31 of 35 checks passed
@midenok
Copy link
Collaborator

midenok commented Mar 5, 2024

Any evidences it doesn't work correctly?

@davidducos
Copy link
Member Author

@midenok,
when you mix digits and upper and lower characters, you will might end up exporting same chunks on different moments. This was not detected before due to checksum bugs.

@midenok
Copy link
Collaborator

midenok commented Mar 6, 2024

What is the consequence of this? Duplicate PK?

@davidducos
Copy link
Member Author

Hi @midenok, yes, data was being exported twice (or more) and in some scenarios, it was not being exported due to invalid characters.
I checked mysqlsh strategy and I don't like it as it is doing a SELECT MIN,MAX LIMIT N, which might take a similar amount of time of executing the SELECT to extract the data. I think that we can do better, but we need to take into consideration the character set and collation to do not leave gaps and make it performant.
The problem to solve is how we determine the min and max when we create a new chunk, taking into considerations the estimation of rows to export.
Taking into account that we have utf8mb4 now, we need to plan a good strategy... my first implementation was good for understanding this and I learn a lot, but it was not going to scale and it didn't work on all use cases.

@davidducos davidducos deleted the disabling_split_by_varchar branch March 12, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants