Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

Closed
dsdashun opened this issue Oct 10, 2022 · 5 comments · Fixed by #38391
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 component/lightning This issue is related to Lightning of TiDB. severity/major type/bug This issue is a bug.

Comments

@dsdashun
Copy link
Contributor

dsdashun commented Oct 10, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Import a moderate data set into different versions of Lightning: 1) 6.2.0; 2) 5.3.3
The reproduce conditions:

  • The data format of the source is parquet
  • The data set to import contains some string columns
  • The table schema should set the encoding to 'binary'

2. What did you expect to see? (Required)

  • It takes longer time to finish the import on 6.2.0 than in 5.3.3
  • From the log, we can see the encode KV time is much longer in 6.2.0, the import KV time is roughly the same as in 5.3.3

3. What did you see instead (Required)

The import time on newer version of Lightning should be roughly the same as the older version

4. What is your TiDB version? (Required)

6.2.0

@dsdashun dsdashun added the type/bug This issue is a bug. label Oct 10, 2022
@dsdashun
Copy link
Contributor Author

/component lightning

@ti-chi-bot ti-chi-bot added the component/lightning This issue is related to Lightning of TiDB. label Oct 10, 2022
@dsdashun
Copy link
Contributor Author

I've captured the profile of 6.2.0. The bottleneck lies in converting each string datum value from data source into a normalized string datum value. This logic was first introduced in 5.4.0. Here is the PR . Originally, the string value doesn't do any normalizing processing. After this PR, each string datum will check the charset and collation to normalize the value.

By default, when creating a string column in a table without setting collations and charsets, the collation is empty, and the charset is 'binary'. In this case, when the normalizing logic tries to check the collation, it cannot find the collation. So the logic generates an error object with a stack trace and returned back. When the upper level caller catches the error, it logs a warning message, ignores the error, and falls back to return the raw string as the datum value.

So compared to 5.3.3, each string value normalization will generate an extra error object with a stack trace, and record an extra warning log. Surprisingly, the cumulative time for generating errors with stack traces is huge. That causes the bottleneck.

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 may-affects-6.3 labels Oct 10, 2022
@dsdashun
Copy link
Contributor Author

I've taken a deeper look why the collation settings are not correct. Looks like it is related to the parquet data source. When reading parquet data and set the string datum, the collation is passed as an empty string:

Then, when the logic comes here to normalize the string datum:

coll, err := charset.GetCollationByName(d.Collation())

the empty string cannot find an entry in the collation name map, thus returning an error with a stack trace.

Currently, all other data format ( csv / sql ) will pass "utf8mb4_bin" as the collation value. So here the solution is to change all the SetString() in parquet parser to use "utf8mb4_bin" as the collation.

@dsdashun dsdashun changed the title Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings Oct 11, 2022
@dsdashun
Copy link
Contributor Author

After taking a further look at the code, to trigger this performance regression, the encoding of some target table's string columns should be set to 'binary'.

@lance6716
Copy link
Contributor

please check which version this bug affects, and modify according issue tags

@dsdashun dsdashun removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 may-affects-6.3 labels Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 component/lightning This issue is related to Lightning of TiDB. severity/major type/bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants