Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

dsdashun · 2022-10-10T02:54:13Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Import a moderate data set into different versions of Lightning: 1) 6.2.0; 2) 5.3.3
The reproduce conditions:

The data format of the source is parquet
The data set to import contains some string columns
The table schema should set the encoding to 'binary'

2. What did you expect to see? (Required)

It takes longer time to finish the import on 6.2.0 than in 5.3.3
From the log, we can see the encode KV time is much longer in 6.2.0, the import KV time is roughly the same as in 5.3.3

3. What did you see instead (Required)

The import time on newer version of Lightning should be roughly the same as the older version

4. What is your TiDB version? (Required)

6.2.0

dsdashun · 2022-10-10T02:57:24Z

/component lightning

dsdashun · 2022-10-10T03:01:04Z

I've captured the profile of 6.2.0. The bottleneck lies in converting each string datum value from data source into a normalized string datum value. This logic was first introduced in 5.4.0. Here is the PR . Originally, the string value doesn't do any normalizing processing. After this PR, each string datum will check the charset and collation to normalize the value.

By default, when creating a string column in a table without setting collations and charsets, the collation is empty, and the charset is 'binary'. In this case, when the normalizing logic tries to check the collation, it cannot find the collation. So the logic generates an error object with a stack trace and returned back. When the upper level caller catches the error, it logs a warning message, ignores the error, and falls back to return the raw string as the datum value.

So compared to 5.3.3, each string value normalization will generate an extra error object with a stack trace, and record an extra warning log. Surprisingly, the cumulative time for generating errors with stack traces is huge. That causes the bottleneck.

dsdashun · 2022-10-11T04:25:53Z

I've taken a deeper look why the collation settings are not correct. Looks like it is related to the parquet data source. When reading parquet data and set the string datum, the collation is passed as an empty string:

tidb/br/pkg/lightning/mydump/parquet_parser.go

Line 461 in accff68

d.SetString(v, "")

Then, when the logic comes here to normalize the string datum:

tidb/types/datum.go

Line 199 in accff68

coll, err := charset.GetCollationByName(d.Collation())

the empty string cannot find an entry in the collation name map, thus returning an error with a stack trace.

Currently, all other data format ( csv / sql ) will pass "utf8mb4_bin" as the collation value. So here the solution is to change all the SetString() in parquet parser to use "utf8mb4_bin" as the collation.

dsdashun · 2022-10-11T04:47:48Z

After taking a further look at the code, to trigger this performance regression, the encoding of some target table's string columns should be set to 'binary'.

lance6716 · 2022-10-13T02:42:36Z

please check which version this bug affects, and modify according issue tags

close #38351

…) (#38488) close #38351

dsdashun added the type/bug This issue is a bug. label Oct 10, 2022

ti-chi-bot added the component/lightning This issue is related to Lightning of TiDB. label Oct 10, 2022

ChenPeng2013 added the severity/major label Oct 10, 2022

dsdashun changed the title ~~Lightning: Performance Regression on 6.2.0 Compared with 5.3.3~~ Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings Oct 11, 2022

dsdashun mentioned this issue Oct 11, 2022

lightning: specify collation when parquet value to string datum #38391

Merged

12 tasks

dsdashun added affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 labels Oct 17, 2022

ti-chi-bot closed this as completed in #38391 Oct 17, 2022

ti-chi-bot pushed a commit that referenced this issue Oct 17, 2022

lightning: specify collation when parquet value to string datum (#38391)

68305e9

close #38351

This was referenced Oct 17, 2022

lightning: specify collation when parquet value to string datum (#38391) #38487

Open

lightning: specify collation when parquet value to string datum (#38391) #38488

Merged

lightning: specify collation when parquet value to string datum (#38391) #38489

Closed

niubell mentioned this issue Nov 9, 2022

add v6.4 release notes pingcap/docs-cn#11835

Merged

14 tasks

ti-chi-bot added a commit that referenced this issue Nov 26, 2022

lightning: specify collation when parquet value to string datum (#38391…

287e880

…) (#38488) close #38351

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

dsdashun commented Oct 10, 2022 •

edited

Loading

dsdashun commented Oct 10, 2022

dsdashun commented Oct 10, 2022

dsdashun commented Oct 11, 2022

dsdashun commented Oct 11, 2022

lance6716 commented Oct 13, 2022

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings #38351

Comments

dsdashun commented Oct 10, 2022 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

dsdashun commented Oct 10, 2022

dsdashun commented Oct 10, 2022

dsdashun commented Oct 11, 2022

dsdashun commented Oct 11, 2022

lance6716 commented Oct 13, 2022

dsdashun commented Oct 10, 2022 •

edited

Loading