Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranger: fix prefix index when charset is UTF-8 #7194

Merged
merged 9 commits into from Jul 31, 2018

Conversation

@birdstorm
Copy link
Member

commented Jul 30, 2018

What have you changed? (mandatory)

Fix #7115, this PR will fix prefix index when charset is UTF-8. Previously the index was cut by bytes rather than characters.

What is the type of the changes? (mandatory)

  • Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

Unit Test

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

NO

Does this PR affect tidb-ansible update? (mandatory)

NO

Does this PR need to be added to the release notes? (mandatory)

release note:
fix prefix index, when the charset is utf8 or utf8mb4, truncate it from runes rather than bytes.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

@birdstorm birdstorm requested review from winkyao and winoros Jul 30, 2018
@birdstorm birdstorm referenced this pull request Jul 30, 2018
@shenli shenli added the type/bug-fix label Jul 30, 2018
if length != types.UnspecifiedLength && length < len(v.GetBytes()) {
v.SetBytes(v.GetBytes()[:length])
// In case of UTF8, prefix should be cut by characters rather than bytes
if v.Kind() == types.KindString || v.Kind() == types.KindBytes {

This comment has been minimized.

Copy link
@shenli

shenli Jul 30, 2018

Member

For other types, should we consider length?

This comment has been minimized.

Copy link
@birdstorm

birdstorm Jul 31, 2018

Author Member

I doubt if it is possible to have prefix index on other types...

This comment has been minimized.

Copy link
@birdstorm

birdstorm Jul 31, 2018

Author Member

For string columns, indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:

  • Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY key parts.

  • Prefixes must be specified for BLOB and TEXT key parts. Additionally, BLOB and TEXT columns can be indexed only for InnoDB, MyISAM, and BLACKHOLE tables.

"github.com/pingcap/tidb/util/codec"
"unicode/utf8"

This comment has been minimized.

Copy link
@shenli

shenli Jul 30, 2018

Member

Move this to the upper group.

This comment has been minimized.

Copy link
@winkyao

winkyao Jul 31, 2018

Member

Use gofmt .

@@ -178,7 +178,7 @@ func (s *testPlanSuite) TestDAGPlanBuilderSimpleCase(c *C) {
// Test index filter condition push down.
{
sql: "select * from t use index(e_d_c_str_prefix) where t.c_str = 'abcdefghijk' and t.d_str = 'd' and t.e_str = 'e'",
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\",\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"abcdefghij\",\"e\" \"d\" \"abcdefghij\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",

This comment has been minimized.

Copy link
@shenli

shenli Jul 30, 2018

Member

Why is this changed?

This comment has been minimized.

Copy link
@birdstorm

birdstorm Jul 31, 2018

Author Member

You see, now prefix index is set by string when charset is UTF-8 rather than bytes.

@shenli

This comment has been minimized.

Copy link
Member

commented Jul 30, 2018

/run-all-tests

Copy link
Member

left a comment

LGTM

@shenli

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

@birdstorm Please fix the CI.

Copy link
Member

left a comment

Please address comments.

@zhexuany

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

/run-all-tests

@winoros

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

/run-unit-test

Copy link
Member

left a comment

lgtm

// In case of UTF8, prefix should be cut by characters rather than bytes
if v.Kind() == types.KindString || v.Kind() == types.KindBytes {
colCharset := tp.Charset
if colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4 {

This comment has been minimized.

Copy link
@zz-jason

zz-jason Jul 31, 2018

Member

how about:

colValue := v.GetBytes()
isUTF8Charset := colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4
if isUTF8Charset && length != types.UnspecifiedLength && utf8.RuneCount(colValue) > length {
	...
} else if length != types.UnspecifiedLength && len(colValue) > length {
	v.SetBytes(colValue[:length])
}

This comment has been minimized.

Copy link
@birdstorm

birdstorm Jul 31, 2018

Author Member

@zz-jason Actually I just copied the same logic from

func (c *index) truncateIndexValuesIfNeeded(indexedValues []types.Datum) []types.Datum {

I left it the same because it seems hard to decide where to put the reusable code. Should both logic be changed at the moment or I could move on to another PR to solve this?

This comment has been minimized.

Copy link
@zz-jason

zz-jason Jul 31, 2018

Member

Uh, I think we can change both logic here in this PR.

This comment has been minimized.

Copy link
@birdstorm

birdstorm Jul 31, 2018

Author Member

Done. In addition, isUTF8Charset must be on the outer if-statement to maintain the correct logic.

Copy link
Member

left a comment

LGTM

Copy link
Member

left a comment

LGTM

@shenli

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

@zz-jason Should we cherry-pick this PR to the release-2.0?

@winoros winoros changed the title Fix prefix index when charset is UTF-8 ranger: fix prefix index when charset is UTF-8 Jul 31, 2018
@zz-jason

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

Yes, and I think this PR needs a release note.

@birdstorm

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

@zz-jason yes, I have it in issue description. BTW, please note that this PR might affect speed of prefix index because it has an extra Rune than before (and it might be pretty SLOW).

Copy link
Member

left a comment

LGTM

@zhexuany zhexuany merged commit 42bba99 into master Jul 31, 2018
4 checks passed
4 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci-tidb/build Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
@zhexuany zhexuany deleted the birdstorm/fix_utf8_prefix branch Jul 31, 2018
birdstorm added a commit to birdstorm/tidb that referenced this pull request Aug 1, 2018
birdstorm added a commit to birdstorm/tidb that referenced this pull request Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.