New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: support importing timestamp from Hive parquet #37685
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/component lightning |
@@ -446,6 +449,13 @@ func setDatumByString(d *types.Datum, v string, meta *parquet.SchemaElement) { | |||
if meta.LogicalType != nil && meta.LogicalType.DECIMAL != nil { | |||
v = binaryToDecimalStr([]byte(v), int(meta.LogicalType.DECIMAL.Scale)) | |||
} | |||
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 12 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 12 { | |
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len(v) == 12 { |
Or maybe use 96/8 to replace 12 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the value of LogicalType if enter this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No logical type is set (nil).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 12 { | ||
ts := int96ToTime([]byte(v)) | ||
ts = ts.UTC() | ||
tsStr := ts.Format("2006-01-02 15:04:05.999999Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we choose this layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy from
tidb/br/pkg/lightning/mydump/parquet_parser.go
Lines 512 to 513 in ec9e43e
timeStr := formatTime(v, logicalType.TIMESTAMP.Unit, "2006-01-02 15:04:05.999999", | |
"2006-01-02 15:04:05.999999Z", logicalType.TIMESTAMP.IsAdjustedToUTC) |
This layout is used when setting a UTC timestamp.
return time.Unix(sec, int64(nsec)) | ||
} | ||
|
||
func int96ToTime(parquetDate []byte) time.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you refer to a parquet encoding for timestamp as comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -26,6 +27,8 @@ const ( | |||
// if a parquet if small than this threshold, parquet will load the whole file in a byte slice to | |||
// optimize the read performance | |||
smallParquetFileThreshold = 256 * 1024 * 1024 | |||
jan011970 = 2440588 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of 2440588
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decoding of the INT96 timestamp refers to a parquet library called parquet-go. Will paste some comments here tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
func jdToTime(jd int32, nsec int64) time.Time { | ||
sec := int64(jd-jan011970) * secPerDay | ||
return time.Unix(sec, int64(nsec)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return time.Unix(sec, int64(nsec)) | |
return time.Unix(sec, nsec) |
And can you add unit tests for nsec outside [0, 999,999,999]? (see the comment of time.Unix
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look that this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
@@ -446,6 +449,13 @@ func setDatumByString(d *types.Datum, v string, meta *parquet.SchemaElement) { | |||
if meta.LogicalType != nil && meta.LogicalType.DECIMAL != nil { | |||
v = binaryToDecimalStr([]byte(v), int(meta.LogicalType.DECIMAL.Scale)) | |||
} | |||
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 12 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the value of LogicalType if enter this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
@@ -446,6 +451,11 @@ func setDatumByString(d *types.Datum, v string, meta *parquet.SchemaElement) { | |||
if meta.LogicalType != nil && meta.LogicalType.DECIMAL != nil { | |||
v = binaryToDecimalStr([]byte(v), int(meta.LogicalType.DECIMAL.Scale)) | |||
} | |||
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 96/8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 96/8 { | |
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len(v) == 96/8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if meta.Type != nil && *meta.Type == parquet.Type_INT96 && len([]byte(v)) == 96/8 { | ||
ts := int96ToTime([]byte(v)) | ||
ts = ts.UTC() | ||
v = ts.Format("2006-01-02 15:04:05.999999Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable can be extracted as a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: lance6716 <lance6716@gmail.com>
Co-authored-by: lance6716 <lance6716@gmail.com>
Co-authored-by: lance6716 <lance6716@gmail.com>
/run-integration-br-test |
/run-all-tests |
/run-mysql-test |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 63dfb26
|
TiDB MergeCI notify✅ Well Done! New fixed [1] after this pr merged.
|
What problem does this PR solve?
Issue Number: close #37536
Problem Summary:
What is changed and how it works?
The timestamp exported from Hive will be encoded as INT96 which is natively supported by neither parquet-go nor Golang itself. Therefore, this type of timestamp will be stored as strings in lightning. Because we cannot extract timestamps from this byte string, the import result is kind of fallacious.
This PR is about to identify this type of timestamp and decode true time from it.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.