Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upSupport and test mutation log queries at intermediate timestamps #1380
Conversation
Mostly nits, plus one "pleeeease". |
This comment has been minimized.
This comment has been minimized.
Ok, sooo... I've removed all concepts of what "quantum" the storage layer has from tests, or anywhere else outside of the storage implementation itself. I think we've arrived at a place where the code in In order to have consistent behavior for the HighWatermarks test across storage implementations, I've settled on returning a timestamp "just beyond" the primary key of the highest item. Since we're using It is now the storage layer's job to round timestamp inputs up to whatever quantum the storage layer itself decides to use. This preserves all the necessary semantics and (very helpfully) we now have tests that verify this behavior. |
type LogsReadWriter interface { | ||
sequencer.LogsReader | ||
// Send submits the whole group of mutations atomically to a log. | ||
// TODO(gbelvin): Create a batch level object to make it clear that this a batch of updates. |
This comment has been minimized.
This comment has been minimized.
pavelkalinnikov
Nov 9, 2019
Contributor
Is this TODO worth it? When a parameter is a list, it's already clear that it's a batch API.
This comment has been minimized.
This comment has been minimized.
gdbelvin
Nov 11, 2019
Author
Collaborator
I think the idea was to create a distinct type to hold batches of mutations.
This would make the ReadLogs API much clearer since we could return []Batches rather than having the semi-awkard current situation of returning more than batchSize
of mutations.
@mhutchinson -- your thoughts?
This comment has been minimized.
This comment has been minimized.
mhutchinson
Nov 12, 2019
Contributor
If a batch is a first-class object then it seems subtle and indirect that it is constructed by passing in a bunch of EntryUpdate
s at the same time. This feels like it should be a slice at a minimum.
For the purposes of just this method I wouldn't go beyond turning the varargs mutation
into a slice, but an API shouldn't be designed a method at a time so I'd need to re-review the whole API & future plans to make a call on the situation.
@@ -173,21 +176,25 @@ func (m *Mutations) HighWatermark(ctx context.Context, directoryID string, logID | |||
ORDER BY Q.Time ASC | |||
LIMIT ? | |||
) AS T1`, | |||
directoryID, logID, start, batchSize). | |||
directoryID, logID, startQuery, batchSize). |
This comment has been minimized.
This comment has been minimized.
pavelkalinnikov
Nov 9, 2019
Contributor
I'm surprized the SQL query takes time.Time
as a parameter. My understanding was that it was a microseconds int.
Have you considered not using the DATETIME
type? There is much details on how it works, which, together with details on how time.Time
works, make me nervous about missing some unexpected corner cases like leap years/seconds or timezones, or keeping in mind this implicit "microsecond quantum" quirk.
Is MySQL DATETIME
type one of the reasons you decided to use time.Time
? I wonder if we could avoid this complexity as follows:
- Use
BIGINT
MySQL type for clarity - Use
time.Now().UnixNano()
timestamp when writing - Do not convert it back to
time.Time
when reading, just continue using the int - Use
Watermark
type to wrap this int in a type-safe manner
This comment has been minimized.
This comment has been minimized.
gdbelvin
Nov 11, 2019
•
Author
Collaborator
The big motivation for all these changes was managing the discrepancy between the mysql implementation (which used to be BIGINT with UnixNanos), and Spanner, which uses microseconds. These were two very different kinds of "ints", which caused their own problems.
- Trying to abstract away the time precision of the storage layer is how we got to time.Time.
- Migrating to time.Time has also produced a whole new suite of tests to ensure that storage implementations implement these APIs correctly, and have removed quite a few sneaky modifications of these watermarks in the code.
- In addition to abstracting away the time precision of the storage layer, I've also tried to match the precision of mysql with spanner (hence the use of
DATETIME
).
I think your concerns are very legitimate and worth discussing further.
This comment has been minimized.
This comment has been minimized.
An interesting side effect of this PR is that is requires watermarks to have nanosecond precision. |
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Nov 11, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #1380 +/- ##
==========================================
+ Coverage 65.54% 65.61% +0.06%
==========================================
Files 52 52
Lines 3959 3961 +2
==========================================
+ Hits 2595 2599 +4
+ Misses 969 964 -5
- Partials 395 398 +3
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
@gdbelvin is this ready for review now, or are you planning on making the other changes regarding testing semantics for high watermark? |
5f323f1
to
4d44f1e
Replace tests that wanted to predict the exact value of HighWatermark with tests that rely on the behavior of ReadLog to assert that the value returned is indeed correct. This allows different storage layers to use their own strategies for returning high watermarks. It also removes the requirement that HighWatermark values themselves be stored with nanosecond precision.
This comment has been minimized.
This comment has been minimized.
This also has the happy side effect of not imposing a particular time fidelity on the batch definition table. (Previous versions of this PR were going to force nanosecond fidelity) |
if got := len(rows); got != tc.want { | ||
t.Fatalf("ReadLog(%v): len: %v, want %v", tc.limit, got, tc.want) | ||
} | ||
t.Run(fmt.Sprintf("%d", i), func(t *testing.T) { |
This comment has been minimized.
This comment has been minimized.
mhutchinson
Nov 12, 2019
Contributor
I feel like I saw Pavel use an empty string instead of the fmt.Sprintf
description and the framework automatically generates a description?
This comment has been minimized.
This comment has been minimized.
gdbelvin
Nov 12, 2019
•
Author
Collaborator
Hmm, I suppose that's true. For tests with the same name, the framework will postpend _$i
to the name. That feels a bit hacky. Here I preserve an explicit link between the test name and the row it's testing.
type LogsReadWriter interface { | ||
sequencer.LogsReader | ||
// Send submits the whole group of mutations atomically to a log. | ||
// TODO(gbelvin): Create a batch level object to make it clear that this a batch of updates. |
This comment has been minimized.
This comment has been minimized.
mhutchinson
Nov 12, 2019
Contributor
If a batch is a first-class object then it seems subtle and indirect that it is constructed by passing in a bunch of EntryUpdate
s at the same time. This feels like it should be a slice at a minimum.
For the purposes of just this method I wouldn't go beyond turning the varargs mutation
into a slice, but an API shouldn't be designed a method at a time so I'd need to re-review the whole API & future plans to make a call on the situation.
* master: Reduce log spam (google#1382) Support and test mutation log queries at intermediate timestamps (google#1380)
* master: Define watermarks as micros (google#1384) Library for converting time.Time to sequencer watermarks (google#1381) Reduce log spam (google#1382) Support and test mutation log queries at intermediate timestamps (google#1380) In memory logs implementation (google#1375) Fix generic comparisons on protobuf messages (google#1379) Do pagination the right way (google#1378) Move the responsibility to pick an input log from storage to the frontend (google#1376) Init metrics for whole test file (google#1373) Break layering violation by using native types (google#1374) Switch Timestamp storage to mysql DATETIME (google#1369) Use testdb in integration tests (google#1371)
gdbelvin commentedNov 7, 2019
Standardize on quantums of time.Microsecond