Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: beginnings of a mysql backend #22

Merged
merged 1 commit into from
Sep 11, 2018
Merged

feat: beginnings of a mysql backend #22

merged 1 commit into from
Sep 11, 2018

Conversation

pjenvey
Copy link
Member

@pjenvey pjenvey commented Sep 10, 2018

w/ some initial calls and a test suite migrated from the sqlite version

  • prefers raw DQL (note: not DML) queries vs diesel's query builder for
    potential reuse for other backends (spanner)

  • TODO: further fleshing out of the types, likely wanting i64 or wrappers
    everywhere (as all spanner has is INT64) -- nor should the db layer be
    responsible for conversions from unsigned

Issue #18

CREATE TABLE `collections` (
`id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` VARCHAR(32) UNIQUE NOT NULL,
`modified` BIGINT DEFAULT '0' NOT NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of modification would this undergo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a modified timestamp here, because any modification timestamp is going to be per-user rather than global.

CREATE TABLE `user_collections` (
`user_id` INT NOT NULL,
`collection_id` INT NOT NULL,
`modified` BIGINT NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm curious about what kind of modifications are being tracked here. I realize this is likely due to something the Go/Python version tracks, but it'd be nice to document either inline or near here why this field exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last-modified timestamp of the collection as a whole, and should change whenever an item is added, modified or deleted in the collection. It will be greater-than-or-equal-to the MAX() of the modified column for that collection in the bso table, with the greater-than case being because an item was deleted.

FWIW, the python server has a bug here in its handling of deleted collections:

mozilla-services/server-syncstorage#62

Basically, it tries to calculate the last-modified time of the storage as a whole by doing SELECT MAX(modified) FROM user_collections WHERE uid = X. That's incorrect in the case of a deleted collection, which should cause the last-modified time of the storage as a whole to increase, but won't affect the last-modified time of any remaining collections.

It's clearly an edge-case, because we haven't bothered to actually fix it in the python version. But for greenfields code it's probably worth doing it right the first time. I suggest adding a deleted_at timestamp field to the user_collections table, so that we can explicit track the deletion of collections. But that can be a follow-up issue if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23

}
}

impl Fail for DbError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/db/mysql/test.rs Outdated Show resolved Hide resolved
src/db/mysql/test.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@rfk rfk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start!

Architecturally, I'm a little worried about the way that the DB is taking transactions and querying the current time from behind its API boundary. This could lead to subtle edge-cases under high concurrency, such as two concurrent PUTs inserting items with the same timestamp into the same collection, which could cause the two clients to never sync down each others changes.

Ideally, each HTTP request would be processed in a single logical transaction and at a single logical instant in time. (It might help to think of the modified integers here not as timestamps, but as opaque version numbers, with a new version number being generated for each change to the collection).

I wonder if there's a way to help enforce that at the DB trait API level here.

@@ -0,0 +1,70 @@
CREATE DATABASE IF NOT EXISTS `syncstorage` /*!40100 DEFAULT CHARACTER SET latin1 */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BSO payload can in theory be arbitrary unicode, since it's JSON. Should we set an explicit encoding like utf8mb4 to guard against any weirdness in unicode handling?

`sortindex` INT DEFAULT NULL,

`payload` MEDIUMTEXT NOT NULL,
`payload_size` INT DEFAULT '0' NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This payload_size is a separate column to (in theory) make it quicker to calculate total size of items stored in a collection. IIRC we don't ever do that in practice, so it may be worth considering whether that optimization still makes sense for us here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload_size does appear to be utilized in the info/quota/collectionion_usage calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interesting to compare SELECT SUM(payload_size) vs SELECT SUM(LENGTH(payload)) for this purpose to see whether having it as a separate column really provides much value in practice.

To be clear, I don't have any particular objection to keeping it, just wondering whether the extra complexity pays for itself or not.

CREATE TABLE `collections` (
`id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` VARCHAR(32) UNIQUE NOT NULL,
`modified` BIGINT DEFAULT '0' NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a modified timestamp here, because any modification timestamp is going to be per-user rather than global.

CREATE TABLE `user_collections` (
`user_id` INT NOT NULL,
`collection_id` INT NOT NULL,
`modified` BIGINT NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last-modified timestamp of the collection as a whole, and should change whenever an item is added, modified or deleted in the collection. It will be greater-than-or-equal-to the MAX() of the modified column for that collection in the bso table, with the greater-than case being because an item was deleted.

FWIW, the python server has a bug here in its handling of deleted collections:

mozilla-services/server-syncstorage#62

Basically, it tries to calculate the last-modified time of the storage as a whole by doing SELECT MAX(modified) FROM user_collections WHERE uid = X. That's incorrect in the case of a deleted collection, which should cause the last-modified time of the storage as a whole to increase, but won't affect the last-modified time of any remaining collections.

It's clearly an edge-case, because we haven't bothered to actually fix it in the python version. But for greenfields code it's probably worth doing it right the first time. I suggest adding a deleted_at timestamp field to the user_collections table, so that we can explicit track the deletion of collections. But that can be a follow-up issue if necessary.

CREATE TABLE `batches` (
`user_id` INT NOT NULL,
`collection_id` INT NOT NULL,
`id` varchar(64) NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "id" here is the batch id, right? It may be worth naming it "batch_id" or similar to avoid confusion.

}

impl MysqlDb {
pub fn get_collection_id_sync(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me why some of these are suffixed with _sync and some are not; what's the significance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the higher level db interface (trait) supplies async calls. So the MysqlDb impl of this trait will call sync methods via the tokio blocking wrapping call.

}
*/

// XXX: consider mysql ON DUPLICATE KEY UPDATE?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do :-)

.execute(&self.conn)?;
} else {
let payload = bso.payload.as_ref().map(Deref::deref).unwrap_or_default();
let sortindex = bso.sortindex.unwrap_or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me whether this defaults sortindex to NULL or to 0; I believe NULL is the correct behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, this is leftover from the port from go which instead defaulted to 0.


// fetch an extra row to detect if there are more rows that
// match the query conditions
query = query.limit(if limit >= 0 { limit + 1 } else { limit });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs aren't clear on what happens when limit < 0, does it default to "no limit"?

assert_eq!(bso.payload, payload);
assert_eq!(bso.sortindex, Some(sortindex));
// XXX: go version assumes ttl was updated here?
//assert_eq!(bso.expiry, modified + ttl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I wouldn't expect the ttl to be updated unless you explicitly sent a new TTL in the update.

let mut query = bso::table
//.select(bso::table::all_columns())
.select((bso::id, bso::modified, bso::payload, bso::sortindex, bso::expiry))
.filter(bso::user_id.eq(user_id as i32)) // XXX:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the // XXX are to maybe do a convert operation with ? so we can ensure we catch casting errors?

bbangert
bbangert previously approved these changes Sep 11, 2018
@pjenvey
Copy link
Member Author

pjenvey commented Sep 11, 2018

Will address further things later (likely switching get_bsos limit to u64, table encoding, batches table/architecture is still up in the air).

I'm already intending to have the transactions work like you suggest @rfk -- a lot like the python version: one transaction started (TODO: a transaction() call added to the Db trait) per handler request, having all db calls taking place within it.

modified timestamp likely following the same pattern

@pjenvey pjenvey force-pushed the feat/18 branch 2 times, most recently from 5722817 to 64d8a48 Compare September 11, 2018 23:28
w/ some initial calls and a test suite migrated from the sqlite version

- prefers raw DQL (note: not DML) queries vs diesel's query builder for
potential reuse for other backends (spanner)

- TODO: further fleshing out of the types, likely wanting i64 or wrappers
everywhere (as all spanner has is INT64) -- nor should the db layer be
responsible for conversions from unsigned

Issue #18
@bbangert bbangert merged commit fbd314b into master Sep 11, 2018
@bbangert bbangert deleted the feat/18 branch September 11, 2018 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants