Reinventing the wheel with custom serialization? #29

straight-shoota · 2019-03-06T17:28:33Z

In order to store job data in the queue, it needs to be serialized and deserialized for execution.

I'm wondering why this shard pulls up a completely proprietary serialization mechanism while there are widely supported general-purpose solutions to this problem. This makes it unnecessarily difficult for users because you need to implement custom serialization methods for anything but the most basic data types.

In my opinion, using a standard format like JSON for this would provide great benefits. There is practically no downside I can think of besides the cost for switching (which is not very high).
JSON is a universal data format and with Crystal's JSON implementation, it is really easy to provide bindings for custom datatypes, which are not limited to use with mosquito. It is string based, so instead of storing a hash in redis, mosquito would store a string. I don't think that makes much of a difference. The current implementation has some issues, because job data and task metadata are stored in the same hash, which will cause problems if a task uses eg. enqueue_time as parameter.

A huge benefit besides usability is also the reduction of complexity. Serializing and deserializing JSON is all provided in Crystal's stdlib. The entire params macro could be reduced or potentially even completely removed in favour of using plain properties/ivars.

The text was updated successfully, but these errors were encountered:

robacarp · 2019-03-07T20:20:26Z

@straight-shoota yep, and I agree. The serialization system is both redundant and complicated. You probably shouldn't be using the mosquito custom serializers. Ideally you shouldn't have to.

Most importantly, mosquito should not be used as a canonical data store. You shouldn't be storing full objects as parameters in your jobs -- whatever your real data store is, send the keys to your objects into a job instead of trying to serialize the entire object. The first task in your job should be to read any objects from the data store. Doing it this way accomplishes three things: it keeps the mosquito data footprint low, prevents stale objects (and thus decreases the probability of a distributed race condition), and maintains the single responsibility principle (let your ORM do the ORM-ing, and mosquito can handle the job scheduling). The current, unrefined method is left as such to be a kind of syntactic salt. The feature is there, if you really want it, but you probably shouldn't be using it. It's not a hard rule, it's just the choice I made. There is certainly some iteration to be done even within the paradigm.

Secondly, as you mention, mosquito architecturally gets in the way here. The metadata and job parameters are currently stored in the same redis hash. If it's used with a simple set of parameters as recommended above, this isn't a big problem, but even the recently implemented throttling code exposes some of the ways this is not going to be a scalable paradigm. Storing json in redis instead of using redis built in hashes feels like reinventing an altogether different wheel, but I digress.

An early iteration of mosquito just serialized and stored all ivars, and it was fine but it wasn't flexible. There are plenty of uses for ivars which are explicitly not serialized and blacklisting is less intuitive than whitelisting.

Would you mind posting or linking to some job code? I'd love to see some of the ways it's being used and document specific situations where the current framework is getting in the way.

straight-shoota · 2019-03-07T21:10:25Z

Yes, of course, jobs arguments shouldn't store any state. But job configuration can include custom datatypes (such as enums or value objects) which use only simple values but there are no default serialization handlers. Setting up JSON serialization for such types is trivial and reusable.

Storing json in redis instead of using redis built in hashes feels like reinventing an altogether different wheel

That's actually a very common pattern and considered good practice because JSON is much more flexible than redis hashes. Similar applications use this method. Sidekiq for example stores jobs as JSON.

Redis hashes are better when you need to access single hash keys, but that's not an issue for a job que because it always reads and writes the entire job config. In fact it might even be faster to use JSON serialization.

As an example, a serialized mosquito task in JSON could look like this:

{
  "enqueue_time": "...",
  "type": "...",
  "retry_count": "...",
  "job_data: {
    // serialized Job instance
  },
}

You wouldn't have to worry about how jobs are serialized. The Job type just needs to be able to serialize to JSON, which is usually trivial to accomplish with stdlib's tools.

Some examples are the jobs for shardsdb. They have a slim layer of indirection, because I was experimenting with different job processors and didn't want to adapt the individual jobs for each API and put taskmaster in between. But for the job definition this is completely irrelevant. Taskmaster::Job only includes JSON::Serializable. But that makes it really dead simple. You don't need to specify anything in a job implementation. No param declarations. It's just a plain Crystal type with an initializer receiving some ivars and automatic JSON serialization. ImportShard usses an ivar with a custom data type, but doesn't require any special serialization. It just works.
It's also very comfortable to write specs for these jobs: I'm simply using a mock adapter which collects all created jobs and can easily validate the Job data (example). I don't need any redis server for this, not even deserialize the jobs, JSON is just universally usable.

robacarp · 2019-03-07T22:01:29Z

I think storing the job metadata as a json hash is a nonstarter... functions like hincrby are atomic when you use redis hashes and wildly prone to distributed race conditions when you have to read, increment, and write the whole object. Mosquito's viability as a distributed megaprocessor is still far from realization, but this would make it difficult to accomplish. For example, imagine sending 10k emails with a dozen servers, but being forced to use a 3rd party api which has a hard rate limit? Rate limiting is on my mind because it's the most recent feature addition but the ability to ask redis to atomically alter a member of a hash is not worth compromising.

Regardless, for the purposes of any user of the toolchain, the job metadata storage mechanism is irrelevant anyway. Dividing the two into separate places in redis is a prerequisite for that.

I agree that the task parameters should be more flexible and that code for serializing is outside the bounds of mosquito. I believe I wandered a bit far during the implementation. Honestly, my immediate temptation is to just punt on serialization altogether and only accept string parameters but some experimentation with the newer json stuff is worth a shot.

straight-shoota · 2019-03-07T23:04:57Z

Sure, job config and task metadata are independent of each other. Task metadata can be stored as a redis hash and one such metadata value could simply be a JSON string containing the job config.

As a side note, I don't really follow your argument for using redis hashes. Maybe it's because I don't know your future plans. But sidekiq uses JSON serialization and that seems to work pretty well with thousands of job executions per second.

But whatever, metadata storage is an implementation detail (and easy to change in either direction, should the need arise). I care mostly about making job serialization easy for the user.

straight-shoota mentioned this issue Mar 7, 2019

Proposal for a standardized API for background job queues straight-shoota/taskmaster#1

Open

robacarp closed this as completed Apr 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinventing the wheel with custom serialization? #29

Reinventing the wheel with custom serialization? #29

straight-shoota commented Mar 6, 2019 •

edited

robacarp commented Mar 7, 2019

straight-shoota commented Mar 7, 2019

robacarp commented Mar 7, 2019

straight-shoota commented Mar 7, 2019

Reinventing the wheel with custom serialization? #29

Reinventing the wheel with custom serialization? #29

Comments

straight-shoota commented Mar 6, 2019 • edited

robacarp commented Mar 7, 2019

straight-shoota commented Mar 7, 2019

robacarp commented Mar 7, 2019

straight-shoota commented Mar 7, 2019

straight-shoota commented Mar 6, 2019 •

edited