add import logging functionality #225

jaffee · 2019-04-03T17:27:29Z

This adds a new client option to enable logging of all import requests. It gob
encodes the index, path, shard, and request body to a file for all requests. The
index and shard are needed so that the code replaying the requests can send them
to a cluster of a different size and still figure out which nodes a request
needs to go to.

I just realized that we're currently writing out every request to every node,
but when we read back in, we're sending each request we read in to every node
that should receive it. This is a bit of a problem if replication is > 1. I'll think about how best to address that, but this is the basic idea.

This adds a new client option to enable logging of all import requests. It gob encodes the index, path, shard, and request body to a file for all requests. The index and shard are needed so that the code replaying the requests can send them to a cluster of a different size and still figure out which nodes a request needs to go to. I just realized that we're currently writing out every request to every node, but when we read back in, we're sending each request we read in to every node that should receive it. This is a bit of a problem if replication is > 1.

yuce · 2019-04-04T13:05:31Z

I still believe the best approach to recording requests is doing that externally, but we can add "hooks" to the library which call user code on some event, like a request is about to happen. The user would be responsible for dealing with the e.g., request data it receives. They can save it to a file or post it to somewhere else. That would allow us to write a utility to handle recording/replaying logic without needing to update the library.

jaffee · 2019-04-04T13:23:11Z

unfortunately, due to the design of our API, we need information which isn't included in the request parameters, but only in the payload. A proxy cannot easily decode the payload because (a) it needs access to the proto definitions which is kind of annoying, and (b) our import endpoint accepts two different types of protobuf message, and Pilosa only knows how to decode because it can look up the type of the field that's being imported into.

To replay the requests into a different cluster, we need to know the index and shard for the request so we can look up which node(s) it will actually be sent to.

So basically, we could do this stuff from a proxy, but it'd be pretty annoying, and we'd need to write a separate replay program that duplicated a bunch of code which is already in go-pilosa.

Also, the point of this is for users to be able to record their import workloads so that they can send them to us and we can recreate them and reproduce bugs or profile things. I'd much rather tell someone to enable a client option than to start another service and configure it as a proxy between their import and Pilosa.

yuce · 2019-04-04T15:30:57Z

We would have the same number of proxies as the number of Pilosa nodes. So e.g., the client would post some data to node A (depending on the shard), which would be intercepted by proxy A, saved and forwarded to real Pilosa node A. Neither Pilosa nodes, nor the client library would have to know there is a proxy in the middle. User code doesn't have to be modified. Also this is client-library agnostic, so we won't have to update other client libraries. Also that would help saving requests for everything, not just imports.The replay program would just receive a list of (URL, header, request data) and would post that to Pilosa.

Anyway, since we don't want to go with the the proxy approach, I think at the least we should make it customizable in the library. Since we don't know what environment the client code runs (e.g., it could very well be a serverless function) the user should be able to choose where and how to save/post the request data.

A possible way of defining hooks could be:

function onHttpRequest(url string, headers []Header, data []byte) {
    // save/post the url, data, etc...
}

client, _ := pilosa.NewClient(..., OptEventHandler("http-request", onHttpRequest))

We could supply predefined HTTP event handlers in a separate package, something like:

client, _ := pilosa.NewClient(..., OptEventHandler("http-request", events.DefaultHttpRequestHandler))

The replay logic seems to be entirely movable to a separate utility. Currently the library has the HttpRequest function, which can be used to post the replay data. It doesn't allow to choose a specific host, but it can be made to do so (one client per node + manual server address) but we can also add a similar function which makes the request to the specified host.

previously, we were creating the protobuf messages for imports, and then marshalling them to bytes on a node-by-node basis. The data, however, was the same for each node, so in the case of replication > 1 this was duplicating a lot of work. By doing it ahead of time, we can improve performance, and (the original motivation), we can log the import just once instead of once per node.

Pilosa's API has some idiosyncrasies such that import-roaring will forward the request on to all shard owners while import won't.

more flexible, and we end up with less code

jaffee · 2019-04-04T18:34:03Z

OK, I've made the import logger take an io.Writer instead of creating a file, so that should make it pretty flexible.

I don't want to plug in at the HTTP level, because then the cluster we're replaying into would need to be the same size and have the same config as the one we're recording. Recording the requests at that level doesn't give us enough info to know where those requests need to be sent in a different cluster with (potentially) a different number of nodes, or different replication setting.

The replay logic can't easily be moved to a separate utility because it needs to be able to use fragmentNodes.

jaffee · 2019-04-05T15:30:42Z

@yuce I've unexported or marked as experimental all the new API in this PR

yuce · 2019-04-05T17:06:46Z

@jaffee Thanks! Is the PR good to review?

jaffee · 2019-04-05T17:58:54Z

yep sorry—removed WIP

yuce

Just 2 issues. Also the coverage dropped a bit. If it isn't too hard to add tests to cover those paths, it would be great to do so.

client.go

jaffee requested a review from yuce April 3, 2019 17:27

jaffee added 3 commits April 4, 2019 12:38

replay imports to all nodes, but import-roaring to one node

b133000

Pilosa's API has some idiosyncrasies such that import-roaring will forward the request on to all shard owners while import won't.

make import logging take a Writer

52cc633

more flexible, and we end up with less code

mark experimental or unexport all log/replay functionality

f2bb9d7

jaffee changed the title ~~WIP: add import logging functionality~~ add import logging functionality Apr 5, 2019

yuce suggested changes Apr 5, 2019

View reviewed changes

client.go Outdated Show resolved Hide resolved

client.go Outdated Show resolved Hide resolved

importLog: add isRoaring and rename logLoc

2694b85

yuce approved these changes Apr 5, 2019

View reviewed changes

jaffee merged commit c27ef05 into FeatureBaseDB:master Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add import logging functionality #225

add import logging functionality #225

jaffee commented Apr 3, 2019

yuce commented Apr 4, 2019

jaffee commented Apr 4, 2019

yuce commented Apr 4, 2019

jaffee commented Apr 4, 2019

jaffee commented Apr 5, 2019

yuce commented Apr 5, 2019

jaffee commented Apr 5, 2019

yuce left a comment

add import logging functionality #225

add import logging functionality #225

Conversation

jaffee commented Apr 3, 2019

yuce commented Apr 4, 2019

jaffee commented Apr 4, 2019

yuce commented Apr 4, 2019

jaffee commented Apr 4, 2019

jaffee commented Apr 5, 2019

yuce commented Apr 5, 2019

jaffee commented Apr 5, 2019

yuce left a comment

Choose a reason for hiding this comment