Conversation
This adds a new client option to enable logging of all import requests. It gob encodes the index, path, shard, and request body to a file for all requests. The index and shard are needed so that the code replaying the requests can send them to a cluster of a different size and still figure out which nodes a request needs to go to. I just realized that we're currently writing out every request to every node, but when we read back in, we're sending each request we read in to every node that should receive it. This is a bit of a problem if replication is > 1.
I still believe the best approach to recording requests is doing that externally, but we can add "hooks" to the library which call user code on some event, like a request is about to happen. The user would be responsible for dealing with the e.g., request data it receives. They can save it to a file or post it to somewhere else. That would allow us to write a utility to handle recording/replaying logic without needing to update the library. |
unfortunately, due to the design of our API, we need information which isn't included in the request parameters, but only in the payload. A proxy cannot easily decode the payload because (a) it needs access to the proto definitions which is kind of annoying, and (b) our import endpoint accepts two different types of protobuf message, and Pilosa only knows how to decode because it can look up the type of the field that's being imported into. To replay the requests into a different cluster, we need to know the index and shard for the request so we can look up which node(s) it will actually be sent to. So basically, we could do this stuff from a proxy, but it'd be pretty annoying, and we'd need to write a separate replay program that duplicated a bunch of code which is already in go-pilosa. Also, the point of this is for users to be able to record their import workloads so that they can send them to us and we can recreate them and reproduce bugs or profile things. I'd much rather tell someone to enable a client option than to start another service and configure it as a proxy between their import and Pilosa. |
We would have the same number of proxies as the number of Pilosa nodes. So e.g., the client would post some data to node A (depending on the shard), which would be intercepted by proxy A, saved and forwarded to real Pilosa node A. Neither Pilosa nodes, nor the client library would have to know there is a proxy in the middle. User code doesn't have to be modified. Also this is client-library agnostic, so we won't have to update other client libraries. Also that would help saving requests for everything, not just imports.The replay program would just receive a list of Anyway, since we don't want to go with the the proxy approach, I think at the least we should make it customizable in the library. Since we don't know what environment the client code runs (e.g., it could very well be a serverless function) the user should be able to choose where and how to save/post the request data. A possible way of defining hooks could be: function onHttpRequest(url string, headers []Header, data []byte) {
// save/post the url, data, etc...
}
client, _ := pilosa.NewClient(..., OptEventHandler("http-request", onHttpRequest)) We could supply predefined HTTP event handlers in a separate package, something like: client, _ := pilosa.NewClient(..., OptEventHandler("http-request", events.DefaultHttpRequestHandler)) The replay logic seems to be entirely movable to a separate utility. Currently the library has the |
previously, we were creating the protobuf messages for imports, and then marshalling them to bytes on a node-by-node basis. The data, however, was the same for each node, so in the case of replication > 1 this was duplicating a lot of work. By doing it ahead of time, we can improve performance, and (the original motivation), we can log the import just once instead of once per node.
Pilosa's API has some idiosyncrasies such that import-roaring will forward the request on to all shard owners while import won't.
more flexible, and we end up with less code
OK, I've made the import logger take an io.Writer instead of creating a file, so that should make it pretty flexible. I don't want to plug in at the HTTP level, because then the cluster we're replaying into would need to be the same size and have the same config as the one we're recording. Recording the requests at that level doesn't give us enough info to know where those requests need to be sent in a different cluster with (potentially) a different number of nodes, or different replication setting. The replay logic can't easily be moved to a separate utility because it needs to be able to use fragmentNodes. |
@yuce I've unexported or marked as experimental all the new API in this PR |
@jaffee Thanks! Is the PR good to review? |
yep sorry—removed WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 issues. Also the coverage dropped a bit. If it isn't too hard to add tests to cover those paths, it would be great to do so.
This adds a new client option to enable logging of all import requests. It gob
encodes the index, path, shard, and request body to a file for all requests. The
index and shard are needed so that the code replaying the requests can send them
to a cluster of a different size and still figure out which nodes a request
needs to go to.
I just realized that we're currently writing out every request to every node,
but when we read back in, we're sending each request we read in to every node
that should receive it. This is a bit of a problem if replication is > 1. I'll think about how best to address that, but this is the basic idea.