Skip to content
This repository has been archived by the owner on Sep 28, 2022. It is now read-only.

add import logging functionality #225

Merged
merged 6 commits into from Apr 5, 2019
Merged

Conversation

jaffee
Copy link
Member

@jaffee jaffee commented Apr 3, 2019

This adds a new client option to enable logging of all import requests. It gob
encodes the index, path, shard, and request body to a file for all requests. The
index and shard are needed so that the code replaying the requests can send them
to a cluster of a different size and still figure out which nodes a request
needs to go to.

I just realized that we're currently writing out every request to every node,
but when we read back in, we're sending each request we read in to every node
that should receive it. This is a bit of a problem if replication is > 1. I'll think about how best to address that, but this is the basic idea.

This adds a new client option to enable logging of all import requests. It gob
encodes the index, path, shard, and request body to a file for all requests. The
index and shard are needed so that the code replaying the requests can send them
to a cluster of a different size and still figure out which nodes a request
needs to go to.

I just realized that we're currently writing out every request to every node,
but when we read back in, we're sending each request we read in to every node
that should receive it. This is a bit of a problem if replication is > 1.
@jaffee jaffee requested a review from yuce April 3, 2019 17:27
@yuce
Copy link
Contributor

yuce commented Apr 4, 2019

I still believe the best approach to recording requests is doing that externally, but we can add "hooks" to the library which call user code on some event, like a request is about to happen. The user would be responsible for dealing with the e.g., request data it receives. They can save it to a file or post it to somewhere else. That would allow us to write a utility to handle recording/replaying logic without needing to update the library.

@jaffee
Copy link
Member Author

jaffee commented Apr 4, 2019

unfortunately, due to the design of our API, we need information which isn't included in the request parameters, but only in the payload. A proxy cannot easily decode the payload because (a) it needs access to the proto definitions which is kind of annoying, and (b) our import endpoint accepts two different types of protobuf message, and Pilosa only knows how to decode because it can look up the type of the field that's being imported into.

To replay the requests into a different cluster, we need to know the index and shard for the request so we can look up which node(s) it will actually be sent to.

So basically, we could do this stuff from a proxy, but it'd be pretty annoying, and we'd need to write a separate replay program that duplicated a bunch of code which is already in go-pilosa.

Also, the point of this is for users to be able to record their import workloads so that they can send them to us and we can recreate them and reproduce bugs or profile things. I'd much rather tell someone to enable a client option than to start another service and configure it as a proxy between their import and Pilosa.

@yuce
Copy link
Contributor

yuce commented Apr 4, 2019

We would have the same number of proxies as the number of Pilosa nodes. So e.g., the client would post some data to node A (depending on the shard), which would be intercepted by proxy A, saved and forwarded to real Pilosa node A. Neither Pilosa nodes, nor the client library would have to know there is a proxy in the middle. User code doesn't have to be modified. Also this is client-library agnostic, so we won't have to update other client libraries. Also that would help saving requests for everything, not just imports.The replay program would just receive a list of (URL, header, request data) and would post that to Pilosa.

Anyway, since we don't want to go with the the proxy approach, I think at the least we should make it customizable in the library. Since we don't know what environment the client code runs (e.g., it could very well be a serverless function) the user should be able to choose where and how to save/post the request data.

A possible way of defining hooks could be:

function onHttpRequest(url string, headers []Header, data []byte) {
    // save/post the url, data, etc...
}

client, _ := pilosa.NewClient(..., OptEventHandler("http-request", onHttpRequest))

We could supply predefined HTTP event handlers in a separate package, something like:

client, _ := pilosa.NewClient(..., OptEventHandler("http-request", events.DefaultHttpRequestHandler))

The replay logic seems to be entirely movable to a separate utility. Currently the library has the HttpRequest function, which can be used to post the replay data. It doesn't allow to choose a specific host, but it can be made to do so (one client per node + manual server address) but we can also add a similar function which makes the request to the specified host.

previously, we were creating the protobuf messages for imports, and then
marshalling them to bytes on a node-by-node basis. The data, however, was the
same for each node, so in the case of replication > 1 this was duplicating a lot
of work. By doing it ahead of time, we can improve performance, and (the
original motivation), we can log the import just once instead of once per node.
Pilosa's API has some idiosyncrasies such that import-roaring will forward the
request on to all shard owners while import won't.
more flexible, and we end up with less code
@jaffee
Copy link
Member Author

jaffee commented Apr 4, 2019

OK, I've made the import logger take an io.Writer instead of creating a file, so that should make it pretty flexible.

I don't want to plug in at the HTTP level, because then the cluster we're replaying into would need to be the same size and have the same config as the one we're recording. Recording the requests at that level doesn't give us enough info to know where those requests need to be sent in a different cluster with (potentially) a different number of nodes, or different replication setting.

The replay logic can't easily be moved to a separate utility because it needs to be able to use fragmentNodes.

@jaffee
Copy link
Member Author

jaffee commented Apr 5, 2019

@yuce I've unexported or marked as experimental all the new API in this PR

@yuce
Copy link
Contributor

yuce commented Apr 5, 2019

@jaffee Thanks! Is the PR good to review?

@jaffee jaffee changed the title WIP: add import logging functionality add import logging functionality Apr 5, 2019
@jaffee
Copy link
Member Author

jaffee commented Apr 5, 2019

yep sorry—removed WIP

Copy link
Contributor

@yuce yuce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 issues. Also the coverage dropped a bit. If it isn't too hard to add tests to cover those paths, it would be great to do so.

client.go Outdated Show resolved Hide resolved
client.go Outdated Show resolved Hide resolved
@jaffee jaffee merged commit c27ef05 into FeatureBaseDB:master Apr 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants