New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization format #3075

Open
cortesi opened this Issue Apr 24, 2018 · 11 comments

Comments

Projects
None yet
4 participants
@cortesi
Copy link
Member

cortesi commented Apr 24, 2018

This is a discussion ticket for the next steps in our serialization format that will be worked on by @madt1m in the coming months. This is part of the GSoC project, but anyone should feel free to contribute to the discussion.

Aim

Our current serialization mechanism is substandard in a number of ways:

  • It's a flat, append-only sequence, which means that it doesn't handle streams (HTTP2/websockets/TCP), and we can't (easily) build indexes over it. This is why we currently aggregate all flows in memory for our interactive tools, which is obviously not ideal.
  • Our serialisation mechanism is spread throughout our core objects in bits and pieces. It's very hard to know exactly what's included and what the structure is at a glance.
  • We use our own netstrings-like serialization format (tnetstrings). It's served us well, but having a totally custom format is not ideal for interoperability.
  • We want to be able to annotate flows, modify them in situ and perform other operations that mutate flow state. All of this is impossible or very hard with an append-only format.

This ticket roughly outlines an approach that fixes all of these issues. It's a bit of a strawman, hoping to provoke others into filling things out more completely.

Outline

I propose that we shift the on-disk format to SQLite. SQLite is built in to Python, extremely robust, very fast and operates on all the platforms we care about. I can't think of a better choice for the mechanics of putting data on disk.

Now, we need to consider what the database format for flows would be. If we're using SQLite, one immediate tought might be to decompose our core data types entirely, and store flows in normalised tables with columns for each consituent value. I feel this would be slow, complicated and error prone. Instead, I propose that we treat SQLite as an indexable key-value store, where flows and flow components are indexed by ID and stored as blobs. For the format of these blobs, I propose protobufs, which gives us fast (?), well-defined serialization and automatic interoperability with a huge range of other tools and languages.

Below, I include some entirely untested, utterly unreliable notes on what the storage format might look like - please don't trust the details. Consider a core storage table like this:

CREATE TABLE flows (
    id integer PRIMARY KEY, -- row id, generated by sqlite
    mid text NOT NULL, -- message id, generated by mitmproxy
    kind text NOT NULL,
    data BLOB NOT NULL,
);

CREATE INDEX message_id ON flows (mid, kind); -- maybe - we'll have to play with this

Lets suppose that we have a client connection protobuf like this:

message ClientConnect {
  string address = 1;
  ... etc
}

And a request protobuf like this:

message HTTPRequest {
  string client_connection_id = 1;
  string method = 2;
  string scheme = 3;
  ... etc
}

The lifecycle of an HTTP flow progresses (roughly, ignoring details like header events) through client_connect -> request -> server_connect -> response -> server_disconnect -> client_disconnect. When we receive the client_connect event, we serialize it and make an entry with (mid = message ID, kind = "clientconnect", data = protobuf serialized data). When we receive the request, we make a request entry with the client connection ID set to the appropriate value, and so on. In this way, we can stream data into the database, keeping the minimum needed in memory. For things like websockets, where there can be an arbitrary number of message events, each message is simply added to the database on the fly. When a flow is requested from the store, we can now reassemble it by selecting and deserializing the matching components from the database.

There are a few embellishments to add here:

  • We would like to add an annotations mechanism that allows addons and our own tools to add custom data to flows. This might be a column in the main table, or a separate table.
  • We let users specify a flow order for interactive tools (currently method, size, time or url). We'll have to handle these either by maintaining an order index in memory, or by adding matching order columns to the database.

The View

I've been working towards this change for a long time, and it's not a coincidence that the API that the persistent store needs to implement is already mostly encapsulated in the view addon. There's some more work to be done to ensure that the addon is completely encapsulated (mitmproxy console currently uses some parts of the view directly, not through commands), but after that, the task of actually bolting the new persistence layer into mitmproxy will consist of implementing the view API, and then swapping out the current in-memory view for an on-disk view.

Rough implementation plan

  • Performance testing The pure-Python protobuf implementation has performace issues, and we can't use the compiled C++ implementation without immense headache. We should implement some fraction of the outline above for a single message - say HTTP responses. We should then test that the protobuf serilization speed for large messages is adequate, and that insertion of large blogs the database is fast enough. We'd like to make sure that we're speedy for messages up to at least a few megabytes in size. Ideally, the rate at which we can serialise and store messages should not be significantly less than our core performance of about 80 flows per second.
  • Serializtion addon The next step is to create an addon that hooks into the lifecycle event mechanisms, and performs serialization to disk on the fly. The addon should also have a mechanism that let us enter and retrieve a list of flows into the state. This will let us test two-way compatibility between the new mechanism and the status quo.
  • View addon After that, we can clean up the current View API to make sure it's encapsulated, and implement it on top of the persistent state database.
  • Deploy Finally, we'll remove the current in-memory view and replace it with the on-disk view. There are many details to consider here, including automatic conversion of old data formats to the new format.

Longer term

There are many repercussions of the new storage format that will need to be explored. For instance, it also gives us an easy way to add a persistent session to our tools. This might store a copy of changed options on exit, so that a session with all its settings intact can be resumed fluently.

@madt1m

This comment has been minimized.

Copy link
Member

madt1m commented Apr 24, 2018

I would say that the major hits on performance, here, will come from:

  • Multiple disk accesses for many small-size messages (website spidering?)
  • Serialization of large-size message

Therefore, I plan to test both behaviours with the native python compiler for protobufs. The idea is to quickly develop an addon to catch events, and test separately serialization/deserialization, INSERT/SELECT execution time.

I am also wondering if performances would improve moving towards a buffered, async write. Like keeping up to a given number of flows in memory, and writing them afterwards. I guess this wouldn't impact serialization performances, but could improve exec time on write to db. But this is just random guessing, I would actually prefer keeping the mechanism on-the-fly.

About ordering. Isn't that some sort of hard-coded ordering? I mean, since we're not yet reading the blobs, we need to specifically define the ordering attributes, and adding some sort of data to keep track of the ordering. Moving the ordering process to the controller (view?) would make it more modular, and prone to modifications in the ordering attributes/logic. Let me know what you think about this.

We could expose an interface to the serialization process. Adding annotations (custom data type) through a set of specific commands invoked in custom addons (or script) seems to me the way to go.

@mhils

This comment has been minimized.

Copy link
Member

mhils commented Apr 24, 2018

We would like to add an annotations mechanism that allows addons and our own tools to add custom data to flows. This might be a column in the main table, or a separate table.

Maybe we can get away with just kind=set_annotation entries. :-)

Therefore, I plan to test both behaviours with the native python compiler for protobufs. The idea is to quickly develop an addon to catch events, and test separately serialization/deserialization, INSERT/SELECT execution time.

This sounds good to me. I think it would be quite useful to also compare against "tnetstrings in sqlite" and see how that looks im comparison. So we can maybe compare to_proto_string(client_conn.get_state()) with tnetstring.dumps(client_conn.get_state()).

I also took a quick look at sqlite with JSON1 extension, but extensions are apparently not supported on e.g. macOS without major tricks.

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Apr 24, 2018

Here's a library that crossed my radar the other day, which might be useful, since we now have an async core:

https://github.com/jreese/aiosqlite

I haven't dug into it to see if it's completely sane, but the general idea of making DB writes/reads async might be useful.

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Apr 24, 2018

@madt1m Regarding ordering, let's ponder it. So imagine we have a save file with 100k flows, and we fire up mitmproxy console on it. The console app now requests the first 30 flows under a particular ordering for display. Here are the options as I see it:

  • Have the orderings stored in the database with indexes over them, and use select with order by and limit/offset to retrieve the data. This is simple and should be performant.
  • Build an ordering index in-memory. On startup, we iterate over all flows to build all the ordering indexes. Whenever a flow is added, we make the appropriate modification to the in-memory indexes. When N records are requested, we look up the ID in the ordering index, and select by ID from the database.

I think both of these work. The advantage of the second is that it doesn't touch the database and we don't need to make schema modifications to support different indexes. The advantage of the first is that it's much simpler and has fewer moving parts. I'm divided, but on reflection lean a bit toward the second.

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Apr 24, 2018

I'll also say a bit about what kinds of annotations we want to support, because there are a few different types, and it might be worth considering their structure up front.

  • We currently support flow marking as a built-in. I feel this should just be an annotation. However, this is an annotation of a specific kind - a tag. We want to be able to select all flows with a given tag, and operate on them en masse. We also want to be able to treat all flows with a specific tag as a view (as we do when limiting a view to marked), and iterate/page through it.
  • I would like to also support letting addons attach arbitrary data to flows in an annotation. The data is opaque to us, and is meaningful only to the associated addon. This will let us do things like allow vulnerability scanners to annotate flows, and then use the annotations to generate reports, etc.
  • Finally, I would really love to allow users to add notes and other text annotations to flows. I really want this for myself. This would be a built-in "notes" addon, which would then be invoked as commands from our interactive tools.
@Kriechi

This comment has been minimized.

Copy link
Member

Kriechi commented Apr 25, 2018

How would we handle future schema updates?
With the current proposal we have two things to worry about: sqlite table layout, and protobuf interface descriptions.

With the current file format we make use of step-by-step converter function to migrate the data structures step by step - which I think is an nice feature that we want to keep!

@madt1m

This comment has been minimized.

Copy link
Member

madt1m commented Apr 25, 2018

Build an ordering index in-memory. On startup, we iterate over all flows to build all the ordering indexes. Whenever a flow is added, we make the appropriate modification to the in-memory indexes. When N records are requested, we look up the ID in the ordering index, and select by ID from the database.

I agree with this. I suppose we want to leave the majority of the deserialization logic out of the DB structures.

How would we handle future schema updates?
With the current proposal we have two things to worry about: sqlite table layout, and protobuf interface descriptions.

All updates to data structures will reflect only to the blob. Which other kind of updates? Both schema layout and protobuf description will have to be designed with some mental work. Anyway, the interface leaves enough space to ensure compatibility in the future, using tags. I will spend many headaches on it, for sure :)

@Kriechi

This comment has been minimized.

Copy link
Member

Kriechi commented Apr 25, 2018

@madt1m I'm talking about something like we have here: https://github.com/mitmproxy/mitmproxy/blob/5546f0a05ec21db986f0639cee9e89452ba68642/mitmproxy/io/compat.py

When we use sqlite + protobuf + in-memory objects we have multiple layers of parsing & converting to do. Just keep that in mind - there will be updates and changes in the future and we need to design the new serialization format so we can handle this without loosing data.

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Apr 26, 2018

@Kriechi We'll need to think carefully about how we want to do schema modifications as the SQL schema evolves. We can do it in-place, or we can guarantee that we'll read data in a backwards compatible way, but write to a newly created session file. I lean towards the latter.

For the blobs themselves, I think the conversion will be pretty much what we have now.

@madt1m

This comment has been minimized.

Copy link
Member

madt1m commented Apr 26, 2018

So, let's say that I am a user which wants to use mitmproxy interactively. I generate flows which are serialized on-the-fly on disk, and the view addon takes the job of retrieving, ordering, and showing my flows. At that point, there's this nice addon (say, notes) which allows me to annotate text into every flow, so e.g. I can, two days later, filter on text added.

Now, let's say a vulnerability scanner is playing a bit with the internet, using both scapy and mitmproxy testing weird behaviour, and trying to obtain free deliveries of pizza on Just Eat. As pointed out before, it could have the opportunity to just attach some custom, opaque blob of data to tuples on the DB.

Some questions emerge here, about SQL schema:

First of all, should the text for notes be placed into a fixed text column in the schema? Or it would just be a pre_cooked example of the aforementioned "custom, opaque blob of data"?

And, concerning that, how would it work? Designing, for example, the schema with some "custom_annotation" column that can be used by developers? What if we're trying to support multiple kinds of annotation on the same tuple? Or maybe creating another table entirely for annotations blobs?

@madt1m

This comment has been minimized.

Copy link
Member

madt1m commented Apr 26, 2018

I would sleep better if I knew that schema layout for captures is mostly fixed, implementing a clear interface for backwards compatibility for future updates. The "read old, write new" approach which @cortesi pointed out before sounds good in my mind, in that sense.

@cortesi cortesi added the RFC label May 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment