New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf - Benchmark Script #3256

Merged
merged 2 commits into from Jul 23, 2018

Conversation

Projects
None yet
2 participants
@madt1m
Copy link
Member

madt1m commented Jul 22, 2018

A cleaner and updated addon to test performance of protobuf module.

Right now, the addon:

  • Constantly generates flows with optional size content;
  • Periodically gathers generated flows in bunch of N=150;
  • Measures time to dump, and store in DB, the bunch;
  • Write stats to file (in the form of num_dumps / time_spent) and exits.

Any idea you'd want to see realized here to improve this profiling? Or more, something you feel is not correct in the benchmark process, which makes it not reliable?
I felt it was necessary to clean and use this, since protobuf interface has evolved more complex, and I had to make sure that performances stayed on par.

serbenchmark1

serbenchmark2

@madt1m madt1m force-pushed the madt1m:serialization-benchmark branch from 92fa460 to 8ab82ad Jul 22, 2018

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Jul 22, 2018

A few comments/questions:

  • Does this obsolete #3126? If so, let's close that PR.
  • This looks legit, and the performance is great. There's one aspect of this that we need to fix/think about. At the moment, the flow stream source generates many, many more flows that we actually consume. This is because we wake up, read 150 flows, write them, and then sleep for another 3 second interval. So the queue backs up infinitely. This raises a few questions for me. First, we should probably adjust the benchmark to check the queue length and never let the queue grow bigger than, say, 3x the flush rate. Second, in production we will need to think of a similar back-pressure mechanism - that is, cap the number of "hot" flows we tolerate, and pause flow processing in some way when we exceed that. That's for the future, though.
@madt1m

This comment has been minimized.

Copy link
Member Author

madt1m commented Jul 22, 2018

Does this obsolete #3126? If so, let's close that PR.

Definitely. I'll close it.

The growing queue issue definitely opens some questions for the future. I'll fix it here as you suggested, but there's something more I am concerned with, if on-the-fly capture method for Sessions will be similar to this: that hardcoded flush period.
It just doesn't feel right to establish a priori a period to "wake" the flushing method, but without a sleep on the writer, there would be a ping pong between producer that puts flow in queue, and consumer that immediately flush it, thus losing the optimization of writing flows in stocks.
A way I see to "solve" this is via async writes to DB. This way, while a group of flows is being stored to DB, there would be a convenient time window for streamer (or any flow capture method) to fill as much as possible the queue. No idle time, and probably a good throughput.

I could start with testing the aiosqlite module right here in the benchmark script, actually.

count = 1
f = await self.queue.get()
self.hot_flows.append(f)
while not self.queue.empty() and count < self._flush_rate:

This comment has been minimized.

@cortesi

cortesi Jul 23, 2018

Member

The queue.empty check here is not necessary. The get_nowait below will raise when the queue is empty anyway.

async def stats(self):
while True:
await asyncio.sleep(1.0)
if self._fflushes == 21:

This comment has been minimized.

@cortesi

cortesi Jul 23, 2018

Member

Ideally, I'd like this to be configurable. It might make more sense to configure the number of flows written, though, rather than the number of flushes.

str,
"/tmp/stats",
"Destination for the stats result file"
)

This comment has been minimized.

@cortesi

cortesi Jul 23, 2018

Member

I'd suggest that this should be optional, with stats dumped to screen by default. I'll be running this by hand and saying cat /tmp/stats quite a lot. ;) I also prefer not to have the filename hardcoded here.

self.tf = tflow.tflow()
self.tf.request.content = b'A' * ctx.options.testflow_size
ctx.log(f"With content size: {len(self.tf.request.content)} B")
self.dbh = db.DBHandler("/tmp/temp.sqlite", mode='write')

This comment has been minimized.

@cortesi

cortesi Jul 23, 2018

Member

This is not portable. Let's use the tempfile module to generate a temp directory or file for us.

@cortesi cortesi merged commit 16e0799 into mitmproxy:master Jul 23, 2018

4 checks passed

codecov/patch Coverage not affected when comparing ec092fd...e727446
Details
codecov/project 88.8% remains the same compared to ec092fd
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment