Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETS and Snapshots #34

Closed
martinsumner opened this issue Feb 28, 2017 · 16 comments
Closed

ETS and Snapshots #34

martinsumner opened this issue Feb 28, 2017 · 16 comments

Comments

@martinsumner
Copy link
Owner

martinsumner commented Feb 28, 2017

The Ledger Cache is kept small (up to about 2000 entries) as when a clone is required, this cache goes through ets:to_list to allow a snapshot to exist in the clone.

For short-lived 2i queries, this is probably a poor trade off. It would probably be better to run the query against the ETS table and push the results downstream to the clone, rather than push the whole cache.

@DeadZen
Copy link
Contributor

DeadZen commented Feb 28, 2017

What about a generated module? Could generate the entries and some query functions to scan them and delete the module when the cache expires perhaps.

@martinsumner
Copy link
Owner Author

Do you have any pointers to an example of doing that sort of thing? I'd like to try and understand the idea better.

@DeadZen
Copy link
Contributor

DeadZen commented Feb 28, 2017

Sure I have a library that does part of it.. http://github.com/DeadZen/goldrush
Should not be too hard to make it store and scan data internally, I would have to make a PoC

@DeadZen
Copy link
Contributor

DeadZen commented Feb 28, 2017

I'll work on it now, can't do anything else with AWS down atm .. ;p

@martinsumner
Copy link
Owner Author

OK. I'm still a bit unsure on what your suggesting, so I'm intrigued.

Just to be clear, that the snapshot is taken of the cache for the 2i query so that the query will provide a view of the database at the point in time that the query was initiated. I don't need to stream any new events that arrive after initiation of the query into the snapshot.

The Ledger Cache is an ets ordered set, and it gets flushed occasionally (after it has pushed itself to the Penciller). So by extracting from the set at the point of the query, the Bookie is free to flush the set, without any concern that there may be a clone pointing at the ets table expecting it to still be there.

The extraction at the moment is done by doing a tab2list, and this takes about 1ms. However, i will have N/3 nodes doing this concurrently - so if there's a lot of 2i queries all those 1ms CPU hits will add up. All I was intending to do was to iterate over the ets ordered set to extract only the results relevant to the query at the snapshot - on the basis that a few calls to ets:next will be cheaper than an ets:tab2list.

@DeadZen
Copy link
Contributor

DeadZen commented Feb 28, 2017

Ok I was actually going to ask if that was the case with needing to stream new events.
That's good for sure.

@martinsumner
Copy link
Owner Author

#39

Will close this for now. Hopefully over the next few days will have some proper basho_bench results to compare with leveldb

@martinsumner
Copy link
Owner Author

In single-threaded tests, this does OK. But less so when running multiple parallel queries.

There is some possible concern about the cost of starting a query - i.e. what is the cost of a query on a near empty database with limited results to return? Is there too big an overhead of snapshotting?

@martinsumner martinsumner reopened this Mar 6, 2017
@DeadZen
Copy link
Contributor

DeadZen commented Mar 11, 2017

How do I reproduce this test?

@martinsumner
Copy link
Owner Author

There's a basho_bench script I'm building for 2i. It is a work in progress at the moment. I will upload it tomorrow evening.

@martinsumner
Copy link
Owner Author

Here's the current state of the script (sorry forgot about posting this)

https://github.com/martinsumner/basho_bench/blob/mas-nhsload/src/basho_bench_driver_riakc_pb.erl

I'm going to do some more work on this Thursday.

There is a bunch of optimisation work I've undertaken which is all merged into this branch:

https://github.com/martinsumner/leveled/tree/mas-pushmem-i46/src

This has really about improving the performance,not under query load, but under the update-related load when adding more indexes.

The big differences come from this commit:

2b0ec1d#diff-25d902c24283ab8cfbac54dfa101ad31

There is a bit of a convoluted history around what made the difference. There are two parts - one allowing for slots that contain index entries to have twice as many keys (which reduces the pace of merge activity in the tree), and secondly implementing that in a way which reduces needless object copying by the beam.

Part of the issue is that I'd made a rookie mistake in assuming that all large objects are passed as references, not just large binaries. So there has been a general history of performance-related issues from message passing large objects between processes in leveled.

Prior to this, running a 6-hour test with multiple 2i index updates for every object, led to the beam occupying 40% of RAM by the end of the test. Now the same test and the beam ends up with just 10% of RAM - with all that saved memory being used in the page cache, and presumably saved CPU time in garbage collection

@martinsumner
Copy link
Owner Author

I've rolled through a number of ideas to try and improve this, and they all hit the buffers at some stage.

One thing to note is that although the manifest is small - it must be copied very aggressively for 2i queries. When volume testing at 200 2i queries per second, there are nearly 5000 manifest copies being generated every second. Each individual version of the manifest will be copied thousands of times.

Although the manifest is small, this seems wasteful. Instead, for the first snapshot for any given process a separate worker should be started - and subsequent snapshots for that manifest should use the same worker.

@martinsumner
Copy link
Owner Author

martinsumner commented Mar 24, 2017

the change to share manifests across snapshots didn't make a huge difference the first time I tested .. but I was busy rolling a lot of changes at the time.

I will re-test this. Also considering a broader overhaul of actor organisation within the Penciller. Perhaps L0 and the manifest should live under a dedicated actor for handling those things - and all queries (including mainstream non-clone queries) are handled by runners which only has the levelzero index. So there will be a permanent runner (using runner as another actor name - as in bookie's runner) for non-snapshot requests, and runner clones for query requests. The actor holding the L0 sate will have to support snapshotting of request and handling multi-version (of both L0 and the manifest).

@martinsumner
Copy link
Owner Author

Tried to have a background accumulator of the penciller memory which could be use din queries ... but this didn't help.

https://github.com/martinsumner/leveled/tree/mas-pmemacc-i34/src

Quick volume test with 2i showed a significant reduction in throughput when compared to master

@martinsumner
Copy link
Owner Author

I'm of the opinion that I should defer work on this until we can test Leveled end-to-end with Riak in OTP 20.0 (and perhaps 20.0 ++ the I/O scheduler changes). Time might be a healer.

Actual testing hasn't actually highlighted this as a major issue. Also despite lots of thinking I'm devoid of any ideas to resolve it that don't have nasty side effects.

I'm moving this to the back-burner for now

@martinsumner
Copy link
Owner Author

When testing in OTP20/21 there appears to be significant speed-ups, especially when running ct tests.

In the target environment (NHS/Riak) things are ultimately slowed by disk - so enhancements here don't appear to be relevant. Also 2i query response times show no issues.

Won't fix - for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants