Skip to content

Performance

okay edited this page May 16, 2016 · 12 revisions

Performance

Sybil's performance can be broken into 3 parts: perf of the record ingestion, perf of record collation and querying performance. Sybil is fastest at query performance, but is not slow at ingestion, by any means. Below is an example schema and timings.

Example Uptime Data

Schema

  • time: high cardinality int
  • weight: low cardinality int
  • random int: low cardinality int
  • ping: low cardinality int
  • host name: low cardinality string
  • status code: low cardinality string
  • groups: low cardinality strings inside a set column

Row Ingestion

To ingest 10,000 records of the above schema (generating and writing 10 records at a time with a pause of 10ms between batches) takes approximately 90 seconds on an SSD and has my laptop's 4 CPU's at about 10 - 20% load. That translates to a comfortable ingest rate of 400K per hour events on commodity hardware.

If 400K events seems like a small amount, consider that each event has about 10 fields (I left some unlisted above, because they don't have semantic meaning).

Row Digestion

Reading, digesting and collating the above generated 1,000 row store files (with 10 records each) into a half filled block (already containing 37K records of 65K) takes ~500ms, with the time varying based on how full the block currently is. Filling a complete 65K block takes about 600 - 800ms, depending on how many columns are involved (in this example, there are 10 columns, mostly low cardinality)

Querying

As expected of a column store, query times are dependent on how many columns are being loaded off disk. By default, sybil will try to skip columns that are not relevant to the current query (and not bother loading them off disk). For example, running a query that does a group by host on the above data would take less time than a query that does a group by host and status because now the status column has to be read off disk.

Sybil also tries to skip blocks that don't fit the query filters - for example, it will not load any blocks off disk that don't fit within the query time window (if one is supplied) - this translates to great savings in query times as the database grows. However, in order to skip a block, sybil does have to load it's metadata off disk. Once there are hundreds of blocks (or millions samples), that can translate into an extra 50 - 100ms per query to read the block metadata off disk.

Once the blocks to read off disk are chosen, the loading and aggregation time for them is what takes up the main portion of the query time. For example, doing a group by host on 9 million samples on my laptop takes about 900ms. Adding another string column to the group by can increase that to 1,400ms. As you can see - the query time will increase with how much data is being loaded off disk.

For comparison, mongo can take upwards of 10 - 30 seconds to scan those 9 million docs and do the same aggregations.

Clone this wiki locally