Skip to content

Commit

Permalink
Add advanced cache documentation and belady approximator to hitrates
Browse files Browse the repository at this point in the history
- belady is useful to get *some* sort of semi-realistic expectation of
  a cache, as the maximum hit rate is only somewhat realistic as cache
  sizes get close to the number of unique entries
- caches have been busting my balls and I'd assume the average user
  doesn't have the time and inclination to bother, so some guidance is
  useful
- as caching is generally a CPU/memory tradeoff, while ``hitrates``
  provides a cache overhead estimation giving users a better grasp of
  the implementation details and where the overhead comes from is
  useful
- plus I regularly re-wonder and re-research and re-discover the size
  complexity of various collections so this gives me the opportunity
  to actually write it down for once
  • Loading branch information
masklinn committed Mar 26, 2024
1 parent a0b5cec commit e380448
Show file tree
Hide file tree
Showing 3 changed files with 442 additions and 11 deletions.
372 changes: 372 additions & 0 deletions doc/advanced/caches.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,372 @@
=========
On Caches
=========

Evaluating Caches
=================

UA-Parser tries to provide a somewhat decent cache by default, but
cache algorithms react differently to traffic patterns, and setups can
have different amounts of space to dedicate to cache overhead.

Thus, ua-parser also provides some tooling to try and evaluate
fitness, in the form of two built-in command-line scripts. Both
scripts take a mandatory *sample file* in order to provide evaluation
on representative traffic. Thus this sample file should be a
representative sample of your real world traffic (no sorting, no
deduplicating, ...).

``python -mua_parser hitrates``
-------------------------------

As its name indicates, the ``hitrates`` script allows measuring the
hit rates of ua-parser's available caches by simulating cache use at
various sizes on the sample file. It also provides the memory overhead
of each cache implementation at those sizes, both in total and per
entry.

.. warning::

The cache overhead does not include the size of the cached entries
themselves, which is generally 500~700 bytes for a complete entry
(all three domains matched).

``hitrates`` also includes Bélády's MIN (aka OPT) algorithm for
reference. MIN is not a practical cache as it requires knowledge of
the future, but it provides the theoretical upper bound at a given
cache size (very theoretical, practical cache algorithms tend to be
way behind until cache sizes close in on the total number of unique
values in the dataset).

``hitrates`` has the advantage of being very cheap as it only
exercises the caches themselves and barely looks at the data.

``python -mua_parser bench``
----------------------------

``bench`` is much more expensive in both CPU and wallclock as it
actually runs the base resolvers, combined with various caches of
various sizes. For usability, it can report its data (the average
parse time per input entry) in both human-readable text with one
result per line and CSV with resolver configurations as the columns
and cache sizes as the rows.

``hitrates`` is generally sufficient as generally speaking for the
same base resolver performances tend to more or less follo hit rates:
a cache hit is close to free compared to a cache miss. Although this
is truer for the basic resolver (for which misses tend to be very
expensive). ``bench`` is mostly useful to validate or tie-break
decisions based on ``hitrates``, and allows creating nice graphs in
your spreadsheet software of choice.

Cache Algorithms
================

[S3-FIFO]_
----------

[S3-FIFO]_ is a novel fifo-based cache algorithm. It might seem odd to
pick that as default rather than a "tried and true" LRU_, but the
principles are interesting and on our sample it shows very good
performances for an acceptable implementation complexity.

Advantages
''''''''''

- excellent hit rates
- thread-safe on hits
- excellent handling of one hit wonders (entries unique to the data
set) and rare fews (multiple entries with a lot of separation)
- flexible implementation

Drawbacks
'''''''''

- O(n) eviction
- somewhat demanding on memory, especially at small sizes

Space
'''''

An S3Fifo of size n is composed of:

- one :ref:`dict` of size 1.9*n
- three :ref:`deque` of sizes 0.1 * n, 0.9 * n, and 0.9 * n

[SIEVE]_
--------

[SIEVE]_ is an other novel fifo-based algorithm, a cousin of S3Fifo it
works off of somewhat different principle. It has good performances
and a more straightforward implementation than S3, but it is strongly
wedded to linked lists as it needs to remove entries from the middle
of the fifo (whereas S3 uses strict fifo).

Advantages
''''''''''

- good hit rates
- thread-safe on hits
- memory efficient

Drawbacks
'''''''''

- O(n) eviction

Space
'''''

A SIEVE of size n is composed of:

- a :ref:`dict` of size n
- a linked list with n :ref:`nodes of 4 pointers each <class>`

LRU
---

The grandpappy of non-trivial cache eviction, it's mostly included as
a safety in case users encounter workloads for which the fifo-based
algorithms completely fall over (do report them, I'm sure the authors
would be interested).

Advantages
''''''''''

- basically built in the Python stdlib (via
:class:`~collections.OrderedDict`)
- O(1) eviction
- nobody ever got evicted for using an LRU

Drawbacks
'''''''''

- must be synchronised on hit: entries are moved
- poor hit rates

Space
'''''

An LRU of size n is composed of:

- an :ref:`ordered dict <odict>` of size n

Memory analysis of Python objects
=================================

Measures as of Python 3.11, on a 64b platform. Information is the
overhead of the object itself, not the data it stores e.g. if an
object stores strings the sizes of the strings are not included in the
calculations.

.. _class:

``class``
---------

With ``__slots__``, a Python object is 32 bytes + 8 bytes for each
member. An additional 8 bytes is necessary for weakref support
(slotted objects in UA-Parser don't have weakref support).

Without ``__slots__``, a Python object is 48 bytes plus an instance
:ref:`dict`.

.. note:: The instance dict is normally key-sharing, which is not
included in the analysis, see :pep:`412`.

.. _dict:

``dict``
--------

Python's ``dict`` is a relatively standard hash map, but it has a bit
of a twist in that it stores the *entries* in a dense array, which
only needs to be sized up to the dict's load factor, while the shallow
array used for hash lookups (which needs to be sized to match
capacity) only holds indexes into the dense array. This also allows
the *size* of the indices to only be as large as needed to index into
the dense array, so for small dicts the sparse array is an array of
bytes (8 bits).

*However* because the dense array of entries is used as a stack (only
the last entry can be replaced) in case a dict "churns" (entries get
added and removed without the size changing) if the size of the dict
is close to the next break-point it would need to be compacted
frequently leading to poor performances.

As a result, although a dictionary being created or added to will be
just the next size up a dict with a lot of churn will be two sizes up
to limit the amout of compaction necessary e.g. 10000 entries would
fit in ``2**14`` (capacity 16384, for a usable size of 10922) but the
dict may be sized up to ``2**15`` (capacity 32768, for a usable size
of 21845).

Python dicts also have a concept of *key kinds* which influences parts
of the layout. As of 3.12 there are 3 kinds called
``DICT_KEYS_GENERAL``, ``DICT_KEYS_UNICODE``, and ``DICT_KEYS_SPLIT``.
This is relevant here because UA-Parser caches are keyed on strings,
which means they should always use the ``DICT_KEYS_UNICODE`` kind.

In the ``DICT_KEYS_GENERAL`` layout, each entry of the dense array has
to store three pointer-sized items: a pointer to the key, a pointer to
the value, and a cached version of the key hash. However since strings
memoize their hash internally, the ``DICT_KEYS_UNICODE`` layout
retrieves the hash value from the key itself when needed and can save
8 bytes per entry.

Thus the space necessary for a dict is:

- the standard 4 pointers object header (``prev``, ``next``, and type
pointers, and reference count)
- ``ma_size``, 8 bytes, the number of entries
- ``ma_version_tag``, 8 bytes, deprecated
- ``ma_keys``, a pointer to the dict entries
- ``ma_values``, a pointer to the split values in ``DICT_KEYS_SPLIT``
layout (not relevant for UA-Parser)

The dict entries then are:

- ``dk_refcnt``, an 8 bytes refcount (used for the ``DICT_KEYS_SPLIT``
layout)
- ``dk_log2_size``, 1 byte, the total capacity of the hash map, as a
power of two
- ``dk_log2_index_bytes``, 1 byte, the size of the sparse indexes
array in bytes, as a power of two, it essentially memoizes the log2
size of the sparse indexes array by incrementing ``dk_log2_size`` by
3 if above 32, 2 if above 16, and 1 if above 8

.. note::

This means the dict bumps up the indexes array a bit early to
avoids having to resize again within a ``dk_log2_size`` e.g. at
171 elements the dict will move to size 9 (total capacity 512,
usable capacity 341) and the index size will immediately get
bumped to 10 even though it can still fit ~80 additional items
with a u8 index.

- ``dk_kind``, 1 byte, the key kind explained above
- ``dk_version``, 4 bytes, used for some internal optimisations of
cpython
- ``dk_usable``, 8 bytes, the number of usable entries in the dense array
- ``dk_nentries``, 8 bytes, the number of used entries in the dense
array, this can't be computed from ``dk_usable`` and
``dk_log2_size`` because ??? from the mention of ``DKIX_DUMMY`` I
assume it's because ``dk_usable`` is used to know when the dict
needs to be compacted or resized, and because python uses open
addressing and leaves tombstone (``DKIX_DUMMY``) in the sparse array
they matter for collision performances, and thus load calculations
- ``dk_indices``, the sparse array of size
``1<<dk_log2_size_index_bytes``
- ``dk_entries``, the dense array of size
``USABLE_FRACTION(1<<dk_log2_size) * 16``

.. note:: ``USABLE_FRACTION`` is 2/3

Thus the space formula for dicts -- in the context of string-indexed
caches -- is::

32 + 32 + 32
+ 2**(ceil(log2(n)) + 1) * ceil(log256(n))
+ floor(2/3 * 2**ceil(log2(n)) + 1) * 16

.. _odict:

``collections.OrderedDict``
---------------------------

While CPython has a pure-python ``OrderedDict`` it's not actually
used, instead a native implementation with a native doubly linked list
and a bespoke secondary hashmap is used, leading to a much denser
collection than achievable in Python. The broad strokes are similar
though:

- a regular ``dict`` links keys to values
- a secondary hashmap links keys to *nodes* of the linked list,
allowing reordering entries easily

The secondary hashmap is only composed of a dense array of nodes,
using the internal details of the dict in order to handle lookups in
the sparse array and collision resolution. Unlike ``dict`` however
it's sized to the dict's capacity rather than ``USABLE_FRACTION``
thereof.

The entire layout is:

- a full dict object (see above), inline
- pointers to the first and last nodes of the doubly linked list
- a pointer to the array of nodes
- ``od_fast_nodes_size``, 8 bytes, which is used to see if the
underlying dict has been resized
- ``*od_resize_sentinel`` which is *also* used to see if the
underlying dict has been redized (a pointer to the dict entries
object)
- ``od_state``, 8 bytes, to check for concurrent mutations during
iteration
- ``od_inst_dict``, 8 bytes, used to provide a fake ``__dict__`` and
better imitate
- ``od_inst_dict``, 8 bytes, weakref support

And each node in the linked list is 4 pointers: previous, next, key,
and hash.

.. note::

Hash is (likely) to speed up lookup since going from odict node to
dict entry requires a full lookup, and such a lookup is what
happens during iteration, except it uses a regular
``PyDict_GetItem`` instead of a low-level lookup, why?

So the ordereddict space requirement formula is::

dict(n) + 64 + 8 * 2**(ceil(log2(n)) + 1) + 32 * n

Because it matches dict's, like dict's the capacity is double what's
strictly required due to amortising churn.

.. _deque:

``collections.deque``
---------------------

Deque is an unrolled doubly linked list of order 64, that is every
node of the linked list stores 64 items, plus two pointers for the
previous and next links. Note that the deque always allocates a block
upfront (nb: why not allocate on use?).

The deque metadata (excluding the blocks) is 232 bytes:

- the 32 bytes standard object of an object header (next pointer,
previous pointer, refcount, and type pointer)
- the ``ob_size`` of a VAR_OBJ, apparently used to store the number of
items as the deque does not track its blocks size
- pointers to the left and right blocks
- offsets into the left and right blocks (as they may only be
partially filled)
- ``state``, a mutation counter used to track mutations during
iteration
- ``maxlen``, in case the deque is length-bounded
- ``numfreeblocks``, the actual size of the freelist
- ``freelist``, 16 pointers to already allocated available blocks
- ``weakreflist``, the weakref support pointer

So the deque space requirement formula is::

232 + max(1, ceil(n / 64)) * 66 * 8

:func:`~functools.lru_cache`
----------------------------

While not strictly relevant to ua-parser, it should be noted that
:func:`~functools.lru_cache` is *not* built on
:class:`~collections.OrderedDict`, it has its own native
implementation which uses a single dict and a different bespoke doubly
linked list with larger nodes (9 pointers).

.. [S3-FIFO] Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, Rashmi
Vinayak. 2023. FIFO queues are all you need for cache eviction.
SOSP '23. https://dl.acm.org/doi/10.1145/3600006.3613147
.. [SIEVE] Yazhuo Zhang, Juncheng Yang, Yao Yue, Ymir Vigfusson,
K. V. Rashmi. 2023. SIEVE is Simpler than LRU: an Efficient
Turn-Key Eviction Algorithm for Web Caches. NSDI24.
https://junchengyang.com/publication/nsdi24-SIEVE.pdf
2 changes: 2 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@ For more detailed insight and advanced uses, see the :doc:`api` and
:doc:`guides`.

.. toctree::
:maxdepth: 2
:caption: Contents:

installation
quickstart
guides
api
advanced/caches
Loading

0 comments on commit e380448

Please sign in to comment.