Skip to content

Commit

Permalink
添加 HyperLogLog 相关文档的原文
Browse files Browse the repository at this point in the history
  • Loading branch information
huangzworks committed Jul 22, 2014
1 parent 930e3ac commit 5e41428
Show file tree
Hide file tree
Showing 3 changed files with 135 additions and 0 deletions.
32 changes: 32 additions & 0 deletions hyperloglog/pfadd.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,34 @@
.. _pfadd:

PFADD
===========

**PFADD key element [element ...]**

Adds all the element arguments to the HyperLogLog data structure stored at the variable name specified as first argument.

As a side effect of this command the HyperLogLog internals may be updated to reflect a different estimation of the number of unique items added so far (the cardinality of the set).

If the approximated cardinality estimated by the HyperLogLog changed after executing the command, PFADD returns 1, otherwise 0 is returned. The command automatically creates an empty HyperLogLog structure (that is, a Redis String of a specified length and with a given encoding) if the specified key does not exist.

To call the command without elements but just the variable name is valid, this will result into no operation performed if the variable already exists, or just the creation of the data structure if the key does not exist (in the latter case 1 is returned).

For an introduction to HyperLogLog data structure check the PFCOUNT command page.

**可用版本:**
>= 2.8.9

**时间复杂度:**
O(1) to add every element.

**返回值:**
Integer reply, specifically:
1 if at least 1 HyperLogLog internal register was altered. 0 otherwise.

::

redis> PFADD hll a b c d e f g
(integer) 1

redis> PFCOUNT hll
(integer) 7
72 changes: 72 additions & 0 deletions hyperloglog/pfcount.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,74 @@
.. _pfcount:

PFCOUNT
=============

**PFCOUNT key [key ...]**

When call with a single key, returns the approximated cardinality computed by the HyperLogLog data structure stored at the specified variable, which is 0 if the variable does not exist.

When called with multiple keys, returns the approximated cardinality of the union of the HyperLogLogs passed, by internally merging the HyperLogLogs stored at the provided keys into a temporary hyperLogLog.

The HyperLogLog data structure can be used in order to count unique elements in a set using just a small constant amount of memory, specifically 12k bytes for every HyperLogLog (plus a few bytes for the key itself).

The returned cardinality of the observed set is not exact, but approximated with a standard error of 0.81%.

For example in order to take the count of all the unique search queries performed in a day, a program needs to call PFADD every time a query is processed. The estimated number of unique queries can be retrieved with PFCOUNT at any time.

Note: as a side effect of calling this function, it is possible that the HyperLogLog is modified, since the last 8 bytes encode the latest computed cardinality for caching purposes. So PFCOUNT is technically a write command.

**可用版本:**
>= 2.8.9

**时间复杂度:**
O(1) with every small average constant times when called with a single key. O(N) with N being the number of keys, and much bigger constant times, when called with multiple keys.

**返回值:**
Integer reply, specifically:
The approximated number of unique elements observed via PFADD.

::

redis> PFADD hll foo bar zap
(integer) 1

redis> PFADD hll zap zap zap
(integer) 0

redis> PFADD hll foo bar
(integer) 0

redis> PFCOUNT hll
(integer) 3

redis> PFADD some-other-hll 1 2 3
(integer) 1

redis> PFCOUNT hll some-other-hll
(integer) 6


Performances
---------------

When PFCOUNT is called with a single key, performances as excellent even if in theory constant times to process a dense HyperLogLog are high. This is possible because the PFCOUNT uses caching in order to remember the cardinality previously computed, that rarely changes because most PFADD operations will not update any register. Hundreds of operations per second are possible.

When PFCOUNT is called with multiple keys, an on-the-fly merge of the HyperLogLogs is performed, which is slow, moreover the cardinality of the union can't be cached, so when used with multiple keys PFCOUNT may take a time in the order of magnitude of the millisecond, and should be not abused.

The user should take in mind that single-key and multiple-keys executions of this command are semantically different and have different performances.


HyperLogLog representation
-------------------------------

Redis HyperLogLogs are represented using a double representation: the sparse representation suitable for HLLs counting a small number of elements (resulting in a small number of registers set to non-zero value), and a dense representation suitable for higher cardinalities. Redis automatically switches from the sparse to the dense representation when needed.

The sparse representation uses a run-length encoding optimized to store efficiently a big number of registers set to zero. The dense representation is a Redis string of 12288 bytes in order to store 16384 6-bit counters. The need for the double representation comes from the fact that using 12k (which is the dense representation memory requirement) to encode just a few registers for smaller cardinalities is extremely suboptimal.

Both representations are prefixed with a 16 bytes header, that includes a magic, an encoding / version fiend, and the cached cardinality estimation computed, stored in little endian format (the most significant bit is 1 if the estimation is invalid since the HyperLogLog was updated since the cardinality was computed).

The HyperLogLog, being a Redis string, can be retrieved with GET and restored with SET. Calling PFADD, PFCOUNT or PFMERGE commands with a corrupted HyperLogLog is never a problem, it may return random values but does not affect the stability of the server. Most of the times when corrupting a sparse representation, the server recognizes the corruption and returns an error.

The representation is neutral from the point of view of the processor word size and endianess, so the same representation is used by 32 bit and 64 bit processor, big endian or little endian.

More details about the Redis HyperLogLog implementation can be found in this blog post. The source code of the implementation in the hyperloglog.c file is also easy to read and understand, and includes a full specification for the exact encoding used for the sparse and dense representations.
31 changes: 31 additions & 0 deletions hyperloglog/pfmerge.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,33 @@
.. _pfmerge:

PFMERGE
===============

**PFMERGE destkey sourcekey [sourcekey ...]**

Merge multiple HyperLogLog values into an unique value that will approximate the cardinality of the union of the observed Sets of the source HyperLogLog structures.

The computed merged HyperLogLog is set to the destination variable, which is created if does not exist (defauling to an empty HyperLogLog).

**可用版本:**
>= 2.8.9

**时间复杂度:**
O(N) to merge N HyperLogLogs, but with high constant times.

**返回值:**
Simple string reply: The command just returns OK.

::

redis> PFADD hll1 foo bar zap a
(integer) 1

redis> PFADD hll2 a b c foo
(integer) 1

redis> PFMERGE hll3 hll1 hll2
OK

redis> PFCOUNT hll3
(integer) 6

0 comments on commit 5e41428

Please sign in to comment.