Skip to content
Lukas Fittl edited this page Nov 3, 2022 · 50 revisions

Rationale

This document exists to enable implementation of a fingerprint method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query (Ruby) and pg_query.go (Go).

The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.

It is inspired by what Postgres' pg_stat_statements extension does in order to generate a queryid that is used to group similar queries together for statistics purposes.

Different from pg_stat_statements, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on different servers as the same, even if the relation OIDs are different (pg_stat_statements would see different relation oids as different queries).

Version 0 (Ruby only, deprecated)

This is what pg_query (Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.

Version 1 (based on PostgreSQL 9.5)

  • V1 Fingerprints are represented as 0x01 + SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes)
  • In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
  • Nodes in the AST are traversed top-down
  • All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
  • Node type is written first, in Postgres source convention (e.g. A_Const)
  • Then, in normal cases, going by alphabetic order of the field names:
    • In some cases, the field is not written at all:
      • nil values
      • 0 value for integer/byte values
      • false values for booleans
      • empty lists
      • empty strings
    • Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
    • Otherwise, first the field name is written (as it appears in the Postgres source in underscore_notation), followed by the field's value as one or more strings:
      • Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
      • "true" for booleans that are true
      • "123" for integer values
      • "0" for an enum value
      • Enums are written as their integer values
      • Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
  • Special cases
    • List nodes (including a simple array that contains nodes)
      • If parent field name is fromClause, targetList, cols or rexpr:
        • Fingerprint each node, and store the resulting tokens in an array
        • Ignore duplicate nodes (based on equivalent fingerprint tokens)
        • Sort nodes by its tokens (simple string comparison)
        • Write sorted tokens into the result hash
      • Otherwise just fingerprint each node in original order
    • Node types to be ignored
      • A_Const
      • Alias
      • ParamRef
      • SetToDefault
      • Value nodes of type IntList, OidList or Null
    • Attributes to be ignored
      • location (for all nodes)
      • ResTarget.name (if parent field name is targetList)
      • PrepareStmt.name
      • ExecuteStmt.name
      • DeallocateStmt.name
      • TransactionStmt.options
      • TransactionStmt.gid

Version 1.1 (based on PostgreSQL 9.5)

  • Attributes to be ignored: ResTarget.name should only be ignored if parent node is a SelectStmt and parent field name is targetList
    • In version 1 this was a problem with UpdateStmt in particular, where the name is significant

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 1.2 (based on PostgreSQL 9.5)

  • Attributes to be ignored:
    • DeclareCursorStmt.portalname
    • FetchStmt.portalname
    • ClosePortalStmt.portalname

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 1.3 (based on PostgreSQL 9.5)

  • Attributes to be ignored:
    • RangeVar.relname (if node also has RangeVar.relpersistence = "t")
  • Special cases: List nodes where parent field name is valuesLists
    • Follow same logic described for fromClause/targetList/cols/rexpr

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 2.0 (based on PostgreSQL 10)

  • Same fingerprinting rules as 1.3, but in addition:
  • Attributes to be ignored:
    • RawStmt.stmt_len
    • RawStmt.stmt_location

Version 3.0 (based on PostgreSQL 13)

Note: Please note that whilst version 3.0 is inspired by prior versions, the complete (updated) fingerprinting rules are repeated for this version. When implementing based on the new Protobuf parse tree format in libpg_query, take care of field names - they need to match Postgres source, which is represented in the Protobuf descriptor's json_name field names - the Protobuf field names themselves are often subtly different in terms of capitalization.

When the text below refers to a "Node", it references the C level struct in the Postgres parse tree. In the libpg_query JSON format this is represented as a JSON object, and in the libpg_query Protobuf format this is represented as a message.

  • πŸ†• V3 Fingerprints are represented as a 64 bit (8 byte) integer, commonly shown as a 16 character hex value when displayed
  • πŸ†• V3 Fingerprints are based on the xxHash XXH3 64-bit hash function, with the number 3 provided as a seed
  • Nodes in the parse tree are traversed top-down
  • πŸ†• Ignore any nodes more than 100 nodes deep (start counting depth with 0 at the top level, incrementing by 1 when going into a child node through a field - subsequent fields on a node do not increment the depth counter, nor do other nodes at the same level, e.g. in a list - if depth >= 100, ignore that node and any child nodes)
  • All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
  • For each node, field names are handled in alphabetic order:
    • In some cases, the field is not written at all:
      • nil values
      • 0 value for integer/byte values
      • false values for booleans
      • empty lists
      • empty strings
    • Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
    • Otherwise, first the field name is written (as it appears in the Postgres source), followed by the field's value as one or more strings:
      • Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
      • "true" for booleans that are true
      • "123" for integer values
      • πŸ†• "ENUM_VALUE" for an enum value (as they show in the Postgres source)
      • Node values are traversed immediately (so the child struct goes first, fields that follow in the parent go afterwards)
  • Special cases
    • πŸ†• Node generic node type
      • Output the actual node type name (as represented in the Postgres source or libpg_query JSON output, e.g. SelectStmt), before handling other fields -- except, if the node type is List, in that case do not output the node type name
    • πŸ†• List node (note the JSON format often represents these as simple arrays)
      • If parent field name is fromClause, targetList, cols, rexpr, valuesLists or πŸ†• args:
        • Fingerprint each node, and store the resulting tokens, πŸ†• together with each node's hash value in an array
        • πŸ†• Ignore duplicate nodes (based on the node hash value being equivalent)
        • πŸ†• Sort the overall array by each node's hash value (ascending)
        • Write sorted tokens into the result hash
      • Otherwise just traverse each node in original order (and add fields continuously to the hash)
    • RangeVar node
      • relname:
        • Ignore completely if node has RangeVar.relpersistence = "t"
        • πŸ†• Otherwise, remove any two or more subsequent digits (s/\d{2,}//g) and write out the resulting value
    • πŸ†• A_Expr node
      • If kind is AEXPR_OP_ANY, write AEXPR_OP kind instead
      • If kind is AEXPR_IN, write AEXPR_OP kind instead
    • Node types to be ignored
      • A_Const
      • Alias
      • ParamRef
      • SetToDefault
      • Value nodes of type IntList, OidList or Null
      • πŸ†• TypeCast, if the arg is a node of type A_Const or ParamRef
    • Attributes to be ignored
      • location, for all nodes
      • ResTarget.name, if parent node is a SelectStmt and parent field name is targetList
      • PrepareStmt.name
      • ExecuteStmt.name
      • DeallocateStmt.name
      • TransactionStmt.options
      • TransactionStmt.gid
      • πŸ†• TransactionStmt.savepoint_name
      • DeclareCursorStmt.portalname
      • FetchStmt.portalname
      • ClosePortalStmt.portalname
      • RawStmt.stmt_len
      • RawStmt.stmt_location

Version 3.1 (based on PostgreSQL 13)

  • Same fingerprinting rules as 3.0, but in addition:
  • Consider lists that have a single NIL element as being present for determining whether the parent field should be included in the fingerprint. In particular, this matters for fingerprinting "SELECT DISTINCT" without columns specified.
  • Attributes to be ignored:
    • CreateFunctionStmt.options (and thus the function definition itself)
    • FunctionParameter.name (a function is deemed unique based on the function names and parameter types, not the parameter names)
    • DoStmt.args (this effectively groups all DO statements together, avoiding bloating query tracking with anonymous blocks that'd all be considered unique)
    • ListenStmt.conditionname, UnlistenStmt.conditionname and NotifyStmt.conditionname (this effectively groups all LISTEN, UNLISTEN and NOTIFY statements together - since channels are often made unique to notify about different per-session channel names, this can easily bloat query tracking)