Skip to content

Commit

Permalink
Enhance nbtree index tuple deletion.
Browse files Browse the repository at this point in the history
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions.  This is "bottom-up
deletion".  Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page.  This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718b).

The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates.  It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_.  Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index.  The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.

The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs.  That is left as work for a
later commit.

Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing.  This increases the number of index tuples deleted
significantly in many cases.  The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples.  We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).

Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages.  That is, it almost never fails
to delete at least a few extra index tuples.  It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples.  It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).

Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples.  It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly).  Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes.  Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade.  The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
  • Loading branch information
petergeoghegan committed Jan 13, 2021
1 parent 9dc718b commit d168b66
Show file tree
Hide file tree
Showing 19 changed files with 2,112 additions and 442 deletions.
152 changes: 124 additions & 28 deletions doc/src/sgml/btree.sgml
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,109 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</sect2>

<sect2 id="btree-deletion">
<title>Bottom-up index deletion</title>
<para>
B-Tree indexes are not directly aware that under MVCC, there might
be multiple extant versions of the same logical table row; to an
index, each tuple is an independent object that needs its own index
entry. <quote>Version churn</quote> tuples may sometimes
accumulate and adversely affect query latency and throughput. This
typically occurs with <command>UPDATE</command>-heavy workloads
where most individual updates cannot apply the
<acronym>HOT</acronym> optimization. Changing the value of only
one column covered by one index during an <command>UPDATE</command>
<emphasis>always</emphasis> necessitates a new set of index tuples
&mdash; one for <emphasis>each and every</emphasis> index on the
table. Note in particular that this includes indexes that were not
<quote>logically modified</quote> by the <command>UPDATE</command>.
All indexes will need a successor physical index tuple that points
to the latest version in the table. Each new tuple within each
index will generally need to coexist with the original
<quote>updated</quote> tuple for a short period of time (typically
until shortly after the <command>UPDATE</command> transaction
commits).
</para>
<para>
B-Tree indexes incrementally delete version churn index tuples by
performing <firstterm>bottom-up index deletion</firstterm> passes.
Each deletion pass is triggered in reaction to an anticipated
<quote>version churn page split</quote>. This only happens with
indexes that are not logically modified by
<command>UPDATE</command> statements, where concentrated build up
of obsolete versions in particular pages would occur otherwise. A
page split will usually be avoided, though it's possible that
certain implementation-level heuristics will fail to identify and
delete even one garbage index tuple (in which case a page split or
deduplication pass resolves the issue of an incoming new tuple not
fitting on a leaf page). The worst case number of versions that
any index scan must traverse (for any single logical row) is an
important contributor to overall system responsiveness and
throughput. A bottom-up index deletion pass targets suspected
garbage tuples in a single leaf page based on
<emphasis>qualitative</emphasis> distinctions involving logical
rows and versions. This contrasts with the <quote>top-down</quote>
index cleanup performed by autovacuum workers, which is triggered
when certain <emphasis>quantitative</emphasis> table-level
thresholds are exceeded (see <xref linkend="autovacuum"/>).
</para>
<note>
<para>
Not all deletion operations that are performed within B-Tree
indexes are bottom-up deletion operations. There is a distinct
category of index tuple deletion: <firstterm>simple index tuple
deletion</firstterm>. This is a deferred maintenance operation
that deletes index tuples that are known to be safe to delete
(those whose item identifier's <literal>LP_DEAD</literal> bit is
already set). Like bottom-up index deletion, simple index
deletion takes place at the point that a page split is anticipated
as a way of avoiding the split.
</para>
<para>
Simple deletion is opportunistic in the sense that it can only
take place when recent index scans set the
<literal>LP_DEAD</literal> bits of affected items in passing.
Prior to <productname>PostgreSQL</productname> 14, the only
category of B-Tree deletion was simple deletion. The main
differences between it and bottom-up deletion are that only the
former is opportunistically driven by the activity of passing
index scans, while only the latter specifically targets version
churn from <command>UPDATE</command>s that do not logically modify
indexed columns.
</para>
</note>
<para>
Bottom-up index deletion performs the vast majority of all garbage
index tuple cleanup for particular indexes with certain workloads.
This is expected with any B-Tree index that is subject to
significant version churn from <command>UPDATE</command>s that
rarely or never logically modify the columns that the index covers.
The average and worst case number of versions per logical row can
be kept low purely through targeted incremental deletion passes.
It's quite possible that the on-disk size of certain indexes will
never increase by even one single page/block despite
<emphasis>constant</emphasis> version churn from
<command>UPDATE</command>s. Even then, an exhaustive <quote>clean
sweep</quote> by a <command>VACUUM</command> operation (typically
run in an autovacuum worker process) will eventually be required as
a part of <emphasis>collective</emphasis> cleanup of the table and
each of its indexes.
</para>
<para>
Unlike <command>VACUUM</command>, bottom-up index deletion does not
provide any strong guarantees about how old the oldest garbage
index tuple may be. No index can be permitted to retain
<quote>floating garbage</quote> index tuples that became dead prior
to a conservative cutoff point shared by the table and all of its
indexes collectively. This fundamental table-level invariant makes
it safe to recycle table <acronym>TID</acronym>s. This is how it
is possible for distinct logical rows to reuse the same table
<acronym>TID</acronym> over time (though this can never happen with
two logical rows whose lifetimes span the same
<command>VACUUM</command> cycle).
</para>
</sect2>

<sect2 id="btree-deduplication">
<title>Deduplication</title>
<para>
Expand Down Expand Up @@ -666,15 +769,17 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</note>
<para>
The deduplication process occurs lazily, when a new item is
inserted that cannot fit on an existing leaf page. This prevents
(or at least delays) leaf page splits. Unlike GIN posting list
tuples, B-Tree posting list tuples do not need to expand every time
a new duplicate is inserted; they are merely an alternative
physical representation of the original logical contents of the
leaf page. This design prioritizes consistent performance with
mixed read-write workloads. Most client applications will at least
see a moderate performance benefit from using deduplication.
Deduplication is enabled by default.
inserted that cannot fit on an existing leaf page, though only when
index tuple deletion could not free sufficient space for the new
item (typically deletion is briefly considered and then skipped
over). Unlike GIN posting list tuples, B-Tree posting list tuples
do not need to expand every time a new duplicate is inserted; they
are merely an alternative physical representation of the original
logical contents of the leaf page. This design prioritizes
consistent performance with mixed read-write workloads. Most
client applications will at least see a moderate performance
benefit from using deduplication. Deduplication is enabled by
default.
</para>
<para>
<command>CREATE INDEX</command> and <command>REINDEX</command>
Expand Down Expand Up @@ -702,25 +807,16 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
deduplication isn't usually helpful.
</para>
<para>
B-Tree indexes are not directly aware that under MVCC, there might
be multiple extant versions of the same logical table row; to an
index, each tuple is an independent object that needs its own index
entry. <quote>Version duplicates</quote> may sometimes accumulate
and adversely affect query latency and throughput. This typically
occurs with <command>UPDATE</command>-heavy workloads where most
individual updates cannot apply the <acronym>HOT</acronym>
optimization (often because at least one indexed column gets
modified, necessitating a new set of index tuple versions &mdash;
one new tuple for <emphasis>each and every</emphasis> index). In
effect, B-Tree deduplication ameliorates index bloat caused by
version churn. Note that even the tuples from a unique index are
not necessarily <emphasis>physically</emphasis> unique when stored
on disk due to version churn. The deduplication optimization is
selectively applied within unique indexes. It targets those pages
that appear to have version duplicates. The high level goal is to
give <command>VACUUM</command> more time to run before an
<quote>unnecessary</quote> page split caused by version churn can
take place.
It is sometimes possible for unique indexes (as well as unique
constraints) to use deduplication. This allows leaf pages to
temporarily <quote>absorb</quote> extra version churn duplicates.
Deduplication in unique indexes augments bottom-up index deletion,
especially in cases where a long-running transactions holds a
snapshot that blocks garbage collection. The goal is to buy time
for the bottom-up index deletion strategy to become effective
again. Delaying page splits until a single long-running
transaction naturally goes away can allow a bottom-up deletion pass
to succeed where an earlier deletion pass failed.
</para>
<tip>
<para>
Expand Down
36 changes: 29 additions & 7 deletions doc/src/sgml/ref/create_index.sgml
Original file line number Diff line number Diff line change
Expand Up @@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
<para>
The fillfactor for an index is a percentage that determines how full
the index method will try to pack index pages. For B-trees, leaf pages
are filled to this percentage during initial index build, and also
are filled to this percentage during initial index builds, and also
when extending the index at the right (adding new largest key values).
If pages
subsequently become completely full, they will be split, leading to
gradual degradation in the index's efficiency. B-trees use a default
fragmentation of the on-disk index structure. B-trees use a default
fillfactor of 90, but any integer value from 10 to 100 can be selected.
If the table is static then fillfactor 100 is best to minimize the
index's physical size, but for heavily updated tables a smaller
fillfactor is better to minimize the need for page splits. The
other index methods use fillfactor in different but roughly analogous
ways; the default fillfactor varies between methods.
</para>
<para>
B-tree indexes on tables where many inserts and/or updates are
anticipated can benefit from lower fillfactor settings at
<command>CREATE INDEX</command> time (following bulk loading into the
table). Values in the range of 50 - 90 can usefully <quote>smooth
out</quote> the <emphasis>rate</emphasis> of page splits during the
early life of the B-tree index (lowering fillfactor like this may even
lower the absolute number of page splits, though this effect is highly
workload dependent). The B-tree bottom-up index deletion technique
described in <xref linkend="btree-deletion"/> is dependent on having
some <quote>extra</quote> space on pages to store <quote>extra</quote>
tuple versions, and so can be affected by fillfactor (though the effect
is usually not significant).
</para>
<para>
In other specific cases it might be useful to increase fillfactor to
100 at <command>CREATE INDEX</command> time as a way of maximizing
space utilization. You should only consider this when you are
completely sure that the table is static (i.e. that it will never be
affected by either inserts or updates). A fillfactor setting of 100
otherwise risks <emphasis>harming</emphasis> performance: even a few
updates or inserts will cause a sudden flood of page splits.
</para>
<para>
The other index methods use fillfactor in different but roughly
analogous ways; the default fillfactor varies between methods.
</para>
</listitem>
</varlistentry>
Expand Down
Loading

0 comments on commit d168b66

Please sign in to comment.