Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with PostgreSQL 9.5 (up to almost 9.6devel is stamped) #8509

Merged
merged 1,867 commits into from Aug 29, 2019

Conversation

pengzhout
Copy link
Contributor

Notable upstream changes:

* Allow INSERTs that would generate constraint conflicts to be turned
  into UPDATEs or ignored. The syntax is INSERT ... ON CONFLICT DO
  NOTHING/UPDATE. This is the Postgres implementation of the popular
  UPSERT command. Append-Only table and ORCA are not yet supported.

* Add row-level security control so tables can have row security
  policies that restrict, on a per-user basis, which rows can be
  returned by normal queries or inserted, updated, or deleted by data
  modification commands.

* Add Block Range Indexes, BRIN indexes store only summary data
  (such as minimum and maximum values) for ranges of heap blocks.
  Append-Only table and ORCA are not yet supported.

* Substantial performance improvement. Improve the speed of sorting of
  varchar, text, and numeric fields via "abbreviated" keys; Improve lock
  scalability, this particularly addresses scalability problems when
  running on systems with multiple CPU sockets; Improve performance of
  hash joins and bitmap index scan, etc.

* Allow setting multiple target columns in an UPDATE from the result of
  a single sub-SELECT. This is accomplished using the syntax UPDATE tab
  SET (col1, col2, ...) = (SELECT ...).

* Add SELECT option TABLESAMPLE to return a subset of a table.
  Append-Only table and ORCA are not yet supported.

* Simplify WAL record format, this allows external tools to more easily
  track what blocks are modified.

Notable GPDB changes:

* GROUPING SETS is no longer MPP-aware and not supported by hashagg.

  GPDB implement GROUPING SETS with a number of issues, ranging from
  assertion failures to incorrect results and now PG9.5 introduced its
  own implementation of GROUPING SETS, CUBE, ROLLUP, we decided to
  replace it with the one of PG9.5. As discussed in
  https://groups.google.com/a/greenplum.org/d/topic/gpdb-dev/z9Ww2lU5_fE/discussion
  it's not MPP-aware and needs a gather node to bring all data to the
  QD node and perform the grouping there, meanwhile, it only supports
  sorted aggregates now. The following commits will address those two
  problems.

Co-authored-by: Asim R P <apraveen@pivotal.io>
Co-authored-by: Chris Hajas <chajas@pivotal.io>
Co-authored-by: David Kimura <dkimura@pivotal.io>
Co-authored-by: Georgios Kokolatos <gkokolatos@pivotal.io>
Co-authored-by: Heikki Linnakangas <hlinnakangas@pivotal.io>
Co-authored-by: Jinbao Chen <jinchen@pivotal.io>
Co-authored-by: Jimmy Yih <jyih@pivotal.io>
Co-authored-by: Pengzhou Tang <ptang@pivotal.io>

simonat2ndQuadrant and others added 30 commits May 15, 2015 14:37
Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.

Petr Jelinek

Reviewed by Michael Paquier and Simon Riggs
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF.  That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns.  Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.

Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries.  This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.

In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.

Per bug greenplum-db#12845 from Arjen Nienhuis.  The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.
Contrib module implementing a tablesample method
that allows you to limit the sample by a hard row
limit.

Petr Jelinek

Reviewed by Michael Paquier, Amit Kapila and
Simon Riggs
Contrib module implementing a tablesample method
that allows you to limit the sample by a hard time
limit.

Petr Jelinek

Reviewed by Michael Paquier, Amit Kapila and
Simon Riggs
Per compiler warnings.
For upcoming BRIN opclasses, it's convenient to have strategy numbers
defined in a single place.  Since there's nothing appropriate, create
it.  The StrategyNumber typedef now lives there, as well as existing
strategy numbers for B-trees (from skey.h) and R-tree-and-friends (from
gist.h).  skey.h is forced to include stratnum.h because of the
StrategyNumber typedef, but gist.h is not; extensions that currently
rely on gist.h for rtree strategy numbers might need to add a new

A few .c files can stop including skey.h and/or gist.h, which is a nice
side benefit.

Per discussion:
https://www.postgresql.org/message-id/20150514232132.GZ2523@alvh.no-ip.org

Authored by Emre Hasegeli and Álvaro.

(It's not clear to me why bootscanner.l has any #include lines at all.)
Add a bit of coverage of high code points.

Arjen Nienhuis
This lets BRIN be used with R-Tree-like indexing strategies.

Also provided are operator classes for range types, box and inet/cidr.
The infrastructure provided here should be sufficient to create operator
classes for similar datatypes; for instance, opclasses for PostGIS
geometries should be doable, though we didn't try to implement one.

(A box/point opclass was also submitted, but we ripped it out before
commit because the handling of floating point comparisons in existing
code is inconsistent and would generate corrupt indexes.)

Author: Emre Hasegeli.  Cosmetic changes by me
Review: Andreas Karlsson
DST law changes in Egypt, Mongolia, Palestine.
Historical corrections for Canada and Chile.
Revised zone abbreviation for America/Adak (HST/HDT not HAST/HADT).
This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.

This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.

The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.

The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.

Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage.  The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting.  The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.

A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.

Needs a catversion bump because stored rules may change.

Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
    Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
It's not very portable.  Per buildfarm.
This patch causes pg_upgrade to error out during its check phase if:

(1) template0 is marked connectable
or
(2) any other database is marked non-connectable

This is done because, in the first case, pg_upgrade would fail because
the pg_dumpall --globals restore would fail, and in the second case, the
database would not be restored, leading to data loss.

Report by Matt Landry (1), Stephen Frost (2)

Backpatch through 9.0
Previously, this prevented promoted standby servers from being upgraded
because of a missing WAL history file.  (Timeline 1 doesn't need a
history file, and we don't copy WAL files anyway.)

Report by Christian Echerer(?), Alexey Klyukin

Backpatch through 9.0
<float.h> is required for isinf() on some platforms.  Per buildfarm.
I don't think "respectfully" is what was meant here ...
As usual, the release notes for older branches will be made by cutting
these down, but put them up for community review first.
Dmitriy Olshevskiy
Clean up the Makefile, per Michael Paquier.

Classify REINDEX as we do in core, use '1.0' for the version, per Fujii.
hlinnaka and others added 16 commits June 28, 2015 22:30
… ERROR.

Seems like cheap insurance for WAL bugs. A spurious call to
XLogBeginInsert() in itself would be fairly harmless, but if there is any
data registered and the insertion is not completed/cancelled properly, there
is a risk that the data ends up in a wrong WAL record.

Per Jeff Janes's suggestion.
Oops. I could swear I built the docs before pushing, but I guess not..
When archive recovery and restartpoints were initially introduced,
checkpoint_segments was ignored on the grounds that the files restored from
archive don't consume any space in the recovery server. That was changed in
later releases, but even then it was arguably a feature rather than a bug,
as performing restartpoints as often as checkpoints during normal operation
might be excessive, but you might nevertheless not want to waste a lot of
space for pre-allocated WAL by setting checkpoint_segments to a high value.
But now that we have separate min_wal_size and max_wal_size settings, you
can bound WAL usage with max_wal_size, and still avoid consuming excessive
space usage by setting min_wal_size to a lower value, so that argument is
moot.

There are still some issues with actually limiting the space usage to
max_wal_size: restartpoints in recovery can only start after seeing the
checkpoint record, while a checkpoint starts flushing buffers as soon as
the redo-pointer is set. Restartpoint is paced to happen at the same
leisurily speed, determined by checkpoint_completion_target, as checkpoints,
but because they are started later, max_wal_size can be exceeded by upto
one checkpoint cycle's worth of WAL, depending on
checkpoint_completion_target. But that seems better than not trying at all,
and max_wal_size is a soft limit anyway.

The documentation already claimed that max_wal_size is obeyed in recovery,
so this just fixes the behaviour to match the docs. However, add some
weasel-words there to mention that max_wal_size may well be exceeded by
some amount in recovery.
As first committed, this view reported on the file contents as they were
at the last SIGHUP event.  That's not as useful as reporting on the current
contents, and what's more, it didn't work right on Windows unless the
current session had serviced at least one SIGHUP.  Therefore, arrange to
re-read the files when pg_show_all_settings() is called.  This requires
only minor refactoring so that we can pass changeVal = false to
set_config_option() so that it won't actually apply any changes locally.

In addition, add error reporting so that errors that would prevent the
configuration files from being loaded, or would prevent individual settings
from being applied, are visible directly in the view.  This makes the view
usable for pre-testing whether edits made in the config files will have the
desired effect, before one actually issues a SIGHUP.

I also added an "applied" column so that it's easy to identify entries that
are superseded by later entries; this was the main use-case for the original
design, but it seemed unnecessarily hard to use for that.

Also fix a 9.4.1 regression that allowed multiple entries for a
PGC_POSTMASTER variable to cause bogus complaints in the postmaster log.
(The issue here was that commit bf007a2 unintentionally reverted
3e3f659, which suppressed any duplicate entries within
ParseConfigFp.  However, since the original coding of the pg_file_settings
view depended on such suppression *not* happening, we couldn't have fixed
this issue now without first doing something with pg_file_settings.
Now we suppress duplicates by marking them "ignored" within
ProcessConfigFileInternal, which doesn't hide them in the view.)

Lesser changes include:

Drive the view directly off the ConfigVariable list, instead of making a
basically-equivalent second copy of the data.  There's no longer any need
to hang onto the data permanently, anyway.

Convert show_all_file_settings() to do its work in one call and return a
tuplestore; this avoids risks associated with assuming that the GUC state
will hold still over the course of query execution.  (I think there were
probably latent bugs here, though you might need something like a cursor
on the view to expose them.)

Arrange to run SIGHUP processing in a short-lived memory context, to
forestall process-lifespan memory leaks.  (There is one known leak in this
code, in ProcessConfigDirectory; it seems minor enough to not be worth
back-patching a specific fix for.)

Remove mistaken assignment to ConfigFileLineno that caused line counting
after an include_dir directive to be completely wrong.

Add missed failure check in AlterSystemSetConfigFile().  We don't really
expect ParseConfigFp() to fail, but that's not an excuse for not checking.
Yeah, I know, pretty anal-retentive of me.  But we oughta find some
way to automate this for the .y and .l files.
Source-Git-URL: git://git.postgresql.org/git/pgtranslation/messages.git
Source-Git-Hash: fb7e72f46cfafa1b5bfe4564d9686d63a1e6383f
_Asm_sched_fence() is just a compiler barrier, not a memory barrier. But
spinlock release on IA64 needs, at the very least, release
semantics. Use a full barrier instead.

This might be the cause for the occasional failures on buildfarm member
anole.

Discussion: 20150629101108.GB17640@alap3.anarazel.de
Avoid memory leak from incorrect choice of how to free a StringInfo
(resetStringInfo doesn't do it).  Now that pg_split_opts doesn't scribble
on the optstr, mark that as "const" for clarity.  Attach the commentary in
protocol.sgml to the right place, and add documentation about the
user-visible effects of this change on postgres' -o option and libpq's
PGOPTIONS option.
Minor corrections and clarifications.  Notably, for stuff that got moved
out of contrib, make sure it's documented somewhere other than "Additional
Modules".

I'm sure these need more work, but that's all I have time for today.
…record.

I broke this with my WAL format refactoring patch. Before that, the metapage
was read from disk, and modified in-place regardless of the LSN. That was
always a bit silly, as there's no need to read the old page version from
disk disk when we're overwriting it anyway. So that was changed in 9.5, but
I failed to add a GinInitPage call to initialize the page-headers correctly.
Usually you wouldn't notice, because the metapage is already in the page
cache and is not zeroed.

One way to reproduce this is to perform a VACUUM on an already vacuumed
table (so that the vacuum has no real work to do), immediately after a
checkpoint, and then perform an immediate shutdown. After recovery, the
page headers of the metapage will be incorrectly all-zeroes.

Reported by Jeff Janes
Without this, we might access memory that's already been freed, or
leak memory if in the C locale.

Peter Geoghegan
After calling XLogInitBufferForRedo(), the page might be all-zeros if it was
not in page cache already. btree_xlog_unlink_page initialized the page
correctly, but it called PageGetSpecialPointer before initializing it, which
would lead to a corrupt page at WAL replay, if the unlinked page is not in
page cache.

Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
This seems useful to catch errors of the sort I just fixed, where
PageGetSpecialPointer is called before initializing the page.
Coverity rightly gripes that it's silly to have a test here when
the adjacent ExecEvalExpr() would choke on a NULL expression pointer.

Petr Jelinek
Apparently, this is needed in some Solaris versions.

Author: Oskari Saarenmaa
@pengzhout pengzhout closed this Aug 27, 2019
@pengzhout pengzhout reopened this Aug 27, 2019
@@ -0,0 +1,292 @@
resource_types:
- name: gcs
type: docker-image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpdb_opensource_release.yml should not be added by this PR. It was moved from master to another repo by commit 8490aaa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will remove it.

@hlinnaka
Copy link
Contributor

Looks good at a quick glance. Thanks to everyone involved!

I'm glad we got the new WAL record format now. That simplifies pg_rewind a lot, and also removed the annoying diff we've been carrying in all our redo functions, where the arguments where slightly different from upstream's.

@ashwinstar
Copy link
Contributor

ashwinstar commented Aug 27, 2019 via email

@yydzero
Copy link

yydzero commented Aug 28, 2019

Great achievement!

@pengzhout
Copy link
Contributor Author

I'm glad we got the new WAL record format now. That simplifies pg_rewind a lot, and also removed the annoying diff we've been carrying in all our redo functions, where the arguments where slightly different from upstream's.

Agreed, that's a very impressive change in the 9.5 merge, thanks for the great work.

@guofengrichard
Copy link
Contributor

I'm most excited that we aligned the implementation of grouping sets with upstream. That helps us get rid of lots of issues and makes the future life much easier.

@soumyadeep2007
Copy link
Contributor

We need to address this issue before we merge in the 9_5 merge PR into gpdb master?
greenplum-db/gpdb-postgres-merge#47

Notable upstream changes:

* Allow INSERTs that would generate constraint conflicts to be turned
  into UPDATEs or ignored. The syntax is INSERT ... ON CONFLICT DO
  NOTHING/UPDATE. This is the Postgres implementation of the popular
  UPSERT command. Append-Only table and ORCA are not yet supported.

* Add row-level security control so tables can have row security
  policies that restrict, on a per-user basis, which rows can be
  returned by normal queries or inserted, updated, or deleted by data
  modification commands.

* Add Block Range Indexes, BRIN indexes store only summary data
  (such as minimum and maximum values) for ranges of heap blocks.
  Append-Only table and ORCA are not yet supported.

* Substantial performance improvement. Improve the speed of sorting of
  varchar, text, and numeric fields via "abbreviated" keys; Improve lock
  scalability, this particularly addresses scalability problems when
  running on systems with multiple CPU sockets; Improve performance of
  hash joins and bitmap index scan, etc.

* Allow setting multiple target columns in an UPDATE from the result of
  a single sub-SELECT. This is accomplished using the syntax UPDATE tab
  SET (col1, col2, ...) = (SELECT ...).

* Add SELECT option TABLESAMPLE to return a subset of a table.
  Append-Only table and ORCA are not yet supported.

* Simplify WAL record format, this allows external tools to more easily
  track what blocks are modified.

Notable GPDB changes:

* GROUPING SETS is no longer MPP-aware and not supported by hashagg.

  GPDB implement GROUPING SETS with a number of issues, ranging from
  assertion failures to incorrect results and now PG9.5 introduced its
  own implementation of GROUPING SETS, CUBE, ROLLUP, we decided to
  replace it with the one of PG9.5. As discussed in
  https://groups.google.com/a/greenplum.org/d/topic/gpdb-dev/z9Ww2lU5_fE/discussion
  it's not MPP-aware and needs a gather node to bring all data to the
  QD node and perform the grouping there, meanwhile, it only supports
  sorted aggregates now. The following commits will address those two
  problems.

Co-authored-by: Asim R P <apraveen@pivotal.io>
Co-authored-by: Chris Hajas <chajas@pivotal.io>
Co-authored-by: David Kimura <dkimura@pivotal.io>
Co-authored-by: Georgios Kokolatos <gkokolatos@pivotal.io>
Co-authored-by: Heikki Linnakangas <hlinnakangas@pivotal.io>
Co-authored-by: Jinbao Chen <jinchen@pivotal.io>
Co-authored-by: Jimmy Yih <jyih@pivotal.io>
Co-authored-by: Pengzhou Tang <ptang@pivotal.io>
@pengzhout pengzhout merged commit 8ae22d1 into greenplum-db:master Aug 29, 2019
@pengzhout
Copy link
Contributor Author

Merged. Thanks for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet