Skip to content

Commit

Permalink
Merge branch 'master' of github.com:zalando/patroni into feature/quor…
Browse files Browse the repository at this point in the history
…um-commit
  • Loading branch information
CyberDem0n committed Jul 13, 2023
2 parents 2dafb37 + a4d29eb commit b0d8b21
Show file tree
Hide file tree
Showing 33 changed files with 425 additions and 97 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -173,4 +173,4 @@ jobs:

- uses: jakebailey/pyright-action@v1
with:
version: 1.1.316
version: 1.1.317
44 changes: 44 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,50 @@
Release notes
=============

Version 3.0.4
-------------

**New features**

- Make the replication status of standby nodes visible (Alexander Kukushkin)

For PostgreSQL 9.6+ Patroni will report the replication state as ``streaming`` when the standby is streaming from the other node or ``in archive recovery`` when there is no replication connection and ``restore_command`` is set. The state is visible in ``member`` keys in DCS, in the REST API, and in ``patronictl list`` output.


**Improvements**

- Improved error messages with Etcd v3 (Alexander Kukushkin)

When Etcd v3 cluster isn't accessible Patroni was reporting that it can't access ``/v2`` endpoints.

- Use quorum read in ``patronictl`` if it is possible (Alexander Kukushkin)

Etcd or Consul clusters could be degraded to read-only, but from the ``patronictl`` view everything was fine. Now it will fail with the error.

- Prevent splitbrain from duplicate names in configuration (Mark Pekala)

When starting Patroni will check if node with the same name is registered in DCS, and try to query its REST API. If REST API is accessible Patroni exits with an error. It will help to protect from the human error.

- Start Postgres not in recovery if it crashed while Patroni is running (Alexander Kukushkin)

It may reduce recovery time and will help from unnecessary timeline increments.


**Bugfixes**

- REST API SSL certificate were not reloaded upon receiving a SIGHUP (Israel Barth Rubio)

Regression was introduced in 3.0.3.

- Fixed integer GUCs validation for parameters like ``max_connections`` (Feike Steenbergen)

Patroni didn't like quoted numeric values. Regression was introduced in 3.0.3.

- Fix issue with ``synchronous_mode`` (Alexander Kukushkin)

Execute ``txid_current()`` with ``synchronous_commit=off`` so it doesn't accidentally wait for absent synchronous standbys when ``synchronous_mode_strict`` is enabled.


Version 3.0.3
-------------

Expand Down
10 changes: 8 additions & 2 deletions docs/rest_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,10 @@ Retrieve the Patroni metrics in Prometheus format through the ``GET /metrics`` e
# HELP patroni_replica Value is 1 if this node is a replica, 0 otherwise.
# TYPE patroni_replica gauge
patroni_replica{scope="batman"} 0
# HELP patroni_sync_standby Value is 1 if this node is a sync standby, 0 otherwise.
# HELP patroni_sync_standby Value is 1 if this node is a sync standby replica, 0 otherwise.
# TYPE patroni_sync_standby gauge
patroni_sync_standby{scope="batman"} 0
# HELP patroni_quorum_standby Value is 1 if this node is a quorum standby, 0 otherwise.
# HELP patroni_quorum_standby Value is 1 if this node is a quorum standby replica, 0 otherwise.
# TYPE patroni_quorum_standby gauge
patroni_quorum_standby{scope="batman"} 0
# HELP patroni_xlog_received_location Current location of the received Postgres transaction log, 0 if this node is not a replica.
Expand All @@ -160,6 +160,12 @@ Retrieve the Patroni metrics in Prometheus format through the ``GET /metrics`` e
# HELP patroni_xlog_paused Value is 1 if the Postgres xlog is paused, 0 otherwise.
# TYPE patroni_xlog_paused gauge
patroni_xlog_paused{scope="batman"} 0
# HELP patroni_postgres_streaming Value is 1 if Postgres is streaming, 0 otherwise.
# TYPE patroni_postgres_streaming gauge
patroni_postgres_streaming{scope="batman"} 1
# HELP patroni_postgres_in_archive_recovery Value is 1 if Postgres is replicating from archive, 0 otherwise.
# TYPE patroni_postgres_in_archive_recovery gauge
patroni_postgres_in_archive_recovery{scope="batman"} 0
# HELP patroni_postgres_server_version Version of Postgres (if running), 0 otherwise.
# TYPE patroni_postgres_server_version gauge
patroni_postgres_server_version {scope="batman"} 140004
Expand Down
11 changes: 5 additions & 6 deletions features/basic_replication.feature
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,13 @@ Feature: basic replication
Then table bar is present on postgres1 after 20 seconds
And Response on GET http://127.0.0.1:8010/config contains master_start_timeout after 10 seconds

Scenario: check immediate failover when master_start_timeout=0
Given I kill postmaster on postgres2
Then postgres1 is a leader after 10 seconds
And postgres1 role is the primary after 10 seconds

Scenario: check rejoin of the former primary with pg_rewind
Given I add the table splitbrain to postgres0
And I start postgres0
Then postgres0 role is the secondary after 20 seconds
When I add the table buz to postgres1
When I add the table buz to postgres2
Then table buz is present on postgres0 after 20 seconds

Scenario: check graceful rejection when two nodes have the same name
Given I start duplicate postgres0 on port 8011
Then there is a "Can't start; there is already a node named 'postgres0' running" CRITICAL in the dup-postgres0 patroni log
2 changes: 2 additions & 0 deletions features/citus.feature
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Feature: citus
Scenario: coordinator failover updates pg_dist_node
Given I run patronictl.py failover batman --group 0 --candidate postgres1 --force
Then postgres1 role is the primary after 10 seconds
And "members/postgres0" key in a group 0 in DCS has state=running after 15 seconds
And replication works from postgres1 to postgres0 after 15 seconds
And postgres1 is registered in the postgres2 as the primary in group 0 after 5 seconds
And "sync" key in a group 0 in DCS has sync_standby=postgres0 after 15 seconds
Expand All @@ -31,6 +32,7 @@ Feature: citus
When I run patronictl.py switchover batman --group 1 --force
Then I receive a response returncode 0
And postgres3 role is the primary after 10 seconds
And "members/postgres2" key in a group 1 in DCS has state=running after 15 seconds
And replication works from postgres3 to postgres2 after 15 seconds
And postgres3 is registered in the postgres0 as the primary in group 1 after 5 seconds
And "sync" key in a group 1 in DCS has sync_standby=postgres2 after 15 seconds
Expand Down
12 changes: 9 additions & 3 deletions features/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,9 @@ def start(self, max_wait_limit=5):
self._log = open(os.path.join(self._output_dir, self._name + '.log'), 'a')
self._handle = self._start()

assert self._has_started(), "Process {0} is not running after being started".format(self._name)

max_wait_limit *= self._context.timeout_multiplier
for _ in range(max_wait_limit):
assert self._has_started(), "Process {0} is not running after being started".format(self._name)
if self._is_accessible():
break
time.sleep(1)
Expand Down Expand Up @@ -344,6 +343,13 @@ def backup(self, dest=os.path.join('data', 'basebackup')):
'--datadir=' + os.path.join(self._work_directory, dest),
'--dbname=' + self.backup_source])

def read_patroni_log(self, level):
try:
with open(str(os.path.join(self._output_dir or '', self._name + ".log"))) as f:
return [line for line in f.readlines() if line[24:24 + len(level)] == level]
except IOError:
return []


class ProcessHang(object):

Expand Down Expand Up @@ -827,7 +833,7 @@ def start(self, name, max_wait_limit=40, custom_config=None):

def __getattr__(self, func):
if func not in ['stop', 'query', 'write_label', 'read_label', 'check_role_has_changed_to',
'add_tag_to_config', 'get_watchdog', 'patroni_hang', 'backup']:
'add_tag_to_config', 'get_watchdog', 'patroni_hang', 'backup', 'read_patroni_log']:
raise AttributeError("PatroniPoolController instance has no attribute '{0}'".format(func))

def wrapper(name, *args, **kwargs):
Expand Down
12 changes: 6 additions & 6 deletions features/patroni_api.feature
Original file line number Diff line number Diff line change
Expand Up @@ -35,21 +35,21 @@ Scenario: check local configuration reload
Then I receive a response code 202

Scenario: check dynamic configuration change via DCS
Given I run patronictl.py edit-config -s 'ttl=10' -p 'max_connections=101' --force batman
Then I receive a response returncode 0
And I receive a response output "+ttl: 10"
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"ttl": 20, "postgresql": {"parameters": {"max_connections": "101"}}}
Then I receive a response code 200
And Response on GET http://127.0.0.1:8008/patroni contains pending_restart after 11 seconds
When I issue a GET request to http://127.0.0.1:8008/config
Then I receive a response code 200
And I receive a response ttl 10
And I receive a response ttl 20
When I issue a GET request to http://127.0.0.1:8008/patroni
Then I receive a response code 200
And I receive a response tags {'new_tag': 'new_value'}
And I sleep for 4 seconds

Scenario: check the scheduled restart
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"postgresql": {"parameters": {"superuser_reserved_connections": "6"}}}
Then I receive a response code 200
Given I run patronictl.py edit-config -p 'superuser_reserved_connections=6' --force batman
Then I receive a response returncode 0
And I receive a response output "+ superuser_reserved_connections: 6"
And Response on GET http://127.0.0.1:8008/patroni contains pending_restart after 5 seconds
Given I issue a scheduled restart at http://127.0.0.1:8008 in 5 seconds with {"role": "replica"}
Then I receive a response code 202
Expand Down
24 changes: 24 additions & 0 deletions features/recovery.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Feature: recovery
We want to check that crashed postgres is started back

Scenario: check that timeline is not incremented when primary is started after crash
Given I start postgres0
Then postgres0 is a leader after 10 seconds
And there is a non empty initialize key in DCS after 15 seconds
When I start postgres1
And I add the table foo to postgres0
Then table foo is present on postgres1 after 20 seconds
When I kill postmaster on postgres0
Then postgres0 role is the primary after 10 seconds
When I issue a GET request to http://127.0.0.1:8008/
Then I receive a response code 200
And I receive a response role master
And I receive a response timeline 1

Scenario: check immediate failover when master_start_timeout=0
Given I issue a PATCH request to http://127.0.0.1:8008/config with {"master_start_timeout": 0}
Then I receive a response code 200
And Response on GET http://127.0.0.1:8008/config contains master_start_timeout after 10 seconds
When I kill postmaster on postgres0
Then postgres1 is a leader after 10 seconds
And postgres1 role is the primary after 10 seconds
10 changes: 10 additions & 0 deletions features/standby_cluster.feature
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Feature: standby cluster
When I start postgres0
Then "members/postgres0" key in DCS has state=running after 10 seconds
And replication works from postgres1 to postgres0 after 15 seconds
When I issue a GET request to http://127.0.0.1:8008/patroni
Then I receive a response code 200
And I receive a response replication_state streaming
And "members/postgres0" key in DCS has replication_state=streaming after 10 seconds

@slot-advance
Scenario: check permanent logical slots are synced to the replica
Expand All @@ -34,6 +38,9 @@ Feature: standby cluster
Then postgres1 is a leader of batman1 after 10 seconds
When I add the table foo to postgres0
Then table foo is present on postgres1 after 20 seconds
When I issue a GET request to http://127.0.0.1:8009/patroni
Then I receive a response code 200
And I receive a response replication_state streaming
And I sleep for 3 seconds
When I issue a GET request to http://127.0.0.1:8009/primary
Then I receive a response code 503
Expand All @@ -44,6 +51,9 @@ Feature: standby cluster
When I start postgres2 in a cluster batman1
Then postgres2 role is the replica after 24 seconds
And table foo is present on postgres2 after 20 seconds
When I issue a GET request to http://127.0.0.1:8010/patroni
Then I receive a response code 200
And I receive a response replication_state streaming
And postgres1 does not have a logical replication slot named test_logical

Scenario: check failover
Expand Down
23 changes: 23 additions & 0 deletions features/steps/basic_replication.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,22 @@ def start_patroni(context, name):
return context.pctl.start(name)


@step('I start duplicate {name:w} on port {port:d}')
def start_duplicate_patroni(context, name, port):
config = {
"name": name,
"restapi": {
"listen": "127.0.0.1:{0}".format(port)
}
}
try:
context.pctl.start('dup-' + name, custom_config=config)
assert False, "Process was expected to fail"
except AssertionError as e:
assert 'is not running after being started' in str(e),\
"No error was raised by duplicate start of {0} ".format(name)


@step('I shut down {name:w}')
def stop_patroni(context, name):
return context.pctl.stop(name, timeout=60)
Expand Down Expand Up @@ -90,3 +106,10 @@ def replication_works(context, primary, replica, time_limit):
When I add the table test_{0} to {1}
Then table test_{0} is present on {2} after {3} seconds
""".format(int(time()), primary, replica, time_limit))


@then('there is a "{message}" {level:w} in the {node} patroni log')
def check_patroni_log(context, message, level, node):
messsages_of_level = context.pctl.read_patroni_log(node, level)
assert any(message in line for line in messsages_of_level),\
"There was no {0} {1} in the {2} patroni log".format(message, level, node)
22 changes: 21 additions & 1 deletion patroni/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,15 @@ def __init__(self, config: 'Config') -> None:

self.version = __version__
self.dcs = get_dcs(self.config)
self.request = PatroniRequest(self.config, True)

self.ensure_unique_name()

self.watchdog = Watchdog(self.config)
self.load_dynamic_configuration()

self.postgresql = Postgresql(self.config['postgresql'])
self.api = RestApiServer(self, self.config['restapi'])
self.request = PatroniRequest(self.config, True)
self.ha = Ha(self)

self.tags = self.get_tags()
Expand All @@ -60,6 +63,23 @@ def load_dynamic_configuration(self) -> None:
logger.warning('Can not get cluster from dcs')
time.sleep(5)

def ensure_unique_name(self) -> None:
"""A helper method to prevent splitbrain from operator naming error."""
from patroni.dcs import Member

cluster = self.dcs.get_cluster()
if not cluster:
return
member = cluster.get_member(self.config['name'], False)
if not isinstance(member, Member):
return
try:
_ = self.request(member, endpoint="/liveness")
logger.fatal("Can't start; there is already a node named '%s' running", self.config['name'])
sys.exit(1)
except Exception:
return

def get_tags(self) -> Dict[str, Any]:
return {tag: value for tag, value in self.config.get('tags', {}).items()
if tag not in ('clonefrom', 'nofailover', 'noloadbalance', 'nosync') or value}
Expand Down
25 changes: 22 additions & 3 deletions patroni/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -542,6 +542,18 @@ def do_GET_metrics(self) -> None:
metrics.append("patroni_xlog_paused{0} {1}"
.format(scope_label, int(postgres.get('xlog', {}).get('paused', False) is True)))

if postgres.get('server_version', 0) >= 90600:
metrics.append("# HELP patroni_postgres_streaming Value is 1 if Postgres is streaming, 0 otherwise.")
metrics.append("# TYPE patroni_postgres_streaming gauge")
metrics.append("patroni_postgres_streaming{0} {1}"
.format(scope_label, int(postgres.get('replication_state') == 'streaming')))

metrics.append("# HELP patroni_postgres_in_archive_recovery Value is 1"
" if Postgres is replicating from archive, 0 otherwise.")
metrics.append("# TYPE patroni_postgres_in_archive_recovery gauge")
metrics.append("patroni_postgres_in_archive_recovery{0} {1}"
.format(scope_label, int(postgres.get('replication_state') == 'in archive recovery')))

metrics.append("# HELP patroni_postgres_server_version Version of Postgres (if running), 0 otherwise.")
metrics.append("# TYPE patroni_postgres_server_version gauge")
metrics.append("patroni_postgres_server_version {0} {1}".format(scope_label, postgres.get('server_version', 0)))
Expand Down Expand Up @@ -1159,8 +1171,11 @@ def get_postgresql_status(self, retry: bool = False) -> Dict[str, Any]:

if postgresql.state not in ('running', 'restarting', 'starting'):
raise RetryFailedError('')
replication_state = ('(pg_catalog.pg_stat_get_wal_receiver()).status'
if postgresql.major_version >= 90600 else 'NULL') + ", " +\
("pg_catalog.current_setting('restore_command')" if postgresql.major_version >= 120000 else "NULL")
stmt = ("SELECT " + postgresql.POSTMASTER_START_TIME + ", " + postgresql.TL_LSN + ","
" pg_catalog.pg_last_xact_replay_timestamp(),"
" pg_catalog.pg_last_xact_replay_timestamp(), " + replication_state + ","
" pg_catalog.array_to_json(pg_catalog.array_agg(pg_catalog.row_to_json(ri))) "
"FROM (SELECT (SELECT rolname FROM pg_catalog.pg_authid WHERE oid = usesysid) AS usename,"
" application_name, client_addr, w.state, sync_state, sync_priority"
Expand Down Expand Up @@ -1196,8 +1211,12 @@ def get_postgresql_status(self, retry: bool = False) -> Dict[str, Any]:
if not cluster or cluster.is_unlocked() or not cluster.leader else cluster.leader.timeline
result['timeline'] = postgresql.replica_cached_timeline(leader_timeline)

if row[7]:
result['replication'] = row[7]
replication_state = postgresql.replication_state_from_parameters(row[1] > 0, row[7], row[8])
if replication_state:
result['replication_state'] = replication_state

if row[9]:
result['replication'] = row[9]

except (psycopg.Error, RetryFailedError, PostgresConnectionException):
state = postgresql.state
Expand Down
3 changes: 2 additions & 1 deletion patroni/ctl.py
Original file line number Diff line number Diff line change
Expand Up @@ -1490,7 +1490,8 @@ def output_members(obj: Dict[str, Any], cluster: Cluster, name: str,
* ``Role``: ``Leader``, ``Standby Leader``, ``Sync Standby`` or ``Replica``;
* ``State``: ``stopping``, ``stopped``, ``stop failed``, ``crashed``, ``running``, ``starting``,
``start failed``, ``restarting``, ``restart failed``, ``initializing new cluster``, ``initdb failed``,
``running custom bootstrap script``, ``custom bootstrap failed``, or ``creating replica``, and so on;
``running custom bootstrap script``, ``custom bootstrap failed``, ``creating replica``, ``streaming``,
``in archive recovery``, and so on;
* ``TL``: current timeline in Postgres;
``Lag in MB``: replication lag.
Expand Down
Loading

0 comments on commit b0d8b21

Please sign in to comment.