Remove logging of monitor connection string with password #512

rheaton · 2020-11-16T18:08:21Z

In the cases when the connection string is parsable, we scrub and remove the password prior to logging in.
In the cases where the connection string is un-parsable, we have removed it from the log messages.

Additionally, this PR adds (C Unit tests)[https://libcheck.github.io/check/] that can be run via make test_unit

If we decide that we like these unit tests, we should integrate them into the travis builds.

Fixes #466

DimCitus

Thanks for working on this PR! We have some implementation details to change, as per my comments. The most important is that you removed the connection string entirely from some error messages, leaving our users in a position where they won't be able to understand and fix their situations.

Other than that, we don't need too free stack allocated memory buffers, those buffer initial value should be { 0 }, and I think we can do better with the “scrubbing” API. I think for the sake of user sanity, showing either stars or unicode bullets (•) would be best, so that they know we did parse and use their password and just don't want to expose their secrets in the clear in the logs.

About unicode bullets, unfortunately I am not sure if we can use that freely. Many systems didn't get the memo from 1991 yet. Same with IPv6. Let's not get started on SQL:2016.

DimCitus · 2020-11-17T10:31:08Z

src/bin/pg_autoctl/cli_create_drop_node.c

 		{
 			log_fatal("Failed to determine monitor hostname when parsing "
-					  "Postgres URI \"%s\"", config->monitor_pguri);
+					  "Postgres URI");


Why would we remove the connection string in the error message? How is the user going to figure out what's happening there in a way they can fix it?

I removed it here because I assumed that it was un-parsable. Do you think in the cases of un-parsable connection strings we should log it, even if it may reveal private information?

If we failed to parse the monitor Postgres URI, there is nothing that can work in the deployment, and we want the user to fix that. How can they fix the situation without us showing what's wrong?

So while I have some level of sympathy to hiding private information in the connection string, my personal opinion is that allowing easy understanding of what's wrong in order to fix it takes priority. We don't know if the URI contains sensitive information: we failed to parse it. Let's at least show what we failed to parse, hopefully making it obvious what is to be fixed.

Would a possible compromise be to print the mangled URI if and only if the logging level is debug or verbose? Most folks don't run in that configuration in production, and that's the place we feel that inapprioriately logging credentials is most sensitive. This way, if someone really needs to see what pg_auto_failover is tripping over, they can re-run in verbose mode, pick up the error, fix any mangling and also rotate credentials at that time?

I feel that this may be the right way to go, in that the biggest concern is credentials being silently printed to logs without the operator realizing. This way, we only print sensitive information to the log when we know that someone is explicitly choosing to look at the log.

A good first step would be a version of this PR that omits the controversial points, and then we can open a new PR for what's left in your audit.

I'll take a look and make the changes, I've been in the process of moving, so I haven't been very available lately. :)

DimCitus · 2020-11-17T10:32:19Z

src/bin/pg_autoctl/cli_do_tmux.c

+char pg_autoctl_argv0[MAXPGPATH];
+char pg_autoctl_program[MAXPGPATH];
+


I understand that with producing a new binary file we might need to move things in a shared space. Why cli_do_tmux.c though? This at least requires some comments about it, and I guess a better choice of a common module...

DimCitus · 2020-11-17T10:33:12Z

src/bin/pg_autoctl/cli_enable_disable.c

-			"The monitor SSL setup is ready and your current "
-			"connection string is \"%s\", you might need to update it",
-			config->monitor_pguri);
+			"The monitor SSL setup is ready, you might need to update it");


Again, why are we removing the connection string entirely here? If the user has to update the monitor URI in the configuration file, which URI should they write in there?

I will add it back for this one :)

Ah wait, this one is actually kind of surprising to me -- were we not able to parse the string? What are we actually doing here (I would expect it to return false).

parse_pguri_info_key_vals might have other bugs than a failure in PQconninfoParse...

DimCitus · 2020-11-17T10:33:49Z

src/bin/pg_autoctl/cli_enable_disable.c

 	}

-
+	char scrubbedConnectionString[MAXCONNINFO];


This should have a default value of { 0 }.

char scrubbedConnectionString[MAXCONNINFO] = { 0 };

DimCitus · 2020-11-17T10:34:41Z

src/bin/pg_autoctl/cli_enable_disable.c


-
+	char scrubbedConnectionString[MAXCONNINFO];
+	parse_and_scrub_connection_string(newPgURI, scrubbedConnectionString);


This API returns a boolean to signal if something went wrong, let's preferably use it, or if you believe we can bypass checking the result, add a (void) prefix before the function call to make that intend clear to readers.

In the case where we can't parse/scrub the connection string, what do you think would be the best option? Maybe print "Un-parsable connection string" instead?

Either the user is impacted by the error and we need to help them in their journey to setup our software correctly, or they are not impacted and should not be distracted by our internal limits. Are the users impacted by this failure here? How badly?

DimCitus · 2020-11-17T10:37:46Z

src/bin/pg_autoctl/keeper_pg_init.c

 	log_debug("pg exists: %s", postgresInstanceExists ? "yes" : "no");
 	log_debug("pg is primary: %s", postgresInstanceIsPrimary ? "yes" : "no");
-
+	PQfreemem(scrubbedConnectionString);


Again, no need.

DimCitus · 2020-11-17T10:38:31Z

src/bin/pg_autoctl/main.c

-char pg_autoctl_argv0[MAXPGPATH];
-char pg_autoctl_program[MAXPGPATH];
-int pgconnect_timeout = 2;      /* see also POSTGRES_CONNECT_TIMEOUT */
-
-char *ps_buffer;                /* will point to argv area */
-size_t ps_buffer_size;          /* space determined at run time */
-size_t last_status_len;         /* use to minimize length of clobber */
-
-Semaphore log_semaphore;        /* allows inter-process locking */
-


Another way to solve that problem would be to copy/paste the same definitions in the other main file that you introduce, in order to minimise changes?

DimCitus · 2020-11-17T10:39:17Z

src/bin/pg_autoctl/parsing.c

 	if (conninfo == NULL)
 	{
-		log_error("Failed to parse pguri \"%s\": %s", pguri, errmsg);
+		log_error("Failed to parse pguri: %s", errmsg);


So we failed to parse a connection string. Which connection string? How is the user going to be able to fix their setup now, if we refrain from showing them any clue?

Along with my comment above -- do you think it's worth printing it here, even if it may reveal a secret to the logs? I was torn.

How will the user fix the Postgres URI that we fail to parse if they don't know which URI it is? The only caller to that function is in cli_enable_disable in update_monitor_connection_string, which means the configuration file might be “corrupted” somehow. Yes, I think it's worth printing what we got.

DimCitus · 2020-11-17T10:52:09Z

src/bin/pg_autoctl/parsing.c

+/*
+ * remove_password_from_parameters is a non-exported function that will remove the keyword
+ * and value if the keyword matches "password" for scrubbing a connection string in logs.
+ */
+void
+remove_password_from_parameters(KeyVal *parameters)


I think there is a better way to approach this: what about introducing a KeyVal *secrets to the buildPostgresURIfromPieces API? When building the connection string it is then easy to replace secret parameters with a sequence of stars, or even a unicode bullet • (see https://en.wikipedia.org/wiki/Bullet_(typography)).

The KeyVal *secrets in buildPostgresURIfromPieces would take as input the keyword names that are secrets and implement the computation of their associated “bullets” value in the secrets->values[index] and use that in the final URI.

Hmm maybe we can use the existing overrides option in parse_pguri_info_key_vals as well -- let me try it.

Are there other secret parameters other than password?

Hmm looks like overrides get set even if the original parameters didn't have the value -- I'll try your idea :)

Failing test:

test_parse_and_scrub_connection_string_with_no_password_in_it:0: Assertion 'scrubbedPguri == "postgres://admin@monitor.local:9999/testdb?"' failed: scrubbedPguri == "postgres://admin@monitor.local:9999/testdb?password=****", "postgres://admin@monitor.local:9999/testdb?" == "postgres://admin@monitor.local:9999/testdb?"

It doesn't look like you did publish the new code that is failing the test, and I don't know if the test should be updated or the code fixed, here?

We implemented an update that uses overrides.

DimCitus · 2020-11-17T10:54:41Z

src/bin/pg_autoctl/parsing.c

+	parseUri = parse_pguri_info_key_vals(pguri, &overrides, &uriParams);
+	if (!parseUri)


Style issue: we would usually just test the return value from calling the function:

if (!parse_pguri_info_key_vals(pguri, &overrides, &uriParams)) { log_error(...); return false; }

For debugging purposes, it might be better to have the return value in a variable that we can inspect, in that case the variable name that we usually would prefer is success.

We updated to the inlined style.

rheaton · 2021-01-12T00:15:38Z

@DimCitus Hi Dimitri! I think we've addressed all of your concerns for this PR, but I am getting some flakiness on one of the python tests.

test_basic_operation.test_019_run_secondary is giving me the following issue:

======================================================================
ERROR: test_basic_operation.test_019_run_secondary
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/python/3.7.6/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/rheaton/pg_auto_failover/tests/test_basic_operation.py", line 226, in test_019_run_secondary
    assert node2.has_needed_replication_slots()
  File "/home/travis/build/rheaton/pg_auto_failover/tests/pgautofailover_utils.py", line 1480, in has_needed_replication_slots
    current_slots = self.list_replication_slot_names()
  File "/home/travis/build/rheaton/pg_auto_failover/tests/pgautofailover_utils.py", line 1460, in list_replication_slot_names
    raise e
  File "/home/travis/build/rheaton/pg_auto_failover/tests/pgautofailover_utils.py", line 1456, in list_replication_slot_names
    result = self.run_sql_query(query)
  File "/home/travis/build/rheaton/pg_auto_failover/tests/pgautofailover_utils.py", line 362, in run_sql_query
    with psycopg2.connect(self.connection_string()) as conn:
  File "/opt/python/3.7.6/lib/python3.7/site-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already

Do you know of connections that might be left open? The test passes when I run test_basic_operation in isolation, but fails when in suite (make run-test TEST=single or make run-test). Any ideas of where I should start hunting around for this? I'm not seeing anything intuitively that would have trigger this failure in this PR.

DimCitus · 2021-01-14T15:09:40Z

Hi folks, @rheaton,

I just did a review. There are still some small problems, that we could fix very easily (log messages, variable default assignment styles, etc). That said I would like to first review/finish/merge the password editing parts, and then in a separate PR think about the added unit tests. Can you please split the work there in two?

rheaton · 2021-01-25T19:05:43Z

@DimCitus I will split it into another PR -- based off this branch. We did use the unit-tests to TDD some of the code and only added the missing ones for characterization tests to understand how things were working. I did not attempt to backfill all unit-tests at once.

- Tests added for hapostgres#512 - Some Characterization tests added

rheaton · 2021-01-26T00:44:29Z

@DimCitus I've pulled out the unit tests (and I'll deal with that PR once we are through this one 🙂). I'm still getting the nose test failure from test 019. Same issue as before. Any idea what could have been added that would introduce the "too many connections" issue? #512 (comment)

Additionally, let's get those other small changes in!

Thank you,
Rachel

rheaton · 2021-04-13T20:53:38Z

@DimCitus I've made updates to this PR to make it as slim as possible. There is some more work to be done to completely close out the issue, but implementation probably should be discussed.
In the situation where there could be a failure to get the scrubbed uri, I've logged a message slightly differently. In most cases, the function won't return false.
Let me know if you need anything else updated. Thanks!

DimCitus

Looks pretty good to me, I think with a pass of make indent and the Semaphore declaration clean-up we should be good to merge!

DimCitus · 2021-04-19T13:11:07Z

src/bin/pg_autoctl/lock_utils.h

 	int semId;
 } Semaphore;

+Semaphore log_semaphore;        /* allows inter-process locking */


Why is this change part of this PR?

DimCitus · 2021-04-19T13:11:27Z

src/bin/pg_autoctl/main.c

-Semaphore log_semaphore;        /* allows inter-process locking */
-
-


Same, I don't think this PR should change that.

rheaton · 2021-04-19T18:32:05Z

@DimCitus I'm not seeing any changes with make indent and I've pushed up the removal of the unnecessary change

DimCitus

Lots of whitespace changes that I am not comfortable with, still. The most important are around the static SQL string definitions, but also white lines that have no reason to be added or removed. It might be just a setup or tooling problem? you might have to use a different editor to finish up the PR unfortunately...

DimCitus · 2021-04-20T08:49:51Z

src/bin/pg_autoctl/keeper_pg_init.c

+	if (!parse_and_scrub_connection_string(config->monitor_pguri,
+										   scrubbedConnectionString))
+	{
+		log_error("Monitor connection string is unparseable");


Better written in the same style as the other errors messages, such as: "Failed to parse the monitor connection string".

DimCitus · 2021-04-20T08:50:06Z

src/bin/pg_autoctl/parsing.c

+	/* */
+	/* Look for a populated password connection parameter */
+	/* */


Strange formatting.

src/bin/pg_autoctl/parsing.c

DimCitus · 2021-04-20T08:50:59Z

src/bin/pg_autoctl/parsing.c

+		if (
+			strcmp(option->keyword, "password") == 0 &&
+			option->val != NULL &&
+			strcmp(option->val, "") != 0)


Maybe we should use IS_EMPTY_STRING_BUFFER here?

DimCitus · 2021-04-20T08:51:44Z

src/bin/pg_autoctl/pgsql.c

 #define ERRCODE_OBJECT_IN_USE "55006"

 static char * ConnectionTypeToString(ConnectionType connectionType);
+


Why does this PR changes the white lines here, and in the following lines in this file?

DimCitus · 2021-04-20T08:52:20Z

src/bin/pg_autoctl/pgsql.c

-	char *sqlTemplate =
-		/*
-		 * We could simplify the writing of this query, but we prefer that it
-		 * looks as much as possible like the query used in
-		 * pgsql_replication_slot_maintain() so that we can maintain both
-		 * easily.
-		 */
-		"WITH nodes(slot_name, lsn) as ("
-		" SELECT '" REPLICATION_SLOT_NAME_DEFAULT "_' || id, lsn"
-		"   FROM (%s) as sb(id, lsn) "
-		"), \n"
-		"dropped as ("
-		" SELECT slot_name, pg_drop_replication_slot(slot_name) "
-		"   FROM pg_replication_slots pgrs LEFT JOIN nodes USING(slot_name) "
-		"  WHERE nodes.slot_name IS NULL "
-		"    AND (   slot_name ~ '" REPLICATION_SLOT_NAME_PATTERN "' "
-		"         OR slot_name ~ '" REPLICATION_SLOT_NAME_DEFAULT "' )"
-		"    AND not active"
-		"    AND slot_type = 'physical'"
-		"), \n"
-		"created as ("
-		"SELECT c.slot_name, c.lsn "
-		"  FROM nodes LEFT JOIN pg_replication_slots pgrs USING(slot_name), "
-		"       LATERAL pg_create_physical_replication_slot(slot_name, true) c"
-		" WHERE pgrs.slot_name IS NULL "
-		") \n"
-		"SELECT 'create', slot_name, lsn FROM created "
-		" union all "
-		"SELECT 'drop', slot_name, NULL::pg_lsn FROM dropped";
+    char *sqlTemplate =
+            /*
+             * We could simplify the writing of this query, but we prefer that it
+             * looks as much as possible like the query used in
+             * pgsql_replication_slot_maintain() so that we can maintain both
+             * easily.
+             */
+            "WITH nodes(slot_name, lsn) as ("
+            " SELECT '" REPLICATION_SLOT_NAME_DEFAULT "_' || id, lsn"
+            "   FROM (%s) as sb(id, lsn) "
+            "), \n"
+            "dropped as ("
+            " SELECT slot_name, pg_drop_replication_slot(slot_name) "
+            "   FROM pg_replication_slots pgrs LEFT JOIN nodes USING(slot_name) "
+            "  WHERE nodes.slot_name IS NULL "
+            "    AND (   slot_name ~ '" REPLICATION_SLOT_NAME_PATTERN "' "
+            "         OR slot_name ~ '" REPLICATION_SLOT_NAME_DEFAULT "' )"
+            "    AND not active"
+            "    AND slot_type = 'physical'"
+            "), \n"
+            "created as ("
+            "SELECT c.slot_name, c.lsn "
+            "  FROM nodes LEFT JOIN pg_replication_slots pgrs USING(slot_name), "
+            "       LATERAL pg_create_physical_replication_slot(slot_name, true) c"
+            " WHERE pgrs.slot_name IS NULL "
+            ") \n"
+            "SELECT 'create', slot_name, lsn FROM created "
+            " union all "
+            "SELECT 'drop', slot_name, NULL::pg_lsn FROM dropped";