Skip to content

Conversation

@DinoBektesevic
Copy link
Contributor

As was suggested to me by @r-owen and @timj the draft PR for S3Datastore and PostgreSqlRegistry together was too large and clunky to be reasonable. That Draft PR #147 is now closed. Since all the comments I received about the PostgreSqlRegistry implementation were down the line "please remove it" I am opening a new draft PR request separate to that of S3Datastore.

For the most part the inclusion of a PostgreSqlRegistry seems equivalent/analogous to OracleRegistry with the exception of authorization. Specifically there are 2 authorizations that need to happen, one is to the AWS service, in the form of security policies and groups (f.e. if your IP is not white-listed outside source you will be denied access to the DB, or if your IAM is not permitted access etc..) and the second one is authorizing to the database itself. The latter is currently done by reading the credentials from ~/.rds/credentials file and then substituting the correct username and password according to the nick given in the connection string.

A PosixDatastore with an RDS backed PostgresSqlRegistry test was left in because I have noticed certain slowness when running the tests that I have not seen for Sqlite registry. I am willing to make arrangements for access to people that are willing to try it out. Additionally, it outlines the neccessary events that would have to occur if actual PostgreSqlRegistry would be tested (I hope).

Comments and suggestions welcomed.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for rearranging things such that it's clear what's needed for postgres support. My main comment is that I don't think we are ready to accept the test code. daf_butler tests do not yet require access to a remote database in order to run and I see no code here that skips the tests if RDS is not available. We do not test with Oracle in daf_butler either. I think this means that we need separate integration tests that can be set up to run with different database backends but that should not be included here.

I'm happy to merge the first non-test code once we understand how credentials in external text files should be handled.

parsed = urlparse.urlparse(constr)
if parsed.username and (parsed.password is None):
localconf = configparser.ConfigParser()
localconf.read(os.path.expanduser('~/.rds/credentials'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktlim probably has an opinion on the contents of this credentials file but I'm fine with this for now. We have historically been using ~/.lsst/db-auth.paf for database access but I don't know if this is something we wish to continue with. For example should the credentials be in ~/.rds/ and ~/.oracle or be in a single file inside a ~/.lsst or ~/.butler directory? Also, why are postgres credentials in an .rds directory? If this is RDS specific then why is this class named after postgres?

cc/ @MichelleGower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the two review comments all others relate to test_butler code. As I mentioned in the PR - that code exists only as a quick way for anyone else that perhaps wants to take the PostgreSqlRegistry for a spin and because I thought it was a shame to completely "forget" about the steps I needed to take in test set-up. Considering the comment:

... happy to merge the first non-test code once we understand how credentials in external text files should be handled ...

I wonder wouldn't it be better to just totally remove the piece of test-code in a new commit where I sort out the authorization issues? This would leave the code in history (at least on my end) and simultaneously get rid of most of the complaints in this Draft-PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want an environment variable that you can set which points to the config file more than you want to put it in a well defined place. This sucks from a dev/prod database perspective if wanting to run code against both instances from the same home filesystem because you're going to be doing a bunch of symlink gymnastics. You can still have a default, which should probably be under ~/.lsst, as it's not a Postgres config file (e.g. .pgpass file)

A postgres-specific solution would be to use postgres norms via env vars or a .pgpass file:
https://www.postgresql.org/docs/9.3/libpq-envars.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read up on all of the suggestions made here and on the Slack channel yesterday. To summarize, the following suggestions were made:

  • Postgres env vars
    • not recommended by postgres manual as they say on some systems anyone can see any env variable.
  • .paf, .conf file in ~/.rds, ~/.postgres, ~/.lsst or ~/.butler directory
    • the way its done now
    • I have a feeling (from @ktlim and @timj) that its .paf file using something built in the stack like the DBAuth code would be most preferred
    • but I would still like some clarification on where exactly should I put this: ~/.lsst or ~/.butler? ~/.rdsor ~/.postgres seem to be not as popular
  • .pgpass file
    • I can set the PGPASSFILE env var to point to ~/.lsst or ~/.butler directories so it works well with the file (above) approach
    • the format of entries in the file is given as hostname:port:database:username:password so it makes no sense to keep the current connection string format intact. I could change the current connection string to postgresql://user:pass@hostname/dbname and read the port and the password for the matching username, hostname and dbname from .pgpass file. This is also good as it simplifies the connection string.
    • but at some point if I want to connect to the DB using SQLAlchemy I still have to go there, read the file as plaintext and make substitutions in the connection string. So on face value it doesn't seem any different than the other file based approaches.
    • The difference is only visible if from terminal one invokes a clean :~$ psql command as that is when postgres will natively go there and read the file if no user and pass env vars are set. If there's only one entry in the file it will connect to that directly otherwise --username --host and --dbname have to be given.
    • I'm not counting the env var as a win because that's easy to implement and depends on exactly the same questions as .pgpass does: when and where is the variable exported and to which location should it point to?
  • vault + confd
    • This requires a running vault server, which is simple enough I guess. I'm just confused about the general approach I guess: should there be one central vault, is everybody supposed to run their own, where is the vault in the AWS then, how do tests work... A lot of things I know not a lot about.
    • Then the templatefiles need to be created by confd, which I guess could be a part of a startup script somewhere (where and when?)
    • but then again, every time I want to connect using SQLAlchemy I would have to run something like:
      ROOT_TOKEN=$(vault read -field id auth/token/lookup-self)
      
      confd -onetime -backend vault -node http://127.0.0.1:8200 \
            -auth-type token -auth-token $ROOT_TOKEN
      
      and would again have to read the files and make string substitutions to SQLAlchemy as was the case in all the options so far. Nobody can read the values in the files, even if they could access them, unless they luck out and look at them post token-substitutions? I guess it does get me more flexibility, especially if I run in deamon mode that will update the files if the back-end password changes.

The dot files are plain text, but they are not public in the sense that they need to be in 600 or 400 permission set. All of the answers seem to boil down to reading the dotfile and making plain string substitutions. It was my understanding that the main problem was reading/keeping usernames and passwords floating about in plain text around the code. So I guess I'm confused where the security comes in to play in any of the suggestions, short of relying on OS privileges? (or maybe this is "the security", I don't know a lot about this stuff so someone needs to point it out to me)

The passwords are read prior to creating the connection string and passed on to SQLAlchemy as soon as possible (which then hides them again). This was the reasoning behind duplicating all the read-password code. If I write a function that will read and return username and password isn't that more dangerous than just doing it in place like it is? This is what, I though, I needed to avoid.

I think I prefer the .pgpass file with PGPASSFILE, by default, pointing to one of the ~/.lsst or ~/.butler directories by which is read by the LSST's DBAuth code approach the best. It gets me everything regular dotfile approach does but also integrates well with native psql commands. While we're at it, we can change the location of the AWS credentials file to be in the same default location as the rest of them by setting AWS_SHARED_CREDENTIALS_FILE. Any thoughts, comments and answers very welcomed! (especially, I think, input from @brianv0 might be very illuminating to me)

Thank you all for your help!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a postgres file in a ~/.lsst directory. It's really important that you don't duplicate all the code for handling the password reading in multiple places. The more places you have it the more places where a mistake can be made. There is nothing wrong I can see with using function calls to obtain information and then inserting the answers into your connection string. You should check that the file has the correct restricted permissions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the work is focused on this comment. I have implemented a ConnectionStringBuilder class in lsst.daf.butler.core.registry. It constructs the connection string as generally as possible from the given config and db-auth.paf file. All tests pass (including the RDSButlerTestCase which was updated in previous commit but removed from the latest one).

I think I might have over-extended the level of changes you requested but I wanted to avoid repeat of the experience with URI handling from before. I have followed the SQLAlchemy manual that cites RFC-1738 and all the default values are set equal to those SQLAlchemy Dialect part of manual states. This should hopefully enable the construction of maximally general connection strings so all cases should be covered.

I was not 100% sure how Oracle connection string has been supplied but I suspect that they use conection strings without user and pass and then leave Oracle to work out the security. So I left that as an keyword option in ConnectionStringBuilder.fromConfig.

I have left questions in form of comments in the ConnectionStringBuilder class that I can remove once I hear if you like this or not.

@timj
Copy link
Member

timj commented May 28, 2019

Removing the test code in a separate commit is fine with me, although it might be worth deferring that removal until we have sorted out the code duplication so that you can still test what you have whilst you are developing.

@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch 4 times, most recently from 3b9653a to c2ccee4 Compare July 12, 2019 22:32
@timj timj requested a review from andy-slac July 26, 2019 21:10
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By @timj invite I looked at the code and left a bunch of comments. I tried to get some context for the proposed changes but could not gather much (is there a ticket for this in JIRA?) One thing that I do not quite like is an addition of overrides for URL pieces in registry config, I think this adds too much complexity to the code with unknown (to me) benefits. Could we still stick with URL-only connection configuration? As @brianv0 mentioned in one of the comments DbAuth use is problematic with its hard-coded location of config file, it would be better to have more flexibility there (I think ticket for daf_persistence is the right thing to do). And I think that inheritance for new registry class needs to be changed.


class ConnectionStringBuilder:
"""Builds a connection string by parsing (in order of precedence):
I) RegistryConfig for explicitly defined keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is a new addition to the configuration? Why do we need explicit keys, I thought that URL-style connection string is already flexible enough for anything that might be needed. I think having more than one way to specify things adds unnecessary complexity, would be better to keep things simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there's a strict requirement that the db string be an URL and it was not obvious that this is the case from conversations or other PRs. I wanted more flexibility. The current changes are more in line with what you said.

def fromConfig(cls, config, withPass=True):
conKwargs = cls._dictFromRegistryConf(config)
validationKeys = list(cls.requiredKeys)
constr = "{dialect}+{driver}://{username}:{password}@{host}:{port}{dbname}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think port number should be optional in configuration, and I believe conKwargs['port'] will be None in that case, would it be rendered as None on output too?

Copy link
Contributor Author

@DinoBektesevic DinoBektesevic Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially, DBAuth matches on host and port so in some cases port becomes mandatory. I don't believe in the previous case it would have been rendered as "None" but as "". This was used in cases where dbname was the path to a file as then it would, usually, get shifted leftwards so file-targeting DB connection strings would work too.

In any case, not important in the reworked version. Added a test that expects the correct result when no port, no username and no password are provided for good measure.


regConf['host'] = 'some.host.com:1234/dbname'
with self.assertRaises(ValueError):
ConnectionStringBuilder.fromConfig(regConf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test case when port number is not specified in configuration (for non-sqlite case)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that but I'm not sure what the usecase is. Will username and password be required in this non-sqlite case? How would I use the DBAuth then? Does this still apply?

@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from c2ccee4 to 8cb727f Compare July 30, 2019 05:26
@timj
Copy link
Member

timj commented Aug 5, 2019

@ktlim has agreed to look at this pull request.

@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from f6c14c6 to 82b9723 Compare August 6, 2019 01:31
@DinoBektesevic DinoBektesevic marked this pull request as ready for review August 13, 2019 22:04
@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from 234a56f to 409f3e2 Compare August 13, 2019 22:05
@DinoBektesevic DinoBektesevic changed the title PostgreSqlRegistry Add support for PostgreSQL Registry Aug 13, 2019
@DinoBektesevic
Copy link
Contributor Author

At PCW 2019 DM Hack session it was agreed upon, between @ktlim, @andy-slac and me, that correct registry class should be resolved based on the dialect part of the connection string.

TO DO: resolve correct Registry class to use directly from connection string.

@timj
Copy link
Member

timj commented Aug 14, 2019

My main concern with deriving the class from the connection string is that we still allow for the mapping from connection string to class to be a configuration option.

cc/ @TallJimbo

@TallJimbo
Copy link
Member

My main concern with deriving the class from the connection string is that we still allow for the mapping from connection string to class to be a configuration option.

Was the second clause of that sentence supposed to be slightly different? What else would the connection string be other than a configuration option? (Sorry if that's a silly question; I've thus far avoided getting involved in the connection/authentication configuration discussions.)

@DinoBektesevic
Copy link
Contributor Author

DinoBektesevic commented Aug 14, 2019

What else would the connection string be other than a configuration option? (Sorry if that's a silly question; I've thus far avoided getting involved in the connection/authentication configuration discussions.)

Right now the config looks like:

registry:
  cls: lsst.daf.butler.registries.sqliteRegistry.SqliteRegistry
  db: 'sqlite:///:memory:'

the idea is that we kick the cls out and infer the correct cls from the db string, which is possible, but not quite how configs are currently used in the stack. (If I understood what was said to me correctly). Namely in Butler.makeRepo:

datastoreClass = doImport(full["datastore", "cls"])
datastoreClass.setConfigRoot(BUTLER_ROOT_TAG, config, full, overwrite=forceConfigRoot)
registryClass = doImport(full["registry", "cls"])
registryClass.setConfigRoot(BUTLER_ROOT_TAG, config, full, overwrite=forceConfigRoot)

@timj
Copy link
Member

timj commented Aug 14, 2019

@TallJimbo my concern is that I'm not sure I want python code to be the place where something says "if this looks like a sqlite connection string then import the sqlite registry". Maybe everyone else does want that? It means that no-one could specify their own override sqlite registry implementation but maybe we don't want that? Deriving the class from the connection string does have the advantage that you aren't ever going to specify the oracle registry with a postgres connection string.

I was suggesting we have an extra yaml configuration section something like:

registryMap:
  sqlite: lsst.daf.butler.registry.sqliteRegistry
  postgresql: lsst.daf.butler...
  oracle: lsst.daf.butler...

Then in theory someone could experiment with different classes. It does mean that the connection string parser can only support a limited set of keys in registryMap.

@TallJimbo
Copy link
Member

That works for me. Another approach that I think would provide the same flexibility would be to retain cls, but make it optional and look in the connection string only if it is not provided.

@timj
Copy link
Member

timj commented Aug 14, 2019

@TallJimbo allowing cls for overrides sounds good to me, but still needs the registryMap concept to exist in the YAML if you want to avoid hard-coding the mapping in the python.

@TallJimbo
Copy link
Member

I was thinking it'd be fine to hard-code the mapping if you had a way to override it, but I guess that's not quite as flexible (presumably the need to change this means there are some databases that can't use cls for some reason?).

@DinoBektesevic
Copy link
Contributor Author

DinoBektesevic commented Aug 14, 2019

(presumably the need to change this means there are some databases that can't use cls for some reason?).

Any class importable with doImport would work. I thought that it was to avoid information duplication such as pointing to a wrong class while using a wrong connection string. Just in case I got it wrong, could @ktlim or @andy-slac elaborate on the reasons further?

EDIT: is this really the best place for this discussion or should it move to slack?

@andy-slac
Copy link
Contributor

The concern is indeed to avoid duplication and keep things consistent. URL string should be enough to identify everything that code needs, I don't think we'll have different implementations for the same backend (ideally we would need only one SQL implementation with SQLAlchemy handling all backend peculiarities). A agree that for developers it may be needed to override the class name for particular backend. I think Tim's map is probably cleanest way to do that (and cls override would not be needed if we have a map), what worries me is that configuration becomes more and more involved and if defaults are specifies in the config file and not in the code then it effectively means that we depend on configuration in the same way as the code. I'd probably like more if we had defaults in the code (maybe as a default configuration rather than hard-coded if ... else ...). So if defaults are in the code then having cls override may be easier to manage than a map. Sorry if this is not super-decisive, I don't know an answer that can satisfy all possible use cases.

@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from 409f3e2 to 328a794 Compare August 14, 2019 17:16
@TallJimbo
Copy link
Member

Sorry to derail this a bit by asking everyone to catch me up on the motivation.

I certainly don't have terribly strong opinions about this, but I think the simplest thing that does everything we might want is @timj's addendum (adding a mapping from connection substring to registry class to the config). I think we can revisit whether we need even that later; I may have a bit more clarity in the next month or so on how much we will actually be able to collapse the Registry subclasses down by relying more on SQLAlchemy and avoiding database-specific syntax in our use of it.

@DinoBektesevic
Copy link
Contributor Author

So I added automatic resolution of correct Registry class based on the dialect found in the connection string. This is still an early commit, waiting for the dbAuth to get finalized before connection string works as agreed upon, but it could be useful to open this matter for early discussion.

Specifically, is there a better way to read the registryMap.yaml config file? Should the registryMap even be a key in registryMap.yaml config file?

@timj
Copy link
Member

timj commented Aug 15, 2019

Regarding registryMap. How about we put it in config/registry.yaml. Replacing:

registry:
  cls: lsst.daf.butler.registries.sqliteRegistry.SqliteRegistry
  ...

with

registry:
   clsMap: 
      sqlite: lsst.daf.butler.registries.sqliteRegistry.SqliteRegistry
      oracle: lsst.daf.butler.registries.sqliteRegistry.OracleRegistry

@DinoBektesevic
Copy link
Contributor Author

@timj Sorry, got distracted and forgot about this response. I'm not sure how that would work in the global scheme of things since the configs in the daf_butler/config are considered defaults so when no registry is explicitly given in the "custom" butler.yaml then the registry is inherited from there.

See the full config in Butler.makeRepo. I don't know if that's what you intended, or the intention was to just add clsMap key to the default registry.yaml config?

@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from 63e02eb to 09c817a Compare August 29, 2019 21:42
@DinoBektesevic
Copy link
Contributor Author

DinoBektesevic commented Aug 29, 2019

I changed the rebase so the odd changes don't appear anymore. I'm not sure if this was the right way of doing it but nothing else seems broken?

# becomes the database, but we are not treating 'localhost' explicitly
host = conStr.host
if conStr.host is None:
host = str(conStr.database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dislike this. If there is no host, we shouldn't set one, especially not to the database name. I'd just return conStr here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relic from before when I used to let the db string be just the host or a path and the cls key to set the dialect. If the db was a path then it would become the database name, but if it was an url, the host. Since I was treating the host as the target location previously I had to make that switch. Since we kicked out cls and I discovered URL (thanks to Andy) I believe it's not needed now at all so I removed it.

This affects the username and password setting issue I asked you about. I don't want to return the connection string here because there are single file databases that support username and passwords (LiteDB, RaptorDB, sqlite has an extension that supports usernames and passwords). I don't know/think that they would be ever used but it's been made clear to me that I should try to be as general as possible (but also that it's ok not to do it for sqlite).

So I left it as is, but I take it, from the early return, that you're not fine with not using DbAuth when username+password are given in db connection string or as explicit keys in the config file for single file dbs?

Copy link
Contributor

@ktlim ktlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Code was separated from a Draft-PR which contained both S3Datastore
and PostgreSqlRegistry into a self-standing PostgreSqlRegistry code
only.
- transitioned connection strings to
  ConnectionStringBuilder
- added tests for ConnectionStringBuilder
- added class attributes for desired default
  values of dialect and driver

Removed RDSButlerTestCase.
Rebased onto master.
Rebased onto master.
Removed RDS tests and supporting files.
ConnectionSttring classes.

Removed RegistryConfig 'cls' key from default configs.
The registry class is determined through the 'db' string except
in the cases when the 'cls' class is explicitly stated.
This includes ignoring the cls key in registry setConfigRoot.
Omit explicitly setting cls values.
Remove ConnectionStringClass. Reduce code duplication by
moving the dialect and correct registry class doImport
statements to Registry. Fixed tests.
Correct class resolution moved to Registry. Removed that logic
from Butler.makeRepo and Registry.fromConfig.

ConnectionStringFactory is now used only internaly in the Registry.
Removed the connection string handling from the registry subclasses.

Removed the ConnectionString classes. ConnectionStringFactory now
produces an SQLAlchemy URL instances. Added checks to avoid
reading username and password when not required.

Fixed OracleRegistry misshandling of config keys in
OracleRegistry.setConfigRoot.

Changed the create_engine calls on all registry subclasses to
reflect the changes.

Docs.
ConnectionStringFactory.fromConfig is now a classmethod.
Fixes to Registry where this makes a difference.

Fix to a doc string return value.

Added missing comment about import having to be in a
method.
Moved the getDialect, getRegistryClass and connectionString into
RegistryConfig and made them instance methods instead.

Moved RegistryConfig into its own module.

Circular import error is now handled in the ConnectionStringFactory
instead of in multiple places in RegistryConfig.

Fixes to ConnectionStringFactory (removal of useless code).
@DinoBektesevic DinoBektesevic force-pushed the u/DinoBektesevic/rds_registry branch from e54e567 to b184a7c Compare August 30, 2019 19:29
@timj
Copy link
Member

timj commented Aug 30, 2019

@andy-slac you have a "change requested" review on this PR. I think the changes have been done so would you like to update your review?

@andy-slac
Copy link
Contributor

I looked at it quickly again, looks better indeed. Approved.

@DinoBektesevic
Copy link
Contributor Author

Thank you all for the reviews and all the help! Sorry if I was a bother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants