-
Notifications
You must be signed in to change notification settings - Fork 17
Add support for PostgreSQL Registry #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for PostgreSQL Registry #161
Conversation
timj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for rearranging things such that it's clear what's needed for postgres support. My main comment is that I don't think we are ready to accept the test code. daf_butler tests do not yet require access to a remote database in order to run and I see no code here that skips the tests if RDS is not available. We do not test with Oracle in daf_butler either. I think this means that we need separate integration tests that can be set up to run with different database backends but that should not be included here.
I'm happy to merge the first non-test code once we understand how credentials in external text files should be handled.
| parsed = urlparse.urlparse(constr) | ||
| if parsed.username and (parsed.password is None): | ||
| localconf = configparser.ConfigParser() | ||
| localconf.read(os.path.expanduser('~/.rds/credentials')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktlim probably has an opinion on the contents of this credentials file but I'm fine with this for now. We have historically been using ~/.lsst/db-auth.paf for database access but I don't know if this is something we wish to continue with. For example should the credentials be in ~/.rds/ and ~/.oracle or be in a single file inside a ~/.lsst or ~/.butler directory? Also, why are postgres credentials in an .rds directory? If this is RDS specific then why is this class named after postgres?
cc/ @MichelleGower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the two review comments all others relate to test_butler code. As I mentioned in the PR - that code exists only as a quick way for anyone else that perhaps wants to take the PostgreSqlRegistry for a spin and because I thought it was a shame to completely "forget" about the steps I needed to take in test set-up. Considering the comment:
... happy to merge the first non-test code once we understand how credentials in external text files should be handled ...
I wonder wouldn't it be better to just totally remove the piece of test-code in a new commit where I sort out the authorization issues? This would leave the code in history (at least on my end) and simultaneously get rid of most of the complaints in this Draft-PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want an environment variable that you can set which points to the config file more than you want to put it in a well defined place. This sucks from a dev/prod database perspective if wanting to run code against both instances from the same home filesystem because you're going to be doing a bunch of symlink gymnastics. You can still have a default, which should probably be under ~/.lsst, as it's not a Postgres config file (e.g. .pgpass file)
A postgres-specific solution would be to use postgres norms via env vars or a .pgpass file:
https://www.postgresql.org/docs/9.3/libpq-envars.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read up on all of the suggestions made here and on the Slack channel yesterday. To summarize, the following suggestions were made:
- Postgres env vars
- not recommended by postgres manual as they say on some systems anyone can see any env variable.
.paf,.conffile in~/.rds,~/.postgres,~/.lsstor~/.butlerdirectory- the way its done now
- I have a feeling (from @ktlim and @timj) that its
.paffile using something built in the stack like theDBAuthcode would be most preferred - but I would still like some clarification on where exactly should I put this:
~/.lsstor~/.butler?~/.rdsor~/.postgresseem to be not as popular
.pgpassfile- I can set the
PGPASSFILEenv var to point to~/.lsstor~/.butlerdirectories so it works well with the file (above) approach - the format of entries in the file is given as
hostname:port:database:username:passwordso it makes no sense to keep the current connection string format intact. I could change the current connection string topostgresql://user:pass@hostname/dbnameand read the port and the password for the matching username, hostname and dbname from.pgpassfile. This is also good as it simplifies the connection string. - but at some point if I want to connect to the DB using SQLAlchemy I still have to go there, read the file as plaintext and make substitutions in the connection string. So on face value it doesn't seem any different than the other file based approaches.
- The difference is only visible if from terminal one invokes a clean
:~$ psqlcommand as that is when postgres will natively go there and read the file if no user and pass env vars are set. If there's only one entry in the file it will connect to that directly otherwise--username--hostand--dbnamehave to be given. - I'm not counting the env var as a win because that's easy to implement and depends on exactly the same questions as
.pgpassdoes: when and where is the variable exported and to which location should it point to?
- I can set the
- vault + confd
- This requires a running
vaultserver, which is simple enough I guess. I'm just confused about the general approach I guess: should there be one central vault, is everybody supposed to run their own, where is the vault in the AWS then, how do tests work... A lot of things I know not a lot about. - Then the templatefiles need to be created by
confd, which I guess could be a part of a startup script somewhere (where and when?) - but then again, every time I want to connect using SQLAlchemy I would have to run something like:
and would again have to read the files and make string substitutions to SQLAlchemy as was the case in all the options so far. Nobody can read the values in the files, even if they could access them, unless they luck out and look at them post token-substitutions? I guess it does get me more flexibility, especially if I run in deamon mode that will update the files if the back-end password changes.
ROOT_TOKEN=$(vault read -field id auth/token/lookup-self) confd -onetime -backend vault -node http://127.0.0.1:8200 \ -auth-type token -auth-token $ROOT_TOKEN
- This requires a running
The dot files are plain text, but they are not public in the sense that they need to be in 600 or 400 permission set. All of the answers seem to boil down to reading the dotfile and making plain string substitutions. It was my understanding that the main problem was reading/keeping usernames and passwords floating about in plain text around the code. So I guess I'm confused where the security comes in to play in any of the suggestions, short of relying on OS privileges? (or maybe this is "the security", I don't know a lot about this stuff so someone needs to point it out to me)
The passwords are read prior to creating the connection string and passed on to SQLAlchemy as soon as possible (which then hides them again). This was the reasoning behind duplicating all the read-password code. If I write a function that will read and return username and password isn't that more dangerous than just doing it in place like it is? This is what, I though, I needed to avoid.
I think I prefer the .pgpass file with PGPASSFILE, by default, pointing to one of the ~/.lsst or ~/.butler directories by which is read by the LSST's DBAuth code approach the best. It gets me everything regular dotfile approach does but also integrates well with native psql commands. While we're at it, we can change the location of the AWS credentials file to be in the same default location as the rest of them by setting AWS_SHARED_CREDENTIALS_FILE. Any thoughts, comments and answers very welcomed! (especially, I think, input from @brianv0 might be very illuminating to me)
Thank you all for your help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with a postgres file in a ~/.lsst directory. It's really important that you don't duplicate all the code for handling the password reading in multiple places. The more places you have it the more places where a mistake can be made. There is nothing wrong I can see with using function calls to obtain information and then inserting the answers into your connection string. You should check that the file has the correct restricted permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the work is focused on this comment. I have implemented a ConnectionStringBuilder class in lsst.daf.butler.core.registry. It constructs the connection string as generally as possible from the given config and db-auth.paf file. All tests pass (including the RDSButlerTestCase which was updated in previous commit but removed from the latest one).
I think I might have over-extended the level of changes you requested but I wanted to avoid repeat of the experience with URI handling from before. I have followed the SQLAlchemy manual that cites RFC-1738 and all the default values are set equal to those SQLAlchemy Dialect part of manual states. This should hopefully enable the construction of maximally general connection strings so all cases should be covered.
I was not 100% sure how Oracle connection string has been supplied but I suspect that they use conection strings without user and pass and then leave Oracle to work out the security. So I left that as an keyword option in ConnectionStringBuilder.fromConfig.
I have left questions in form of comments in the ConnectionStringBuilder class that I can remove once I hear if you like this or not.
|
Removing the test code in a separate commit is fine with me, although it might be worth deferring that removal until we have sorted out the code duplication so that you can still test what you have whilst you are developing. |
3b9653a to
c2ccee4
Compare
andy-slac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By @timj invite I looked at the code and left a bunch of comments. I tried to get some context for the proposed changes but could not gather much (is there a ticket for this in JIRA?) One thing that I do not quite like is an addition of overrides for URL pieces in registry config, I think this adds too much complexity to the code with unknown (to me) benefits. Could we still stick with URL-only connection configuration? As @brianv0 mentioned in one of the comments DbAuth use is problematic with its hard-coded location of config file, it would be better to have more flexibility there (I think ticket for daf_persistence is the right thing to do). And I think that inheritance for new registry class needs to be changed.
|
|
||
| class ConnectionStringBuilder: | ||
| """Builds a connection string by parsing (in order of precedence): | ||
| I) RegistryConfig for explicitly defined keys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a new addition to the configuration? Why do we need explicit keys, I thought that URL-style connection string is already flexible enough for anything that might be needed. I think having more than one way to specify things adds unnecessary complexity, would be better to keep things simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know there's a strict requirement that the db string be an URL and it was not obvious that this is the case from conversations or other PRs. I wanted more flexibility. The current changes are more in line with what you said.
| def fromConfig(cls, config, withPass=True): | ||
| conKwargs = cls._dictFromRegistryConf(config) | ||
| validationKeys = list(cls.requiredKeys) | ||
| constr = "{dialect}+{driver}://{username}:{password}@{host}:{port}{dbname}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think port number should be optional in configuration, and I believe conKwargs['port'] will be None in that case, would it be rendered as None on output too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially, DBAuth matches on host and port so in some cases port becomes mandatory. I don't believe in the previous case it would have been rendered as "None" but as "". This was used in cases where dbname was the path to a file as then it would, usually, get shifted leftwards so file-targeting DB connection strings would work too.
In any case, not important in the reworked version. Added a test that expects the correct result when no port, no username and no password are provided for good measure.
tests/test_connectionString.py
Outdated
|
|
||
| regConf['host'] = 'some.host.com:1234/dbname' | ||
| with self.assertRaises(ValueError): | ||
| ConnectionStringBuilder.fromConfig(regConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test case when port number is not specified in configuration (for non-sqlite case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that but I'm not sure what the usecase is. Will username and password be required in this non-sqlite case? How would I use the DBAuth then? Does this still apply?
c2ccee4 to
8cb727f
Compare
|
@ktlim has agreed to look at this pull request. |
f6c14c6 to
82b9723
Compare
234a56f to
409f3e2
Compare
|
At PCW 2019 DM Hack session it was agreed upon, between @ktlim, @andy-slac and me, that correct registry class should be resolved based on the dialect part of the connection string. TO DO: resolve correct Registry class to use directly from connection string. |
|
My main concern with deriving the class from the connection string is that we still allow for the mapping from connection string to class to be a configuration option. cc/ @TallJimbo |
Was the second clause of that sentence supposed to be slightly different? What else would the connection string be other than a configuration option? (Sorry if that's a silly question; I've thus far avoided getting involved in the connection/authentication configuration discussions.) |
Right now the config looks like: the idea is that we kick the |
|
@TallJimbo my concern is that I'm not sure I want python code to be the place where something says "if this looks like a sqlite connection string then import the sqlite registry". Maybe everyone else does want that? It means that no-one could specify their own override sqlite registry implementation but maybe we don't want that? Deriving the class from the connection string does have the advantage that you aren't ever going to specify the oracle registry with a postgres connection string. I was suggesting we have an extra yaml configuration section something like: Then in theory someone could experiment with different classes. It does mean that the connection string parser can only support a limited set of keys in registryMap. |
|
That works for me. Another approach that I think would provide the same flexibility would be to retain |
|
@TallJimbo allowing |
|
I was thinking it'd be fine to hard-code the mapping if you had a way to override it, but I guess that's not quite as flexible (presumably the need to change this means there are some databases that can't use |
Any class importable with EDIT: is this really the best place for this discussion or should it move to slack? |
|
The concern is indeed to avoid duplication and keep things consistent. URL string should be enough to identify everything that code needs, I don't think we'll have different implementations for the same backend (ideally we would need only one SQL implementation with SQLAlchemy handling all backend peculiarities). A agree that for developers it may be needed to override the class name for particular backend. I think Tim's map is probably cleanest way to do that (and |
409f3e2 to
328a794
Compare
|
Sorry to derail this a bit by asking everyone to catch me up on the motivation. I certainly don't have terribly strong opinions about this, but I think the simplest thing that does everything we might want is @timj's addendum (adding a mapping from connection substring to registry class to the config). I think we can revisit whether we need even that later; I may have a bit more clarity in the next month or so on how much we will actually be able to collapse the Registry subclasses down by relying more on SQLAlchemy and avoiding database-specific syntax in our use of it. |
|
So I added automatic resolution of correct Registry class based on the dialect found in the connection string. This is still an early commit, waiting for the dbAuth to get finalized before connection string works as agreed upon, but it could be useful to open this matter for early discussion. Specifically, is there a better way to read the |
|
Regarding registryMap. How about we put it in config/registry.yaml. Replacing: registry:
cls: lsst.daf.butler.registries.sqliteRegistry.SqliteRegistry
...with registry:
clsMap:
sqlite: lsst.daf.butler.registries.sqliteRegistry.SqliteRegistry
oracle: lsst.daf.butler.registries.sqliteRegistry.OracleRegistry |
|
@timj Sorry, got distracted and forgot about this response. I'm not sure how that would work in the global scheme of things since the configs in the See the |
63e02eb to
09c817a
Compare
|
I changed the rebase so the odd changes don't appear anymore. I'm not sure if this was the right way of doing it but nothing else seems broken? |
| # becomes the database, but we are not treating 'localhost' explicitly | ||
| host = conStr.host | ||
| if conStr.host is None: | ||
| host = str(conStr.database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still dislike this. If there is no host, we shouldn't set one, especially not to the database name. I'd just return conStr here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a relic from before when I used to let the db string be just the host or a path and the cls key to set the dialect. If the db was a path then it would become the database name, but if it was an url, the host. Since I was treating the host as the target location previously I had to make that switch. Since we kicked out cls and I discovered URL (thanks to Andy) I believe it's not needed now at all so I removed it.
This affects the username and password setting issue I asked you about. I don't want to return the connection string here because there are single file databases that support username and passwords (LiteDB, RaptorDB, sqlite has an extension that supports usernames and passwords). I don't know/think that they would be ever used but it's been made clear to me that I should try to be as general as possible (but also that it's ok not to do it for sqlite).
So I left it as is, but I take it, from the early return, that you're not fine with not using DbAuth when username+password are given in db connection string or as explicit keys in the config file for single file dbs?
ktlim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
Code was separated from a Draft-PR which contained both S3Datastore and PostgreSqlRegistry into a self-standing PostgreSqlRegistry code only.
- transitioned connection strings to ConnectionStringBuilder - added tests for ConnectionStringBuilder - added class attributes for desired default values of dialect and driver Removed RDSButlerTestCase. Rebased onto master.
Rebased onto master. Removed RDS tests and supporting files.
ConnectionSttring classes. Removed RegistryConfig 'cls' key from default configs. The registry class is determined through the 'db' string except in the cases when the 'cls' class is explicitly stated. This includes ignoring the cls key in registry setConfigRoot. Omit explicitly setting cls values.
Remove ConnectionStringClass. Reduce code duplication by moving the dialect and correct registry class doImport statements to Registry. Fixed tests.
Correct class resolution moved to Registry. Removed that logic from Butler.makeRepo and Registry.fromConfig. ConnectionStringFactory is now used only internaly in the Registry. Removed the connection string handling from the registry subclasses. Removed the ConnectionString classes. ConnectionStringFactory now produces an SQLAlchemy URL instances. Added checks to avoid reading username and password when not required. Fixed OracleRegistry misshandling of config keys in OracleRegistry.setConfigRoot. Changed the create_engine calls on all registry subclasses to reflect the changes. Docs.
ConnectionStringFactory.fromConfig is now a classmethod. Fixes to Registry where this makes a difference. Fix to a doc string return value. Added missing comment about import having to be in a method.
Moved the getDialect, getRegistryClass and connectionString into RegistryConfig and made them instance methods instead. Moved RegistryConfig into its own module. Circular import error is now handled in the ConnectionStringFactory instead of in multiple places in RegistryConfig. Fixes to ConnectionStringFactory (removal of useless code).
e54e567 to
b184a7c
Compare
|
@andy-slac you have a "change requested" review on this PR. I think the changes have been done so would you like to update your review? |
|
I looked at it quickly again, looks better indeed. Approved. |
|
Thank you all for the reviews and all the help! Sorry if I was a bother. |
As was suggested to me by @r-owen and @timj the draft PR for S3Datastore and PostgreSqlRegistry together was too large and clunky to be reasonable. That Draft PR #147 is now closed. Since all the comments I received about the
PostgreSqlRegistryimplementation were down the line "please remove it" I am opening a new draft PR request separate to that ofS3Datastore.For the most part the inclusion of a
PostgreSqlRegistryseems equivalent/analogous toOracleRegistrywith the exception of authorization. Specifically there are 2 authorizations that need to happen, one is to the AWS service, in the form of security policies and groups (f.e. if your IP is not white-listed outside source you will be denied access to the DB, or if your IAM is not permitted access etc..) and the second one is authorizing to the database itself. The latter is currently done by reading the credentials from~/.rds/credentialsfile and then substituting the correct username and password according to the nick given in the connection string.A
PosixDatastorewith an RDS backedPostgresSqlRegistrytest was left in because I have noticed certain slowness when running the tests that I have not seen for Sqlite registry. I am willing to make arrangements for access to people that are willing to try it out. Additionally, it outlines the neccessary events that would have to occur if actualPostgreSqlRegistrywould be tested (I hope).Comments and suggestions welcomed.