DM-12920: Prototype Butler configuration #4

pschella · 2017-12-14T16:17:02Z

No description provided.

timj

I have quite a lot of comments on config.py but is that because the code came from daf_persistence? It's very inconsistent. I would have preferred some way to indicate in the commit history that this file was written by someone else before you modified it.

timj · 2017-12-14T16:32:28Z

python/lsst/butler/config.py

+    """Config implements a datatype that is used by Butler for configuration parameters.
+    It is essentially a dict with key/value pairs, including nested dicts (as values). In fact, it can be
+    initialized with a dict. The only caveat is that keys may NOT contain dots ('.'). This is explained next:
+    Config extends the dict api so that hierarchical values may be accessed with dot-delimited notiation.


"notiation" -> "notation"

timj · 2017-12-14T16:36:20Z

python/lsst/butler/config.py

+
+    def __init__(self, other=None):
+        """Initialize the Config. Other can be used to initialize the Config in a variety of ways:
+        other (string) Treated as a path to a config file on disk. Must end with '.paf' or '.yaml'.


are you really supporting paf?

No, removed.

timj · 2017-12-14T16:37:33Z

python/lsst/butler/config.py

+            # if other is a string, assume it is a file path.
+            self.__initFromFile(other)
+        else:
+            # if the config specified by other could not be loaded raise a runtime error.


This comment is not strictly correct. This error is raised if other was not recognized, not if it could not be loaded.

timj · 2017-12-14T16:38:40Z

python/lsst/butler/config.py

+        use: pdb> print myConfigObject.pprint()
+        :return: a prettyprint formatted string representing the config
+        """
+        import pprint


this is a standard module so are we putting it here because it will slow down imports and is rarely used?

Don't know (since I didn't write this). But moving all imports up.

timj · 2017-12-14T16:40:44Z

python/lsst/butler/config.py

+
+    def __initFromFile(self, path):
+        """Load a file from path. If path is a list, will pick one to use, according to order specified
+        by extensionPreference.


Lists are not supported and extensionPreference is not an argument.

timj · 2017-12-14T16:48:13Z

python/lsst/butler/config.py

+            cur[keys[-1]] = value
+            self.update(d)
+        data = self.data
+        keys = name.split('.')


can you split the name once on entry to the method rather than doing it twice?

timj · 2017-12-14T16:48:45Z

python/lsst/butler/config.py

+    def __contains__(self, key):
+        d = self.data
+        keys = key.split('.')
+        for k in keys[0:-1]:


Same indexing comment here.

timj · 2017-12-14T16:51:51Z

python/lsst/butler/config.py

+                                     (default), the fileName argument is relative to the installation
+                                     directory.
+        """
+        basePath = lsst.utils.getPackageDir(productName)


getPackageDir throws lsst.pex.exceptions.NotFoundError if the product can not be found so the trap on the next line is irrelevant.

Actually just removed the whole function for now, pending discussion about default configuration locations.

timj · 2017-12-14T16:52:08Z

python/lsst/butler/config.py

+        @param productName (string) The name of the product that the default config is installed as part of
+        @param fileName (string) The name of the config file. Can also include a path to the file relative to
+                                 the directory where the product is installed.
+        @param relativePath (string) The relative path from the directior where the product is installed to


"directior"

timj · 2017-12-14T16:58:01Z

python/lsst/butler/registry.py


        metadata.create_all(self.engine)

+    @staticmethod
+    def fromConfig(config):
+        from lsst.daf.persistence import doImport


I'm not really a fan of these deferred imports as it hides the dependencies.

So I moved the import of doImport to the top. Of course it is still doing a deferred import, since that is the whole point of doImport. But I assume that is not what you were objecting to.

Yes, it was the from I wanted moved up top.

pschella · 2017-12-14T17:07:29Z

Yes, that code is verbatim (minus a couple of deletions) from daf_persistence. I don't know how to preserve cross-package commit history. I made it a separate commit, and could add some more comments but am open to suggestions.

timj · 2017-12-14T17:09:44Z

What I am asking for is that you don't own the commit that just copied the file across. Use the --author option to git commit. If Nate wrote it all make him the author. If it has many authors use a generic author name. This will make it more obvious where the code came from and indicate that you didn't write it. The commits to fix it up for this usage would then be yours.

TallJimbo · 2017-12-14T18:12:56Z

This policy has been ignored basically since its inception, but it does provide a workflow that addresses this concern:

https://developer.lsst.io/processes/transferring_code.html

pschella · 2017-12-14T22:18:12Z

Ok, I followed the procedure outlined there (except for the steps at the origin since this is a copy). Hope it is clear now.

@jbosch

Rebased work originally done by @jbosch.

Author set to original author of `lsst.daf.persistence.Policy`.

The `Registry` and `Datastore` types to create are read, along with their respective configurations, from a YAML file.

timj · 2018-01-25T21:42:02Z

Merged in commit 32c6a8e

- added JSON from and to bytes methods. - fixed all the flake8 errors. - wrote tests for s3utils. - rewrote parsedPathToUriElements to make more sense. Added examples, better documentation, split on root capability - refactored how read/write to file methods work for formatters. No more serialization code duplication. - Fix in ingest S3Datastore functionality and path-uri changes made it possible to kick out the duplicate path removing code from S3LocationFactory.

Hardcoded Butler.get from S3 storage for simplest of Datasets with no Dimensions. Changes ------- Added path parsing utilities that figure out if a path is an S3 URI or filesystem path to the utilities. ButlerConfig checks at instantiation, in the case its not being instantiated from another ButlerConfig instance, whether the given path is an S3 URI. If it is, the butler yaml configuration file is downloaded and a ButlerConfig is instantiated from that. The download location is hard-coded and won't work on different machine. Edited fromConfig instantiation method of Registry. If a registry is instantiated from general Config class it checks whether the path to db is an S3 URI or filesystem path. If it's S3 URI it will download the database locally to hardcoded path that will only work on my machine. The remainder of instantiation is then executed as normal from the downloaded file. The RegistryConfig is updated to reflect the local downloaded path. S3Datastore has a different config than the default PosixDatastore one. It manually appends the s3:// to default datastore templates. I do not know outfront if this is really required or not. Initialization of the S3DataStore is different compared to PosixDatastore. Attributes service, bucket and rootpath are added identifying the 's3://', bucket name and path to root of the repository elements of the object storage repository respectively. Attributes session and client provide access, through boto3 API, to upload and download functionality. The file exists and filesize checks were changed to query the S3 API. Currently this is performed by making a listing request to the S3 Bucket. This is allegedly the fastest, but for searching for the name of the file among many. It is also 12.5 times as expensive as directly making a request for the object and catching the exception. I do not know if there are data that can have multiple different paths returned. I suspect not, so this might not be worth it. There is some functionality that will create a bucket if one doesn't exists in the S3DataStore init method, but this is just showcase code - doesn't work or make sense because Registry has to be instantiated before Datastore is and we need that yaml config file to exists - so the bucket must exist too. The get method was not particularily drastically changed, because the mechanism was pushed to formatters. LocationFactory used in the DataStores was changed to return S3Location if it was instantiated from an S3 URI. The S3Location keeps the repository root, the bucket name and the full original bucket URI stored as attributes so its possible to figure out what is the bucket path from which to download the file in the Formatters. The S3FitsExposureFormatter readFull method uses the S3Location to download the exposure to a baked in path that will, again, only work for me. The mapped DatasetType is then used to instantiate from that file and return appropriate in-memory (hopefully) object. All downloaded Datasets are given the same name to avoid clutter, but this should be done differently anyhow so the hacking solution will do for now. Tentative working version of cloud specialized Butler. Supports a local Registry and an S3 backed DataStore or a RDS database service Registry and S3 backed datastore. Tests are not yet mocked! CHANGES ------- S3Datastore: - Call to ingest at the end of butler.put now takes in a location.uri - In ingest fixed several wrong if statements that would lead to errors being raised in correct scenarios. - Liberally added comments to help me remember . - TODO: Fix the checksum generation, fix the patch for parsePath2Uri. schema.yaml: - Increased maximum allowed length of instrument name. Default instrument name length was too short (8) which would raise errors on Instrument insert in PostgreSQL. Apparently, Oracle was only tested on 'HSC' and SQLite doesn't care so the short name went un-noticed. - Increased the length attribute on all instances of "name: instrument" entries. - Is there a better way to define global length values than manually changing a lot of entries? Butler: - Refactored the makeRepo code into two functions. Previusly this was one classmethod with a lot of if-s. It should be easier to see what are the steps that need to happen to create a new repo in both cases now. This is a temporary fix, as indicated by the comments in the code, untill someone tells me how I should solve the issue: 2 Butlers, 1 dynamically constructed Butler, multiple classmethods etc... - Removed any and all hard-coded download locations. - Added a check if boto3 import was succesfull but I don't know if this is the correct style to do that in. I left a comment. - Liberal application of comments. - TODO: BUTLER_ROOT_TAG is still being ignored in the case of S3 DataStore and an in-memory Registry. ButlerConfig: - Removed the hardcoded local paths where I used to download yaml config files. They are now downloaded as bytestr and then loaded with yaml. No need for tmpfiles. - There is a case to be made about splitting ButlerConfig into different ones, since there could be a proliferation of if-else statements in its __init__. Registry: - Changed instantiation logic: previously I would create a local SqlRegistry and then upload it to the S3 Bucket. This was deemed a useless idea so it was scratched. The current paradigm is to have an S3 backed DataStore and an in-memory SQLite or an S3 backed DataStore and and RDS backed Registry. Since SqlAlchemy is capable of always creating a new in-memory DB (no name clashes etc.) and considering that the RDS service must be created and exist, we have no need for checking if the Registry exists or creating persistent local SQLRegistries and uploading them. Utils: - Removed some unnecessary comments. - Tried fixing a mistake in parsePath2Uri. It seems like 's3://bucket/root' and 's3://bucket/root/' are parsed differently by the function. This is a mistake. I am sure its fixable here, but I confused myself so I patched it in S3Datastore. The problem is that it is impossible to discern if uri's like 's3://bucket/root' are pointing to a directory or a file. - TODO: Revisit the URI parsing issue. views.py: - added an PostgreSql compiler extensions to create views. In PostgreSql views are created via 'CREATE OR REPLACE VIEW'. FileFormatter: - Code refactoring: _assembleDataset method now contains most of the duplicated code that existed in read and readBytes YamlFormatter: - I think it was only the order of methods that changed, and some code beautification oracleRegistry.py: - In experimentation I kept the code here, since PostgreSQL seems to share a lot of similarities with Oracle. That code has now been excised and moved to postgresqlRegistry.py PostgreSqlRegistry: - Wrote a new class to separate the previously mixed OracleRegistry and PostgreSqlRegistry. - Wrote in additional code that will read `~/.rds/credentials` file where it expects to find a nick under which RDS username and password credentials are stored. The credentials are read at SqlAlchemy engine creation time so they should not be visible anywhere in the code. The connection string can now be given as: `dialect+driver://NICKNAME@host:port/database` as long as (case-sensitive) NICKNAME exists in the `~/.rds/credentials` and is given in the following format: [NICKNAME] username = username password = password - The class eventually ended up being an exact copy of Oracle again because its required for the username and password to be read as close to engine creation as possible, so there we go and here we are. butler-s3store.yaml: - Now includes an sqlRegistry.yaml configuration file. This configures the test_butler.py S3DatastoreButlerTestCase to use an S3 backed DataStore and an in-memory SQLite Registry. Practical for S3Datastore testing. sqlRegistry.yaml: - Targets the registry to an in-memory SQLite registry butler-s3rds.yaml: - Now includes an rdsRegistry.yaml configuration file. This configures the test_butler.py S3RdsButlerTestCase to use an S3 backed DataStore and an (existing) RDS backed Registry. rdsRegistry.yaml: - Targets the registry to an (existing) RDS database. The connection string used by default is: 'postgresql://DEFAULT@gen3registry.cejsinjimzit.us-west-2.rds.amazonaws.com:5432/gen3registry'. This means that an RDS identifier with the name gen3registry must exist (the first 'gen3registry') and in it a **Schema that is on the search_path must exist and that Schema must be owned by the username under the DEFAULT nickname**. This is very important. test_butler.py: - Home to 2 new tests: S3DatastoreButlerTestCase and S3RdsButlerTestCase. Both tests passed at the time of this commit. - S3DatastoreButlerTestCase will use the butler-s3store.yaml to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` to find the aws_access_key_id and aws_secret_access_key. The name of the bucket to which it will connect is set by the S3DatastoreButlerTestCase bucketName class variable. The permRoot class varible sets the root directory in that Bucket only when useTempRoot is False. The Registry is a in-memory SQLite registry. This is very practical for testing the S3Datastore. This test seems mockable, I just haven't succeeded at it yet. - S3RdsButlerTestCase will use the butler-s3rds.yaml file to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` expecting to find the aws `access_key` and `secret_access_key`. The name of the bucket is set by S3RdsButlerTestCase bucketName class variable. The permRoot class varible is only used when useTempRoot variable is False. The Registry is an RDS service identified by a "generalized" connection string given in rdsRegistry.yaml configuration file in the test directory. The DEFAULT represents a nickname defined in the `~/.rds/credentials` file under which the username and password of a user with sufficient priviledges (enough to create and drop databases) is expected. The tests are conducted by creating many DBs, each of which is assigned to a particular Registry instance tied to a particular test. This test seems absolutely impossible to mock. test_butlerfits.py: - New S3DatastoreButlerTestCase added. The test is an S3 backed DataStore using a local in-memory Registry. Obviosuly, extends trivially to the S3RdsButlerTestCase since only the S3Datastore is really being tested here - but no such test case exists because I suspect the way the setUp and tearDown is performed in that case will cause a lot of consternation so I'll defer that till a later time when I know what people want/expect. Added mocking for S3. Resolving review fixes #1 General: - updated to incorporate recent master branch changes - Removed NO_BOTO and replaced imports with a nicer boto3=None style YAML config files: - Fixed S3 related default config files to lowercase tablenames - Refactored the formatters section from Posix and S3 default yaml files into a formatters.yaml Butler: - Renamed parsePath2Uri --> parsePathToUriElements - Removed example makeRepo functions: - Uploading is now handled from within Config class - makeButler is once again a method of Butler class - Removed del and close methods ButlerConfig: - Removed the code that downloaded butler.yaml file in init Config: - Added a dumpToS3 method that uploads a new config file to Bucket - Added a initFromS3File method - Modified initFromFile method to check whether file is a local one or S3 one Location: - Removed unnecessary comments - Fixed awkward newline mistakes S3Location: - Removed unnecessary comments, corrected all references from Location to S3Location in the docstrings utils.py: - Removed all s3 related utility functionality into s3utils.py - Added more docstrings, removed stale comments and elaborated on unclear ones parsePathToUriElements: - refactored the if-nastiness into something more legible and correct test_butler: - Moved the testPutTemplates into generic Butler test class as a not-tested for method - Added tested-for versions of that method to both PosixDatastoreButlerTestCase and S3DatastoreButlerTestCase - Added a more generic checkFileExists functionality that discerns between S3 and POSIX files - removed a lot of stale comments - improved on the way S3DatastoreButlerTestCase does tear-down - Added mocked no-op functionality and test skipping for the case whne boot3 does not exist. Resolving review fixes #2 - refactored Formatters, read/write/File is now implemented over read/write/Bytes. - Removed all things RDS related from the commit. - refactored test_butler and test_butlerFits Review fixes #3 - converted the S3 datastore records to lowercase - rebased to most recent changes and survived merger - fixed formatters.yaml (again) - removed my own fix for config overwrites and am currently using the overwrite keyword functionality from a recent ticket. Review fixes #4. - added JSON from and to bytes methods. - fixed all the flake8 errors. - wrote tests for s3utils. - rewrote parsedPathToUriElements to make more sense. Added examples, better documentation, split on root capability - refactored how read/write to file methods work for formatters. No more serialization code duplication. - Fix in ingest S3Datastore functionality and path-uri changes made it possible to kick out the duplicate path removing code from S3LocationFactory. Fixing Formatters read/write functionality. All Formatters now have both read/write and to/from/Bytes methods. Those specific formatter implementations that also implement the _from/_to/Bytes method will default to directly downloading the bytes to memory. The rest will be downloaded/uploaded to/from a temporary file. Checks if file exists or does not, in S3Datastore, were changed since now we definitely incurr a GET charge for a Key's header everytime so there is no need to duplicate the checks with s3CheckFileExists calls.

- added JSON from and to bytes methods. - fixed all the flake8 errors. - wrote tests for s3utils. - rewrote parsedPathToUriElements to make more sense. Added examples, better documentation, split on root capability - refactored how read/write to file methods work for formatters. No more serialization code duplication. - Fix in ingest S3Datastore functionality and path-uri changes made it possible to kick out the duplicate path removing code from S3LocationFactory.

pschella force-pushed the tickets/DM-12920 branch from 66a005e to 5ed14cb Compare December 14, 2017 16:20

timj approved these changes Dec 14, 2017

View reviewed changes

pschella force-pushed the tickets/DM-12920 branch from 5ed14cb to b7b4b8e Compare December 14, 2017 22:16

Pim Schellart and others added 10 commits December 18, 2017 16:37

Implement minimal Registry prototype

773ade7

Add tests based on metadata extracted from ci_hsc

4bc8cda

Rebased work originally done by @jbosch.

Merge branch 'tickets/DM-12371'

45aebb0

Implement minimal Datastore prototype

6efa1bc

Merge branch 'tickets/DM-12667'

85f608c

Minimal functional Butler.

6b52d98

Fix bug when input to MappingProxyType is None instead of empty dict

2e5a636

Merge branch 'tickets/DM-12668'

3222b3d

Transfer policy from daf_persistence to form basis of configuration

9f963b3

Author set to original author of `lsst.daf.persistence.Policy`.

Implement prototype Butler configuration

11d87c0

The `Registry` and `Datastore` types to create are read, along with their respective configurations, from a YAML file.

pschella force-pushed the tickets/DM-12920 branch from a5028e6 to 11d87c0 Compare December 18, 2017 22:46

timj closed this Jan 25, 2018

ktlim deleted the tickets/DM-12920 branch August 25, 2018 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-12920: Prototype Butler configuration #4

DM-12920: Prototype Butler configuration #4

pschella commented Dec 14, 2017

timj left a comment

timj Dec 14, 2017

timj Dec 14, 2017

pschella Dec 14, 2017

timj Dec 14, 2017

timj Dec 14, 2017

pschella Dec 14, 2017

timj Dec 14, 2017

timj Dec 14, 2017

timj Dec 14, 2017

timj Dec 14, 2017

pschella Dec 14, 2017

timj Dec 14, 2017

timj Dec 14, 2017

pschella Dec 14, 2017

timj Dec 15, 2017

pschella commented Dec 14, 2017

timj commented Dec 14, 2017

TallJimbo commented Dec 14, 2017

pschella commented Dec 14, 2017

timj commented Jan 25, 2018

DM-12920: Prototype Butler configuration #4

DM-12920: Prototype Butler configuration #4

Conversation

pschella commented Dec 14, 2017

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pschella commented Dec 14, 2017

timj commented Dec 14, 2017

TallJimbo commented Dec 14, 2017

pschella commented Dec 14, 2017

timj commented Jan 25, 2018