Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-12920: Prototype Butler configuration #4

Closed
wants to merge 10 commits into from

Conversation

pschella
Copy link
Collaborator

No description provided.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have quite a lot of comments on config.py but is that because the code came from daf_persistence? It's very inconsistent. I would have preferred some way to indicate in the commit history that this file was written by someone else before you modified it.

"""Config implements a datatype that is used by Butler for configuration parameters.
It is essentially a dict with key/value pairs, including nested dicts (as values). In fact, it can be
initialized with a dict. The only caveat is that keys may NOT contain dots ('.'). This is explained next:
Config extends the dict api so that hierarchical values may be accessed with dot-delimited notiation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"notiation" -> "notation"


def __init__(self, other=None):
"""Initialize the Config. Other can be used to initialize the Config in a variety of ways:
other (string) Treated as a path to a config file on disk. Must end with '.paf' or '.yaml'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you really supporting paf?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, removed.

# if other is a string, assume it is a file path.
self.__initFromFile(other)
else:
# if the config specified by other could not be loaded raise a runtime error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not strictly correct. This error is raised if other was not recognized, not if it could not be loaded.

use: pdb> print myConfigObject.pprint()
:return: a prettyprint formatted string representing the config
"""
import pprint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a standard module so are we putting it here because it will slow down imports and is rarely used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know (since I didn't write this). But moving all imports up.


def __initFromFile(self, path):
"""Load a file from path. If path is a list, will pick one to use, according to order specified
by extensionPreference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists are not supported and extensionPreference is not an argument.

cur[keys[-1]] = value
self.update(d)
data = self.data
keys = name.split('.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split the name once on entry to the method rather than doing it twice?

def __contains__(self, key):
d = self.data
keys = key.split('.')
for k in keys[0:-1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same indexing comment here.

(default), the fileName argument is relative to the installation
directory.
"""
basePath = lsst.utils.getPackageDir(productName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPackageDir throws lsst.pex.exceptions.NotFoundError if the product can not be found so the trap on the next line is irrelevant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just removed the whole function for now, pending discussion about default configuration locations.

@param productName (string) The name of the product that the default config is installed as part of
@param fileName (string) The name of the config file. Can also include a path to the file relative to
the directory where the product is installed.
@param relativePath (string) The relative path from the directior where the product is installed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"directior"


metadata.create_all(self.engine)

@staticmethod
def fromConfig(config):
from lsst.daf.persistence import doImport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of these deferred imports as it hides the dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I moved the import of doImport to the top. Of course it is still doing a deferred import, since that is the whole point of doImport. But I assume that is not what you were objecting to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was the from I wanted moved up top.

@pschella
Copy link
Collaborator Author

Yes, that code is verbatim (minus a couple of deletions) from daf_persistence. I don't know how to preserve cross-package commit history. I made it a separate commit, and could add some more comments but am open to suggestions.

@timj
Copy link
Member

timj commented Dec 14, 2017

What I am asking for is that you don't own the commit that just copied the file across. Use the --author option to git commit. If Nate wrote it all make him the author. If it has many authors use a generic author name. This will make it more obvious where the code came from and indicate that you didn't write it. The commits to fix it up for this usage would then be yours.

@TallJimbo
Copy link
Member

This policy has been ignored basically since its inception, but it does provide a workflow that addresses this concern:

https://developer.lsst.io/processes/transferring_code.html

@pschella
Copy link
Collaborator Author

Ok, I followed the procedure outlined there (except for the steps at the origin since this is a copy). Hope it is clear now.

@timj
Copy link
Member

timj commented Jan 25, 2018

Merged in commit 32c6a8e

@timj timj closed this Jan 25, 2018
@ktlim ktlim deleted the tickets/DM-12920 branch August 25, 2018 06:49
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request May 16, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request May 17, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
hsinfang pushed a commit that referenced this pull request Jun 19, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
hsinfang pushed a commit that referenced this pull request Jun 19, 2019
Hardcoded Butler.get from S3 storage for simplest of Datasets with no Dimensions.

Changes
-------
Added path parsing utilities that figure out if a path is an S3 URI or filesystem
path to the utilities.

ButlerConfig checks at instantiation, in the case its not being instantiated from
another ButlerConfig instance, whether the given path is an S3 URI. If it is, the
butler yaml configuration file is downloaded and a ButlerConfig is instantiated
from that. The download location is hard-coded and won't work on different
machine.

Edited fromConfig instantiation method of Registry. If a registry is instantiated
from general Config class it checks whether the path to db is an S3 URI or
filesystem path. If it's S3 URI it will download the database locally to
hardcoded path that will only work on my machine. The remainder of instantiation
is then executed as normal from the downloaded file. The RegistryConfig is
updated to reflect the local downloaded path.

S3Datastore has a different config than the default PosixDatastore one. It
manually appends the s3:// to default datastore templates. I do not know
outfront if this is really required or not.

Initialization of the S3DataStore is different compared to PosixDatastore.
Attributes service, bucket and rootpath are added identifying the 's3://',
bucket name and path to root of the repository elements of the object
storage repository respectively. Attributes session and client provide
access, through boto3 API, to upload and download functionality.
The file exists and filesize checks were changed to query the S3 API. Currently
this is performed by making a listing request to the S3 Bucket. This is
allegedly the fastest, but for searching for the name of the file among many.
It is also 12.5 times as expensive as directly making a request for the object
and catching the exception. I do not know if there are data that can have
multiple different paths returned. I suspect not, so this might not be worth it.
There is some functionality that will create a bucket if one doesn't exists in
the S3DataStore init method, but this is just showcase code - doesn't work or
make sense because Registry has to be instantiated before Datastore is and we
need that yaml config file to exists - so the bucket must exist too. The get
method was not particularily drastically changed, because the  mechanism was
pushed to formatters.

LocationFactory used in the DataStores was changed to return S3Location if it
was instantiated from an S3 URI.

The S3Location keeps the repository root, the bucket name and the full
original bucket URI stored as attributes so its possible to figure out what
is the bucket path from which to download the file in the Formatters.

The S3FitsExposureFormatter readFull method uses the S3Location to download
the exposure to a baked in path that will, again, only work for me. The
mapped DatasetType is then used to instantiate from that file and return
appropriate in-memory (hopefully) object. All downloaded Datasets are given
the same name to avoid clutter, but this should be done differently anyhow
so the hacking solution will do for now.

Tentative working version of cloud specialized Butler.

Supports a local Registry and an S3 backed DataStore or a RDS database service
Registry and S3 backed datastore. Tests are not yet mocked!

CHANGES
-------

S3Datastore:
    - Call to ingest at the end of butler.put now takes in a location.uri
    - In ingest fixed several wrong if statements that would lead to errors being
      raised in correct scenarios.
    - Liberally added comments to help me remember .
    - TODO: Fix the checksum generation, fix the patch for parsePath2Uri.

schema.yaml:
    - Increased maximum allowed length of instrument name.
      Default instrument name length was too short (8) which would raise errors
      on Instrument insert in PostgreSQL. Apparently, Oracle was only tested on
      'HSC' and SQLite doesn't care so the short name went un-noticed.
    - Increased the length attribute on all instances of "name: instrument"
      entries.
    - Is there a better way to define global length values than manually changing
      a lot of entries?

Butler:
    - Refactored the makeRepo code into two functions.
      Previusly this was one classmethod with a lot of if-s. It should be easier
      to see what are the steps that need to happen to create a new repo in both
      cases now. This is a temporary fix, as indicated by the comments in the
      code, untill someone tells me how I should solve the issue: 2 Butlers,
      1 dynamically constructed Butler, multiple classmethods etc...
    - Removed any and all hard-coded download locations.
    - Added a check if boto3 import was succesfull but I don't know if this is
      the correct style to do that in. I left a comment.
    - Liberal application of comments.
    - TODO: BUTLER_ROOT_TAG is still being ignored in the case of S3 DataStore
      and an in-memory Registry.

ButlerConfig:
    - Removed the hardcoded local paths where I used to download yaml config files.
      They are now downloaded as bytestr and then loaded with yaml. No need for
      tmpfiles.
    - There is a case to be made about splitting ButlerConfig into different ones,
      since there could be a proliferation of if-else statements in its __init__.

Registry:
    - Changed instantiation logic: previously I would create a local SqlRegistry
      and then upload it to the S3 Bucket. This was deemed a useless idea so it
      was scratched.
      The current paradigm is to have an S3 backed DataStore and an in-memory
      SQLite or an S3 backed DataStore and and RDS backed Registry. Since
      SqlAlchemy is capable of always creating a new in-memory DB (no name
      clashes etc.) and considering that the RDS service must be created and
      exist, we have no need for checking if the Registry exists or creating
      persistent local SQLRegistries and uploading them.

Utils:
    - Removed some unnecessary comments.
    - Tried fixing a mistake in parsePath2Uri. It seems like 's3://bucket/root'
      and 's3://bucket/root/' are parsed differently by the function. This is
      a mistake. I am sure its fixable here, but I confused myself so I
      patched it in S3Datastore. The problem is that it is impossible to
      discern if uri's like 's3://bucket/root' are pointing to a directory or
      a file.
    - TODO: Revisit the URI parsing issue.

views.py:
    - added an PostgreSql compiler extensions to create views.
      In PostgreSql views are created via 'CREATE OR REPLACE VIEW'.

FileFormatter:
    - Code refactoring: _assembleDataset method now contains most of the
      duplicated code that existed in read and readBytes

YamlFormatter:
    - I think it was only the order of methods that changed, and some code
      beautification

oracleRegistry.py:
    - In experimentation I kept the code here, since PostgreSQL seems to
      share a lot of similarities with Oracle. That code has now been
      excised and moved to postgresqlRegistry.py

PostgreSqlRegistry:
    - Wrote a new class to separate the previously mixed OracleRegistry
      and PostgreSqlRegistry.
    - Wrote in additional code that will read `~/.rds/credentials` file where it
      expects to find a nick under which RDS username and password credentials
      are stored. The credentials are read at SqlAlchemy engine creation time so
      they should not be visible anywhere in the code. The connection string can
      now be given as:
          `dialect+driver://NICKNAME@host:port/database`
      as long as (case-sensitive) NICKNAME exists in the `~/.rds/credentials`
      and is given in the following format:
           [NICKNAME]
               username = username
               password = password
    - The class eventually ended up being an exact copy of Oracle again because
      its required for the username and password to be read as close to engine
      creation as possible, so there we go and here we are.

butler-s3store.yaml:
    - Now includes an sqlRegistry.yaml configuration file. This configures the
      test_butler.py S3DatastoreButlerTestCase to use an S3 backed DataStore
      and an in-memory SQLite Registry. Practical for S3Datastore testing.

sqlRegistry.yaml:
    - Targets the registry to an in-memory SQLite registry

butler-s3rds.yaml:
    - Now includes an rdsRegistry.yaml configuration file. This configures the
      test_butler.py S3RdsButlerTestCase to use an S3 backed DataStore and an
      (existing) RDS backed Registry.

rdsRegistry.yaml:
    - Targets the registry to an (existing) RDS database. The connection
      string used by default is:
'postgresql://DEFAULT@gen3registry.cejsinjimzit.us-west-2.rds.amazonaws.com:5432/gen3registry'.
      This means that an RDS identifier with the name gen3registry must exist
      (the first 'gen3registry') and in it a **Schema that is on the
      search_path must exist and that Schema must be owned by the username
      under the DEFAULT nickname**. This is very important.

test_butler.py:
    - Home to 2 new tests: S3DatastoreButlerTestCase and S3RdsButlerTestCase.
      Both tests passed at the time of this commit.
    - S3DatastoreButlerTestCase will use the butler-s3store.yaml to connect to
      a Bucket to which it will authorize by looking at `~/.aws/credentials`
      to find the aws_access_key_id and aws_secret_access_key. The name of the
      bucket to which it will connect is set by the S3DatastoreButlerTestCase
      bucketName class variable. The permRoot class varible sets the root
      directory in that Bucket only when useTempRoot is False.
      The Registry is a in-memory SQLite registry. This is very practical for
      testing the S3Datastore. This test seems mockable, I just haven't
      succeeded at it yet.
    - S3RdsButlerTestCase will use the butler-s3rds.yaml file to connect to a
      Bucket to which it will authorize by looking at `~/.aws/credentials`
      expecting to find the aws `access_key` and `secret_access_key`. The name
      of the bucket is set by S3RdsButlerTestCase bucketName class variable.
      The permRoot class varible is only used when useTempRoot variable is
      False.
      The Registry is an RDS service identified by a "generalized" connection
      string given in rdsRegistry.yaml configuration file in the test
      directory. The DEFAULT represents a nickname defined in the
      `~/.rds/credentials` file under which the username and password of a
      user with sufficient priviledges (enough to create and drop databases)
      is expected. The tests are conducted by creating many DBs, each of
      which is assigned to a particular Registry instance tied to a
      particular test.
      This test seems absolutely impossible to mock.

test_butlerfits.py:
    - New S3DatastoreButlerTestCase added. The test is an S3 backed
      DataStore using a local in-memory Registry. Obviosuly, extends
      trivially to the S3RdsButlerTestCase since only the S3Datastore is
      really being tested here - but no such test case exists because I
      suspect the way the setUp and tearDown is performed in that case
      will cause a lot of consternation so I'll defer that till a
      later time when I know what people want/expect.

Added mocking for S3.

Resolving review fixes #1

General:
- updated to incorporate recent master branch changes
- Removed NO_BOTO and replaced imports with a nicer boto3=None style

YAML config files:
- Fixed S3 related default config files to lowercase tablenames
- Refactored the formatters section from Posix and S3 default
  yaml files into a formatters.yaml

Butler:
- Renamed parsePath2Uri --> parsePathToUriElements
- Removed example makeRepo functions:
    - Uploading is now handled from within Config class
    - makeButler is once again a method of Butler class
- Removed del and close methods

ButlerConfig:
- Removed the code that downloaded butler.yaml file in init

Config:
- Added a dumpToS3 method that uploads a new config file to Bucket
- Added a initFromS3File method
- Modified initFromFile method to check whether file is a local one or S3 one

Location:
- Removed unnecessary comments
- Fixed awkward newline mistakes

S3Location:
- Removed unnecessary comments, corrected all references from Location to
  S3Location in the docstrings

utils.py:
- Removed all s3 related utility functionality into s3utils.py
- Added more docstrings, removed stale comments and elaborated on unclear ones

parsePathToUriElements:
- refactored the if-nastiness into something more legible and correct

test_butler:
- Moved the testPutTemplates into generic Butler test class as a not-tested
  for method
- Added tested-for versions of that method to both PosixDatastoreButlerTestCase
  and S3DatastoreButlerTestCase
- Added a more generic checkFileExists functionality that discerns between S3
  and POSIX files
- removed a lot of stale comments
- improved on the way S3DatastoreButlerTestCase does tear-down
- Added mocked no-op functionality and test skipping for the case whne boot3
  does not exist.

Resolving review fixes #2

- refactored Formatters, read/write/File is now implemented over
  read/write/Bytes.
- Removed all things RDS related from the commit.
- refactored test_butler and test_butlerFits

Review fixes #3

- converted the S3 datastore records to lowercase
- rebased to most recent changes and survived merger
- fixed formatters.yaml (again)
- removed my own fix for config overwrites and am currently using
  the overwrite keyword functionality from a recent ticket.

Review fixes #4.

 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.

Fixing Formatters read/write functionality.

All Formatters now have both read/write and to/from/Bytes methods.
Those specific formatter implementations that also implement the
_from/_to/Bytes method will default to directly downloading the
bytes to memory. The rest will be downloaded/uploaded to/from
a temporary file.

Checks if file exists or does not, in S3Datastore, were changed
since now we definitely incurr a GET charge for a Key's header
everytime so there is no need to duplicate the checks with
s3CheckFileExists calls.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Jun 26, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Jul 12, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Jul 29, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Jul 29, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Aug 6, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Aug 6, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
DinoBektesevic added a commit to DinoBektesevic/daf_butler that referenced this pull request Aug 6, 2019
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants