Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for S3 Datastore #159

Closed

Conversation

DinoBektesevic
Copy link
Collaborator

As was suggested to me by @r-owen and @timj the draft PR for S3Datastore and PostgreSqlRegistry together was too large and clunky to be reasonable. Draft PR #147 is now closed and substituted by two PRs, separating S3Datastore and PostgreSqlRegistry.

Changes

Based on review comments from the Draft-PR the following was changed compared to the draft PR.

  • Reworked how to/from/Bytes method work
    • Added toBytes and fromBytes functionality to JsonFormatter
    • Changed the toFile and fromFile methods to internally use the to/from/Bytes on all formatters that support it
    • S3Datastore now downloads all files as bytes and attempts to read them. If it fails it stores them in a temporary file and then attempts to read them from there.
  • Refactored formatters section in datastore YAML config files into their own file that is included into datastore config files.
  • Removed anything RDS related out of this PR.
  • Config class was made aware of S3 and the code in butlerConfig and Butler.makeRepo that was used as a work-around to that was removed.
  • Butler.makeRepo was re-factored back to (almost) the way it was and is now pretty again
  • Removed the Butler.computeChecksum as I never made it to work.
  • Moved all S3-related utilities from daf.butler.core.utils to daf.butler.core.s3utils
  • Added tests for s3utils.
  • Fixed up the de-duplication hacks in S3Location by fixing parsePathToUriElements
  • General housekeeping (better names, formatters are now their own yaml file included where needed, removed old unnecessary comments, fixed docstrings, added more docstrings, new default and test config files for S3, better style for boto3 imports protection, mocks and skips for tests when moto or boto3 is missing...)

Unresolved/Other

  • <butlerRoot> tag is still not supported for the current S3Datastore as, from what I can see, the main effect of the butler relocation code is that sqlite:///:memory: registry location is replaced by 'sqlite:///<butlerRoot>/gen3.sqlite3 which just does not make practical sense when working with S3Datastore. We can not talk to sqlite DB in a Bucket, while, for S3Datastore tests at least, it does make sense to have an ephemeral Registry. This makes sense for cases when the db string in the registry YAML config can be specified directly and won't be changed, so I'm deferring this issue for later.
  • Did not clean up S3Datastore.ingest as was requested. There are several things slightly unclear to me here so I haven't quite polished up the large block of if-else code containing a lot of path and string manipulations to figure out what type of ingest needs to happen and how. It is on my todo list.
  • boto3.client.delete_object(bucket_name, object_name) is just bad. Deleting objects from a Bucket deletes the current version (S3 versions file when upload overwrites them) so technically a delete can just revert to an older file version when versioned Buckets are used. But that's not as bad since almost no files are allowed to be rewritten. The problem is the following, still unresolved, issue Boto3 delete_objects response does not match documented response boto/boto3#507 . On testing I find that the only failure I can get from delete_object is the one when I give it an non-existing Bucket. In all other cases I just get a HTTP 200 OK response, even when files don't exist.
  • The comments on S3Datastore.remove is that I need to find the location of the file before I can remove it, so while at it why not check for errors. This is an exact duplicate of PosixDatastore except for the remove call.

The rest of the comments were either fixed, or enough has changed, I believe, that they are not relevant anymore. Thanks to all the input I got, I think this is a much better iteration of S3Datastore than what was in the original PR - so, many thanks to everyone who inputted.

@DinoBektesevic DinoBektesevic changed the title U/dino bektesevic/object store S3Datastore May 17, 2019
@timj timj changed the title S3Datastore Add support for S3 Datastore May 30, 2019
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much much better. Thank you for working on this. My main comments are associated with URI handling. I will try to help you out by augmenting #167 such that you can simplify your code here.

python/lsst/daf/butler/core/s3utils.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/s3utils.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/s3utils.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/butlerConfig.py Show resolved Hide resolved
python/lsst/daf/butler/butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/s3Datastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/s3Datastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/s3Datastore.py Outdated Show resolved Hide resolved

# format the downloaded bytes into appropriate object directly, or via
# tempfile (when formatter does not support to/from/Bytes)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this later, but I'm worried that we are setting ourselves up here to read in multiple gigabytes of data, then decided that we can't actually convert from bytes and then writing it all to disk, when if we knew that fromBytes was not going to work we could do the efficient incrementally write to disk, then call the formatter to read from disk (which might result in only a subset of the file being read if we wanted a component) with much less memory overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how big of a problem this would actually be, in the sense that the fix seems really easy and that, currently, the only Formatters supporting direct download from bytes are YAML, JSON, and Pickle. I'm not seeing how they would be GB in size.

On the other hand I agree that there is a lot of place for optimization here, especially when downloading files to disk because boto3 offers variety of options to do very fast large file downloads. For example there's the overarching configuration in TransferConfig that sets various chunk sizes, number of concurrent download threads etc. when using boto3's TransferManager.

And of course the same applies for any uploads, which would ideally probably be batched together after certain number of new files are created (or some such) and the uploaded with multipart_upload as the speedups seem significant.

This was discussed in AWS POC meetings, where it was decided to table the issue until we get the POC done first and then it was decided to experiment whether the uploads should be optimized from within boto3 or via condor uploads/downloads and at which sizes of data might one be better than other, if at all.

But a very good point to make for sure.

EDIT: A point I forgot to make - solution is trivial because we know the filesize in advance because the way s3CheckFileExists works is that it returns True/False for file existence and its size if it does exist. That information is in response['ContentLength'] which you can see couple of lines above. So we know the filesize in advance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.

python/lsst/daf/butler/datastores/s3Datastore.py Outdated Show resolved Hide resolved
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an early review on the new code. I've not covered everything but I'm about to go out for the day on vacation so want to send something.

python/lsst/daf/butler/core/config.py Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the requested fixes. My main concern is still mainly with the URI mangling. There is also a huge amount of duplication between PosixDatastore and S3Datastore that I have to think about at some point.

python/lsst/daf/butler/butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/butlerConfig.py Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registries/sqliteRegistry.py Outdated Show resolved Hide resolved
return rndstr + '/'

def setUp(self):
config = Config(self.configFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a ButlerConfig? (So that the doubled up datastore.datastore are removed)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to make this ButlerConfig since I need to call makeRepo first line of which is

if isinstance(config, (ButlerConfig, ConfigSubset)):
    raise ValueError("makeRepo must be passed a regular Config without defaults applied.")

I guess I could try and recast it as Config when I make the call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it for now. I'll have a think. We shouldn't be doing datastore.datastore because it's confusing.

tests/test_butler.py Outdated Show resolved Hide resolved
tests/test_s3utils.py Outdated Show resolved Hide resolved

# format the downloaded bytes into appropriate object directly, or via
# tempfile (when formatter does not support to/from/Bytes)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.

@@ -894,7 +894,7 @@ def updateParameters(configType, config, full, toUpdate=None, toCopy=None, overw
for key in toCopy:
if key in localConfig and not overwrite:
log.debug("Not overriding key '%s' from defaults in config %s",
key, localConfig.__class__.__name__)
key, value, localConfig.__class__.__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message has to placeholders so you can not put three parameters in the call here (ie I think you were right the first time -- or you need to edit the log message).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed value here, I can make a different PR if you want the changes made immediately. Or just make them yourself if that's easier and faster.

@timj
Copy link
Member

timj commented Jul 25, 2019

There's a lot to disentangle in the URI discussion and it is likely this is not the best venue for that discussion at this point. The important point though is that you can not ever represent a relative file path using a file URI. This is forbidden by RFC-8089 and when we started on this a few months back there was a big discussion on why it was that urlparse wouldn't round trip a relative path and I submitted a bunch of GitHub issues on this for other python URI parsers. They all came back to me with the same point about relative URIs.

ButlerURI was written specifically to deal with the case where you want a file scheme-less file system path to be usable as well as file or s3 URI schemes. By "scheme-less" I really do mean "no scheme specified". You cannot convert a scheme-less relative path to a file scheme without also turning it into an absolute path (recall there was a review comment about that: if people are requesting that the path be turned into an absolute path then a case could be made for converting a schemeless URI to a file scheme URI). For a schemeless URI it makes sense that netloc is None. It is telling you that you have a local path. If you have a file URI we are assuming POSIX semantics for paths (that seems to be consistent with the examples) but for schemeless URIs we are assuming os.path semantics. That's why ButlerURI goes through many hoops to ensure we are consistently using posixpath and os.path.

I'm not sure I understand the issue with relativeToNetloc. That method is specifically designed to turn an absolute path to a path relative to the root so would by definition strip the leading / as you require. It does this regardless of what netloc is set to and should only be dealing with the path component. What am I missing?

It might make life simpler for you if ButlerUri had an exists method that would either look on local filesystem or use your s3 utility routine. It would have to always use a separate code path for schemeless and file schemes.

Does this help?

@DinoBektesevic
Copy link
Collaborator Author

DinoBektesevic commented Jul 30, 2019

(I couldn't figure how to reply to your comment. so I'm making a new one).

I have reverted the ButlerURI changes as you requested. The thing that confused me is having a relativeToNetloc when netloc is None. The method did not seem to do what it declares on the tin. Adding netloc in seemed closer to described behaviour for URIs and made the following difference:

original/current

(lsst-scipipe-1172c30) [dinob@boom lsstspark]$ python testNetloc.py 
scheme  | netloc          | relativeToNetloc               | path                           | expected                                 | uri
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/              | /rootDir/absolute/             | /rootDir/absolute/                       | file:///rootDir/absolute/
file    |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file:///tmp/relative/file.ext
file    |                 | tmp/relative/directory/        | /tmp/relative/directory/       | /tmp/relative/directory/                 | file:///tmp/relative/directory/
file    | relative        | file.ext                       | /file.ext                      | /file.ext                                | file://relative/file.ext
file    |                 | absolute/directory/            | /absolute/directory/           | /absolute/directory/                     | file:///absolute/directory/
        |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | /tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext
s3      | bucketname      | rootDir/relative/file.ext      | /rootDir/relative/file.ext     | /rootDir/relative/file.ext               | s3://bucketname/rootDir/relative/file.ext
file    |                 | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file:///home/dinob/relative/file.ext
file    |                 | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file:///home/dinob/relative/file.ext
        |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | /tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        |

proposed

file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/              | /rootDir/absolute/             | /rootDir/absolute/                       | file://localhost/rootDir/absolute/
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
file    | localhost       | tmp/relative/directory/        | /tmp/relative/directory/       | /tmp/relative/directory/                 | file://localhost/tmp/relative/directory/
file    | relative        | file.ext                       | /file.ext                      | /file.ext                                | file://relative/file.ext
file    | localhost       | absolute/directory/            | /absolute/directory/           | /absolute/directory/                     | file://localhost/absolute/directory/
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext
s3      | bucketname      | rootDir/relative/file.ext      | /rootDir/relative/file.ext     | /rootDir/relative/file.ext               | s3://bucketname/rootDir/relative/file.ext
file    | localhost       | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file://localhost/home/dinob/relative/file.ext
file    | localhost       | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file://localhost/home/dinob/relative/file.ext
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext

which does exactly what you describe

The important point though is that you can not ever represent a relative file path using a file URI.
You cannot convert a scheme-less relative path to a file scheme without also turning it into an absolute path

and the 3 cases where the relative path was provided were the three that were not parsed as URIs or parsed correctly (as they would have been intended by imaginary user, correct in the sense of the rules). The rest were promoted to completely qualified URIs.

In any case, I changed it back as requested and am dropping further discussion. I kept the relativeToNetloc method and yes, the way its implemented it will work regardless of whether netloc exits or not. This is acceptable, correct?

The other changes were adding support for !include statements in Yaml configs to work when !include is pointing to another file in the same bucket, replacing all single quotes with doubles, adding in some missing tests and skips and rebasing on top of newest master.

@timj
Copy link
Member

timj commented Jul 30, 2019

Re relativeToNetloc how about we call it relativeToPathRoot ?

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a quick look at the URI stuff and it's looking better now. I've made some more comments since I'm not entirely sure how you are handling scheme-less relative paths still. You should consider adding some tests for relativeToPathRoot.

python/lsst/daf/butler/butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/butlerConfig.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you deal with the scheme-less vs file vs other comments I just made I'm happy to approve this change. Any further issues I can clean up later. I wonder if we should rename the branch to be tickets/DM-13361 (which I can assign to you I think and would better reflect your work on this). This would require a new PR of course.

One minor complication is that #176 just merged and that changed the Formatter API a little. This will cause you some merge conflicts with your read/write bytes API. Sorry about that. Let me know if you need help disentangling things.

@DinoBektesevic
Copy link
Collaborator Author

DinoBektesevic commented Aug 6, 2019

The Formatters were not a big deal. I saw the PR and I was ready for the changes. It actually helped me notice how out of date the S3Datastore was and fix a bunch of mistakes. I should have paid more attention to that.
One thing I have not gotten an answer on are the pytypes in method signatures for formatters and I have not noticed that was addressed in the #176 so I have again left them as is.

When I separated Draft PR #147 the advice I got from @r-owen was that "PRs are cheap". If you want me I can close this PR.

EDIT: also if you list back through all review comments there are some I have left open, and even though it states "outdate" there's still comments/answers/pings that could be made to appropriate people, whom I assume you would know, under them.

@timj
Copy link
Member

timj commented Aug 6, 2019

@DinoBektesevic - I think it's ready for you to push this directly to the daf_butler repo on a ticket branch. That would allow us to run tests on jenkins to make sure nothing is broken. Yes, that means a new PR. I've added you as a collaborator so you can push directly.

… Dimensions.

Changes
-------
Added path parsing utilities that figure out if a path is an S3 URI or filesystem
path to the utilities.

ButlerConfig checks at instantiation, in the case its not being instantiated from
another ButlerConfig instance, whether the given path is an S3 URI. If it is, the
butler yaml configuration file is downloaded and a ButlerConfig is instantiated
from that. The download location is hard-coded and won't work on different
machine.

Edited fromConfig instantiation method of Registry. If a registry is instantiated
from general Config class it checks whether the path to db is an S3 URI or
filesystem path. If it's S3 URI it will download the database locally to
hardcoded path that will only work on my machine. The remainder of instantiation
is then executed as normal from the downloaded file. The RegistryConfig is
updated to reflect the local downloaded path.

S3Datastore has a different config than the default PosixDatastore one. It
manually appends the s3:// to default datastore templates. I do not know
outfront if this is really required or not.

Initialization of the S3DataStore is different compared to PosixDatastore.
Attributes service, bucket and rootpath are added identifying the 's3://',
bucket name and path to root of the repository elements of the object
storage repository respectively. Attributes session and client provide
access, through boto3 API, to upload and download functionality.
The file exists and filesize checks were changed to query the S3 API. Currently
this is performed by making a listing request to the S3 Bucket. This is
allegedly the fastest, but for searching for the name of the file among many.
It is also 12.5 times as expensive as directly making a request for the object
and catching the exception. I do not know if there are data that can have
multiple different paths returned. I suspect not, so this might not be worth it.
There is some functionality that will create a bucket if one doesn't exists in
the S3DataStore init method, but this is just showcase code - doesn't work or
make sense because Registry has to be instantiated before Datastore is and we
need that yaml config file to exists - so the bucket must exist too. The get
method was not particularily drastically changed, because the  mechanism was
pushed to formatters.

LocationFactory used in the DataStores was changed to return S3Location if it
was instantiated from an S3 URI.

The S3Location keeps the repository root, the bucket name and the full
original bucket URI stored as attributes so its possible to figure out what
is the bucket path from which to download the file in the Formatters.

The S3FitsExposureFormatter readFull method uses the S3Location to download
the exposure to a baked in path that will, again, only work for me. The
mapped DatasetType is then used to instantiate from that file and return
appropriate in-memory (hopefully) object. All downloaded Datasets are given
the same name to avoid clutter, but this should be done differently anyhow
so the hacking solution will do for now.
Supports a local Registry and an S3 backed DataStore or a RDS database service
Registry and S3 backed datastore. Tests are not yet mocked!

CHANGES
-------

S3Datastore:
    - Call to ingest at the end of butler.put now takes in a location.uri
    - In ingest fixed several wrong if statements that would lead to errors being
      raised in correct scenarios.
    - Liberally added comments to help me remember .
    - TODO: Fix the checksum generation, fix the patch for parsePath2Uri.

schema.yaml:
    - Increased maximum allowed length of instrument name.
      Default instrument name length was too short (8) which would raise errors
      on Instrument insert in PostgreSQL. Apparently, Oracle was only tested on
      'HSC' and SQLite doesn't care so the short name went un-noticed.
    - Increased the length attribute on all instances of "name: instrument"
      entries.
    - Is there a better way to define global length values than manually changing
      a lot of entries?

Butler:
    - Refactored the makeRepo code into two functions.
      Previusly this was one classmethod with a lot of if-s. It should be easier
      to see what are the steps that need to happen to create a new repo in both
      cases now. This is a temporary fix, as indicated by the comments in the
      code, untill someone tells me how I should solve the issue: 2 Butlers,
      1 dynamically constructed Butler, multiple classmethods etc...
    - Removed any and all hard-coded download locations.
    - Added a check if boto3 import was succesfull but I don't know if this is
      the correct style to do that in. I left a comment.
    - Liberal application of comments.
    - TODO: BUTLER_ROOT_TAG is still being ignored in the case of S3 DataStore
      and an in-memory Registry.

ButlerConfig:
    - Removed the hardcoded local paths where I used to download yaml config files.
      They are now downloaded as bytestr and then loaded with yaml. No need for
      tmpfiles.
    - There is a case to be made about splitting ButlerConfig into different ones,
      since there could be a proliferation of if-else statements in its __init__.

Registry:
    - Changed instantiation logic: previously I would create a local SqlRegistry
      and then upload it to the S3 Bucket. This was deemed a useless idea so it
      was scratched.
      The current paradigm is to have an S3 backed DataStore and an in-memory
      SQLite or an S3 backed DataStore and and RDS backed Registry. Since
      SqlAlchemy is capable of always creating a new in-memory DB (no name
      clashes etc.) and considering that the RDS service must be created and
      exist, we have no need for checking if the Registry exists or creating
      persistent local SQLRegistries and uploading them.

Utils:
    - Removed some unnecessary comments.
    - Tried fixing a mistake in parsePath2Uri. It seems like 's3://bucket/root'
      and 's3://bucket/root/' are parsed differently by the function. This is
      a mistake. I am sure its fixable here, but I confused myself so I
      patched it in S3Datastore. The problem is that it is impossible to
      discern if uri's like 's3://bucket/root' are pointing to a directory or
      a file.
    - TODO: Revisit the URI parsing issue.

views.py:
    - added an PostgreSql compiler extensions to create views.
      In PostgreSql views are created via 'CREATE OR REPLACE VIEW'.

FileFormatter:
    - Code refactoring: _assembleDataset method now contains most of the
      duplicated code that existed in read and readBytes

YamlFormatter:
    - I think it was only the order of methods that changed, and some code
      beautification

oracleRegistry.py:
    - In experimentation I kept the code here, since PostgreSQL seems to
      share a lot of similarities with Oracle. That code has now been
      excised and moved to postgresqlRegistry.py

PostgreSqlRegistry:
    - Wrote a new class to separate the previously mixed OracleRegistry
      and PostgreSqlRegistry.
    - Wrote in additional code that will read `~/.rds/credentials` file where it
      expects to find a nick under which RDS username and password credentials
      are stored. The credentials are read at SqlAlchemy engine creation time so
      they should not be visible anywhere in the code. The connection string can
      now be given as:
          `dialect+driver://NICKNAME@host:port/database`
      as long as (case-sensitive) NICKNAME exists in the `~/.rds/credentials`
      and is given in the following format:
           [NICKNAME]
               username = username
               password = password
    - The class eventually ended up being an exact copy of Oracle again because
      its required for the username and password to be read as close to engine
      creation as possible, so there we go and here we are.

butler-s3store.yaml:
    - Now includes an sqlRegistry.yaml configuration file. This configures the
      test_butler.py S3DatastoreButlerTestCase to use an S3 backed DataStore
      and an in-memory SQLite Registry. Practical for S3Datastore testing.

sqlRegistry.yaml:
    - Targets the registry to an in-memory SQLite registry

butler-s3rds.yaml:
    - Now includes an rdsRegistry.yaml configuration file. This configures the
      test_butler.py S3RdsButlerTestCase to use an S3 backed DataStore and an
      (existing) RDS backed Registry.

rdsRegistry.yaml:
    - Targets the registry to an (existing) RDS database. The connection
      string used by default is:
'postgresql://DEFAULT@gen3registry.cejsinjimzit.us-west-2.rds.amazonaws.com:5432/gen3registry'.
      This means that an RDS identifier with the name gen3registry must exist
      (the first 'gen3registry') and in it a **Schema that is on the
      search_path must exist and that Schema must be owned by the username
      under the DEFAULT nickname**. This is very important.

test_butler.py:
    - Home to 2 new tests: S3DatastoreButlerTestCase and S3RdsButlerTestCase.
      Both tests passed at the time of this commit.
    - S3DatastoreButlerTestCase will use the butler-s3store.yaml to connect to
      a Bucket to which it will authorize by looking at `~/.aws/credentials`
      to find the aws_access_key_id and aws_secret_access_key. The name of the
      bucket to which it will connect is set by the S3DatastoreButlerTestCase
      bucketName class variable. The permRoot class varible sets the root
      directory in that Bucket only when useTempRoot is False.
      The Registry is a in-memory SQLite registry. This is very practical for
      testing the S3Datastore. This test seems mockable, I just haven't
      succeeded at it yet.
    - S3RdsButlerTestCase will use the butler-s3rds.yaml file to connect to a
      Bucket to which it will authorize by looking at `~/.aws/credentials`
      expecting to find the aws `access_key` and `secret_access_key`. The name
      of the bucket is set by S3RdsButlerTestCase bucketName class variable.
      The permRoot class varible is only used when useTempRoot variable is
      False.
      The Registry is an RDS service identified by a "generalized" connection
      string given in rdsRegistry.yaml configuration file in the test
      directory. The DEFAULT represents a nickname defined in the
      `~/.rds/credentials` file under which the username and password of a
      user with sufficient priviledges (enough to create and drop databases)
      is expected. The tests are conducted by creating many DBs, each of
      which is assigned to a particular Registry instance tied to a
      particular test.
      This test seems absolutely impossible to mock.

test_butlerfits.py:
    - New S3DatastoreButlerTestCase added. The test is an S3 backed
      DataStore using a local in-memory Registry. Obviosuly, extends
      trivially to the S3RdsButlerTestCase since only the S3Datastore is
      really being tested here - but no such test case exists because I
      suspect the way the setUp and tearDown is performed in that case
      will cause a lot of consternation so I'll defer that till a
      later time when I know what people want/expect.
General:
- updated to incorporate recent master branch changes
- Removed NO_BOTO and replaced imports with a nicer boto3=None style

YAML config files:
- Fixed S3 related default config files to lowercase tablenames
- Refactored the formatters section from Posix and S3 default
  yaml files into a formatters.yaml

Butler:
- Renamed parsePath2Uri --> parsePathToUriElements
- Removed example makeRepo functions:
    - Uploading is now handled from within Config class
    - makeButler is once again a method of Butler class
- Removed del and close methods

ButlerConfig:
- Removed the code that downloaded butler.yaml file in init

Config:
- Added a dumpToS3 method that uploads a new config file to Bucket
- Added a initFromS3File method
- Modified initFromFile method to check whether file is a local one or S3 one

Location:
- Removed unnecessary comments
- Fixed awkward newline mistakes

S3Location:
- Removed unnecessary comments, corrected all references from Location to
  S3Location in the docstrings

utils.py:
- Removed all s3 related utility functionality into s3utils.py
- Added more docstrings, removed stale comments and elaborated on unclear ones

parsePathToUriElements:
- refactored the if-nastiness into something more legible and correct

test_butler:
- Moved the testPutTemplates into generic Butler test class as a not-tested
  for method
- Added tested-for versions of that method to both PosixDatastoreButlerTestCase
  and S3DatastoreButlerTestCase
- Added a more generic checkFileExists functionality that discerns between S3
  and POSIX files
- removed a lot of stale comments
- improved on the way S3DatastoreButlerTestCase does tear-down
- Added mocked no-op functionality and test skipping for the case whne boot3
  does not exist.
- refactored Formatters, read/write/File is now implemented over
  read/write/Bytes.
- Removed all things RDS related from the commit.
- refactored test_butler and test_butlerFits
- converted the S3 datastore records to lowercase
- rebased to most recent changes and survived merger
- fixed formatters.yaml (again)
- removed my own fix for config overwrites and am currently using
  the overwrite keyword functionality from a recent ticket.
 - added JSON from and to bytes methods.
 - fixed all the flake8 errors.
 - wrote tests for s3utils.
 - rewrote parsedPathToUriElements to make more sense.
   Added examples, better documentation, split on root
   capability
 - refactored how read/write to file methods work for
   formatters. No more serialization code duplication.
 - Fix in ingest S3Datastore functionality and path-uri
   changes made it possible to kick out the duplicate
   path removing code from S3LocationFactory.
All Formatters now have both read/write and to/from/Bytes methods.
Those specific formatter implementations that also implement the
_from/_to/Bytes method will default to directly downloading the
bytes to memory. The rest will be downloaded/uploaded to/from
a temporary file.

Checks if file exists or does not, in S3Datastore, were changed
since now we definitely incurr a GET charge for a Key's header
everytime so there is no need to duplicate the checks with
s3CheckFileExists calls.
Hardcoded Butler.get from S3 storage for simplest of Datasets with no
- rebased to include newest changes to daf_butler (namely, ButlerURI)
- removed parsePathToUriElements, S3Location and S3LocationFactory,
  and replaced all path handling with ButlerURI and its test cases.
- added transaction checks back into S3Datastore. Unsure on proper
  usage.
- added more path manipulation methods/properties to LocationFactory
  and Location.
  * bucketName returns the name of the bucket in the Location/
    LocationFactory
  * Location has a pathInBucket property that will convert posix
    style paths to S3 protocol style paths. The main difference is
    with respect to leading and trailing separators. S3 interprets
    `/path/`, `/path`, `path/` and `path` keys differently, even
    though some of them are equivalent on a POSIX compliant system.
    So what `/path/to/file.ext` would be on a POSIX system, on S3
    it would read `path/to/file.ext` and the bucket is referenced
    separately with boto3.
- For saving Config as file, moved all the URI handling logic into
  Config and out of Butler makeRepo. The only logic there is root
  directory creation.
  * The call to save config as a file at the butler root directory
    is now done through dumpToUri which then resolves the
    appropriate backend method to call.
- Improved(?) on the proposed scheme for checking if we are
  dealing with a file or a directory in absence of the trailing
  path separator.
  * Noted some differences between the requested generality of code
    in the review for writing to files and inits of config classes.
    For inits it seems as if there's an 2+ year old commit limiting
    the Config files to `yaml` type files only. However, the review
    implied that `dumpToFile` on Config class should be file format
    independent. Then, for `dumpTo*` methods, to check whether we
    have a dir or a file I only inspect whether the path ends on a
    'something.something' style. However, since I can count on
    files having to be `yaml` type and having `yaml` extensions in
    inits I use a simplified logic to determine if we have a dir or
    file. It is possible to generalize inits to other filetypes, as
    long as they have an extension added to them.
  * I assume we do not want to force users to be mindfull of
    trailing separators.
- Now raising errors on unrecognized schemes on all `if scheme ==`
  patterns.
- Closed the StreamIO in Config.dumpToFile
- fixed up the Formatters to return bytes instead of strings. The
  fromBytes methods now expect bytes as well. JSON and YAML were
  main culprints. Fixed the docs for them. At this point I am
  confident I just overwrite the fixes when rewinding changes on
  rebase by accident because I have done that before, twice.
- Added a different way to check if files exist, cheaper but can
  be slower. From boto/botocore#1248 it
  is my understanding that this should not be an issue anymore.
  But the newer boto3 versions are slow to hit package managers.
- `__initFromS3File` now renamed `__initFromS3YamlFile`
- `__initFromS3YamlFile` now uses the `__initFromYaml` method
  to make the format dependency it explicitly clear
- changes to method docs
- `dumpToS3` renamed to `dumpToS3File` to better match naming
  to existing functionality.
- in `dumpToS3File` and `ButlerConfig` it is assumed that files
  must have extensions to be files in the cases where it's not
  possible to resolve whether a string points to a dir or a file.
Changes
-------

- added `ospath` property to ButlerURI that localizes the posix-like
  uri.path
 - Updated docstrings in ButlerURI class to correctly describe
  what properties/methods do.
- File vs Dir is now resolved by checking if "." is present in path
  for both Config and ButlerConfig. If not Dir, path is no longer
  forced to be the top level directory before updating file name.
- Fixed various badly merged Formatter docstrings and file IO.
- Removed file extension update call in 'to/from'Bytes' methods
  in Formatters.
- Restored `super().setConfigRoot` call in SqliteRegistry.
- Removed extra boto3 present check from test_butler.py
- Fixed badly formatted docstrings in s3utils.
- Renamed `cheap` kwarg to `slow` in `s3CheckFileExists`.
Changes
-------
- Added relativeToNetloc to ButlerURI since we are forcing always
  existing netloc already. This removes some uri.path.lstrip calls.
- Fixed s3utils docstrings
- added Raises section to fileFormatter docstrings.
Reverted Location to not autofil 'localhost' and 'file'.
Reverted Location tests to match.

Removed value from log message, was a mistake after all.

Replaced all single quotes with double quotes in my PR.
Schemeless URIs handled more smoothly in ButlerURI.

Removed s3CheckFileExistsGET and s3CheckFileExistsLIST for
a singular s3CheckFileExists function, as botocore is up
to date now on coda and on pip so GET/HEAD requests make
no performance difference anymore.
Live testing shows possibility that HTTP 403 response is
returned for HEAD requests when file does not exist and
user does not have s3:ListBucket permissions. Added a
more informative error message.

Changed the s3utils function signatures to look more
alike.

Changed an instance of os.path to posixpath join in
test_butler to be more consistent with what is expected
from used URIs.

Corrected Config.__initFromS3Yaml to use an URL instead
of path and Config.__initFromYaml to use ospath.
Updated formatters to use fileDescriptors as attributes.

Updated S3Datastores to match the new formatters. Fixed
mistakes in S3Datastore ingest functionality. Removed
hardcoded path manipulations.

Fixes to S3Datastore put functionality, missing
constraints checks added in.  Prettification and
correction of comments in get functionality.
Additional chekcs added. Fixes to getUri
functionality, somehow it seems it never got updated
when ButlerURI class was written.

Implemented review requests in ButlerURI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants