Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adds data stores and references #37

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@MSeal
Copy link
Member

commented Apr 1, 2019

This is still a WIP, more testing and internal API contracts need to be evaluated

In the glue api there is now a store and data_path optional arguments. These allow for specifying the path one wishes to use in saving data, as well as the storage mechanism for that data (default "notebook"). In most cases to support remote saves the caller simply has to add data_path="s3://mybucket/my/path/to/data.json to the glue call to save data to that remote path. When recalling the named data scrap the library will automatically know how to fetch and translate the data saved at the referenced path.

Internally, In place of just specifying an encoder, there's now an encoder_name and a store_name ("json", "notebook") associated with data managers. Data managers can implement encoding or storing, or both capabilities. One can implement ("text", None) and (None, "s3") independently. This lets simple encoders return string or bytes that the stores can save while allowing for more complex store and recall mechanism to encode and store at the same time (e.g. dataframe to multi-file s3 parquet via ("arrow", "s3")).

To facilitate the contract changes there is now a scrapbook v2 schema. Loading v1 or v2 data is transparent to the library user.

@MSeal MSeal requested review from rgbkrk, mpacer and willingc Apr 1, 2019

@mpacer mpacer referenced this pull request Apr 1, 2019

Open

Dev Docs #38

@@ -1,5 +1,6 @@
pandas
six
retry

This comment has been minimized.

Copy link
@willingc

willingc Apr 5, 2019

Member

Looks like this hasn't been updated in 3 years. Is this something that we really need as a dependency? If so, is there a better maintained alternative.

This comment has been minimized.

Copy link
@MSeal

MSeal Apr 5, 2019

Author Member

I could change it over to tenacity. I think I left this here by accident while I had GCS code in my dev, so might not need it in this repo at all.

This comment has been minimized.

Copy link
@willingc

willingc Apr 5, 2019

Member

Cool. Let's double check and remove if possible.

@MSeal MSeal requested a review from captainsafia Apr 7, 2019

@todo

This comment has been minimized.

Copy link

commented Apr 7, 2019

Handle missing depedency as papermill does

# TODO: Handle missing depedency as papermill does
class S3(object): pass
from .scraps import scrap_to_payload
from .exceptions import ScrapbookMissingEncoder, ScrapbookMissingStore


This comment was generated by todo based on a TODO comment in 3d24c67 in #37. cc @MSeal.

for manager in reversed(list(self.values())):
# Only check for managers here that can both encode and store
if not (manager.encoder_name and manager.store_name):

This comment has been minimized.

Copy link
@captainsafia

captainsafia Apr 7, 2019

Member

I found this function a little hard to read. I feel like it would be easier to read if this if-statement was inverted and the logic inside the try statement was placed inside it. Something like:

if manager.encoder_name and manager.store_name:
    blah

Assuming I understand that the intent of this function is to run through the list of valid managers and retrieve the ones that work with the scrap passed as a parameter.

This comment has been minimized.

Copy link
@MSeal

MSeal Apr 8, 2019

Author Member

Good point. I'll take a close look at this section next pass

if scrap.encoder is not None:
return self.get((scrap.encoder, None))

for encoder in reversed(list(self.values())):

This comment has been minimized.

Copy link
@captainsafia

captainsafia Apr 7, 2019

Member

Curious: is there a reason you reverse the list of encoders here and elsewhere? Does it have something to do with the properties of an OrderedDict?

This comment has been minimized.

Copy link
@MSeal

MSeal Apr 8, 2019

Author Member

Yes. You can't prefix to OrderedDicts easily, which we need to be able to do, it requires rebuilding the whole structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.