Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify manifest and dataset serialisation and add local metadata storage #402

Merged
merged 59 commits into from
Apr 6, 2022

Conversation

cortadocodes
Copy link
Member

@cortadocodes cortadocodes commented Mar 25, 2022

Summary

Simplify manifest and dataset serialisation so their datasets' and files' metadata is gathered dynamically at instantiation instead of from hard-coded values in the serialisation. To allow this properly on both cloud and local platforms, the ability to store metadata locally has been introduced. This has also required the definition of a dataset to be tightened so that each dataset is either cloud-based or local, but not both.

Contents (#402)

IMPORTANT: There are 5 breaking changes.

New features

  • Store local datafile metadata locally in a .octue JSON file in the same directory

Enhancements

Dataset:

  • BREAKING CHANGE: Change Dataset serialisation to only include the paths to its files and its own metadata
  • BREAKING CHANGE: Store local dataset metadata locally in a .octue JSON file in its root directory instead of datafile_metadata.json
  • Allow datasets to be instantiated from iterables of paths to files
  • Remove Pathable mixin
  • Unify Dataset.path and Dataset.cloud_path
  • Raise error if trying to download files from a local dataset
  • BREAKING CHANGE: Remove confusing way of adding files to datasets
  • BREAKING CHANGE: Only allow one file to be added to a dataset at a time in Dataset.add
  • Allow specification of path within dataset when adding a file to a dataset

Manifest:

  • BREAKING CHANGE: Change Manifest serialisation to only include the paths to its datasets and its own metadata
  • Remove Pathable mixin
  • Remove path attribute

Other:

  • Use Serialisable.to_primitive as the basis for Serialisable.serialise to simplify and speed up conversion to primitives
  • Use Manifest.to_primitive in Analysis.finalise
  • Update remaining JSON metaschema references to use the latest metaschema

Fixes

  • Stop setting any given kwarg as attributes in Dataset and Manifest
  • Ignore .octue files when constructing datasets

Dependencies

  • Change minimum python version supported from 3.7 to 3.7.1
  • Use pandas=^1.3 to avoid numpy array size errors
  • Use twined=0.3.0

Refactoring

  • Reorder and rename methods in Datafile, Dataset, and Manifest
  • Move dataset files tag checking from twined into Runner._validate_dataset_file_tags
  • Move datafile instantiation in Dataset into separate method

@cortadocodes cortadocodes linked an issue Mar 28, 2022 that may be closed by this pull request
@cortadocodes cortadocodes changed the title Unify dataset path and cloud path Simplify manifest and dataset serialisation and add local metadata storage Mar 29, 2022
octue/resources/analysis.py Show resolved Hide resolved
octue/resources/manifest.py Outdated Show resolved Hide resolved
@cortadocodes cortadocodes merged commit 2e2c1d1 into main Apr 6, 2022
@cortadocodes cortadocodes deleted the enhancement/unify-dataset-path-and-cloud-path branch April 6, 2022 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants