Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
053bfce
IMP: Allow tags to contain capitals and forward slashes
cortadocodes May 3, 2021
7ff84cb
FIX: Disallow slashes at start or end of tag
cortadocodes May 3, 2021
89b0fa9
REF: Pull metadata gathering for Datafile into method
cortadocodes May 3, 2021
0fa1ef5
Merge pull request #153 from octue/refactor/group-datafile-metadata-v…
thclark May 3, 2021
a4a7a0a
IMP: Use Google Cloud Storage custom_time for storing Datafile.timestamp
cortadocodes May 3, 2021
a2218e1
REF: Get datetime objects directly from blob
cortadocodes May 3, 2021
136a19c
IMP: Add Datafile.posix_timestamp property
cortadocodes May 3, 2021
774564f
IMP: Allow datetime and posix timestamps for Datafiles
cortadocodes May 3, 2021
aed9369
FIX: Stop serialising GCS metadata as JSON
cortadocodes May 3, 2021
8b5822f
FIX: Allow there to be no timestamp when persisting as GCS custom_time
cortadocodes May 3, 2021
9a9c1b1
REF: Remove hash_value from Datafile GCS metadata; stop hashing metadata
cortadocodes May 4, 2021
bb6d3c9
IMP: Raise error if crc32c metadata missing from cloud Datafile
cortadocodes May 4, 2021
451eb08
FIX: Use empty string hash value if crc32c GCS metadata is missing
cortadocodes May 4, 2021
57796d8
IMP: Remove hash value from serialisations
cortadocodes May 4, 2021
e8d8b4f
IMP: Allow Datafile to remember where in the cloud it came from
cortadocodes May 4, 2021
c3fb7b6
IMP: Add Datafile.update_metadata method
cortadocodes May 4, 2021
08081cb
Merge pull request #155 from octue/fix/ensure-gcs-metadata-strings-do…
thclark May 5, 2021
d8abeeb
Merge pull request #152 from octue/feature/allow-tags-to-have-capital…
thclark May 5, 2021
31865ce
Merge branch 'release/0.1.17' into refactor/group-datafile-metadata-v…
cortadocodes May 5, 2021
1c7e845
MRG: Merge remote-tracking branch 'origin/release/0.1.17' into refact…
cortadocodes May 5, 2021
442963b
TST: Test Datafile.posix_timestamp
cortadocodes May 5, 2021
ee8cc7f
MRG: Merge pull request #154 from octue/refactor/group-datafile-metad…
cortadocodes May 5, 2021
76990de
MRG: Merge remote-tracking branch 'origin/release/0.1.17' into refact…
cortadocodes May 5, 2021
2976f18
IMP: Remove ability to set custom hash value when using Datafile.from…
cortadocodes May 5, 2021
a54f64a
MRG: Merge remote-tracking branch 'origin/release/0.1.17' into refact…
cortadocodes May 5, 2021
06b6782
REF: Make setting hash value of cloud datafiles simpler
cortadocodes May 5, 2021
4720602
MRG: Merge pull request #156 from octue/refactor/remove-hash-value-fr…
cortadocodes May 5, 2021
c6b6d7e
MRG: Merge branch 'release/0.1.17' into refactor/consolidate-cloud-da…
cortadocodes May 5, 2021
521f6c4
IMP: Raise error if implicit cloud location is missing from Datafile
cortadocodes May 5, 2021
7446c8b
IMP: Add Datafile._store_cloud_location method and use in cloud methods
cortadocodes May 5, 2021
a98bbc0
TST: Test Datafile cloud functions with/without implicit cloud locations
cortadocodes May 5, 2021
02819d5
TST: Factor out cloud datafile creation in datafile tests
cortadocodes May 5, 2021
5d3cfd7
IMP: Avoid re-uploading Datafile file or metadata if they haven't cha…
cortadocodes May 5, 2021
9570807
REF: Simplify output of GoogleCloudStorageClient.get_metadata
cortadocodes May 5, 2021
7a9d9cd
FIX: Add missing dictionary subscription
cortadocodes May 5, 2021
debb757
TST: Ensure file cache doesn't leak between tests
cortadocodes May 5, 2021
fdcd715
IMP: Allow Datafile to be used as a context manager for cloud changes
cortadocodes May 5, 2021
d27ec46
FIX: Get empty dict if custom metadata empty
cortadocodes May 5, 2021
265a658
TST: Test Datafile can be used as context manager for local changes
cortadocodes May 5, 2021
d91cc80
IMP: Allow option to not update cloud metadata in Datafile cloud methods
cortadocodes May 5, 2021
a81c3a9
FIX: Propagate __exit__ exception parameters
cortadocodes May 5, 2021
4509c47
REV: Revert to using timestamp metadata field instead of custom_time
cortadocodes May 7, 2021
f6e89d8
MRG: Merge pull request #160 from octue/fix/revert-custom-time-datafi…
cortadocodes May 7, 2021
b2aba06
OPS: Increase version number
cortadocodes May 7, 2021
d727669
MRG: Merge remote-tracking branch 'origin/release/0.1.17' into refact…
cortadocodes May 7, 2021
8b819b9
REF: Rename context manager inside Datafile
cortadocodes May 7, 2021
2bc160a
REF: Move DatafileContextManager out of Datafile and make private
cortadocodes May 7, 2021
33396e9
DOC: Add docstring to _DatafileContextManager
cortadocodes May 7, 2021
68be5dd
IMP: Use hash of local file if cloud datafile's file has been downloaded
cortadocodes May 7, 2021
aa34562
DOC: Add more docstrings to datafile module
cortadocodes May 7, 2021
f7d3c93
DOC: Add documentation on Datafile usages
cortadocodes May 7, 2021
cfb4022
DOC: Update datafile documentation with image and correction
cortadocodes May 7, 2021
759de13
TST: Add Hashable immutable hash test
cortadocodes May 7, 2021
6ad67e9
MRG: Merge pull request #157 from octue/refactor/consolidate-cloud-da…
cortadocodes May 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/source/datafile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,134 @@ the following main attributes:
- ``sequence`` - a sequence number of this file within its cluster (if sequences are appropriate)
- ``tags`` - a space-separated string or iterable of tags relevant to this file
- ``timestamp`` - a posix timestamp associated with the file, in seconds since epoch, typically when it was created but could relate to a relevant time point for the data


-----
Usage
-----

``Datafile`` can be used functionally or as a context manager. When used as a context manager, it is analogous to the
builtin ``open`` function context manager. On exiting the context (``with`` block), it closes the datafile locally and,
if it is a cloud datafile, updates the cloud object with any data or metadata changes.


.. image:: images/datafile_use_cases.png


Example A
---------
**Scenario:** Download a cloud object, calculate Octue metadata from its contents, and add the new metadata to the cloud object

**Starting point:** Object in cloud with or without Octue metadata

**Goal:** Object in cloud with updated metadata

.. code-block:: python

from octue.resources import Datafile


project_name = "my-project"
bucket_name = "my-bucket",
datafile_path = "path/to/data.csv"

with Datafile.from_cloud(project_name, bucket_name, datafile_path, mode="r") as datafile, f:
data = f.read()
new_metadata = metadata_calculating_function(data)

datafile.timestamp = new_metadata["timestamp"]
datafile.cluster = new_metadata["cluster"]
datafile.sequence = new_metadata["sequence"]
datafile.tags = new_metadata["tags"]


Example B
---------
**Scenario:** Add or update Octue metadata on an existing cloud object *without downloading its content*

**Starting point:** A cloud object with or without Octue metadata

**Goal:** Object in cloud with updated metadata

.. code-block:: python

from datetime import datetime
from octue.resources import Datafile


project_name = "my-project"
bucket_name = "my-bucket"
datafile_path = "path/to/data.csv"

datafile = Datafile.from_cloud(project_name, bucket_name, datafile_path):

datafile.timestamp = datetime.now()
datafile.cluster = 0
datafile.sequence = 3
datafile.tags = {"manufacturer:Vestas", "output:1MW"}

datafile.to_cloud() # Or, datafile.update_cloud_metadata()


Example C
---------
**Scenario:** Read in the contents and Octue metadata of an existing cloud object without intent to update it in the cloud

**Starting point:** A cloud object with Octue metadata

**Goal:** Cloud object data (contents) and metadata held locally in local variables

.. code-block:: python

from octue.resources import Datafile


project_name = "my-project"
bucket_name = "my-bucket"
datafile_path = "path/to/data.csv"

datafile = Datafile.from_cloud(project_name, bucket_name, datafile_path)

with datafile.open("r") as f:
data = f.read()

metadata = datafile.metadata()


Example D
---------
**Scenario:** Create a new cloud object from local data, adding Octue metadata

**Starting point:** A file-like locally (or content data in local variable) with Octue metadata stored in local variables

**Goal:** A new object in the cloud with data and Octue metadata

For creating new data in a new local file:

.. code-block:: python

from octue.resources import Datafile


sequence = 2
tags = {"cleaned:True", "type:linear"}


with Datafile(path="path/to/local/file.dat", timestamp=None, sequence=sequence, tags=tags, mode="w") as datafile, f:
f.write("This is some cleaned data.")

datafile.to_cloud(project_name="my-project", bucket_name="my-bucket", path_in_bucket="path/to/data.dat")


For existing data in an existing local file:

.. code-block:: python

from octue.resources import Datafile


sequence = 2
tags = {"cleaned:True", "type:linear"}

datafile = Datafile(path="path/to/local/file.dat", timestamp=None, sequence=sequence, tags=tags)
datafile.to_cloud(project_name="my-project", bucket_name="my-bucket", path_in_bucket="path/to/data.dat")
2 changes: 1 addition & 1 deletion docs/source/deploying_services.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Automated deployment with Octue means:

All you need to enable automated deployments are the following files in your repository root:

* A ``requirements.txt`` file that includes ``octue>=0.1.16`` and the rest of your service's dependencies
* A ``requirements.txt`` file that includes ``octue>=0.1.17`` and the rest of your service's dependencies
* A ``twine.json`` file
* A ``deployment_configuration.json`` file (optional)

Expand Down
Binary file added docs/source/images/datafile_use_cases.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 32 additions & 20 deletions octue/cloud/storage/client.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import base64
import json
import logging
from google.cloud import storage
from google.cloud.storage.constants import _DEFAULT_TIMEOUT
Expand Down Expand Up @@ -28,6 +27,7 @@ def __init__(self, project_name, credentials=OCTUE_MANAGED_CREDENTIALS):
credentials = credentials

self.client = storage.Client(project=project_name, credentials=credentials)
self.project_name = project_name

def create_bucket(self, name, location=None, allow_existing=False, timeout=_DEFAULT_TIMEOUT):
"""Create a new bucket. If the bucket already exists, and `allow_existing` is `True`, do nothing; if it is
Expand Down Expand Up @@ -83,6 +83,17 @@ def upload_from_string(self, string, bucket_name, path_in_bucket, metadata=None,
self._update_metadata(blob, metadata)
logger.info("Uploaded data to Google Cloud at %r.", blob.public_url)

def update_metadata(self, bucket_name, path_in_bucket, metadata):
"""Update the metadata for the given cloud file.

:param str bucket_name:
:param str path_in_bucket:
:param dict metadata:
:return None:
"""
blob = self._blob(bucket_name, path_in_bucket)
self._update_metadata(blob, metadata)

def download_to_file(self, bucket_name, path_in_bucket, local_path, timeout=_DEFAULT_TIMEOUT):
"""Download a file to a file from a Google Cloud bucket at gs://<bucket_name>/<path_in_bucket>.

Expand Down Expand Up @@ -118,12 +129,23 @@ def get_metadata(self, bucket_name, path_in_bucket, timeout=_DEFAULT_TIMEOUT):
:return dict:
"""
bucket = self.client.get_bucket(bucket_or_name=bucket_name)
metadata = bucket.get_blob(blob_name=self._strip_leading_slash(path_in_bucket), timeout=timeout)._properties

if metadata.get("metadata") is not None:
metadata["metadata"] = {key: json.loads(value) for key, value in metadata["metadata"].items()}

return metadata
blob = bucket.get_blob(blob_name=self._strip_leading_slash(path_in_bucket), timeout=timeout)

if blob is None:
return None

return {
"custom_metadata": blob.metadata or {},
"crc32c": blob.crc32c,
"size": blob.size,
"updated": blob.updated,
"time_created": blob.time_created,
"time_deleted": blob.time_deleted,
"custom_time": blob.custom_time,
"project_name": self.project_name,
"bucket_name": bucket_name,
"path_in_bucket": path_in_bucket,
}

def delete(self, bucket_name, path_in_bucket, timeout=_DEFAULT_TIMEOUT):
"""Delete the given file from the given bucket.
Expand Down Expand Up @@ -189,16 +211,6 @@ def _update_metadata(self, blob, metadata):
:param dict metadata:
:return None:
"""
blob.metadata = self._encode_metadata(metadata or {})
blob.patch()

def _encode_metadata(self, metadata):
"""Encode metadata as a dictionary of JSON strings.

:param dict metadata:
:return dict:
"""
if not isinstance(metadata, dict):
raise TypeError(f"Metadata for Google Cloud storage should be a dictionary; received {metadata!r}")

return {key: json.dumps(value) for key, value in metadata.items()}
if metadata is not None:
blob.metadata = metadata
blob.patch()
6 changes: 6 additions & 0 deletions octue/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,9 @@ class AttributeConflict(OctueSDKException):

class MissingServiceID(OctueSDKException):
"""Raise when a specific ID for a service is expected to be provided, but is missing or None."""


class CloudLocationNotSpecified(OctueSDKException):
"""Raise when attempting to interact with a cloud resource implicitly but the implicit details of its location are
missing.
"""
26 changes: 15 additions & 11 deletions octue/mixins/hashable.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
import base64
import collections.abc
import datetime
from google_crc32c import Checksum


EMPTY_STRING_HASH_VALUE = "AAAAAA=="

_HASH_PREPARATION_FUNCTIONS = {
str: lambda attribute: attribute,
int: str,
float: str,
type(None): lambda attribute: "None",
dict: lambda attribute: str(sorted(attribute.items())),
datetime.datetime: str,
}


Expand All @@ -17,8 +21,9 @@ class Hashable:
_ATTRIBUTES_TO_HASH = None
_HASH_TYPE = "CRC32C"

def __init__(self, hash_value=None, *args, **kwargs):
self._hash_value = hash_value
def __init__(self, immutable_hash_value=None, *args, **kwargs):
self._immutable_hash_value = immutable_hash_value
self._ATTRIBUTES_TO_HASH = self._ATTRIBUTES_TO_HASH or []
super().__init__(*args, **kwargs)

@classmethod
Expand All @@ -35,14 +40,10 @@ class Holder(cls):
@property
def hash_value(self):
"""Get the hash of the instance."""
if self._hash_value:
return self._hash_value

if not self._ATTRIBUTES_TO_HASH:
return None
if self._immutable_hash_value is None:
return self._calculate_hash()

self._hash_value = self._calculate_hash()
return self._hash_value
return self._immutable_hash_value

@hash_value.setter
def hash_value(self, value):
Expand All @@ -51,14 +52,17 @@ def hash_value(self, value):
:param str value:
:return None:
"""
self._hash_value = value
if self._immutable_hash_value is not None:
raise ValueError(f"The hash of {self!r} is immutable - hash_value cannot be set.")

self._immutable_hash_value = value

def reset_hash(self):
"""Reset the hash value to the calculated hash (rather than whatever value has been set).

:return None:
"""
self._hash_value = self._calculate_hash()
self._immutable_hash_value = None

def _calculate_hash(self, hash_=None):
"""Calculate the hash of the sorted attributes in self._ATTRIBUTES_TO_HASH. If hash_ is not None and is
Expand Down
51 changes: 30 additions & 21 deletions octue/mixins/identifiable.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,27 +25,7 @@ def __init__(self, *args, id=None, name=None, **kwargs):
"""Constructor for Identifiable class"""
self._name = name
super().__init__(*args, **kwargs)

# Store a boolean record of whether this object was created with a previously-existing uuid or was created new.
self._created = True if id is None else False

if isinstance(id, uuid.UUID):
# If it's a uuid, stringify it
id = str(id)

elif isinstance(id, str):
# If it's a string (or something similar which can be converted to UUID) check it's valid
try:
id = str(uuid.UUID(id))
except ValueError:
raise InvalidInputException(f"Value of id '{id}' is not a valid uuid string or instance of class UUID")

elif id is not None:
raise InvalidInputException(
f"Value of id '{id}' must be a valid uuid string, an instance of class UUID or None"
)

self._id = id or gen_uuid()
self._set_id(id)

def __str__(self):
return f"{self.__class__.__name__} {self._id}"
Expand All @@ -60,3 +40,32 @@ def id(self):
@property
def name(self):
return self._name

def _set_id(self, value):
"""Set the ID to the given value.

:param str|uuid.UUID|None value:
:return None:
"""
# Store a boolean record of whether this object was created with a previously-existing uuid or was created new.
self._created = True if value is None else False

if isinstance(value, uuid.UUID):
# If it's a uuid, stringify it
value = str(value)

elif isinstance(value, str):
# If it's a string (or something similar which can be converted to UUID) check it's valid
try:
value = str(uuid.UUID(value))
except ValueError:
raise InvalidInputException(
f"Value of id '{value}' is not a valid uuid string or instance of class UUID"
)

elif value is not None:
raise InvalidInputException(
f"Value of id '{value}' must be a valid uuid string, an instance of class UUID or None"
)

self._id = value or gen_uuid()
Loading