Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
cd32c92
CLN: Remove filters field from manifest strand in twines
cortadocodes May 10, 2021
0c579ae
IMP: Disallow more than one colon in tags
cortadocodes May 12, 2021
71d5096
REV: Revert "IMP: Disallow more than one colon in tags"
cortadocodes May 12, 2021
551f97d
MRG: Merge remote-tracking branch 'origin/main' into feature/tag-temp…
cortadocodes May 12, 2021
c5a69b9
DEP: Use new version of twined
cortadocodes May 12, 2021
fa853cf
IMP: Convert tags to labels
cortadocodes May 13, 2021
ae8f0ef
REF: Change filtering syntax to filter_name=value
cortadocodes May 13, 2021
dbc9460
IMP: Add ability to filter by nested attributes/dicts
cortadocodes May 17, 2021
4cc19eb
IMP: Add FilterDict
cortadocodes May 17, 2021
f7d8fea
IMP: Add TagDict and use in Taggable
cortadocodes May 17, 2021
5aad839
IMP: Make Datafiles and Datasets taggable again
cortadocodes May 17, 2021
3e98232
IMP: Stop logging in Serialisable; always exclude logger field in Ser…
cortadocodes May 17, 2021
31f6d22
FIX: Remove Serialisable mixin from LabelSet
cortadocodes May 17, 2021
26d40d7
DEP: Use correct twined branch
cortadocodes May 17, 2021
8299563
IMP: Allow tags to be added to Taggable as kwargs
cortadocodes May 17, 2021
5556be9
TST: Test Taggable
cortadocodes May 17, 2021
b1ca0c9
TST: Test setting items on TagDict
cortadocodes May 17, 2021
0f34a00
TST: Test chaining filters on FilterDict
cortadocodes May 17, 2021
1dd629e
TST: Test filters for a TagDict
cortadocodes May 17, 2021
b7f79bc
FIX: Add tags parameter back to Datafile and Dataset constructors
cortadocodes May 17, 2021
e43ea63
FIX: Initialise superclass in Taggable mixin
cortadocodes May 17, 2021
ea66c28
CLN: Remove extra parameters from Label
cortadocodes May 17, 2021
515fe70
FIX: Add serialise method to TagDict
cortadocodes May 17, 2021
ce39fb9
FIX: Deserialise TagDicts properly in Datafile.from_cloud
cortadocodes May 17, 2021
a2c5d00
TST: Fix Taggable test
cortadocodes May 17, 2021
1d2dcff
CLN: Simplify test method
cortadocodes May 17, 2021
580d432
FIX: Fix Dataset tags parameter
cortadocodes May 17, 2021
d92de3d
CLN: Remove unused _FILTERSET_ATTRIBUTE class variables
cortadocodes May 17, 2021
29e92fe
REF: Base LabelSet on FilterSet
cortadocodes May 17, 2021
73b32b0
REF: Base Label on str
cortadocodes May 17, 2021
4a16676
IMP: Allow FilterDicts to be ordered by their values
cortadocodes May 17, 2021
b6742be
IMP: Allow ignoring of filterables without filtered-for attribute
cortadocodes May 17, 2021
187ac5e
IMP: Allow multiple filters in filter containers' filter methods
cortadocodes May 17, 2021
8f35468
REF: Use lambda for filter instead of def function
cortadocodes May 17, 2021
4c75cb0
IMP: Add Dataset.get_file_by_tag
cortadocodes May 17, 2021
65be685
CLN: Remvove unnecessary class variable; use more pythonic method ove…
cortadocodes May 17, 2021
7da5b64
CLN: Remove commented-out code
cortadocodes May 17, 2021
51b5be4
DEP: Use latest GCS emulator
cortadocodes May 18, 2021
9ed62c7
FIX: Handle timestamps from cloud with/without timezone information
cortadocodes May 18, 2021
d53f518
IMP: Limit allowed tag name and label patterns
cortadocodes May 18, 2021
75c183a
IMP: Raise error if non-Filterables are put into filter containers
cortadocodes May 18, 2021
0f826b4
IMP: Store tags in separate custom metadata fields on GCS
cortadocodes May 18, 2021
b0b879d
DOC: Fix incorrect/outdated information in docs
cortadocodes May 18, 2021
a2750a8
REF: Slightly simplify Taggable and Labelable
cortadocodes May 18, 2021
84f0a03
FIX: Make Analysis taggable again
cortadocodes May 18, 2021
0a570b3
DOC: Update templates with labels/tags
cortadocodes May 18, 2021
2aa0ac3
REF: Simplify Datafile.metadata method
cortadocodes May 21, 2021
ebc11b5
TST: Test that datafile tags are stored as separate pieces of custom …
cortadocodes May 21, 2021
19e0864
REV: Remove Dataset.get_file_by_tag method
cortadocodes May 21, 2021
de967dd
DOC: Update docstrings and error messages
cortadocodes May 21, 2021
d04f579
REV: Unbase TagDict from FilterDict
cortadocodes May 21, 2021
dad9ecf
REV: Unbase Label from Filterable and LabelSet from FilterSet
cortadocodes May 21, 2021
7d025ab
TST: Remove unneeded base class from test class
cortadocodes May 21, 2021
fbc058a
TST: Remove unnecessary casting to set
cortadocodes May 21, 2021
ab893a2
TST: Add wrongly-removed test back in
cortadocodes May 21, 2021
1cd5d77
DOC: Fix error string
cortadocodes May 21, 2021
bf6dec7
TST: Test failing of filtering Filterables with differing attributes;…
cortadocodes May 21, 2021
ad7afa5
TST: Simplify label tests
cortadocodes May 21, 2021
9b0725a
TST: Add tags to datasets in manifest tests
cortadocodes May 21, 2021
7e16fe8
TST: Improve and simplify some more tests
cortadocodes May 21, 2021
19c1891
TST: Test uncovered areas
cortadocodes May 21, 2021
00912a1
IMP: Use new format for manifests' datasets in twine.json files
cortadocodes May 21, 2021
bfccca7
IMP: Support non-English characters in case-insensitive filtering
cortadocodes May 25, 2021
05e6d5d
REF: Base filter containers on new FilterContainer abstract class
cortadocodes May 25, 2021
0446498
IMP: Return items when ordering FilterDict rather than just values
cortadocodes May 25, 2021
2b56d44
DOC: Update filter containers documentation
cortadocodes May 25, 2021
55035fc
DOC: Update other documentation
cortadocodes May 25, 2021
9e7d012
MRG: Merge remote-tracking branch 'origin/release/0.1.19' into featur…
cortadocodes May 25, 2021
d0b48dc
CLN: Remove unnecessary pass statements
cortadocodes May 25, 2021
1189a39
IMP: Add octue SDK version to datafile metadata
cortadocodes Jun 2, 2021
a2c3d98
IMP: Add `one` method to filter containers
cortadocodes Jun 2, 2021
e160ed7
TST: Update tests
cortadocodes Jun 2, 2021
6ad724e
REF: Move filter and order methods into FilterContainer
cortadocodes Jun 2, 2021
4cae5da
MRG: Merge remote-tracking branch 'origin/release/0.1.19' into featur…
cortadocodes Jun 2, 2021
b7b920e
IMP: JSON-encode cloud storage custom metadata
cortadocodes Jun 2, 2021
18912ea
REV: Store tags in tags field of cloud metadata again
cortadocodes Jun 2, 2021
97c5aab
REF: Rename GoogleCloudStorageClient methods; update docstrings
cortadocodes Jun 2, 2021
d1f26c9
DOC: Update filter container docstrings
cortadocodes Jun 2, 2021
af3a2ac
FIX: Allow ordering by nested attributes in other FilterContainers
cortadocodes Jun 2, 2021
cd90140
REF: Refactor Dataset.get_file_by_label
cortadocodes Jun 2, 2021
87e0f2b
IMP: Allow UserStrings to be JSON-encoded by default
cortadocodes Jun 2, 2021
52355b2
IMP: Add set serialisation to en/decoders
cortadocodes Jun 2, 2021
1793d1e
REF: Remove unnecessary methods from LabelSet
cortadocodes Jun 2, 2021
e5ba7d8
DOC: Document label module
cortadocodes Jun 2, 2021
8c5bdfa
REF: Remove method from TagDict; document methods
cortadocodes Jun 2, 2021
66b4924
FIX: Restore required method
cortadocodes Jun 2, 2021
484ff4d
REF: Rename add_labels method and add `add` method to Label
cortadocodes Jun 2, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 2 additions & 15 deletions docs/source/analysis_objects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,5 @@ your app can always be verified. These hashes exist on the following attributes:
- ``configuration_values_hash``
- ``configuration_manifest_hash``

If an input or configuration attribute is ``None``, so will its hash attribute be. For ``Manifests``, some metadata
about the ``Datafiles`` and ``Datasets`` within them, and about the ``Manifest`` itself, is included when calculating
the hash:

- For a ``Datafile``, the content of its on-disk file is hashed, along with the following metadata:

- ``name``
- ``cluster``
- ``sequence``
- ``timestamp``
- ``tags``

- For a ``Dataset``, the hashes of its ``Datafiles`` are included, along with its ``tags``.

- For a ``Manifest``, the hashes of its ``Datasets`` are included, along with its ``keys``.
If a strand is ``None``, so will its corresponding hash attribute be. The hash of a datafile is the hash of
its file, while the hash of a manifest or dataset is the cumulative hash of the files it refers to.
4 changes: 2 additions & 2 deletions docs/source/child_services.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,13 +104,13 @@ The children field must also be present in the ``twine.json`` file:
"key": "wind_speed",
"purpose": "A service that returns the average wind speed for a given latitude and longitude.",
"notes": "Some notes.",
"filters": "tags:wind_speed"
"filters": "labels:wind_speed"
},
{
"key": "elevation",
"purpose": "A service that returns the elevation for a given latitude and longitude.",
"notes": "Some notes.",
"filters": "tags:elevation"
"filters": "labels:elevation"
}
],
...
Expand Down
2 changes: 1 addition & 1 deletion docs/source/cloud_storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ in Octue SDK, please join the discussion `in this issue. <https://github.com/oct
Data container classes
----------------------
All of the data container classes in the SDK have a ``to_cloud`` and a ``from_cloud`` method, which handles their
upload/download to/from the cloud, including all relevant metadata from the instance (e.g. tags, ID). Data integrity is
upload/download to/from the cloud, including all relevant metadata from the instance (e.g. labels, ID). Data integrity is
checked before and after upload and download to ensure any data corruption is avoided.

Datafile
Expand Down
6 changes: 3 additions & 3 deletions docs/source/cloud_storage_advanced_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ to any of these methods.
local_path=<path/to/file>,
bucket_name=<bucket-name>,
path_in_bucket=<path/to/file/in/bucket>,
metadata={"tags": ["blah", "glah", "jah"], "cleaned": True, "id": 3}
metadata={"id": 3, "labels": ["blah", "glah", "jah"], "cleaned": True, "colour": "blue"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the id a uuid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for Identifiables but this is just metadata for any given file (this isn't a method of an Identifiable)

)

storage_client.upload_from_string(
string='[{"height": 99, "width": 72}, {"height": 12, "width": 103}]',
bucket_name=<bucket-name>,
path_in_bucket=<path/to/file/in/bucket>,
metadata={"tags": ["dimensions"], "cleaned": True, "id": 96}
metadata={"id": 96, "labels": ["dimensions"], "cleaned": True, "colour": "red", "size": "small"}
)

**Downloading**
Expand Down Expand Up @@ -61,7 +61,7 @@ to any of these methods.
bucket_name=<bucket-name>,
path_in_bucket=<path/to/file/in/bucket>,
)
>>> {"tags": ["dimensions"], "cleaned": True, "id": 96}
>>> {"id": 96, "labels": ["dimensions"], "cleaned": True, "colour": "red", "size": "small"}


**Deleting**
Expand Down
19 changes: 12 additions & 7 deletions docs/source/datafile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ the following main attributes:
- ``path`` - the path of this file, which may include folders or subfolders, within the dataset.
- ``cluster`` - the integer cluster of files, within a dataset, to which this belongs (default 0)
- ``sequence`` - a sequence number of this file within its cluster (if sequences are appropriate)
- ``tags`` - a space-separated string or iterable of tags relevant to this file
- ``tags`` - key-value pairs of metadata relevant to this file
- ``labels`` - a space-separated string or iterable of labels relevant to this file
- ``timestamp`` - a posix timestamp associated with the file, in seconds since epoch, typically when it was created but could relate to a relevant time point for the data


Expand Down Expand Up @@ -43,14 +44,15 @@ Example A
bucket_name = "my-bucket",
datafile_path = "path/to/data.csv"

with Datafile.from_cloud(project_name, bucket_name, datafile_path, mode="r") as datafile, f:
with Datafile.from_cloud(project_name, bucket_name, datafile_path, mode="r") as (datafile, f):
data = f.read()
new_metadata = metadata_calculating_function(data)

datafile.timestamp = new_metadata["timestamp"]
datafile.cluster = new_metadata["cluster"]
datafile.sequence = new_metadata["sequence"]
datafile.tags = new_metadata["tags"]
datafile.labels = new_metadata["labels"]


Example B
Expand All @@ -76,7 +78,8 @@ Example B
datafile.timestamp = datetime.now()
datafile.cluster = 0
datafile.sequence = 3
datafile.tags = {"manufacturer:Vestas", "output:1MW"}
datafile.tags = {"manufacturer": "Vestas", "output": "1MW"}
datafile.labels = {"new"}

datafile.to_cloud() # Or, datafile.update_cloud_metadata()

Expand Down Expand Up @@ -122,10 +125,11 @@ For creating new data in a new local file:


sequence = 2
tags = {"cleaned:True", "type:linear"}
tags = {"cleaned": True, "type": "linear"}
labels = {"Vestas"}


with Datafile(path="path/to/local/file.dat", sequence=sequence, tags=tags, mode="w") as datafile, f:
with Datafile(path="path/to/local/file.dat", sequence=sequence, tags=tags, labels=labels, mode="w") as (datafile, f):
f.write("This is some cleaned data.")

datafile.to_cloud(project_name="my-project", bucket_name="my-bucket", path_in_bucket="path/to/data.dat")
Expand All @@ -139,7 +143,8 @@ For existing data in an existing local file:


sequence = 2
tags = {"cleaned:True", "type:linear"}
tags = {"cleaned": True, "type": "linear"}
labels = {"Vestas"}

datafile = Datafile(path="path/to/local/file.dat", sequence=sequence, tags=tags)
datafile = Datafile(path="path/to/local/file.dat", sequence=sequence, tags=tags, labels=labels)
datafile.to_cloud(project_name="my-project", bucket_name="my-bucket", path_in_bucket="path/to/data.dat")
20 changes: 12 additions & 8 deletions docs/source/dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ A ``Dataset`` contains any number of ``Datafiles`` along with the following meta

- ``name``
- ``tags``
- ``labels``

The files are stored in a ``FilterSet``, meaning they can be easily filtered according to any attribute of the
:doc:`Datafile <datafile>` instances it contains.
:doc:`Datafile <datafile>` instances contained.


--------------------------------
Expand All @@ -23,23 +24,26 @@ You can filter a ``Dataset``'s files as follows:

dataset = Dataset(
files=[
Datafile(timestamp=time.time(), path="path-within-dataset/my_file.csv", tags="one a:2 b:3 all"),
Datafile(timestamp=time.time(), path="path-within-dataset/your_file.txt", tags="two a:2 b:3 all"),
Datafile(timestamp=time.time(), path="path-within-dataset/another_file.csv", tags="three all"),
Datafile(path="path-within-dataset/my_file.csv", labels=["one", "a", "b" "all"]),
Datafile(path="path-within-dataset/your_file.txt", labels=["two", "a", "b", "all"),
Datafile(path="path-within-dataset/another_file.csv", labels=["three", "all"]),
]
)

dataset.files.filter(filter_name="name__ends_with", filter_value=".csv")
dataset.files.filter(name__ends_with=".csv")
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('another_file.csv')>})>

dataset.files.filter("tags__contains", filter_value="a:2")
dataset.files.filter(labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('your_file.txt')>})>

You can also chain filters indefinitely:
You can also chain filters indefinitely, or specify them all at the same time:

.. code-block:: python

dataset.files.filter(filter_name="name__ends_with", filter_value=".csv").filter("tags__contains", filter_value="a:2")
dataset.files.filter(name__ends_with=".csv").filter(labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>})>

dataset.files.filter(name__ends_with=".csv", labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>})>

Find out more about ``FilterSets`` :doc:`here <filter_containers>`, including all the possible filters available for each type of object stored on
Expand Down
82 changes: 59 additions & 23 deletions docs/source/filter_containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,43 +4,61 @@
Filter containers
=================

A filter container is just a regular python container that has some extra methods for filtering or ordering its
A filter container is just a regular python container that has some extra methods for filtering and ordering its
elements. It has the same interface (i.e. attributes and methods) as the primitive python type it inherits from, with
these extra methods:

- ``filter``
- ``order_by``

There are two types of filter containers currently implemented:
There are three types of filter containers currently implemented:

- ``FilterSet``
- ``FilterList``
- ``FilterDict``

``FilterSets`` are currently used in:
``FilterSets`` are currently used in ``Dataset.files`` to store ``Datafiles`` and make them filterable, which is useful
for dealing with a large number of datasets, while ``FilterList`` is returned when ordering any filter container.

- ``Dataset.files`` to store ``Datafiles``
- ``TagSet.tags`` to store ``Tags``

You can see filtering in action on the files of a ``Dataset`` :doc:`here <dataset>`.
You can see an example of filtering of a ``Dataset``'s files :doc:`here <dataset>`.


---------
Filtering
---------

Filters are named as ``"<name_of_attribute_to_check>__<filter_action>"``, and any attribute of a member of the
``FilterSet`` whose type or interface is supported can be filtered.
Key points:

* Any attribute of a member of a filter container whose type or interface is supported can be used when filtering
* Filters are named as ``"<name_of_attribute_to_check>__<filter_action>"``
* Multiple filters can be specified at once for chained filtering
* ``<name_of_attribute_to_check>`` can be a single attribute name or a double-underscore-separated string of nested attribute names
* Nested attribute names work for real attributes as well as dictionary keys (in any combination and to any depth)

.. code-block:: python

filter_set = FilterSet(
{Datafile(timestamp=time.time(), path="my_file.csv"), Datafile(timestamp=time.time(), path="your_file.txt"), Datafile(timestamp=time.time(), path="another_file.csv")}
{
Datafile(path="my_file.csv", cluster=0, tags={"manufacturer": "Vestas"}),
Datafile(path="your_file.txt", cluster=1, tags={"manufacturer": "Vergnet"}),
Datafile(path="another_file.csv", cluster=2, tags={"manufacturer": "Enercon"})
}
)

filter_set.filter(filter_name="name__ends_with", filter_value=".csv")
# Single filter, non-nested attribute.
filter_set.filter(name__ends_with=".csv")
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('another_file.csv')>})>

The following filters are implemented for the following types:
# Two filters, non-nested attributes.
filter_set.filter(name__ends_with=".csv", cluster__gt=1)
>>> <FilterSet({<Datafile('another_file.csv')>})>

# Single filter, nested attribute.
filter_set.filter(tags__manufacturer__startswith("V"))
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('your_file.csv')>})>


These filters are currently available for the following types:

- ``bool``:

Expand Down Expand Up @@ -73,19 +91,20 @@ The following filters are implemented for the following types:
* ``is``
* ``is_not``

- ``TagSet``:
- ``LabelSet``:

* ``is``
* ``is_not``
* ``equals``
* ``not_equals``
* ``any_tag_contains``
* ``not_any_tag_contains``
* ``any_tag_starts_with``
* ``not_any_tag_starts_with``
* ``any_tag_ends_with``
* ``not_any_tag_ends_with``

* ``contains``
* ``not_contains``
* ``any_label_contains``
* ``not_any_label_contains``
* ``any_label_starts_with``
* ``not_any_label_starts_with``
* ``any_label_ends_with``
* ``not_any_label_ends_with``


Additionally, these filters are defined for the following *interfaces* (duck-types). :
Expand Down Expand Up @@ -118,14 +137,31 @@ list of filters.
--------
Ordering
--------
As sets are inherently orderless, ordering a ``FilterSet`` results in a new ``FilterList``, which has the same extra
methods and behaviour as a ``FilterSet``, but is based on the ``list`` type instead - meaning it can be ordered and
indexed etc. A ``FilterSet`` or ``FilterList`` can be ordered by any of the attributes of its members:
As sets and dictionaries are inherently orderless, ordering any filter container results in a new ``FilterList``, which
has the same methods and behaviour but is based on ``list`` instead, meaning it can be ordered and indexed etc. A
filter container can be ordered by any of the attributes of its members:

.. code-block:: python

filter_set.order_by("name")
>>> <FilterList([<Datafile('another_file.csv')>, <Datafile('my_file.csv')>, <Datafile(path="your_file.txt")>])>

filter_set.order_by("cluster")
>>> <FilterList([<Datafile('my_file.csv')>, <Datafile('your_file.csv')>, <Datafile(path="another_file.txt")>])>

The ordering can also be carried out in reverse (i.e. descending order) by passing ``reverse=True`` as a second argument
to the ``order_by`` method.


--------------
``FilterDict``
--------------
The keys of a ``FilterDict`` can be anything, but each value must be a ``Filterable``. Hence, a ``FilterDict`` is
filtered and ordered by its values' attributes; when ordering, its items (key-value tuples) are returned in a
``FilterList``.

-----------------------
Using for your own data
-----------------------
If using filter containers for your own data, all the members must inherit from ``octue.mixins.filterable.Filterable``
to be filterable and orderable.
Loading