-
Notifications
You must be signed in to change notification settings - Fork 1
/
anatomy_manifest.rst
202 lines (138 loc) · 8.6 KB
/
anatomy_manifest.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
.. _manifest_strands:
======================
Manifest-based Strands
======================
Frequently, twins operate on files containing some kind of data. These files need to be made accessible to the code
running in the twin, in order that their contents can be read and processed. Conversely, a twin might produce an output
dataset which must be understood by users.
The ``configuration_manifest``, ``input_manifest`` and ``output_manifest`` strands describe what kind of datasets (and
associated files) are required / produced.
.. NOTE::
Files are always contained in datasets, even if there's only one file. It's so that we can keep nitty-gritty file
metadata separate from the more meaningful, higher level metadata like what a dataset is for.
.. tabs::
.. group-tab:: Configuration Manifest Strand
This describes datasets/files that are required at startup of the twin / service. They typically contain a
resource that the twin might use across many analyses.
For example, a twin might predict failure for a particular component, given an image. It will require a trained
ML model (saved in a ``*.pickle`` or ``*.json``). While many thousands of predictions might be done over the
period that the twin is deployed, all predictions are done using this version of the model - so the model file is
supplied at startup.
.. group-tab:: Input Manifest Strand
These files are made available for the twin to run a particular analysis with. Each analysis will likely have
different input datasets.
For example, a twin might be passed a dataset of LiDAR ``*.scn`` files and be expected to compute atmospheric flow
properties as a timeseries (which might be returned in the :ref:`output values <values_based_strands>` for onward
processing and storage).
.. group-tab:: Output Manifest Strand
Files are created by the twin during an analysis, tagged and stored as datasets for some onward purpose.
This strand is not used for sourcing data; it enables users or other services to understand appropriate search
terms to retrieve datasets produced.
.. _describing_manifests:
Describing Manifests
====================
Manifest-based strands are a **description of what files are needed**, NOT a list of specific files or datasets. This is
a tricky concept, but important, since services should be reusable and applicable to a range of similar datasets.
The purpose of the manifest strands is to provide a helper to a wider system providing datafiles to digital twins.
The manifest strands therefore use **tagging** - they contain a ``filters`` field, which should be valid
`Apache Lucene <https://lucene.apache.org/>`_ search syntax. This is a powerful syntax, whose tagging features allow
us to specify incredibly broad, or extremely narrow searches (even down to a known unique result). See the tabs below
for examples.
.. NOTE::
Tagging syntax is extremely powerful. Below, you'll see how this enables a digital twin to specify things like:
*"OK, I need this digital twin to always have access to a model file for a particular system, containing trained model data"*
*"Uh, so I need an ordered sequence of files, that are CSV files from a meteorological mast."*
This allows **twined** to check that the input files contain what is needed, enables quick and easy
extraction of subgroups or particular sequences of files within a dataset, and enables management systems
to map candidate datasets to twins that might be used to process them.
.. tabs::
.. group-tab:: Configuration Manifest Strand
Here we construct an extremely tight filter, which connects this digital twin to
datasets from a specific system.
.. accordion::
.. accordion-row:: Show twine containing this strand
.. literalinclude:: ../../examples/damage_classifier_service/twine.json
:language: javascript
.. accordion-row:: Show a matching file manifest
.. literalinclude:: ../../examples/damage_classifier_service/data/configuration_manifest.json
:language: javascript
.. group-tab:: Input Manifest Strand
Here we specify that two datasets (and all or some of the files associated with them) are
required, for a service that cross-checks meteorological mast data and power output data for a wind farm.
.. accordion::
.. accordion-row:: Show twine containing this strand
.. literalinclude:: ../../examples/met_mast_scada_service/strands/input_manifest_filters.json
:language: javascript
.. accordion-row:: Show a matching file manifest
.. literalinclude:: ../../examples/met_mast_scada_service/data/input_manifest.json
:language: javascript
.. group-tab:: Output Manifest Strand
Output figure files (with *.fig extension) containing figures enabling a visual check
of correlation between met mast and scada data.
.. accordion::
.. accordion-row:: Show twine containing this strand
.. literalinclude:: ../../examples/met_mast_scada_service/strands/output_manifest_filters.json
:language: javascript
.. accordion-row:: Show a matching file manifest
.. literalinclude:: ../../examples/met_mast_scada_service/data/output_manifest.json
:language: javascript
..
TODO - clean up or remove this section
.. _how_filtering_works:
How Filtering Works
===================
It's the job of **twined** to make sure of two things:
1. make sure the *twine* file itself is valid,
**File data (input, output)**
Files are not streamed directly to the digital twin (this would require extreme bandwidth in whatever system is
orchestrating all the twins). Instead, files should be made available on the local storage system; i.e. a volume
mounted to whatever container or VM the digital twin runs in.
Groups of files are described by a ``manifest``, where a manifest is (in essence) a catalogue of files in a
dataset.
A digital twin might receive multiple manifests, if it uses multiple datasets. For example, it could use a 3D
point cloud LiDAR dataset, and a meteorological dataset.
.. code-block:: javascript
{
"manifests": [
{
"type": "dataset",
"id": "3c15c2ba-6a32-87e0-11e9-3baa66a632fe", // UUID of the manifest
"files": [
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", // UUID of that file
"sha1": "askjnkdfoisdnfkjnkjsnd" // for quality control to check correctness of file contents
"name": "Lidar - 4 to 10 Dec.csv",
"path": "local/file/path/to/folder/containing/it/",
"type": "csv",
"metadata": {
},
"size_bytes": 59684813,
"tags": "lidar, helpful, information, like, sequence:1", // Searchable, parsable and filterable
},
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"name": "Lidar - 11 to 18 Dec.csv",
"path": "local/file/path/to/folder/containing/it/",
"type": "csv",
"metadata": {
},
"size_bytes": 59684813,
"tags": "lidar, helpful, information, like, sequence:2", // Searchable, parsable and filterable
},
{
"id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
"name": "Lidar report.pdf",
"path": "local/file/path/to/folder/containing/it/",
"type": "pdf",
"metadata": {
},
"size_bytes": 484813,
"tags": "report", // Searchable, parsable and filterable
}
]
},
{
// ... another dataset manifest ...
}
]
}