Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VODML Mapping vs ModelInstanceInVot #23

Open
lmichel opened this issue Mar 24, 2021 · 27 comments
Open

VODML Mapping vs ModelInstanceInVot #23

lmichel opened this issue Mar 24, 2021 · 27 comments
Labels
documentation Improvements or additions to documentation

Comments

@lmichel
Copy link
Collaborator

lmichel commented Mar 24, 2021

Both proposals (VOML Mapping, ModelInstanceInVot) have similar structures. There is nevertheless a major difference that is justified here.

  • Fig 1 shows a dataset made with a block of metadata on the top of the data block and mapped with VOML Mapping
    • Data are mapped in the TEMPLATE block that maps data for one row.
    • Metadata are located in the GLOBALS block along with the coordinate frames.
    • Issues
      • The parser has to browse 2 different mapping blocks to retrieve all the components of the dataset
      • The parser has no easy way to discrimine which GLOBALS element is part or not of the dataset mapping.
  • Fig 3 shows the same dataset mapped with ModelInstanceInVot
    • Both data and metadata are mapped within the TABLE_TEMPLATE blocks
    • The data mapping is enclosed in a TABLE_RAW_TEMPLATE block
    • Gains
      • All elements related to the dataset mapping are located in a single block (TABLE_TEMPLATE), this makes the parser job easier.
      • All elements not related to any dataset are located in the GLOBALS block
      • We can have several TABLE_RAW_TEMPLATE blocks within a single TABLE_TEMPLATE. This allows to tell the parser to iterate several times on the same data table e.g. to extract data subsets (selection by filter in time-series/gaia_multiband).

Screenshot 2021-03-24 at 14 01 19

The advantage of the ModelInstanceInVot mapping structure is even more obvious for multi-table VOTables.

  • Fig 2 shows the mapping structure for a VOTable containing 2 tables, each one with its own mapping (e.g. source + detections).
    • The GLOBALS block contains now the meta data of the 2 datasets in addition to the coordinate frames.
    • Issues
      • Same as before
      • The parser has no easy way to identify which GLOBALS element is part of dataset 1 ort 2
  • Fig 4 shows the same datasets mapped with ModelInstanceInVot
    • Each table is mapped in a specific TABLE_TEMPLATE.
    • Gains
      • Same as before
      • No longer risk to mix up meta data of both datasets.

Screenshot 2021-03-24 at 14 01 29

@lmichel lmichel added the documentation Improvements or additions to documentation label Mar 24, 2021
@mcdittmar
Copy link
Collaborator

I'm taking a closer look to compare the annotation schemes today.

I notice that the preview PDF in the ModelInstanceInVot repository is not current (2020-08-18).
The one under doc (2020-09-15) appears to match the current serializations.

Is that correct?

@lmichel
Copy link
Collaborator Author

lmichel commented Mar 26, 2021 via email

@Bonnarel
Copy link
Contributor

Good clarification Laurent, but I think there is a typo in your text above. TABLE_RAW_MAPPING shoukd write "TABLE_ROW-TEMPLATE", shouldn't it ?

@lmichel
Copy link
Collaborator Author

lmichel commented Mar 31, 2021

right, fixed

@mcdittmar
Copy link
Collaborator

I wrote this up some days ago.. forgot to push the green button!.
I spent the day studying the ModelInstanceInVot syntax so that I could, hopefully, provide a more informed response.

General comment:
One of the main reasons I'm using the Mapping syntax, is that it was designed/vetted against a set of requirements/cases covering the expectations on any syntax. To my knowledge, these requirements on our Annotation syntax have not changed, so I'm reviewing this scheme against my understanding of the Mapping syntax requirements. It is not quite fair to evaluate one document based on the requirements of another, but in this case, we are trying to determine what is gained/lost by the various options.

MODEL_INSTANCE:
Your comments seem to come from the perspective that the annotation represents "the dataset". The Model_Instance description follows that view in that you can provide for no more than 1 'name' and 'uri' for "the mapped model.". It is further supported by INSTANCE with dmrole="root", which "can be used to tell the parser whic class has to be instanciated first.". If the annotation contains more than one "root" instance (eg: a MANGO and a Cube/TimeSeries), the MODEL_INSTANCE name and uri attributes have little meaning.

It is a specific goal of our Annotation Syntax to 'Annotate files to multiple models' (Mapping doc; Section 2.3.2). This can be either different products ( MANGO, Cube ) or different versions of the same model. I think this is a feature we want to preserve, and I am expecting the case in Issue #29 does just this?

GLOBALS/TEMPLATES in the Mapping syntax:
GLOBALS holds 'direct instances' (COLUMN is not allowed under GLOBALS) so that each instance may be given an ID which can be referenced by other instances

  • There can be more than one GLOBALS (where ModelInstanceInVot can have only 1).
    • This allows you to bundle GLOBALS into logical sets.
      • 1 for Coordinate Systems
      • 1 for Photometry Filters
      • 1 for each DatasetMetadata instance needed

TEMPLATES hold all instances which are created per row of the associated TABLE, so looks equivalent to the ModelInstanceInVot TABLE_ROW_TEMPLATE but at a higher level

  • There can be more than one TEMPLATES as well. In my usage, I think this has only come into play when there is more than one TABLE

So, in your first pair of diagrams.. you cannot consider the 1 TABLE to represent 1 Dataset, or even 1 'root' object.
Instead, what the Mapping syntax does is separate the elements into sections where parsers must act differently to generate the instances

  • GLOBALS - simpler, single direct instance generated from LITERAL or CONSTANT/PARAM and
  • TEMPLATES - where it must manage the TABLE data and possibly contend with KEYS to generate multiple instances

TABLE_MAPPING..(TABLE_TEMPLATE in diagrams?):
This element confuses me.
The ModelInstanceInVot document states that the annotation is independent of the data structure

  • Section 2.2: Requirements: "mapping structure must be independent of the data structure"
  • Section 3.1: Mapping Block Structure: "The mapping construction rules are independant from any model or data layout"

But this element appears to tie very directly with the VOTable serialization structure

  • Section 3.1: "There is one <TABLE_MAPPING> per mapped <TABLE>"
  • Section 3.2.3: "contains the mapping statements of the data contained in one TABLE"
  • Section 3.2.3: "A TABLE cannot be referenced by more than one TABLE_MAPPING"

How does this accommodate an INSTANCE whose content is distributed among multiple TABLEs?
This is another specific goal of the Annotation Syntax, "It is often the case that two or more files or tables represent different pieces of information regarding the same astronomical sources of objects." (Mapping doc; Section 2.3.3). So, the data required to create any given INSTANCE may be spread among different TABLEs, even for the GLOBALS.

In the second pair of diagrams.. there can certainly be INSTANCES in GLOBALS which contain elements from multiple TABLES.

Summary:
The ModelInstanceInVot examples above may look and behave more cleanly, but if I understand the descriptions properly, they fail to satisfy some basic requirements of the Annotation Syntax.

  1. We may include annotation to more than one model (different models, different versions of one model).
  2. The Annotation must be independent of the serialization structure. (ie: cannot isolate per TABLE)

Additional Comments:

  • TABLE_MAPPING tableref attribute: "first resolved against the TABLE identifier (@id), then against the table name".
    • there are some workshop examples where the TABLE has no ID (and I've needed to add one).
    • several also had no ID on FIELDs, which I found surprising
    • this fallback in the spec may help accommodate these cases without modifying the original TABLE content.
  • FILTER/JOIN/GROUPBY
    • These all seem to be some action related to using the PRIMARYKEY/FOREIGNKEY of the Mapping syntax.
    • I don't understand GROUPBY, but I rather like FILTER/JOIN as a more intuitive element (since I'm not a database person).
    • I'd like to focus some time on cases which use these

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 1, 2021

Long post multiple answers

** Mapping usecase**

The use case sections in both document differ, so you comparison is teneous.
The ModelIntanceInVot use case is simple:

  • You have a model , you have a VOTable: ModelIntanceInVot allows clients to build model instances populated with VOtable data.
  • Once you have this instance , implemeting your use cases is straighforward

In addition to this, we have a couple of requirements

  • The mapping must be designed in a way that facilitate the annotation process
  • This must work with most of the VOTables one can find in the market. If you come with a VOTable where your Cube data are randomly spread out over multi tables, ModelIntanceInVot will turn into some troubles

To facilitate both annotation and parsing tasks, ModelIntanceInVot is based on a drawer approach.

  • One drawer for the global object
  • One drawer for each model you want to map (only one at the moment, see below)
    So that the annoter knows where to put things and the clients knows where to look for. If it extracts one {TABLE_TEMPLATE} , it knows it gets everything it needs to build the instance mapped there.

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 1, 2021

MODEL_INSTANCE:

Recently ModelIntanceInVot has been downgraded to support the annotation of only one model. The {MODEL} blocks has been removed.
The reasons for this was:

  • The lack of use-cases
  • Simplification of the syntax to help people to take possession of the cases

We could easily move back to a multi-model version:

  • Restoring the {MODEL} block
  • Allowing multiple {TABLE_TEMPLATE} for one {TABLE}

The root role is just an indication given by annoter.
Starting parsing by the root node is not mandatory. My own code ignores it by the way.
I've to admit that feature brings more confusion than help.

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 1, 2021

GLOBALS

The ModelIntanceInVot {GLOBALS} block can hold as many {INSTANCE} as needed.
This is the drawer for the global objects.
{GLOBALS}/{INSTANCE} have no specific role.
They must have an ID to be referable from objects within {TABLE_MAPPING} blocks.

I would say that your {TEMPLATE} matches my {TABLE_ROW_TEMPLATE}. I added one level {TABLE_MAPPING} to encompass both metadata and data.
Beyond the benefits of the drawer structure this allows to support models with mutiple data collection. For instance, gaiamultiband has 3 data arrays, one per filter, each populated with a specific {TABLE_ROW_TEMPLATE}:

{TABLE_MAPPING}
    metatata
    {TABLE_ROW_TEMPLATE} for filter B    
    {TABLE_ROW_TEMPLATE} for filter RP    
    {TABLE_ROW_TEMPLATE} for filter BG

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 1, 2021

TABLE_MAPPING..(TABLE_TEMPLATE in diagrams?):

The annotation is independent of the data structure in sense of the hierarchy the mapping elements (INSTANCE ATTRIBUTE COLLECTION) does not depend on the way that data arranged into the table.
If you remane columns or change their order or move value fron FIELDS to PARAMS (factorisation: Vizier often does it) that hierarchy dies not change, only the ATTRIBUTE@ref attribute has to be modifed.

This feature allows annoters to work with pre-defined blocks which is a very important requirement.

  • This allows e.g. Vizier to demonstrate the capacity to generate annotated VOTable on the fly.
  • This opens a window to annotate TAP responses on the fly.

If you have multiple tables, let's say source + detections:

  • You map sources in a {TABLE_MAPPING} related to the source table
  • You map detections in a {TABLE_MAPPING} related to the detection table
  • You set a JOIN statement at the right place of the {TABLE_ROW_TEMPLATE} of the source {TABLE_MAPPING}

If you have meta-data spread over multiple tables (any use case?) , you can use the @dmref/@id pattern

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 1, 2021

Summary

  1. Answered above
  2. I do not see which dependency you are talking about
  • One mapped model => one {TABLE_MAPPING}

Additional Comments:

You are right to say that ModelInstance should support {TABLE} without name or ID either. A default behavior must defined (e.g. no name -> take the first)

GROUPBY

There is an example here

ex: {GROUPBY}, allows to retrieve per-source data from tables containing a mix of source data

@mcdittmar
Copy link
Collaborator

GROUPBY

There is an example here

ex: {GROUPBY}, allows to retrieve per-source data from tables containing a mix of source data

Without something consuming it, it's hard for me to see what comes out of it. It looks like a SORT method, producing a long collection of instances rearranged so that the instances with common 'filter' criteria are together. I'm not sure what the benefit of that is.

In the ZTF example you mentioned, appears to create:

  • a collection of MangoObject (now 'Source') grouped by identifier
    • so we have multiple Source instances all id=a, then all id=b, then all id=c, etc
  • each Source has an associated VOModelInstance which is a 'tsd:tsdata' (TimeSeries)
    • the tsdata instance contains a collection of tsd:Point, but I'm not sure how many Points are in the Collection
    • The GroupBy says the top level Source is instantiated from each row of the table, so there should be only 1 Point in each tsdata instance... generated by the row creating the top level Source instance.

So each Source has associated TimeSeries containing 1 Point?

The document example uses role:test.lightcurve generating a series of Collections of 'test:photometric.point', first points with "R" filter, then points with "G" filter, then points with "V" filter. I would expect in such an example to create 3 instances of LightCurve, one with the "R" points, one with the "G" points, and one with the "V" points.. but there is no such Instance.

If I'm interpreting this correctly, the only effect GroupBy has is to sort the records. This isn't a model feature, but a client side operation. eg: give me all your Source records in this region, sorted by source_id. The last clause is a client request which has nothing to do with the Source model.

Bottom line: if this is all correct, and GroupBy is ~= Sort, then it is a feature outside the scope of "Mapping the data to data models" which the Annotation is supposed to support.

@Bonnarel
Copy link
Contributor

Bonnarel commented Apr 7, 2021 via email

@msdemlei
Copy link
Contributor

msdemlei commented Apr 7, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 7, 2021

You might be worry to see such dataset (ZTF) circulating, but you have to admit that they do exist.

This is one of the examples we have been committed to work on 2 years ago in the frame of the TDIG.
ModelInstanceInVot proposal he been initiated by this work.
This is a use-case we worked on and there no plan or ambition either to extend the mapping syntax to any sort of QL feature.

The role of GROUP_BY is just to tell the parser how to extract per-source data from a table containing data related to multiple sources.
I believe that Vizier contains similar catalogs (@gilleslandais can confirm?).

In this case, using a model + annotation allows to do something which is not possible otherwise.

@gilleslandais
Copy link
Collaborator

I confirm - it is common in VizieR to have in the same table different objects which have been observed several times

examples:

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 23, 2021

Why does VODML Mapping make a distinction between COLUMN, CONSTANT and LITERAL and ModemInstanceInVot doesn't.

Let's figure out a table where all data are related to the same source identified by source_id.

There are 3 possible configuration:
1- source_id is missing: it must be added by the mapping
2- source_id is set as a parameter.

<PARAM name="source_id" value="MY_SOURCE" />

3- source_id is set as a field. In this case, the id will be repeated in each table row

.....
<FIELD name="source_id" .... />
....
<TR><TD>"MY_SOURCE" </TD>....</TR>
<TR><TD>"MY_SOURCE" </TD>....</TR>
...

The 3 options are valid and can be met.
Each proposal has its own approach to tackle with.

VODML Mapping proposes one different statement for each of these situations.

1- Missing value

<ATTRIBUTE dmrole="model:Source.id">
       <LITERAL value="MY_SOURCE" dmtype="ivoa:string"/>
</ATTRIBUTE>

2- Param value

<ATTRIBUTE dmrole="model:Source.id">
       <CONSTANT ref="source_id" dmtype="ivoa:string"/>
</ATTRIBUTE>

3- Field value

<ATTRIBUTE dmrole="model:Source.id">
       <COLUMN ref="source_id" dmtype="ivoa:string"/>
</ATTRIBUTE>

ModemInstanceInVot proposes one single statement for all of these situations.

<ATTRIBUTE  dmtype="ivoa:string" ref="source_id" value="MY_SOURCE"/>

The processing rules are given by the spec:
1- Search first for a FIELD having identified by source_id.
2- If no FIELD, search for a PARAM identified by source_id.
3- If no PARAM, take the value.

This way to proceed makes easier the use annotation components, and hence facilitates the annotation process.

@mcdittmar
Copy link
Collaborator

Why does VODML Mapping make a distinction between FIELD, PARAM and LITERAL and ModemInstanceInVot doesn't.

There is certainly some consolidation which can happen with these elements.

Correction: VODML Mapping uses

  • LITERAL for in-line value
  • CONSTANT for reference to PARAM
  • COLUMN for reference to FIELD

It distinguishes them, so that there is only 1 action for each case..
Whether this is more or less convenient than having 1 annotation node is the debate.

One point to consider:

  • GLOBALS section is for elements which should generate ONLY 1 INSTANCE; so they can be assigned an ID and be referenced.
    • It should be a problem for any attribute under here to resolve to a FIELD
    • so, the Mapping schema does not allow COLUMN under GLOBALS.
  • TEMPLATES section is for elements where there is 1 instance per table row.
    • here all 3 forms can be used LITERAL|CONSTANT|COLUMN
    • for LITERAL and CONSTANT, the value is repeated for each instance.
      • this means providers can have compact serialization for constant information
      • It has been mentioned that if a provider changes serialization from FIELD to PARAM (or vice-versa), the annotation changes. This is true, but the resulting instances do not.

@lmichel
Copy link
Collaborator Author

lmichel commented Apr 26, 2021

[FIELD, PARAM, LITERAL] vs [COLUMN, CONSTANT, LITERAL] fixed in my post

In my perspective, the scope of mapping XML schema is to validate the syntax, nothing more.
This is similar to a programming language syntax that tell you how declare a variable, but tell nothing about the scope of that variable. The compiler first validates the syntax and then apply building rules that are part of the language definition but not stated in the grammar.

With ModelInstanceInVot the client has to resolve data references found in GLOBALS only against all PARAMs of that VOTable. This rule is not written in the XML schema but in the spec (or it should be).

The ModelInstanceInVot approach likely prevents to consume annotated VOtables by automatic processor such as style sheets. But, relying on my data knowledge, I'm pretty sure that this will never be possible in a predictable future whatever the mapping syntax. Clients will always have to go back on forth from data to mapping to complete their job and thus, applying rules that go beyond the XML schema purpose.

Considering this, the gain of ModelInstanceInVot in term of both flexibility and compactness is very valuable.

@msdemlei
Copy link
Contributor

msdemlei commented Apr 26, 2021 via email

@msdemlei
Copy link
Contributor

msdemlei commented Apr 26, 2021 via email

@mcdittmar
Copy link
Collaborator

mcdittmar commented May 3, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented May 5, 2021

As pointed with good reasons by @msdemlei , repeating FIELD elements into the annotation is useless and possibly confusing.
This has been done however in the MANGO serializations for @ucd and @description because there was no other way to do it.
To avoid this, the annotation syntax must be able to refer to these FIELD elements.
This is major enhancement to bring to ModelInstanceVot.
This could look like this <SC_FIELD ucd="column_ID"/> or <SC_FIELD desc="column_ID"/>

@Bonnarel
Copy link
Contributor

Bonnarel commented May 5, 2021 via email

@lmichel
Copy link
Collaborator Author

lmichel commented May 5, 2021

I mean getting attributes (@ucd) or children (<DESC>) of <FIELD> elements

@mcdittmar
Copy link
Collaborator

As pointed with good reasons by @msdemlei , repeating FIELD elements into the annotation is useless and possibly confusing.
This has been done however in the MANGO serializations for @ucd and @description because there was no other way to do it.
To avoid this, the annotation syntax must be able to refer to these FIELD elements.
This is major enhancement to bring to ModelInstanceVot.
This could look like this <SC_FIELD ucd="column_ID"/> or <SC_FIELD desc="column_ID"/>

I'm not sure I follow.. 
You're referring specifically to the cases where you have a modeled attribute (here mango:Parameter.ucd) whose value comes from

  1.  manually providing it.. you need this option
  2.  pulling from a VOTable FIELD/PARAM element

The VODML Mapping syntax certainly does not support this, and is similar to the problem where you cannot annotate individual components of an VOTable array. I think that annotating into the sub-structure of the VOTable elements was considered out of scope.

How would this integrate with the ATTRIBUTE element, which above you declare has the benefit of having only a single form?

<INSTANCE dmrole="mango:MangoObject.parameters" dmtype="mango:Parameter">
    <ATTRIBUTE dmrole="mango:Parameter.ucd" dmtype="ivoa:string" value="pos.eq;meta.main"/>
</INSTANCE>

would this become

<INSTANCE dmrole="mango:MangoObject.parameters" dmtype="mango:Parameter">
    <ATTRIBUTE dmrole="mango:Parameter.ucd" dmtype="ivoa:string" >
        <SC_FIELD ucd="_pos_ra"/>
    </ATTRIBUTE>
</INSTANCE>
<FIELD ID="_pos_ra" name="RAJ2000" ucd="pos.eq.ra;meta.main" ref="J2000_2000.000" datatype="char" arraysize="12" unit="&quot;h:m:s&quot;">
    <DESCRIPTION>Right Ascension for the Equinox=J2000.0 and Epoch=J2000.0, on the system of FK5</DESCRIPTION>
</FIELD>

To indicate that the value comes from the 'ucd' element of the FIELD with ID "_pos_ra".
Note: in this case, that may not be appropriate because there is no single FIELD to match with this position parameter.

We've talked about consolidating the Mapping COLUMN|CONSTANT|LITERAL elements to a single form, but I'm less convinced about absorbing these into the element which assigns the 'role'. I like the fact that with the Mapping syntax, the annotation for <INSTANCE dmtype="meas:Position">... does not change depending on whether or not it is a member of a parent object. (having the same names for elements in both syntaxes is a bit confusing eh?!)

@lmichel
Copy link
Collaborator Author

lmichel commented May 6, 2021

The problem is well laid out but the answer need more thinking.

@glemson
Copy link
Contributor

glemson commented May 17, 2021

Very late in the discussion, but I do not understand the original comment and I don't think it describes the VO-DML mapping proposal correctly. Even if a change would be desired, it would constitute a minor update to include it in the original approach . No reason to create a different proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants