Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mechanism to include a list or set of values in a string column #25

Open
pdowler opened this issue Jun 28, 2020 · 10 comments
Open

mechanism to include a list or set of values in a string column #25

pdowler opened this issue Jun 28, 2020 · 10 comments
Labels
needs_discussion More discussion needed before fixing

Comments

@pdowler
Copy link
Contributor

pdowler commented Jun 28, 2020

There seem to be many cases where one wants to put multiple values into a character field, eg:

  • multiple words in the semantics column (DataLink)
  • keywords: a set of keywords that get stored in the database and output (via TAP) as a single column (used in CAOM)

The columns are currently datatype="char" arraysize="*". Since words tend to be different lengths :-) the multi-dimensional array notation in VOTable isn't really usable (the first dimension has to be fixed) but with word lists the length of and number of words is variable.

@pdowler
Copy link
Contributor Author

pdowler commented Jun 28, 2020

Ad-hoc/practical solutions usually involve defining a delimiter that isn't used/allowed in any of the values.

Could VOTable feasibly define a standard delimiter that works across all table serialisations? or would it have to be a mechanism where the writer specifies the delimiter as column (FIELD) or cell (TD) metadata?

@pdowler
Copy link
Contributor Author

pdowler commented Jun 28, 2020

If a common delimiter DALI could define an xtype? eg xtype="word-list" and specify the standard delimiter

What about a dynamic xtype, e.g. xtype="word-list-|" where | is the delimiter? That is DALI defines "word-list-{delim}".

Given the meaning of xtype as telling the client about some structure in the value- "this is a char* you can interpret as a word-list" - this would be consistent with other use of xtype. Of course, it would only work in cases where a delimiter is known to be safe for all possible values (recall the problem with specifying a null and a streaming output in BINARY).

@mbtaylor
Copy link
Member

I support this idea, I think xtype is a good way to address this long-running annoyance. How about using a newline character (\n = 
) as the (fixed) delimiter? Although it is possible to think of cases where you would want to encode variable-length arrays of strings containing newlines, it would be a small minority of the cases where variable-length-string arrays are required, and it wouldn't really be well described by the designation "word list".

@pdowler
Copy link
Contributor Author

pdowler commented Jan 27, 2021

Prior art: ObsCore + TAP outputs the pol_states column as a list of string delimited by |

In general I have found | to be a good delimiter that doesn't seem to collide with other uses and require escaping and such). For example, CAOM has several feilds named keywords that allow a list of keywords and some use cases requires phrases (dwarf galaxy) so space was out and we wanted something more clear than \t or \n in TAP query results, so we chose | there as well (and in the model, | is reserved so keywords cannot contain the | char - never had a complaint).

@pdowler
Copy link
Contributor Author

pdowler commented Jan 27, 2021

More general problem: WD-DALI-1.2 includes an xtype="multipolygon" which has VOTable metadata:

datatype="double" arraysize="*" xtype="multipolygon"

(or float). A multipolygon is {polygon} {separator} {polygon} ... so we need a delimiter in the double array to separate the component polygons. Component polygons have 6 or more numbers and there are 1 or more component polygons in a multipolygon. Since polygon is supported as a single double (or float) array, this use case is also one of variable length array and variable in both dimensions (just like list-of-string). We currently specify NaN value(s) as a separator because they are valid double values and easily parseable... but it seems like a problem that might come up again so of VOtable supported a little more we would have to say less about parsing MultiPolygon in DALI and more of the parsing would be done by generic code.

For example, a generic VOTable parser could do

List<double[]> arrays = parser.parse(columnValue);

A parser or application that knew what multipolygon was could use that and convert the raw arrays into a multipolygon object, as could some other structure that was encoded as multiple arrays of numbers.

Side notes: the NaN option is usable for double and float but not for fixed point datatypes.

@pdowler
Copy link
Contributor Author

pdowler commented Jan 27, 2021

Just to get the wheels turning a little....

What about something like allowing arraysize="*x*" with either a fixed delimiter (per datatype: | for char, NaN for double and float, ??? for int) and/or a settable `delim="|" attribute on FIELD (and PARAM). Old VOTable parsers would certainly not expect or know how to deal with that arraysize value.... so that's something to consider.

@pdowler
Copy link
Contributor Author

pdowler commented Nov 18, 2022

Since we coming around to final WD-DALI-1.2 and an RFC, I would like to resolve this. My current inclination is to go with a pure xtype solution in DALI or as custom xtypes, which would have something like:

xtype="words" : space-delimited list of words
xtype="phrases" : |-delimited list of phrases
...
xtype="multipolygon" : NaN-delimited list of (simple) polygons

Of course: each of the above to be discussed individually over in DALI when use cases (usually TAP) are presented.

I have no concrete use case for list of int[] and when I try to stretch and make one up (multipolygon in pixel coordinates) I remember that in FITS pixel coordinates are also floating point... so will ignore.

The only thing that could happen in VOTable might be to define the "words" and "phrases" xtypes here instead of DALI (debatable).

@msdemlei
Copy link
Contributor

msdemlei commented Nov 21, 2022 via email

@mbtaylor
Copy link
Member

FWIW, TOPCAT already understands multi-polygons marked as xtype="polygon" with pairs of NaN as delimiter. I don't recall if any data providers are supplying such values though.

@pdowler
Copy link
Contributor Author

pdowler commented Nov 21, 2022

My only point here is that I am no longer seeking a VOTable solution to this, hence

Of course: each of the above to be discussed individually over in DALI when use cases (usually TAP) are presented.

and I didn't even say I would actually bring this to DALI either:

xtype solution in DALI or as custom xtypes

That was my only point here.

@tomdonaldson tomdonaldson added the needs_discussion More discussion needed before fixing label Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs_discussion More discussion needed before fixing
Projects
None yet
Development

No branches or pull requests

4 participants