usefulness of "for" properties in the controlled vocabulary module #55

DavidFatDavidF · 2023-11-27T12:33:13Z

The tags have multiple "for" properties, e.g., forHeadwords. Do we have restrictions on how these may be combined, e.g., can a inflectedFormTag apply to headwords or translations or languages. Would it not make sense to combine this into a single property with values, e.g., instead of forHeadwords=true have for=headwords

michmech · 2023-12-11T20:57:23Z

Yes, in principle the for... properties could be combined into a single multi-value property, so that e.g. instead of forHeadwords=true and forTranslations=true we would have for=headwords|translations. But I think the way we have it now is easier to implement using eg. XML attributes or database table columns.

As for restrictions on how they may be combined, there are (mostly!) no restrictions and all combinations are valid. There are some exceptions and yes I agree we should state them explicitly. For example, under inflectedFormTag, the forLanguage property only makes sense if forTranslations=true.

DavidFatDavidF · 2023-12-12T12:05:30Z

Decision Dec 12:
@michmech to evaluate impact of @jmccrae proposal by next time
Overall we believe that a general for property is more extensible

jmccrae · 2023-12-12T12:07:18Z

Suggested Implementation:

for property can have the following values separated by spaces

headwords
translations
language:<lang_code>
pos:<part_of_speech>
collocates
etymology

jmccrae · 2023-12-12T12:11:11Z

Accepting a suggestion as above would also effectively fix #60

michmech · 2023-12-20T08:18:54Z

OK, so here is how it could work.

A new model-level `for` property

At model-level, instances of the following types would be allowed to have a property called for:

partOfSpeechTag
inflectedFormTag
labelTag
transcriptionSchemeTag

The property would be optional and repeatable (zero or more).

The property would define, inside each lexicographic resource, how the tag is allowed to be used. Examples:

partOfSpeechTag tag=“noun” for=“headwords” would mean that headwords can be labelled with the part of speech “noun” (but translations can’t)
partOfSpeechTag tag=“nounMasculineAnimate” for=“translations language:cs” would mean that translations in Czech can be labelled with the part of speech “masculine animate noun” (but translations in other languages can’t)
inflectedFormTag tag=“plural” for=“partOfSpeech noun” would mean that anything (headword or translation) which has the part-of-speech label “noun” can have an inflected form labelled as “plural” (while e.g. verbs can’t)
labelTag tag=“ulsterDialect” for=“translations language:ga” would mean that translations in Irish can be labelled as “Ulster dialect” while translations in other languages can’t
inflectedFormTag tag=“pluralGenitive” for=“translations language:ga inflectedFormTag:plural” would means that Irish translations which have a plural inflected form can also have a plural genitive inflected form

Intended semantics of `for`

The intended semantics of the for property would be as follows:

when a tag (such as labelTag tag=“informal”) doesn’t have a for property, then there are no constrains: the tag (e.g. here informal) can be used anywhere (e.g. here, as the value of a label anywhere)
when a tag does have a for property (such as labelTag tag=“ulsterDialect” for=“translations language:ga”), then the tag can only be used in contexts that comply with the union (logical “and”) of the constraints (e.g. here, ulsterDialect can be used as the value of a label when the label is a child of a translation and the translation has the language tag ga).

Predefined allowed values of `for`

The values (headwords, lang:cs etc.) would be instances of a new type called tagConstraint. This type would be defined in the Controlled Values module. The Controlled Values module would list a handful of predefined values, which implementors would be free to extend (= to add their own values). The predefined values would be:

headwords
translations
etymology
collocates
language combined with one of the language codes defined by a translationLanguage instance in the same lexicographc resource
partOfSpeech combined with one of the part-of-speech tags defined by a partOfSpeechTag instance in the same lexicographc resource
inflectedForm combined with one of the inflected-form tags defined by an inflectedFormTag instance in the same lexicographc resource
label combined with one of the label tags defined by a labelTag instance in the same lexicographc resource

Extending the allowed values of `for`

Example of a custom tagConstraint value that impementors might want to create:

tagConstraint constraint="headwordsBeginningWithB"
partOfSpeechTag tag="crazyNoun" for="headwordsBeginningWithB"

This means that the part-of-speech tag crazyNoun can only apply to nouns that beginn with “b”.

Enforcing the constraints

It would be up to the implementor to decide whether and how they want to enforce the constraints defined by the for properties: as business rules, and ocassional “quality assurance” checks, or not at all (= only as a style guide for human lexicographers). DMLex would only provide a formalism for expressing the constrains, not for enforcing them.

Serializations

At serialization-level, I am not in favour of lumping stuff into space-separated/colon-separated strings such as "translations language:cs". I’m in favour of analyzing everything fully and explictly. I propose this.

JSON

{
  "partOfPSpeechTags": [{
    "tag": "nounMasculineAnimate",
    "for": ["translations", {"language": "cs"}]
    "sameAs": [...]
  }, ...],
}

XML

<partOfSpeechTag tag="nounMasculineAnimate">
  <for constraint="translations"/>
  <for constraint="language" detail="cs"/>
</partOfSpeechTag>

jmccrae · 2023-12-20T09:05:15Z

I agree with the proposal other than that I prefer a space-separated string for serialization

michmech · 2023-12-20T09:50:16Z

Here are my arguments for not being in favour of spaces-and-colons-separated strings.

Agument 1: Consistency with rest of DMLex

A spaces-and-colons-separated string here would be inconsistent with the approach taken everywhere else in the DMLex serializations. For example, we never do stuff like this:

<entry labels="informal archaic">
  ...
</entry>

and instead we always do stuff like this:

<entry>
  <label tag="informal"/>
  <label tag="archaic"/>
  ...
</entry>

Argument 2: It’s not the JSON/XML way

JSON and XML parsers cannot not “see” the structure inherent in these strings.

To process a string like “translations language:cs” in an application you have to do your own application-level parsing: split by space, then iterate, then split each by colon. Why bother if your JSON or XML parser can deliver these to you already parsed?

Yes, writing your parsing routine for these things can be a trivial oneliner if you’re processing e.g. dictionary entries one by one. But it can become a nuisance if you want to do some kind of bulk processing, like “give me all tag types that are ‘for’ translations but not have a language specified”. Doing this in e.g. an XSL stylesheet is straightforward if the the XML object model can “see” the individual ’for’ values (= my way) but not if not (= John’s way).

jmccrae · 2023-12-20T10:19:28Z

We would have to introduce for as a new object type in the model.

I am also a little concerned about the inconsistency in the JSON serialization with both strings and objects in the same array, this often creates issues with the parsing, as you have to check the type first.

The single string proposal is easily processed with XSLT, e.g.

<xsl:if test="contains(for, 'translations')"/>

michmech · 2023-12-20T11:23:56Z

1.

We would have to introduce for as a new object type in the model.

No we wouldn’t. We would need to introduce a new <for> element in the XML serialization, but that’s only in the serialization, not in the model.

We have done something similar once or twice already, such as the <text> element( inside <example>) or the <member> element (inside <relation>). These XML elements do not correspond to any object types in the model, they just implement certain properties of other object types from the model.

2.

I am also a little concerned about the inconsistency in the JSON serialization with both strings and objects in the same array, this often creates issues with the parsing, as you have to check the type first.

True, it is a little frowned-upon in the JavaScript/JSON universe to have arrays with mixtures of different types inside them. Perhaps something like this would be better:

"for": [
   {constraint: "translations"},
   {constraint: "language", "detail": "cs"}
]

Bonus: makes it very similar to my proposed XML serialization. Drawback: a bit wordy (but not more than the XML serialization).

3.

The single string proposal is easily processed with XSLT, e.g.
<xsl:if test="contains(for, 'translations')"/>

That’s true, but not bullet-proof. The XPath function contains(...) does plain substring matching, so it can result in unexpected behaviour:

<labelTag tag="informal" for="translations lang:pt">
<labelTag tag="Rio dialect" for="translations lang:pt-BR">

<xsl:if test="contains(for, 'lang:pt')"/> returns two results. There’s no bullet-proof way to make it return only the first one (or correct me if I’m wrong).

Additionally, there might be performance bottlenecks during bulk processing. The XPath processor would typically have to do the substring matching on-the-spot during each run instead of being able to rely on an already parsed object model.

So, I’m not convinced, I still prefer the fully explicit serializations.

DavidFatDavidF · 2023-12-20T12:29:47Z

20th December 2023
We choose 2. as the JSON serialization
Impact on the abstract data model:
"for" will become an object with properties in the abstract data model
@michmech to implement by January 3, 2024

michmech · 2024-01-04T15:29:49Z

Having thought about it a bit more, I’m afraid the scheme we have agreed on is not expressive enough. It is unable to express that, for example, something is only allowed on translations in languages X, Y and Z but not others.

To express such things, it seems to me that, after all, we have no choice but to go with something much like John’s original proposal where the for property is just a string of text, and this string consists of a privately-interpreted notation that could look maybe like this:

partOfSpeechTag
  tag="nounMasculine"
  for="headword OR (translation AND (language:cs OR language:de))"

This would be mean that the part-of-speech tag nounMasculine is allowed to be used to annotate something which is either a headword or a translation in Czech or German.

What we’d lose: the ability to model the constrains explicitly in the formalisms that we have serializations for (JSON, XML...). Instead, we are embedding a private notation as a string.

What we’d gain: full expressivity; the ability to express all and any combinations of constrains, even crazy ones (if we allow implementors to extend the notation with their own atomic terms eg. headwordBeginningWithB).

Another option is to leave the contents of the for property completely underspecified at the model level, and leave it for the serialization. In the XML serialization it could be an XPath expression:

<partOfSpeechTag
  tag="nounMasculine"
  for="entry or headwordTranslation[@langCode='cs' or @langCode='de']"
/>

What would it be for other serializations? I hear XPath 3.1 can query JSON too, so maybe that. Or some other JSON query language, there are a couple of them around.

Also, in my experience, most dictionary schemas in existence today don’t bother modelling these constrains at all (and so they allow e.g. adding “plurals” to verbs, or “past tense” forms to nouns). So, whatever DMLex comes up with, will either be a big step forward or completely ignored by implementors anyway.

DavidFatDavidF · 2024-01-05T08:52:20Z

Decision taken on 5th Jan 2024:
The data model will be simplified to 1 private string (open for implementers to use their own preffered syntax within the string)

michmech · 2024-01-11T07:39:02Z

Implemented. We now have a single optional string-valued for property where previously we had multiple for... properties (forLanguage, forPartOfSpeech etc.).

This issue will close automatically when pull request #77 is merged.

DavidFatDavidF assigned michmech Nov 27, 2023

DavidFatDavidF added this to the 1st public review milestone Nov 27, 2023

michmech changed the title ~~usefulness of "for" proreprties in the controlled vocabulary module~~ usefulness of "for" properties in the controlled vocabulary module Dec 11, 2023

DavidFatDavidF added the module affects only modules label Dec 12, 2023

DavidFatDavidF added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2023

DavidFatDavidF mentioned this issue Dec 20, 2023

Inconsistent serializations of Controlled vocabularies module's "inflectedFormTag" #60

Closed

jmccrae mentioned this issue Jan 1, 2024

Update examples based on work with the converter #74

Merged

DavidFatDavidF added wontfix This will not be worked on and removed wontfix This will not be worked on labels Jan 5, 2024

michmech mentioned this issue Jan 11, 2024

replace "for..." properties with one "for" property #77

Merged

DavidFatDavidF closed this as completed in #77 Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usefulness of "for" properties in the controlled vocabulary module #55

usefulness of "for" properties in the controlled vocabulary module #55

DavidFatDavidF commented Nov 27, 2023

michmech commented Dec 11, 2023

DavidFatDavidF commented Dec 12, 2023

jmccrae commented Dec 12, 2023

jmccrae commented Dec 12, 2023

michmech commented Dec 20, 2023 •

edited

jmccrae commented Dec 20, 2023

michmech commented Dec 20, 2023

jmccrae commented Dec 20, 2023

michmech commented Dec 20, 2023

DavidFatDavidF commented Dec 20, 2023

michmech commented Jan 4, 2024

DavidFatDavidF commented Jan 5, 2024

michmech commented Jan 11, 2024

usefulness of "for" properties in the controlled vocabulary module #55

usefulness of "for" properties in the controlled vocabulary module #55

Comments

DavidFatDavidF commented Nov 27, 2023

michmech commented Dec 11, 2023

DavidFatDavidF commented Dec 12, 2023

jmccrae commented Dec 12, 2023

jmccrae commented Dec 12, 2023

michmech commented Dec 20, 2023 • edited

A new model-level for property

Intended semantics of for

Predefined allowed values of for

Extending the allowed values of for

Enforcing the constraints

Serializations

JSON

XML

jmccrae commented Dec 20, 2023

michmech commented Dec 20, 2023

Agument 1: Consistency with rest of DMLex

Argument 2: It’s not the JSON/XML way

jmccrae commented Dec 20, 2023

michmech commented Dec 20, 2023

1.

2.

3.

DavidFatDavidF commented Dec 20, 2023

michmech commented Jan 4, 2024

DavidFatDavidF commented Jan 5, 2024

michmech commented Jan 11, 2024

michmech commented Dec 20, 2023 •

edited

A new model-level `for` property

Intended semantics of `for`

Predefined allowed values of `for`

Extending the allowed values of `for`