Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usefulness of "for" properties in the controlled vocabulary module #55

Closed
DavidFatDavidF opened this issue Nov 27, 2023 · 13 comments · Fixed by #77
Closed

usefulness of "for" properties in the controlled vocabulary module #55

DavidFatDavidF opened this issue Nov 27, 2023 · 13 comments · Fixed by #77
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed module affects only modules

Comments

@DavidFatDavidF
Copy link
Contributor

The tags have multiple "for" properties, e.g., forHeadwords. Do we have restrictions on how these may be combined, e.g., can a inflectedFormTag apply to headwords or translations or languages. Would it not make sense to combine this into a single property with values, e.g., instead of forHeadwords=true have for=headwords

@DavidFatDavidF DavidFatDavidF added this to the 1st public review milestone Nov 27, 2023
@michmech michmech changed the title usefulness of "for" proreprties in the controlled vocabulary module usefulness of "for" properties in the controlled vocabulary module Dec 11, 2023
@michmech
Copy link
Contributor

Yes, in principle the for... properties could be combined into a single multi-value property, so that e.g. instead of forHeadwords=true and forTranslations=true we would have for=headwords|translations. But I think the way we have it now is easier to implement using eg. XML attributes or database table columns.

As for restrictions on how they may be combined, there are (mostly!) no restrictions and all combinations are valid. There are some exceptions and yes I agree we should state them explicitly. For example, under inflectedFormTag, the forLanguage property only makes sense if forTranslations=true.

@DavidFatDavidF DavidFatDavidF added the module affects only modules label Dec 12, 2023
@DavidFatDavidF
Copy link
Contributor Author

Decision Dec 12:
@michmech to evaluate impact of @jmccrae proposal by next time
Overall we believe that a general for property is more extensible

@DavidFatDavidF DavidFatDavidF added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2023
@jmccrae
Copy link
Contributor

jmccrae commented Dec 12, 2023

Suggested Implementation:

for property can have the following values separated by spaces

  • headwords
  • translations
  • language:<lang_code>
  • pos:<part_of_speech>
  • collocates
  • etymology

@jmccrae
Copy link
Contributor

jmccrae commented Dec 12, 2023

Accepting a suggestion as above would also effectively fix #60

@michmech
Copy link
Contributor

michmech commented Dec 20, 2023

OK, so here is how it could work.

A new model-level for property

At model-level, instances of the following types would be allowed to have a property called for:

  • partOfSpeechTag
  • inflectedFormTag
  • labelTag
  • transcriptionSchemeTag

The property would be optional and repeatable (zero or more).

The property would define, inside each lexicographic resource, how the tag is allowed to be used. Examples:

  • partOfSpeechTag tag=“noun” for=“headwords” would mean that headwords can be labelled with the part of speech “noun” (but translations can’t)

  • partOfSpeechTag tag=“nounMasculineAnimate” for=“translations language:cs” would mean that translations in Czech can be labelled with the part of speech “masculine animate noun” (but translations in other languages can’t)

  • inflectedFormTag tag=“plural” for=“partOfSpeech noun” would mean that anything (headword or translation) which has the part-of-speech label “noun” can have an inflected form labelled as “plural” (while e.g. verbs can’t)

  • labelTag tag=“ulsterDialect” for=“translations language:ga” would mean that translations in Irish can be labelled as “Ulster dialect” while translations in other languages can’t

  • inflectedFormTag tag=“pluralGenitive” for=“translations language:ga inflectedFormTag:plural” would means that Irish translations which have a plural inflected form can also have a plural genitive inflected form

Intended semantics of for

The intended semantics of the for property would be as follows:

  • when a tag (such as labelTag tag=“informal”) doesn’t have a for property, then there are no constrains: the tag (e.g. here informal) can be used anywhere (e.g. here, as the value of a label anywhere)

  • when a tag does have a for property (such as labelTag tag=“ulsterDialect” for=“translations language:ga”), then the tag can only be used in contexts that comply with the union (logical “and”) of the constraints (e.g. here, ulsterDialect can be used as the value of a label when the label is a child of a translation and the translation has the language tag ga).

Predefined allowed values of for

The values (headwords, lang:cs etc.) would be instances of a new type called tagConstraint. This type would be defined in the Controlled Values module. The Controlled Values module would list a handful of predefined values, which implementors would be free to extend (= to add their own values). The predefined values would be:

  • headwords

  • translations

  • etymology

  • collocates

  • language combined with one of the language codes defined by a translationLanguage instance in the same lexicographc resource

  • partOfSpeech combined with one of the part-of-speech tags defined by a partOfSpeechTag instance in the same lexicographc resource

  • inflectedForm combined with one of the inflected-form tags defined by an inflectedFormTag instance in the same lexicographc resource

  • label combined with one of the label tags defined by a labelTag instance in the same lexicographc resource

Extending the allowed values of for

Example of a custom tagConstraint value that impementors might want to create:

tagConstraint constraint="headwordsBeginningWithB"
partOfSpeechTag tag="crazyNoun" for="headwordsBeginningWithB"

This means that the part-of-speech tag crazyNoun can only apply to nouns that beginn with “b”.

Enforcing the constraints

It would be up to the implementor to decide whether and how they want to enforce the constraints defined by the for properties: as business rules, and ocassional “quality assurance” checks, or not at all (= only as a style guide for human lexicographers). DMLex would only provide a formalism for expressing the constrains, not for enforcing them.

Serializations

At serialization-level, I am not in favour of lumping stuff into space-separated/colon-separated strings such as "translations language:cs". I’m in favour of analyzing everything fully and explictly. I propose this.

JSON

{
  "partOfPSpeechTags": [{
    "tag": "nounMasculineAnimate",
    "for": ["translations", {"language": "cs"}]
    "sameAs": [...]
  }, ...],
}

XML

<partOfSpeechTag tag="nounMasculineAnimate">
  <for constraint="translations"/>
  <for constraint="language" detail="cs"/>
</partOfSpeechTag>

@jmccrae
Copy link
Contributor

jmccrae commented Dec 20, 2023

I agree with the proposal other than that I prefer a space-separated string for serialization

@michmech
Copy link
Contributor

Here are my arguments for not being in favour of spaces-and-colons-separated strings.

Agument 1: Consistency with rest of DMLex

A spaces-and-colons-separated string here would be inconsistent with the approach taken everywhere else in the DMLex serializations. For example, we never do stuff like this:

<entry labels="informal archaic">
  ...
</entry>

and instead we always do stuff like this:

<entry>
  <label tag="informal"/>
  <label tag="archaic"/>
  ...
</entry>

Argument 2: It’s not the JSON/XML way

JSON and XML parsers cannot not “see” the structure inherent in these strings.

To process a string like “translations language:cs” in an application you have to do your own application-level parsing: split by space, then iterate, then split each by colon. Why bother if your JSON or XML parser can deliver these to you already parsed?

Yes, writing your parsing routine for these things can be a trivial oneliner if you’re processing e.g. dictionary entries one by one. But it can become a nuisance if you want to do some kind of bulk processing, like “give me all tag types that are ‘for’ translations but not have a language specified”. Doing this in e.g. an XSL stylesheet is straightforward if the the XML object model can “see” the individual ’for’ values (= my way) but not if not (= John’s way).

@jmccrae
Copy link
Contributor

jmccrae commented Dec 20, 2023

We would have to introduce for as a new object type in the model.

I am also a little concerned about the inconsistency in the JSON serialization with both strings and objects in the same array, this often creates issues with the parsing, as you have to check the type first.

The single string proposal is easily processed with XSLT, e.g.

<xsl:if test="contains(for, 'translations')"/>

@michmech
Copy link
Contributor

1.

We would have to introduce for as a new object type in the model.

No we wouldn’t. We would need to introduce a new <for> element in the XML serialization, but that’s only in the serialization, not in the model.

We have done something similar once or twice already, such as the <text> element( inside <example>) or the <member> element (inside <relation>). These XML elements do not correspond to any object types in the model, they just implement certain properties of other object types from the model.

2.

I am also a little concerned about the inconsistency in the JSON serialization with both strings and objects in the same array, this often creates issues with the parsing, as you have to check the type first.

True, it is a little frowned-upon in the JavaScript/JSON universe to have arrays with mixtures of different types inside them. Perhaps something like this would be better:

"for": [
   {constraint: "translations"},
   {constraint: "language", "detail": "cs"}
]

Bonus: makes it very similar to my proposed XML serialization. Drawback: a bit wordy (but not more than the XML serialization).

3.

The single string proposal is easily processed with XSLT, e.g.
<xsl:if test="contains(for, 'translations')"/>

That’s true, but not bullet-proof. The XPath function contains(...) does plain substring matching, so it can result in unexpected behaviour:

<labelTag tag="informal" for="translations lang:pt">
<labelTag tag="Rio dialect" for="translations lang:pt-BR">

<xsl:if test="contains(for, 'lang:pt')"/> returns two results. There’s no bullet-proof way to make it return only the first one (or correct me if I’m wrong).

Additionally, there might be performance bottlenecks during bulk processing. The XPath processor would typically have to do the substring matching on-the-spot during each run instead of being able to rely on an already parsed object model.

So, I’m not convinced, I still prefer the fully explicit serializations.

@DavidFatDavidF
Copy link
Contributor Author

20th December 2023
We choose 2. as the JSON serialization
Impact on the abstract data model:
"for" will become an object with properties in the abstract data model
@michmech to implement by January 3, 2024

@michmech
Copy link
Contributor

michmech commented Jan 4, 2024

Having thought about it a bit more, I’m afraid the scheme we have agreed on is not expressive enough. It is unable to express that, for example, something is only allowed on translations in languages X, Y and Z but not others.

To express such things, it seems to me that, after all, we have no choice but to go with something much like John’s original proposal where the for property is just a string of text, and this string consists of a privately-interpreted notation that could look maybe like this:

partOfSpeechTag
  tag="nounMasculine"
  for="headword OR (translation AND (language:cs OR language:de))"

This would be mean that the part-of-speech tag nounMasculine is allowed to be used to annotate something which is either a headword or a translation in Czech or German.

What we’d lose: the ability to model the constrains explicitly in the formalisms that we have serializations for (JSON, XML...). Instead, we are embedding a private notation as a string.

What we’d gain: full expressivity; the ability to express all and any combinations of constrains, even crazy ones (if we allow implementors to extend the notation with their own atomic terms eg. headwordBeginningWithB).

Another option is to leave the contents of the for property completely underspecified at the model level, and leave it for the serialization. In the XML serialization it could be an XPath expression:

<partOfSpeechTag
  tag="nounMasculine"
  for="entry or headwordTranslation[@langCode='cs' or @langCode='de']"
/>

What would it be for other serializations? I hear XPath 3.1 can query JSON too, so maybe that. Or some other JSON query language, there are a couple of them around.

Also, in my experience, most dictionary schemas in existence today don’t bother modelling these constrains at all (and so they allow e.g. adding “plurals” to verbs, or “past tense” forms to nouns). So, whatever DMLex comes up with, will either be a big step forward or completely ignored by implementors anyway.

@DavidFatDavidF DavidFatDavidF added wontfix This will not be worked on and removed wontfix This will not be worked on labels Jan 5, 2024
@DavidFatDavidF
Copy link
Contributor Author

Decision taken on 5th Jan 2024:
The data model will be simplified to 1 private string (open for implementers to use their own preffered syntax within the string)

@michmech
Copy link
Contributor

Implemented. We now have a single optional string-valued for property where previously we had multiple for... properties (forLanguage, forPartOfSpeech etc.).

This issue will close automatically when pull request #77 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed module affects only modules
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants