Breaking changes #43

goodmami · 2021-02-08T06:18:51Z

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a shame if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?).

Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child of <Lexicon>
Remove the senses attribute from <SyntacticBehaviour>; these associations are handled by the subcat attribute on <Sense> elements
Make the id attribute on <SyntacticBehaviour> required

Proposed Changes

These are new changes that we might consider

~~Remove <Tag>~~ (edit: in the comments below, a case is made for other uses of <Tag>)
Click to show/hide original text

The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on <Lemma> and <Form>:
```
<Lemma writtenForm="头发" partOfSpeech="n" script="Hans" />
<Form writtenForm="頭髮" script="Hant" />
<Form writtenForm="tóufa" script="Latn-pinyin" />
<Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" />
<Form writtenForm="toufa" script="Latn-pinyin-x-simple" />
```
Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.
Remove <Count>? (see comments below)
Remove <ILIDefinition>? (see comments below)
Remove (apparently) unused attributes?
- sourceSense on <Definition>
- lexicalized on <Sense> and <Synset>
- status on anything with metadata

The text was updated successfully, but these errors were encountered:

fcbond · 2021-02-08T07:14:07Z

Hi, I was thinking of using Tag much more broadly, for example to show roots in Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants in Hebrew and so forth. So I don't think it can be replaced by just script.

…

On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman < ***@***.***> wrote: This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a same if we miss some and have to wait for the next major version. For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?). Deferred Changes These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue. - Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child of <Lexicon> - Remove the senses attribute from <SyntacticBehaviour>; these associations are handled by the subcat attribute on <Sense> elements - Make the id attribute on <SyntacticBehaviour> required Proposed Changes These are new changes that we might consider - Remove <Tag>? The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on <Lemma> and <Form>: <Lemma writtenForm="头发" partOfSpeech="n" script="Hans" /> <Form writtenForm="頭髮" script="Hant" /> <Form writtenForm="tóufa" script="Latn-pinyin" /> <Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" /> <Form writtenForm="toufa" script="Latn-pinyin-x-simple" /> Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#43>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

lmorgadodacosta · 2021-02-08T07:27:10Z

I agree with Francis. I would very much like to keep the Tag to store flexible annotations on Lemmas and Forms. These won't be meaningful for OMW (as it reads the LMF) but they can be displayed as a list of tag-values. Also, if projects keep using Tag as a flexible layer to store information, OMW can also better understand what special "tags" could be embedded in the DTD as 'officially supported' with an agreed upon format/meaning. On Mon, Feb 8, 2021 at 3:14 PM Francis Bond <notifications@github.com> wrote:

…

Hi, I was thinking of using Tag much more broadly, for example to show roots in Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants in Hebrew and so forth. So I don't think it can be replaced by just script. On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman < ***@***.***> wrote: > This issue is meant to collect the changes we would like to make to WN-LMF > but have not because doing so would break backward compatibility. When we > get to a 2.0 version we have a chance for some simplification and > belt-tightening, so it would be a same if we miss some and have to wait for > the next major version. > > For better discussion, these issues could be broken up into separate > issues (maybe with an appropriate label or milestone to group them?). > Deferred Changes > > These are changes we would have made in WN-LMF 1.1 if backwards > compatibility were not an issue. > > - Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child > of <Lexicon> > - Remove the senses attribute from <SyntacticBehaviour>; these > associations are handled by the subcat attribute on <Sense> elements > - Make the id attribute on <SyntacticBehaviour> required > > Proposed Changes > > These are new changes that we might consider > > - > > Remove <Tag>? The use case presented in Bond et al. 2020 ("Some Issues > with Building a Multilingual Wordnet") seems more elegantly handled by the > script attribute on <Lemma> and <Form>: > > <Lemma writtenForm="头发" partOfSpeech="n" script="Hans" /> > > <Form writtenForm="頭髮" script="Hant" /> > > <Form writtenForm="tóufa" script="Latn-pinyin" /> > > <Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" /> > > <Form writtenForm="toufa" script="Latn-pinyin-x-simple" /> > > Above, if script were limited to ISO15924 script names, then all 3 > pinyin variants would be just "Latn", so I used BCP-47-like tags minus the > language and region names. The "pinyin" variant and private-use tags > "numeric" and "simple" can be used to distinguish them. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#43>, or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA > > . > -- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB73XHQSPEKS6G3ENKHVPZTS56FNBANCNFSM4XIL72ZA> .

goodmami · 2021-02-08T07:52:31Z

@fcbond, @lmorgadodacosta thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like stimulus/stimuli. Could be with <Tag>:

      <Lemma partOfSpeech="n" writtenForm="stimulus" />
      <Form writtenForm="stimuli">
        <Tag category="number">PL</Tag>
      </Form>

Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, <Count> doesn't really model something true about a language, but something that can be computed for some corpora, so why is this part of WN-LMF? And <ILIDefinition> is only used when a wordnet is the vehicle by which new ILI candidates are proposed, otherwise those definitions are included with the ILI resource, so it seems like there could be another channel for proposing candidates (e.g., by creating issues at https://github.com/globalwordnet/cili/).

fcbond · 2021-02-08T09:08:38Z

Hi, I think frequency information is a part of knowledge of language. Any corpus count is only an imperfect sample, but I would rather make available what we have when we have it. For the ILI I think we tried to get a balance between purely modelling and generally useful. We only want candidates that come with a wordnet, and packaging them together makes this easier to manage.

…

On Mon, Feb 8, 2021 at 3:57 PM Michael Wayne Goodman < ***@***.***> wrote: @fcbond <https://github.com/fcbond>, @lmorgadodacosta <https://github.com/lmorgadodacosta> thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like *stimulus*/ *stimuli*. Could be with <Tag>: <Lemma partOfSpeech="n" writtenForm="stimulus" /> <Form writtenForm="stimuli"> <Tag category="number">PL</Tag> </Form> Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, <Count> doesn't really model something true about a language, but something that can be computed for some corpora, so why is this part of WN-LMF? And <ILIDefinition> is only used when a wordnet is the vehicle by which new ILI candidates are proposed, otherwise those definitions are included with the ILI resource, so it seems like there could be another channel for proposing candidates (e.g., by creating issues at https://github.com/globalwordnet/cili/). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRXHEMWP5LRWK466JHTS56J5BANCNFSM4XIL72ZA> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami · 2021-02-09T03:00:14Z

I think frequency information is a part of knowledge of language. Any
corpus count is only an imperfect sample, but I would rather make available
what we have when we have it.

Sorry, I think my "something true" comment wasn't accurate. I was trying to draw a line between "gold", human-added information and the automatically computed information. I think the line is even blurrier because those computed counts are, I think, from human annotations.

So you have this information and you'd like to make it available. That's great, but I still think it would be better as a separate resource, similar to how the information-content (IC) data files are distributed separately. It's also easier that way to track where the counts came from, e.g., in a file called ntumc-pwn-3.0-counts.tsv instead of having a dc:source="NTUMC" attribute on every <Count> element in the XML file.

Also, practically, I have not seen any wordnets distributed with this information (I suspect you use it internally for annotation projects), and trying to model it properly in Wn complicates the database schema and code. I guess I'm arguing for a worse-is-better approach.

We only want candidates that come with a wordnet, and
packaging them together makes this easier to manage.

My position here is essentially the same as my last argument regarding schema/code complexity. It seems like the format has been refitted with a feature that's only relevant for CILI's development and not for modeling a wordnet. A proposed ILI with ili="in" must be special-cased: it's not the case that all synsets with ili="in" are interlingually aligned, <ILIDefinition> when the ili attribute is not "in" should probably be ignored as the definitions come from CILI, etc. I think it would be better to propose new ILIs by declaring the synset they belong to, such as in a TSV file (examples from EWN 2020):

synset	definition
ewn-05698967-n	the barrier preventing Blacks from participating in various activities with whites
ewn-05822120-n	(plural) something that reminds you of someone or something
...

Furthermore, we cannot express in a DTD that <ILIDefinition> is required when the ili attribute has value "in" and is forbidden (?) otherwise. It's just not a good fit.

goodmami · 2021-02-09T03:32:41Z

Also note that I've updated the original issue text. I added some attributes as candidates for removal. I understand that they had some original purpose but I don't see evidence of their use, so it's worth discussing whether they can be removed. Generally, though, these attributes are relatively simple to model in the database and they can just not appear in the XML when unused, but they can still cause surprises (e.g., see here).

goodmami mentioned this issue Jul 1, 2021

DTD 1.1 #52

Open

goodmami mentioned this issue May 24, 2022

If you create an entry with an ILIDefinition, but ill.id='' you lose the definition goodmami/wn#166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking changes #43

Breaking changes #43

goodmami commented Feb 8, 2021 •

edited

Loading

fcbond commented Feb 8, 2021 via email

lmorgadodacosta commented Feb 8, 2021 via email

goodmami commented Feb 8, 2021

fcbond commented Feb 8, 2021 via email

goodmami commented Feb 9, 2021

goodmami commented Feb 9, 2021

Breaking changes #43

Breaking changes #43

Comments

goodmami commented Feb 8, 2021 • edited Loading

Deferred Changes

Proposed Changes

fcbond commented Feb 8, 2021 via email

lmorgadodacosta commented Feb 8, 2021 via email

goodmami commented Feb 8, 2021

fcbond commented Feb 8, 2021 via email

goodmami commented Feb 9, 2021

goodmami commented Feb 9, 2021

goodmami commented Feb 8, 2021 •

edited

Loading