Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes #43

Open
goodmami opened this issue Feb 8, 2021 · 6 comments
Open

Breaking changes #43

goodmami opened this issue Feb 8, 2021 · 6 comments

Comments

@goodmami
Copy link
Member

goodmami commented Feb 8, 2021

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a shame if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?).

Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

  • Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child of <Lexicon>
  • Remove the senses attribute from <SyntacticBehaviour>; these associations are handled by the subcat attribute on <Sense> elements
  • Make the id attribute on <SyntacticBehaviour> required

Proposed Changes

These are new changes that we might consider

  • Remove <Tag> (edit: in the comments below, a case is made for other uses of <Tag>)

    Click to show/hide original text

    The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on <Lemma> and <Form>:

    <Lemma writtenForm="头发" partOfSpeech="n" script="Hans" />
    <Form writtenForm="頭髮" script="Hant" />
    <Form writtenForm="tóufa" script="Latn-pinyin" />
    <Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" />
    <Form writtenForm="toufa" script="Latn-pinyin-x-simple" />

    Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.

  • Remove <Count>? (see comments below)

  • Remove <ILIDefinition>? (see comments below)

  • Remove (apparently) unused attributes?

    • sourceSense on <Definition>
    • lexicalized on <Sense> and <Synset>
    • status on anything with metadata
@fcbond
Copy link
Member

fcbond commented Feb 8, 2021 via email

@lmorgadodacosta
Copy link

lmorgadodacosta commented Feb 8, 2021 via email

@goodmami
Copy link
Member Author

goodmami commented Feb 8, 2021

@fcbond, @lmorgadodacosta thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like stimulus/stimuli. Could be with <Tag>:

      <Lemma partOfSpeech="n" writtenForm="stimulus" />
      <Form writtenForm="stimuli">
        <Tag category="number">PL</Tag>
      </Form>

Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, <Count> doesn't really model something true about a language, but something that can be computed for some corpora, so why is this part of WN-LMF? And <ILIDefinition> is only used when a wordnet is the vehicle by which new ILI candidates are proposed, otherwise those definitions are included with the ILI resource, so it seems like there could be another channel for proposing candidates (e.g., by creating issues at https://github.com/globalwordnet/cili/).

@fcbond
Copy link
Member

fcbond commented Feb 8, 2021 via email

@goodmami
Copy link
Member Author

goodmami commented Feb 9, 2021

I think frequency information is a part of knowledge of language. Any
corpus count is only an imperfect sample, but I would rather make available
what we have when we have it.

Sorry, I think my "something true" comment wasn't accurate. I was trying to draw a line between "gold", human-added information and the automatically computed information. I think the line is even blurrier because those computed counts are, I think, from human annotations.

So you have this information and you'd like to make it available. That's great, but I still think it would be better as a separate resource, similar to how the information-content (IC) data files are distributed separately. It's also easier that way to track where the counts came from, e.g., in a file called ntumc-pwn-3.0-counts.tsv instead of having a dc:source="NTUMC" attribute on every <Count> element in the XML file.

Also, practically, I have not seen any wordnets distributed with this information (I suspect you use it internally for annotation projects), and trying to model it properly in Wn complicates the database schema and code. I guess I'm arguing for a worse-is-better approach.

We only want candidates that come with a wordnet, and
packaging them together makes this easier to manage.

My position here is essentially the same as my last argument regarding schema/code complexity. It seems like the format has been refitted with a feature that's only relevant for CILI's development and not for modeling a wordnet. A proposed ILI with ili="in" must be special-cased: it's not the case that all synsets with ili="in" are interlingually aligned, <ILIDefinition> when the ili attribute is not "in" should probably be ignored as the definitions come from CILI, etc. I think it would be better to propose new ILIs by declaring the synset they belong to, such as in a TSV file (examples from EWN 2020):

synset	definition
ewn-05698967-n	the barrier preventing Blacks from participating in various activities with whites
ewn-05822120-n	(plural) something that reminds you of someone or something
...

Furthermore, we cannot express in a DTD that <ILIDefinition> is required when the ili attribute has value "in" and is forbidden (?) otherwise. It's just not a good fit.

@goodmami
Copy link
Member Author

goodmami commented Feb 9, 2021

Also note that I've updated the original issue text. I added some attributes as candidates for removal. I understand that they had some original purpose but I don't see evidence of their use, so it's worth discussing whether they can be removed. Generally, though, these attributes are relatively simple to model in the database and they can just not appear in the XML when unused, but they can still cause surprises (e.g., see here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants