Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework release scripts for v1.4 #16

Merged
merged 30 commits into from Nov 5, 2021
Merged

Rework release scripts for v1.4 #16

merged 30 commits into from Nov 5, 2021

Conversation

goodmami
Copy link
Collaborator

@goodmami goodmami commented Oct 27, 2021

This changes this quite a bit for building the wordnets.

  1. There's an index.toml file which contains metadata (language, citation, etc.) for each wordnet, as well as pointers to the source (.tab) files.
  2. The script scripts/tsv2lmf.py is no longer callable directly.
  3. The scripts/ directory is now a Python package so relative imports work.
  4. Need to install dependencies (pip install -r --requirements.txt) and get required files.
  5. Call the conversion script with python -m scripts.build --version=XYZ [id...]
  6. Built wordnets go in build/omw-XYZ/... (not tracked)
  7. Compressed packages go in release/... (not tracked)
  8. Several files have been removed (scripts/ili-map.tab, scripts/WN-LMF.dtd, wns/omw-citations.tab) because they are unnecessary or can be retrieved during a build.
  9. Resulting wordnets follow WN-LMF-1.1.dtd

There are still some things to do with the GitHub Actions workflow:

  • Figure out how building wn30 and wn31 with scripts/wndb2lmf fits
  • download required files
  • build the wordnets with the new scripts
  • package the build for release

@goodmami goodmami requested a review from fcbond October 27, 2021 22:36
@goodmami
Copy link
Collaborator Author

goodmami commented Oct 29, 2021

The latest changes address 3 of 4 of the tasks listed above. There are some new ones, though:

@fcbond
Copy link
Contributor

fcbond commented Oct 30, 2021

I think the only remaining issues are the two above and the availability of wn 0.8.1

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 2, 2021

Wn 0.8.1 had a packaging issue so I released Wn 0.8.2, however it looks like I might not have caught some LMF export issues, so I'm investigating before releasing 0.8.3.

Regarding WordNet 3.0, my validation script found some other issues caused by issues in the source data (shown below, abbreviated):

  1. Nine redundant relations

    $ grep 'benignancy' index.noun 
    benignancy n 1 4 ! @ = + 1 0 04840981  
    $ grep '^04840981' data.noun 
    04840981 07 n 03 benignity 0 benignancy 0 ... + 01372773 a 0201 + 01372773 a 0201 ... + 01372773 a 0101 ...
    $ #                                           ^ here            ^ same                ^ different source sense
  2. One self-loop

    $ grep 'unicycle' index.noun 
    unicycle n 1 2 @ + 1 0 04509417  
    $ grep '^04509417' data.noun 
    04509417 06 n 02 unicycle 0 monocycle 0 ... + 04509417 n 0101 ...
    $ #                                         ^ points to self
  3. Many, many relations without the reverse relations, mostly pertainym, also is_entailed_by, is_caused_by, and also. This should probably be ignored, as that's how the data was designed.

I can fix (1) by simply dropping the redundant relations. (2) may require a manual fix, as we did for inhibit.

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 2, 2021

(2) may require a manual fix, as we did for inhibit.

Or maybe we just drop that one, too. That synset already has derivation links to unicyclist and unicycle (v).

@fcbond
Copy link
Contributor

fcbond commented Nov 2, 2021 via email

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 2, 2021

Ok sounds good. Also, WordNet 3.1 has the same problems, plus it also has about 40 synsets with the same ILI as another synset. E.g.:

Synset ID ILI
omw-en31-00767587-n i39442
omw-en31-00767761-n i39442
omw-en31-00780744-n i39502
omw-en31-00781071-n i39502
... ...

@fcbond
Copy link
Contributor

fcbond commented Nov 2, 2021 via email

@jmccrae
Copy link

jmccrae commented Nov 2, 2021

It seems that this is an error in the alignment script that was used to generate the WN 3.1 mapping.

It seems that some of the synsets that were new to PWN 3.1 have in some cases been assigned a key in that file that is incorrect. Instead there should be no assigned key. I will fix this in the CILI repo

@jmccrae
Copy link

jmccrae commented Nov 2, 2021

Actually, if you look at Turtle version of the mapping you can see what the solution is:

ili:i39442 skos:closeMatch pwn31:100767587-n . # crime, offense, criminal_offense, criminal_offence, offence, law-breaking
ili:i39442 owl:sameAs pwn31:100767761-n . # crime, offense, criminal_offense, criminal_offence, offence, law-breaking

In all cases, one is marked as a close match and one as an exact match

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 2, 2021

I think the solution is to remove the mappings in the .tab file for all skos:closeMatch links? I've opened globalwordnet/cili#13 to discuss.

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 3, 2021

Maybe check both for duplicate
relations (and discard) and self loops (and discard)?

Returning to the above issue, I just compared the NLTK's wordnet files and the one's we're using, and they are currently identical (after we fixed the inhibit loop). That is, the NLTK also has the unicycle loop and redundant edges:

>>> from nltk.corpus import wordnet as wn
>>> wn.lemmas('unicycle', pos='n')
[Lemma('unicycle.n.01.unicycle')]
>>> wn.lemmas('unicycle', pos='n')[0].derivationally_related_forms()
[Lemma('unicyclist.n.01.unicyclist'), Lemma('unicycle.n.01.unicycle'), Lemma('unicycle.v.01.unicycle')]
>>> wn.lemmas('benignancy')[0].derivationally_related_forms()
[Lemma('benign.s.03.benign'), Lemma('benign.s.03.benign')]

Considering this, I think we should refrain from fixing further things, even if it's tempting. That's what PWN 3.1 and OEWN are for. We could document the known issues, though. An exception could be made if the data, due to those errors, is no longer loadable in Wn.

@fcbond
Copy link
Contributor

fcbond commented Nov 3, 2021 via email

It points to the omw-data project page, which is not the URL for the
resource, so it can't be used to automatically download the
dependency.
Currently only README and LICENSE without extensions since we don't
have README.md, LICENSE.txt, etc. in the data. It is trivial to enable
these, though. Also, citation.bib is copied if present.
This way all XML file names are the same as their directories. The
predictability makes other things easier.
Build script now requires a version argument.

The release workflow should work either when the release is created or
on a manual dispatch.
@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

@fcbond I did a bit more work on this; see the commit messages above. I think most things are done now, but the new workflow is untested. I added a "workflow_dispatch" event, so you can click a "Run workflow" button on GitHub to start the action. This might be helpful if there's a failure or we otherwise need to rerun the build. It still starts automatically when a release is created.

Anything else to do?

The omw-en and omw-en31 lexicons are now generated from WordNet
sources directly, so the stored xml.xz files are no longer
needed. This commit also repurposes, and renames, those files'
directories for storing additional files related to the wordnets, such
as the LICENSE, README.md, and citation.bib files. This commit also
updates and expands the READMEs a bit.
@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

And a bit more work. I pushed the 0.8.3 version of Wn so that dependency is bumped here. I also removed the old pwn files and updated their READMEs, which are now copied into the built omw-en* directories.

I realize that we're not building the index.toml file for Wn anymore. But otherwise I think this is done.

@goodmami goodmami marked this pull request as ready for review November 4, 2021 05:50
@fcbond
Copy link
Contributor

fcbond commented Nov 4, 2021 via email

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

Regarding not bundling omw-en31, that's a good idea. Maybe this is as easy as adding --exclude="omw-$VERSION/omw-en31/" to the tar command in .github/workflows/release.yml?

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

Regarding index.toml, adding it to build.sh make senses. I'll think about it.

@fcbond
Copy link
Contributor

fcbond commented Nov 4, 2021 via email

The workflow just calls the scripts. This makes it easier to test
things out locally.
@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

Maybe I forgot to push

Yeah, I didn't see it. No worries, I pushed a commit which does this. It also moves most shell commands into local scripts so you can test things locally. Just be careful with the --publish option on package.sh, as it actually uploads things, but in practice I think it would fail unless you had the GITHUB_TOKEN environment variable appropriately set.

@goodmami
Copy link
Collaborator Author

goodmami commented Nov 4, 2021

Oh, and the index.toml file is now being built in the package.sh script.

@goodmami goodmami merged commit 3393edf into main Nov 5, 2021
@goodmami goodmami deleted the omw-1.4 branch November 5, 2021 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants