Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation script that unpacks lextag into remaining STREUSLE columns #41

Closed
nschneid opened this issue Jun 20, 2019 · 2 comments · Fixed by #46
Closed

Evaluation script that unpacks lextag into remaining STREUSLE columns #41

nschneid opened this issue Jun 20, 2019 · 2 comments · Fixed by #46
Assignees

Comments

@nschneid
Copy link
Contributor

Re: #40, we need a script that takes lextags (full tags, one per token) output by a system and parses them to extract MWE groupings.

Lextags are the 19th and final column in the .conllulex format. Columns 1-10 are UD. Columns 11-18 can be filled in based on UD+lextags.

@nschneid nschneid self-assigned this Jun 20, 2019
@nschneid
Copy link
Contributor Author

Input: .conllulex format except columns 11-18 are blank (not underscores; completely blank)

I think the easiest way to implement this will be to adapt streuseval.py so that instead of VERIFYING that lextags are consistent with columns 11-18, it parses lextags and then populates columns 11-18 in JSON.

Specifically, it needs to:

  • parse each lextag into mwetag + lexcat + supersenses
  • parse mwetag sequences into links
  • form strong and weak groups (token sets) out of links
  • number the groups (first strong, then weak) and the tokens within the groups
  • look up lemmas for the tokens in each group

If we want the output as .conllulex, converting JSON to .conllulex could be a separate script.

nschneid added a commit that referenced this issue Jun 21, 2019
…rse_mwe_links(), which can be imported separately for lextag unpacking (#41)
nschneid added a commit that referenced this issue Jun 21, 2019
…ession annotations from sequence of lextags (#41)
@nschneid
Copy link
Contributor Author

@danielhers I believe I have this working on the lextag-unpack branch. When reconstructing from the gold lextags I can't 100% match the original data file due to an arbitrary numbering issue (#42), but the streuseval score of the original vs. reconstructed is 100%, so there should not be any errors in the reconstruction. Hopefully this means the script is bug-free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant