Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contracted tokens in lexers #51

Open
LoicGrobol opened this issue Jul 24, 2021 · 0 comments
Open

Contracted tokens in lexers #51

LoicGrobol opened this issue Jul 24, 2021 · 0 comments
Assignees
Labels
enhancement New feature or request model-breaking Merging this PR or fixing this issue requires retraining the current models perf-change Merging this PR or fixing this issue might change the performances of new models

Comments

@LoicGrobol
Copy link
Collaborator

So far we pass UD syntactic words to our lexers, so if a tree has e.g. “du”, the lexers will see “de” and “le”, but pretrained embeddings (contextual or not) have been trained seeing surface tokens, so their representations might be somewhat off.
We can hope that it does not have too much impact, especially since we fine-tune everything, but there might be a better way?

Possible strategies:

  • Stay as-is: the lexer sees and encodes the syntactic words
  • Duplicate: the lexer sees the surface (contracted) token and the syntactic words both get represented by the same corresponding vector
  • Duplicate with contraction flag: like duplicate, but a flag signalling that the syntactic words come from a single contracted token is added, for instance by summing them with a trainable vector, or a different trainable vector for every position (or every form? but that would be really impractical)
  • ???
@LoicGrobol LoicGrobol added enhancement New feature or request perf-change Merging this PR or fixing this issue might change the performances of new models model-breaking Merging this PR or fixing this issue requires retraining the current models labels Jul 24, 2021
@LoicGrobol LoicGrobol self-assigned this Jul 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model-breaking Merging this PR or fixing this issue requires retraining the current models perf-change Merging this PR or fixing this issue might change the performances of new models
Projects
None yet
Development

No branches or pull requests

1 participant