Contracted tokens in lexers #51

LoicGrobol · 2021-07-24T15:47:47Z

So far we pass UD syntactic words to our lexers, so if a tree has e.g. “du”, the lexers will see “de” and “le”, but pretrained embeddings (contextual or not) have been trained seeing surface tokens, so their representations might be somewhat off.
We can hope that it does not have too much impact, especially since we fine-tune everything, but there might be a better way?

Possible strategies:

Stay as-is: the lexer sees and encodes the syntactic words
Duplicate: the lexer sees the surface (contracted) token and the syntactic words both get represented by the same corresponding vector
Duplicate with contraction flag: like duplicate, but a flag signalling that the syntactic words come from a single contracted token is added, for instance by summing them with a trainable vector, or a different trainable vector for every position (or every form? but that would be really impractical)
???

LoicGrobol added enhancement New feature or request perf-change Merging this PR or fixing this issue might change the performances of new models model-breaking Merging this PR or fixing this issue requires retraining the current models labels Jul 24, 2021

LoicGrobol self-assigned this Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contracted tokens in lexers #51

Contracted tokens in lexers #51

LoicGrobol commented Jul 24, 2021

Contracted tokens in lexers #51

Contracted tokens in lexers #51

Comments

LoicGrobol commented Jul 24, 2021