Contracted tokens in lexers #51
Labels
enhancement
New feature or request
model-breaking
Merging this PR or fixing this issue requires retraining the current models
perf-change
Merging this PR or fixing this issue might change the performances of new models
So far we pass UD syntactic words to our lexers, so if a tree has e.g. “du”, the lexers will see “de” and “le”, but pretrained embeddings (contextual or not) have been trained seeing surface tokens, so their representations might be somewhat off.
We can hope that it does not have too much impact, especially since we fine-tune everything, but there might be a better way?
Possible strategies:
The text was updated successfully, but these errors were encountered: