Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are add and both synonym ? #478

Closed
PonteIneptique opened this issue Mar 23, 2023 · 11 comments · Fixed by #488
Closed

Are add and both synonym ? #478

PonteIneptique opened this issue Mar 23, 2023 · 11 comments · Fixed by #488

Comments

@PonteIneptique
Copy link
Contributor

From #476

Just a quick note: I am still quite unsatisfied on the naming scheme of codec merging. Specifically, add and both are quite synonym to me (add adds the new characters, both uses both set of character [while it actually only uses the new dataset vocabulary]). While it might be a breaking change, I would very much recommend moving to add and replace or something like this.
To do something clean regarding both changing to something such as replace, we could allow both argument with a DeprecationWarning and map to replace

@dstoekl replied

I support this and would support also the possibility to train excluding new characters. Something that would be clear and not break previous behavior could be:
old_only
new_only (corresponds to today's both)
old_and_new (corresponds to today's add)

@PonteIneptique
Copy link
Contributor Author

PonteIneptique commented Mar 23, 2023

If we move to a new system with three options:

  • --resize new only use the new dataset. Other option: fit
  • --resize original only the original model vocabulary. Other option: old, keep, none
  • --resize both use both the original model vocabulary and the new dataset compiled one. Other option: merge (?), add (current one)

@PonteIneptique
Copy link
Contributor Author

The issue with the last comment, and specifically reusing both, is that we can't use a Deprecation phase, because both would need to mean two things. Hence merge or something else.

@PonteIneptique
Copy link
Contributor Author

If we have something that makes everyone happy, I'll be happy to propose a PR once @mittagessen approves the idea :)

@mittagessen
Copy link
Owner

No disagreement from my side. The semantics of the current values are not exactly self-explanatory. It's probably necessary to change the codec encoding log warnings to warnings module warnings so they can be filtered and won't flood the terminal in original/old mode.

@PonteIneptique
Copy link
Contributor Author

I created a small poll here: https://www.rcv123.org/ballot/njbu5X1xBMjZJykUYXJDJB

@PonteIneptique
Copy link
Contributor Author

PonteIneptique commented Mar 23, 2023

The poll only takes into account the current system, but depending on the results the third one (keep, original, old, none) can be decided when implemented.

@PonteIneptique
Copy link
Contributor Author

People who did not vote are invited to participate, because it's a tie :D

@mittagessen
Copy link
Owner

If there's a still a tie, I'll let my cat flip a coin to decide.

@PonteIneptique
Copy link
Contributor Author

@PonteIneptique
Copy link
Contributor Author

I'll draft a PR in the coming days

@PonteIneptique
Copy link
Contributor Author

I am late on the PR but it will come...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants