Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to prioritize/constrain tokenize results #60

Closed
chrisjrn opened this issue Jan 18, 2018 · 3 comments
Closed

Add ability to prioritize/constrain tokenize results #60

chrisjrn opened this issue Jan 18, 2018 · 3 comments
Assignees

Comments

@chrisjrn
Copy link

Our application uses OpenKoreanTextProcessorJava.tokenize to segment Korean text for word-by-word translation into another language.

At the moment, OKT will return the tokenization that best matches words in its dictionary, but sometimes these words aren't available to us for translation, but more complex tokenizations have translations available.

As an example, tokenize returns a single token for 평창올림픽, but (평창, 올림픽) is also a valid tokenization. We don't have a translation available for 평창올림픽.

In this case, We'd like to be able to rule out 평창올림픽 as a valid token. This could be done by:

  • Adding the ability to give a set of words a penality in TokenizerProfile
  • Adding the ability to remove words from the dictionary
@chrisjrn chrisjrn changed the title Add ability to prioritize/constrain tokenize results based Add ability to prioritize/constrain tokenize results Jan 18, 2018
@hohyon-ryu
Copy link
Member

Hi @chrisjrn, thank you for the suggestions. I think they are great ideas. I will add those features within a week.

@hohyon-ryu
Copy link
Member

@chrisjrn I've merged the second request:

  • Adding the ability to remove words from the dictionary

For the first idea, it would be helpful if you could provide me with an example of TokenizerProfile that you want to use. I think your second idea will serve most of the use case, though.

@hohyon-ryu
Copy link
Member

Published 2.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants