You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our application uses OpenKoreanTextProcessorJava.tokenize to segment Korean text for word-by-word translation into another language.
At the moment, OKT will return the tokenization that best matches words in its dictionary, but sometimes these words aren't available to us for translation, but more complex tokenizations have translations available.
As an example, tokenize returns a single token for 평창올림픽, but (평창, 올림픽) is also a valid tokenization. We don't have a translation available for 평창올림픽.
In this case, We'd like to be able to rule out 평창올림픽 as a valid token. This could be done by:
Adding the ability to give a set of words a penality in TokenizerProfile
Adding the ability to remove words from the dictionary
The text was updated successfully, but these errors were encountered:
chrisjrn
changed the title
Add ability to prioritize/constrain tokenize results based
Add ability to prioritize/constrain tokenize results
Jan 18, 2018
Adding the ability to remove words from the dictionary
For the first idea, it would be helpful if you could provide me with an example of TokenizerProfile that you want to use. I think your second idea will serve most of the use case, though.
Our application uses
OpenKoreanTextProcessorJava.tokenize
to segment Korean text for word-by-word translation into another language.At the moment, OKT will return the tokenization that best matches words in its dictionary, but sometimes these words aren't available to us for translation, but more complex tokenizations have translations available.
As an example,
tokenize
returns a single token for평창올림픽
, but(평창, 올림픽)
is also a valid tokenization. We don't have a translation available for평창올림픽
.In this case, We'd like to be able to rule out
평창올림픽
as a valid token. This could be done by:TokenizerProfile
The text was updated successfully, but these errors were encountered: