Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cutlet creates additional spaces in some words written in Latin alphabet #21

Closed
Lili1228 opened this issue Jan 1, 2021 · 12 comments
Closed

Comments

@Lili1228
Copy link

Lili1228 commented Jan 1, 2021

I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> text = '私は Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch にい ま す'
>>> katsu.romaji(text)
'Watakushi wa L l a n f a i r p w l l g w y n g y l l g o g e r y c h w y r n d robwllllantysiliogogogoch ni ima su'
@polm
Copy link
Owner

polm commented Jan 1, 2021

This is an issue with the way the underlying library, MeCab, works. All words need costs, and lower cost words are preferred. If a word is not in the dictionary, it calculates the cost based on the length and the type of character (latin, hiragana, kanji, etc.). Because costs are calculated for the sequence as a whole, at certain lengths it's cheaper to break the sequence up than to treat it as a single word.

In my tests with the latest version of unidic-lite this only happens if the unknown input is more than 25 characters. Are you actually processing the names of long Welsh towns or is there something else you're doing where this comes up a lot? It may be possible to change this behavior by modifying the dictionary settings but it seems like it's not an issue for normal usage.

@Lili1228
Copy link
Author

Lili1228 commented Jan 1, 2021

That's the only case I've found so far and I assume the only times I'd trigger it otherwise would be if I input a random garbage (which sometimes can happen). While I don't think it's worth doing that for everyone in that case, can you tell me how to change that setting?

@polm
Copy link
Owner

polm commented Jan 2, 2021

On investigation, this isn't actually due to cost calculations. It happens because MeCab has a hard cap on the length of unknown words. You can change this value by passing -M [number] to the Tagger, so for you the fix would look like this:

import cutlet
import fugashi

cut = cutlet.Cutlet()
tagger = fugashi.Tagger('-M 100')
cut.tagger = tagger
# now you can get unknown words up to length 101

The maximum length specification seems to have an off-by-one-error, so it's actually one longer than the number you specify.

Let me know if that fixes it for you.

@Lili1228
Copy link
Author

Lili1228 commented Jan 2, 2021

It fixed my problem, thank you!

@Lili1228 Lili1228 closed this as completed Jan 2, 2021
@Lili1228
Copy link
Author

Lili1228 commented Jan 5, 2021

Reopening because while the previous thing was rather insignificant and fixable, this is definitely bad:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald ' s"

I use Cutlet combined with Text-to-Speech and no English voice synthesizer is able to pronounce this correctly with spacing like that.

@Lili1228 Lili1228 reopened this Jan 5, 2021
@polm
Copy link
Owner

polm commented Jan 6, 2021

Ah, good catch, I'll look at fixing that.

@polm polm closed this as completed in 35ca54e Jan 8, 2021
@polm
Copy link
Owner

polm commented Jan 8, 2021

This should be fixed in the latest version, please confirm if it works for you.

I also looked at handing quotes in general, for sentences like It's 'delicious.' but that ended up being much more complicated, partly because MeCab sticks punctuation together. Since cutlet isn't really designed to take already-English input like that anyway I'm treating it as out of scope.

@Lili1228
Copy link
Author

Lili1228 commented Jan 9, 2021

It works only if it's the only word:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald's"
>>> x.romaji("text McDonald's text")
"Text McDonald ' s text"

@polm
Copy link
Owner

polm commented Jan 9, 2021

Ah, that's embarrassing, but I think I found the issue. Should be fixed in master, I'll release a test alpha tomorrow.

@polm
Copy link
Owner

polm commented Jan 10, 2021

Should be fixed in alpha now, please confirm.

pip install cutlet==0.1.17a2

@Lili1228
Copy link
Author

Works well, thank you!

@polm
Copy link
Owner

polm commented Jan 11, 2021

Great, thanks for the confirmation, I'll make a release shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants