Cutlet creates additional spaces in some words written in Latin alphabet #21

Lili1228 · 2021-01-01T12:41:03Z

I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> text = '私は Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch にい ま す'
>>> katsu.romaji(text)
'Watakushi wa L l a n f a i r p w l l g w y n g y l l g o g e r y c h w y r n d robwllllantysiliogogogoch ni ima su'

polm · 2021-01-01T16:19:59Z

This is an issue with the way the underlying library, MeCab, works. All words need costs, and lower cost words are preferred. If a word is not in the dictionary, it calculates the cost based on the length and the type of character (latin, hiragana, kanji, etc.). Because costs are calculated for the sequence as a whole, at certain lengths it's cheaper to break the sequence up than to treat it as a single word.

In my tests with the latest version of unidic-lite this only happens if the unknown input is more than 25 characters. Are you actually processing the names of long Welsh towns or is there something else you're doing where this comes up a lot? It may be possible to change this behavior by modifying the dictionary settings but it seems like it's not an issue for normal usage.

Lili1228 · 2021-01-01T18:00:19Z

That's the only case I've found so far and I assume the only times I'd trigger it otherwise would be if I input a random garbage (which sometimes can happen). While I don't think it's worth doing that for everyone in that case, can you tell me how to change that setting?

polm · 2021-01-02T05:57:37Z

On investigation, this isn't actually due to cost calculations. It happens because MeCab has a hard cap on the length of unknown words. You can change this value by passing -M [number] to the Tagger, so for you the fix would look like this:

import cutlet
import fugashi

cut = cutlet.Cutlet()
tagger = fugashi.Tagger('-M 100')
cut.tagger = tagger
# now you can get unknown words up to length 101

The maximum length specification seems to have an off-by-one-error, so it's actually one longer than the number you specify.

Let me know if that fixes it for you.

Lili1228 · 2021-01-02T09:39:53Z

It fixed my problem, thank you!

Lili1228 · 2021-01-05T22:00:50Z

Reopening because while the previous thing was rather insignificant and fixable, this is definitely bad:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald ' s"

I use Cutlet combined with Text-to-Speech and no English voice synthesizer is able to pronounce this correctly with spacing like that.

polm · 2021-01-06T10:03:44Z

Ah, good catch, I'll look at fixing that.

polm · 2021-01-08T07:30:32Z

This should be fixed in the latest version, please confirm if it works for you.

I also looked at handing quotes in general, for sentences like It's 'delicious.' but that ended up being much more complicated, partly because MeCab sticks punctuation together. Since cutlet isn't really designed to take already-English input like that anyway I'm treating it as out of scope.

Lili1228 · 2021-01-09T14:51:10Z

It works only if it's the only word:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald's"
>>> x.romaji("text McDonald's text")
"Text McDonald ' s text"

polm · 2021-01-09T16:07:39Z

Ah, that's embarrassing, but I think I found the issue. Should be fixed in master, I'll release a test alpha tomorrow.

polm · 2021-01-10T06:01:58Z

Should be fixed in alpha now, please confirm.

pip install cutlet==0.1.17a2

Lili1228 · 2021-01-10T15:43:28Z

Works well, thank you!

polm · 2021-01-11T06:25:21Z

Great, thanks for the confirmation, I'll make a release shortly.

Lili1228 closed this as completed Jan 2, 2021

Lili1228 reopened this Jan 5, 2021

polm closed this as completed in 35ca54e Jan 8, 2021

polm mentioned this issue Jan 26, 2021

Max Grouping Size off-by-one error taku910/mecab#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cutlet creates additional spaces in some words written in Latin alphabet #21

Cutlet creates additional spaces in some words written in Latin alphabet #21

Lili1228 commented Jan 1, 2021 •

edited

Loading

polm commented Jan 1, 2021

Lili1228 commented Jan 1, 2021

polm commented Jan 2, 2021

Lili1228 commented Jan 2, 2021

Lili1228 commented Jan 5, 2021

polm commented Jan 6, 2021

polm commented Jan 8, 2021

Lili1228 commented Jan 9, 2021

polm commented Jan 9, 2021

polm commented Jan 10, 2021

Lili1228 commented Jan 10, 2021

polm commented Jan 11, 2021

Cutlet creates additional spaces in some words written in Latin alphabet #21

Cutlet creates additional spaces in some words written in Latin alphabet #21

Comments

Lili1228 commented Jan 1, 2021 • edited Loading

polm commented Jan 1, 2021

Lili1228 commented Jan 1, 2021

polm commented Jan 2, 2021

Lili1228 commented Jan 2, 2021

Lili1228 commented Jan 5, 2021

polm commented Jan 6, 2021

polm commented Jan 8, 2021

Lili1228 commented Jan 9, 2021

polm commented Jan 9, 2021

polm commented Jan 10, 2021

Lili1228 commented Jan 10, 2021

polm commented Jan 11, 2021

Lili1228 commented Jan 1, 2021 •

edited

Loading