-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation failed with UTF8 character #17
Comments
Hi, thanks for sending a detailed bug report.
So the issue with your code is that `sigma_star` is delimited in terms of
bytes but your transducers in `ones_map` uses UTF-8 codepoints. Because of
this, `sigma_star` cannot actually accept 'ộ'.
You can go one of two ways. Either you can list codepoints like ộ when
constructing `sigma_star`, or you can use the default byte token type.
A recommended in-line assertion test:
assert pynini.matches('ộ', sigma_star)
…On Tue, Nov 12, 2019 at 3:30 AM Lê Thành ***@***.***> wrote:
I'm learning pynini to map character number to syllabel, but I always got "Operation
failed" when my fst2 in transducer contain "ộ" character, even though I
passed token_type='utf8' on both transducer and stringify.
Here is my code
import pynini
ones_map = pynini.union(
pynini.transducer("1", "một", token_type='utf8'),
pynini.transducer("2", "hai", token_type='utf8'),
pynini.transducer("3", "ba", token_type='utf8'),
)
chars = [chr(i) for i in range(1, 91)] + [r"\[", r"\\", r"\]"] + [chr(i) for i in range(94, 256)]
sigma_star = pynini.union(*chars).closure()
numbers = pynini.union("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
num_norm = (pynini.cdrewrite(ones_map, "", "", sigma_star))
def normalize(string):
return pynini.compose(string.strip(), num_norm).stringify(token_type='utf8')
print(normalize("1")) # Operation failed
print(normalize("2")) # Success, output "hai"
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17?email_source=notifications&email_token=AABG4OJPVRZIIBI5SSUYTO3QTJSSHA5CNFSM4JL76W3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYT6GGQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OJ67IUACKO7XMAVEYLQTJSSHANCNFSM4JL76W3A>
.
|
Thank you for pointing that out to me, so my quick fix is: chars = [chr(i) for i in range(1, 91)] + [r"\[", r"\\", r"\]"] + [chr(i) for i in range(94, 256)]
chars += [bytes(i, "utf8") for i in "aáàạãảăắằặẵẳâấầậẫẩbcdđeéèẹẽẻêếềệễểghiíìịĩỉklmnoóòọõỏôốồộỗổơớờợỡởpqrstuúùụũủưứừựữửvxyýỳỵỹỷfjzw"]
chars = set(chars)
sigma_star = pynini.union(*chars).closure() and also remove all Once again, thank you for your awesome library |
Glad it works! Peace.
…On Tue, Nov 12, 2019 at 12:52 PM Lê Thành ***@***.***> wrote:
Closed #17 <#17>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17?email_source=notifications&email_token=AABG4OPP5G6USO5KXXA5ZODQTLUMLA5CNFSM4JL76W3KYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOUZYP2XQ#event-2792422750>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4ONQ43SGGZO6ESAFH7LQTLUMLANCNFSM4JL76W3A>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm learning pynini to map character number to syllabel, but I always got
"Operation failed"
when my fst2 in transducer contain"ộ"
character, even though I passedtoken_type='utf8'
on bothtransducer
andstringify
.Here is my code
The text was updated successfully, but these errors were encountered: