Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to support unicode/utf-8? like Chinese? #208

Closed
iamsk opened this issue Aug 9, 2018 · 6 comments
Closed

how to support unicode/utf-8? like Chinese? #208

iamsk opened this issue Aug 9, 2018 · 6 comments

Comments

@iamsk
Copy link

iamsk commented Aug 9, 2018

Sample code like this:

parser = Lark('''start: WORD "," WORD "!"
            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''', parser='lalr')
print(parser.parse("Hello,世界!"))

I'm already try a long time, anyone can help on this, thanks!

@ray-linn
Copy link

ray-linn commented Aug 9, 2018

the letter of word is defined as "A-Z" or "a-z"

@erezsh
Copy link
Member

erezsh commented Aug 9, 2018

Right now, WORD is defined for English only. Perhaps that needs to change.
Meanwhile, you can use the following definition:

WORD: /[^\W\d_]+/

@ray-linn
Copy link

ray-linn commented Aug 10, 2018

I could extract Unicode CHS letters by /u"[\u4e00-\u9fa5]"/. however, looks this rule does not work

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
SIMCH_LETTER: /u"[\u4e00-\u9fa5]"/
LETTER: UCASE_LETTER | LCASE_LETTER | SIMCH_LETTER
WORD: LETTER+

I think something in load_grammer.py may need to be changed

@iamsk
Copy link
Author

iamsk commented Aug 10, 2018

@ray-linn ok, thanks, I think this will help.

@ray-linn
Copy link

ray-linn commented Aug 10, 2018

I find out how to code the rule for Unicode in grammer.g, here is the example

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

and it outputs:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

@ruiqurm
Copy link

ruiqurm commented Dec 4, 2021

I find out how to code the rule for Unicode in grammer.g, here is the example

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

and it outputs:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

/[u"\u4e00-\u9fa5"]/ will include the quote marks. You can use /[\u4e00-\u9fa5]/ instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants