how to support unicode/utf-8? like Chinese? #208

iamsk · 2018-08-09T06:57:53Z

Sample code like this:

parser = Lark('''start: WORD "," WORD "!"
            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''', parser='lalr')
print(parser.parse("Hello,世界!"))

I'm already try a long time, anyone can help on this, thanks!

The text was updated successfully, but these errors were encountered:

ray-linn · 2018-08-09T08:21:20Z

the letter of word is defined as "A-Z" or "a-z"

erezsh · 2018-08-09T08:50:56Z

Right now, WORD is defined for English only. Perhaps that needs to change.
Meanwhile, you can use the following definition:

WORD: /[^\W\d_]+/

ray-linn · 2018-08-10T01:40:52Z

I could extract Unicode CHS letters by /u"[\u4e00-\u9fa5]"/. however, looks this rule does not work

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
SIMCH_LETTER: /u"[\u4e00-\u9fa5]"/
LETTER: UCASE_LETTER | LCASE_LETTER | SIMCH_LETTER
WORD: LETTER+

I think something in load_grammer.py may need to be changed

iamsk · 2018-08-10T04:28:15Z

@ray-linn ok, thanks, I think this will help.

ray-linn · 2018-08-10T09:35:51Z

I find out how to code the rule for Unicode in grammer.g, here is the example

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

and it outputs:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

ruiqurm · 2021-12-04T06:15:41Z

I find out how to code the rule for Unicode in grammer.g, here is the example
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+
and it outputs:
Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

/[u"\u4e00-\u9fa5"]/ will include the quote marks. You can use /[\u4e00-\u9fa5]/ instead

iamsk closed this as completed Aug 11, 2018

geographika mentioned this issue Jan 12, 2020

Unquoted Unicode strings cause parsing errors geographika/mappyfile#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to support unicode/utf-8? like Chinese? #208

how to support unicode/utf-8? like Chinese? #208

iamsk commented Aug 9, 2018 •

edited

Loading

ray-linn commented Aug 9, 2018

erezsh commented Aug 9, 2018 •

edited

Loading

ray-linn commented Aug 10, 2018 •

edited

Loading

iamsk commented Aug 10, 2018

ray-linn commented Aug 10, 2018 •

edited

Loading

ruiqurm commented Dec 4, 2021

how to support unicode/utf-8? like Chinese? #208

how to support unicode/utf-8? like Chinese? #208

Comments

iamsk commented Aug 9, 2018 • edited Loading

ray-linn commented Aug 9, 2018

erezsh commented Aug 9, 2018 • edited Loading

ray-linn commented Aug 10, 2018 • edited Loading

iamsk commented Aug 10, 2018

ray-linn commented Aug 10, 2018 • edited Loading

ruiqurm commented Dec 4, 2021

iamsk commented Aug 9, 2018 •

edited

Loading

erezsh commented Aug 9, 2018 •

edited

Loading

ray-linn commented Aug 10, 2018 •

edited

Loading

ray-linn commented Aug 10, 2018 •

edited

Loading