Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Latin1 characters #158

Closed
tfgordon opened this issue Oct 31, 2023 · 4 comments
Closed

Parsing Latin1 characters #158

tfgordon opened this issue Oct 31, 2023 · 4 comments

Comments

@tfgordon
Copy link

tfgordon commented Oct 31, 2023

I need to parser Latin1 characters which are not ASCII. My parser was working with version 4.0.2, but I need to use the newer version of petitparser now due to dependencies with the pdf Flutter package that I also need.

Here's a simplified code snippet which fails:

// letter() extended with Latin 1 characters for coverage of most Western European languages
final Parser extChar =
letter() |
char('ä');

I've also tried using pattern, like this:

final Parser extChar =
letter() |
pattern("À-ÿ");

This also fails.

Would it be easiest to extend letter() to cover all Latin 1 alphabetic characters?

@renggli
Copy link
Member

renggli commented Oct 31, 2023

Note that PetitParser never supported any other encoding but the standard UTF-16 code units of a Dart String. I recommend that you convert your input to Dart before parsing, for example using the built-in Latin1Codec.

I am not aware of a change in how characters are read in a long time. Could you provide a short reproducible test-case that passes with PetitParser 4.0.2, but fails with a newer version?

I agree that the built-in predicates such as letter() are simplistic. It would be great to have built-in support for Unicode character properties. Happy to discuss a possible implementaiton.

renggli added a commit that referenced this issue Oct 31, 2023
@tfgordon
Copy link
Author

Thanks for your quick response. I now think the problem is not with PetitParser, but rather was caused by a change in the way I store files, made to be able to deploy the app as a webapp. I am now using the Hive NoSql database. Printing out the output from the database, before I try to parse it with PetitParse, shows that it is corrupting (some?) non-ASCII characters. The characters returned are not ones handled by the grammar so I get a parse error. So I will see if this problem can be fixed and hope that this will solve the parsing problem as well.

@tfgordon
Copy link
Author

I'd like to be able to help you with extending the letter() implementation, but I'm afraid that's over my head.

@tfgordon
Copy link
Author

tfgordon commented Oct 31, 2023

I found the problem. Hive encodes strings using UTF8. I just needed to convert them into UTF16 and everything works as it should.

@renggli renggli closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants