Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requests: feedback from parsing tweets #370

Open
sayrer opened this issue Feb 15, 2019 · 3 comments
Open

Feature requests: feedback from parsing tweets #370

sayrer opened this issue Feb 15, 2019 · 3 comments

Comments

@sayrer
Copy link

sayrer commented Feb 15, 2019

I wrote a tweet parser with Pest here: https://github.com/sayrer/twitter-text/blob/master/parser/src/twitter_text.pest

It was a good experience, but two small things would have helped a lot.

  1. Support for a parse() method that takes a character iterator rather than a string. This would allow me to nfc normalize the input text without allocating an extra string.

  2. Optional support for more detailed character offsets in Pair (UTF-16 and UTF-32). Finding these offsets requires iterating over the input string with str.char_indices after parsing, but I bet Pest could provide them.

@sayrer
Copy link
Author

sayrer commented Feb 15, 2019

Also, a more difficult request: compile long sequences of literal choices to tries. At the bottom of the tweet grammar, this was done manually for TLDs. I didn't test this yet, but I did look at the generated code, and it seemed to be called for.

@CAD97
Copy link
Contributor

CAD97 commented Feb 15, 2019

The way that pest is currently set up, the returned parse tree borrows the original input string, so making a streaming API isn't that possible. The string has to be collected in either case, so making this externally obvious seems ideal.

It might be possible to support a streaming API in the future with pest:3.0 or otherwise, but streaming introduces a lot of issues. As I understand it, pest is optimized for a full-file processing, as you see in a programming language.

As for the literals, I believe the intent is to utilize logos's lexing plumbing superpowers, which will give us O(1) bytewise lexing for "free" so long as we can get ordered-choice semantics instead of longest-match.

@sayrer
Copy link
Author

sayrer commented Feb 15, 2019

Thanks for the reply. I agree logos should take care of my trie request (excellent!). It's also understandable that streaming might take a while or never happen.

But what about getting UTF-16 and UTF-32 offsets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants