Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode #135

Merged
merged 21 commits into from
Jun 8, 2021
Merged

Fix Unicode #135

merged 21 commits into from
Jun 8, 2021

Conversation

kirawi
Copy link
Member

@kirawi kirawi commented Jun 6, 2021

Fix #123, should probably still include #129 as a fallback. Remember how I said there shouldn't be any regressions in #121? Well, I figured out my "fix" for #88 broke b on unicode especially, but now I guess that's fixed here since it properly detects unicode?

@kirawi
Copy link
Member Author

kirawi commented Jun 6, 2021

Oops, accidentally included old commits.

@kirawi
Copy link
Member Author

kirawi commented Jun 6, 2021

Updated. Is the order of the match a problem?

@kirawi
Copy link
Member Author

kirawi commented Jun 6, 2021

Also, I just found out that it doesn't fix the unicode problem regex problem exactly, I'll look into it tomorrow.

@archseer
Copy link
Member

archseer commented Jun 6, 2021

I think it looks good, but what is '_' categorized as now?

@kirawi
Copy link
Member Author

kirawi commented Jun 6, 2021

Added a test, and it's still recognized as a word. The categories can be seen here, https://www.compart.com/en/unicode/category. I'll look into the aforementioned regression tomorrow, or possibly later depending on my schedule.

| GeneralCategory::OpenPunctuation
| GeneralCategory::InitialPunctuation
| GeneralCategory::FinalPunctuation => Category::Punctuation,
_ => unreachable!("unknown '{}' character category", ch),
Copy link
Contributor

@pickfire pickfire Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail for other cases, like + it will get a panic. We have to cover all the cases for this, or even just put it under unknown category.

@kirawi
Copy link
Member Author

kirawi commented Jun 6, 2021

Might be worth having a separate test containing the equivalent of, "the fox jumped over the... Etc.", though I have no idea what that would be for Unicode. Maybe foreign equivalents? The goal of it would be to check for any panics. I guess we should just get one character from each category.

@pickfire
Copy link
Contributor

pickfire commented Jun 6, 2021

Might be worth having a separate test containing the equivalent of, "the fox jumped over the... Etc.", though I have no idea what that would be for Unicode. Maybe foreign equivalents? The goal of it would be to check for any panics. I guess we should just get one character from each category.

If you want that, fuzzing will be easier since it spams random characters, then we run our code through the fuzzer.

@archseer
Copy link
Member

archseer commented Jun 6, 2021

Property testing: https://github.com/BurntSushi/quickcheck

@kirawi kirawi marked this pull request as draft June 6, 2021 18:57
CBenoit added a commit to CBenoit/helix that referenced this pull request Jun 6, 2021
`is_ascii_punctuation` will only work for ASCII punctuations, and when
we have unicode punctuation (or other) we jump into the `unreachable`.
This patch fallback into categorizing everything in this branch as
`Unknown`.

Fixes helix-editor#123

helix-editor#135: add better support for
unicode categories.
@CBenoit
Copy link
Member

CBenoit commented Jun 6, 2021

I changed a bit #129 to return a Unknown variant instead. Your changes should be enough to handle most useful unicodes.

archseer pushed a commit that referenced this pull request Jun 7, 2021
`is_ascii_punctuation` will only work for ASCII punctuations, and when
we have unicode punctuation (or other) we jump into the `unreachable`.
This patch fallback into categorizing everything in this branch as
`Unknown`.

Fixes #123

#135: add better support for
unicode categories.
@kirawi
Copy link
Member Author

kirawi commented Jun 7, 2021

I changed a bit #129 to return a Unknown variant instead. Your changes should be enough to handle most useful unicodes.

That's perfect, I was a little worried about how I could create tests for this PR.

@pickfire
Copy link
Contributor

pickfire commented Jun 7, 2021

@kirawi Can you please rebase and add it for ctrl-w in words.rs as well? And the tests should not have weird behavior like not deleting the unicode punctuations.

@kirawi kirawi marked this pull request as ready for review June 7, 2021 20:12
@kirawi
Copy link
Member Author

kirawi commented Jun 7, 2021

#88 was not fixed, that issue should be reopened. I still have no idea what is causing the bug.

@hovsater
Copy link
Contributor

hovsater commented Jun 7, 2021

@kirawi actually the test case in #88 is fixed. I guess what I'm seeing now could count as a new issue. I'll create one!

@kirawi
Copy link
Member Author

kirawi commented Jun 7, 2021

@kirawi actually the test case in #88 is fixed. I guess what I'm seeing now could count as a new issue. I'll create one!

Are you sure? I noticed that it chokes if there is more than one unicode, i.e.

key
ささ
key
key

@hovsater
Copy link
Contributor

hovsater commented Jun 7, 2021

Ah, interesting. Never tried that. Perhaps that's why it chokes on the new input. Well, I'll add a comment to the original issue tomorrow. Getting late here.

@archseer archseer requested a review from pickfire June 8, 2021 03:19
Comment on lines +182 to 199
#[inline]
pub(crate) fn is_punctuation(ch: char) -> bool {
use unicode_general_category::{get_general_category, GeneralCategory};

matches!(
get_general_category(ch),
GeneralCategory::OtherPunctuation
| GeneralCategory::OpenPunctuation
| GeneralCategory::ClosePunctuation
| GeneralCategory::InitialPunctuation
| GeneralCategory::FinalPunctuation
| GeneralCategory::ConnectorPunctuation
| GeneralCategory::DashPunctuation
| GeneralCategory::MathSymbol
| GeneralCategory::CurrencySymbol
| GeneralCategory::ModifierSymbol
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move these pub(crate) and categories stuff to category.rs or something. Maybe in another PR.

@@ -231,6 +249,7 @@ where
// need to +1 so that prev() includes current char
let mut chars = slice.chars_at(*pos + 1);

#[allow(clippy::while_let_on_iterator)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a false positive, maybe report it to clippy?

Copy link
Member Author

@kirawi kirawi Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, but the next line operates on next(), and I didn't know what to do with that so I decided to suppress the warning for now.

Copy link
Contributor

@pickfire pickfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Merging now as it has gotten quite long but we need some follow up on the category thing. Will test it tonight.

Oh wait, can't merge due to conflicts. Can you please fix the conflicts and address the issues?

@archseer
Copy link
Member

archseer commented Jun 8, 2021

Squash and merge works

@archseer archseer merged commit b873fb9 into helix-editor:master Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thread 'main' panicked at 'internal error: entered unreachable code', helix-core/src/movement.rs:203:9
5 participants