Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add C-based isAscii :: Text -> Bool #497

Merged
merged 12 commits into from
Feb 9, 2023
Merged

Conversation

raehik
Copy link
Contributor

@raehik raehik commented Jan 26, 2023

Now that Texts are represented internally as a UTF-8 bytestring, we can provide a fast isAscii :: Text -> Bool that inspects the bytestring directly, rather than Char by Char. Such a function can come in handy in serialization libraries, and can't easily be written by an end user. Plus, the C snippet already exists, so there's not much to do.

@raehik
Copy link
Contributor Author

raehik commented Jan 26, 2023

I am almost certain my FFI code is buggy. I don't well understand pinned/unpinned memory. But I'm unable to test on my machine (build issues).

src/Data/Text.hs Outdated Show resolved Hide resolved
src/Data/Text.hs Outdated Show resolved Hide resolved
src/Data/Text.hs Outdated Show resolved Hide resolved
Copy link
Contributor

@phadej phadej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and you probably want to add tests.

@raehik
Copy link
Contributor Author

raehik commented Jan 27, 2023

Fixed my build issues, will do. Thanks a ton for the help with the fiddly bits.

@raehik raehik marked this pull request as ready for review January 27, 2023 17:41
@raehik
Copy link
Contributor Author

raehik commented Jan 27, 2023

Fixed an error in the C code (end of text data is at arr+off+len, not arr+len), added lazy version, added property tests for both. Ready for review/squish/merge.

@raehik raehik requested a review from phadej January 27, 2023 17:44
@raehik
Copy link
Contributor Author

raehik commented Jan 27, 2023

Something to note: Data.Text.isAscii clashes with Data.Char.isAscii. There were already a couple of clashing functions (found toUpper, toLower on a glance), so I wager that it shouldn't be considered an issue.

@raehik raehik changed the title add Data.Text.isAscii add fast Data.Text.isAscii Jan 27, 2023
@raehik raehik changed the title add fast Data.Text.isAscii add C-based isAscii :: Text -> Bool Jan 27, 2023
cbits/is_ascii.c Outdated Show resolved Hide resolved
Copy link
Contributor

@Lysxia Lysxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me!

Another pair of eyes (@phadej?) to go over the FFI bits again would be nice.
There's the @since TODOs, I think this can make it into 2.0.2.

@Boarders
Copy link

Boarders commented Feb 4, 2023

Looks good to me, would be nice to have a test for the edge cases to make sure the offset is working as expected i.e. make sure this fails on text which fails to be ascii on the boundary with non-zero offset (looks correct, but it is useful to have such tests regardless).

src/Data/Text.hs Outdated Show resolved Hide resolved
@raehik
Copy link
Contributor Author

raehik commented Feb 8, 2023

Looks good to me, would be nice to have a test for the edge cases to make sure the offset is working as expected i.e. make sure this fails on text which fails to be ascii on the boundary with non-zero offset (looks correct, but it is useful to have such tests regardless).

OK. I added a single test for that:

isAscii_border :: IO ()
isAscii_border = do
    -- ASCII prefix ends at position 3 (from 0)
    let text  = T.pack "123一二三"
        text' = case text of T.Text arr off _len -> T.Text arr (off+1) 3
    assertBool "UTF-8 string with ASCII prefix ending at last position incorrectly detected as ASCII" $ not $ T.isAscii text'

We're really testing the behaviour of _hs_text_is_ascii rather than isAscii or the C wrapper, which do very little (a non-zero offset means very little to the C call). What do you think @Boarders ?

src/Data/Text/Lazy.hs Outdated Show resolved Hide resolved
@Bodigrim Bodigrim merged commit a9bda5b into haskell:master Feb 9, 2023
@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 9, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants