Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Text.IO.Utf8 module #503

Merged
merged 10 commits into from
Mar 11, 2023
Merged

Add Text.IO.Utf8 module #503

merged 10 commits into from
Mar 11, 2023

Conversation

oberblastmeister
Copy link
Contributor

Solves #472

@Lysxia
Copy link
Contributor

Lysxia commented Feb 13, 2023

The title of issue #472 mentions using ShortByteString for this. What's the status of that?

If I understand correctly, this is about variants of Data.Text.IO.readFile, etc. that don't depend on the locale and which may avoid a copy compared to fmap decodeUtf8 . Data.ByteString.readFile (although the current implementation doesn't do that). Any other benefit I'm missing? In any case that sounds reasonable to me.

Wouldn't it be better to add to Data.Text.IO instead of creating yet another module?

@haskell/text any objections?

@oberblastmeister
Copy link
Contributor Author

Yeah, the intention is to make it faster in the future once I finish haskell/bytestring#547.

I generally like using qualified imports instead of adding suffixes to everything, so that's why I put it in a separate module. I'm fine with putting it in Data.Text.IO also.

@Bodigrim
Copy link
Contributor

I'd mildly prefer to fold it into Data.Text.IO, but I'm fine either way.

Shall we provide the same set of functions for lazy Text as well?

Copy link

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks overall good - especially if we can get a zero-copy via ShortByteString!

Is there any intent to deprecate the existing readFile, possibly pointing it to a named function readFileWithSystemLocale? That would describe the most common footgun and allow people to migrate either to the existing behavior, or to this new behavior.

src/Data/Text/IO/Utf8.hs Show resolved Hide resolved
src/Data/Text/IO/Utf8.hs Show resolved Hide resolved
src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved
src/Data/Text/IO/Utf8.hs Show resolved Hide resolved
@oberblastmeister
Copy link
Contributor Author

How should I group the documentation in the Data.Text.IO module?

Copy link
Contributor

@Lysxia Lysxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it was just the three file functions it seemed odd to have a dedicated module, but now that I see it's the whole Text.IO API that's affected and being duplicated, it makes more sense to have a new module Data.Text.IO.Utf8. That way, users who actually don't want to be locale dependent can just do a search-replace on the module name.

@oberblastmeister
Copy link
Contributor Author

@Lysxia @Bodigrim Is there anything else I need to do here?

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, only couple remarks.

src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved
src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any chance to write some tests?

@Bodigrim Bodigrim requested a review from Lysxia March 3, 2023 23:16
@oberblastmeister
Copy link
Contributor Author

It doesn't seem like Data.Text.IO has any tests? Also, the code really just builds on stuff from bytestring

@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 4, 2023

There are some basic IO tests in a very strange location:

testGroup "input-output" [
testProperty "t_write_read" t_write_read,
testProperty "tl_write_read" tl_write_read,
testProperty "t_write_read_line" t_write_read_line,
testProperty "tl_write_read_line" tl_write_read_line

@oberblastmeister
Copy link
Contributor Author

@Bodigrim I added tests

@Lysxia
Copy link
Contributor

Lysxia commented Mar 7, 2023

There's a danger with the lack of newline conversion. Windows users are going to be very confused when they read their files on their system and get an unexpected number of characters. It's useful to distinguish two issues here:

  1. We can read UTF-8 text faster because it's the encoding used internally in Text
  2. Often we just know that a file is in UTF-8, so having a shorthand for that case would be useful

We can resolve (1) without changing the API, by adding a special case in the existing IO functions. The matter of newline conversions adds some complexity; currently Data.Text.IO.hGetContents does it unconditionally, diverging from System.IO.

And (2) turns out to not be accurate because of newline conversion. UTF-8 still leaves open the question of how to encode newlines. I think that, either way, Utf8.readFile and the rest would be a footgun because half of the users will assume it works the other way.

@oberblastmeister
Copy link
Contributor Author

Can't we just document that it doesn't convert newlines? The user can convert newlines explicitly if they want, and we can add functions for that.

How would we add a special case for the existing functions?

@Lysxia
Copy link
Contributor

Lysxia commented Mar 9, 2023

I don't think those functions are really that conventional that they pass the Fairbairn threshold. Even if we document the behavior, it's really not obvious what specific circumstances warrant it. The standard encoding is set by the platform and the locale. As far as I can tell, it's only legitimate to ignore that standard when you downloaded a file from a Unix system, or you're talking to another local application which is ignoring the standard. IMO The niche bit of convenience of including Utf8.readFile in text is not worth the risk that it gets used by default for the wrong reasons, unnecessarily crippling portability.

How would we add a special case for the existing functions?

The encoding is determined by the Handle, in the field haCodec. You can compare its value with utf8 and do something different if they're equal.

@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 9, 2023

In the modern environment locale-dependent IO is a larger risk than reading UTF-8 by default. You don't really want to lose all data just because someone accidentally changed system locale to ASCII.

E. g., even GHC itself always expects UTF-8 whatever locale.

We actually do warn about this issue already:

text/src/Data/Text/IO.hs

Lines 70 to 78 in ff4af4c

-- Beware that this function (similarly to 'Prelude.readFile') is locale-dependent.
-- Unexpected system locale may cause your application to read corrupted data or
-- throw runtime exceptions about "invalid argument (invalid byte sequence)"
-- or "invalid argument (invalid character)". This is also slow, because GHC
-- first converts an entire input to UTF-32, which is afterwards converted to UTF-8.
--
-- If your data is UTF-8,
-- using 'Data.Text.Encoding.decodeUtf8' '.' 'Data.ByteString.readFile'
-- is a much faster and safer alternative.

See also https://www.snoyman.com/blog/2016/12/beware-of-readfile/


I've defined this very set of functions in private projects more than once, so I'm keen to have them available from text directly. They are not replacing existing ones, anyone wishing to be locale-dependent can continue to do so.

@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 9, 2023

FWIW bytestring has been ignoring locale-dependent line-ending conversion for decades now.

@Lysxia
Copy link
Contributor

Lysxia commented Mar 9, 2023

The blogpost makes a good point. Maybe I underestimated how many system do assume UTF-8. Consider myself overruled. Do mention the lack of newline conversion then.

It makes sense for bytestring to ignore all of that because it's not necessarily dealing with text.

@Bodigrim Bodigrim merged commit 2538102 into haskell:master Mar 11, 2023
@Bodigrim
Copy link
Contributor

Thanks, @oberblastmeister!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants