Add Text.IO.Utf8 module #503

oberblastmeister · 2023-02-10T13:53:27Z

Solves #472

Lysxia · 2023-02-13T10:55:18Z

The title of issue #472 mentions using ShortByteString for this. What's the status of that?

If I understand correctly, this is about variants of Data.Text.IO.readFile, etc. that don't depend on the locale and which may avoid a copy compared to fmap decodeUtf8 . Data.ByteString.readFile (although the current implementation doesn't do that). Any other benefit I'm missing? In any case that sounds reasonable to me.

Wouldn't it be better to add to Data.Text.IO instead of creating yet another module?

@haskell/text any objections?

oberblastmeister · 2023-02-13T15:53:07Z

Yeah, the intention is to make it faster in the future once I finish haskell/bytestring#547.

I generally like using qualified imports instead of adding suffixes to everything, so that's why I put it in a separate module. I'm fine with putting it in Data.Text.IO also.

Bodigrim · 2023-02-13T23:16:27Z

I'd mildly prefer to fold it into Data.Text.IO, but I'm fine either way.

Shall we provide the same set of functions for lazy Text as well?

parsonsmatt

This change looks overall good - especially if we can get a zero-copy via ShortByteString!

Is there any intent to deprecate the existing readFile, possibly pointing it to a named function readFileWithSystemLocale? That would describe the most common footgun and allow people to migrate either to the existing behavior, or to this new behavior.

src/Data/Text/IO/Utf8.hs

oberblastmeister · 2023-02-15T15:05:21Z

How should I group the documentation in the Data.Text.IO module?

Lysxia

When it was just the three file functions it seemed odd to have a dedicated module, but now that I see it's the whole Text.IO API that's affected and being duplicated, it makes more sense to have a new module Data.Text.IO.Utf8. That way, users who actually don't want to be locale dependent can just do a search-replace on the module name.

This reverts commit 306f7fa.

oberblastmeister · 2023-02-28T21:54:07Z

@Lysxia @Bodigrim Is there anything else I need to do here?

Bodigrim

Overall looks good to me, only couple remarks.

src/Data/Text/IO/Utf8.hs

Bodigrim

LGTM. Any chance to write some tests?

oberblastmeister · 2023-03-04T23:11:58Z

It doesn't seem like Data.Text.IO has any tests? Also, the code really just builds on stuff from bytestring

Bodigrim · 2023-03-04T23:36:55Z

There are some basic IO tests in a very strange location:

text/tests/Tests/Properties/LowLevel.hs

Lines 141 to 145 in ff4af4c

    
           testGroup "input-output" [ 
        
             testProperty "t_write_read" t_write_read, 
        
             testProperty "tl_write_read" tl_write_read, 
        
             testProperty "t_write_read_line" t_write_read_line, 
        
             testProperty "tl_write_read_line" tl_write_read_line

oberblastmeister · 2023-03-06T23:53:29Z

@Bodigrim I added tests

Lysxia · 2023-03-07T17:34:50Z

There's a danger with the lack of newline conversion. Windows users are going to be very confused when they read their files on their system and get an unexpected number of characters. It's useful to distinguish two issues here:

We can read UTF-8 text faster because it's the encoding used internally in Text
Often we just know that a file is in UTF-8, so having a shorthand for that case would be useful

We can resolve (1) without changing the API, by adding a special case in the existing IO functions. The matter of newline conversions adds some complexity; currently Data.Text.IO.hGetContents does it unconditionally, diverging from System.IO.

And (2) turns out to not be accurate because of newline conversion. UTF-8 still leaves open the question of how to encode newlines. I think that, either way, Utf8.readFile and the rest would be a footgun because half of the users will assume it works the other way.

oberblastmeister · 2023-03-07T20:20:25Z

Can't we just document that it doesn't convert newlines? The user can convert newlines explicitly if they want, and we can add functions for that.

How would we add a special case for the existing functions?

Lysxia · 2023-03-09T00:27:30Z

I don't think those functions are really that conventional that they pass the Fairbairn threshold. Even if we document the behavior, it's really not obvious what specific circumstances warrant it. The standard encoding is set by the platform and the locale. As far as I can tell, it's only legitimate to ignore that standard when you downloaded a file from a Unix system, or you're talking to another local application which is ignoring the standard. IMO The niche bit of convenience of including Utf8.readFile in text is not worth the risk that it gets used by default for the wrong reasons, unnecessarily crippling portability.

How would we add a special case for the existing functions?

The encoding is determined by the Handle, in the field haCodec. You can compare its value with utf8 and do something different if they're equal.

Bodigrim · 2023-03-09T01:27:17Z

In the modern environment locale-dependent IO is a larger risk than reading UTF-8 by default. You don't really want to lose all data just because someone accidentally changed system locale to ASCII.

E. g., even GHC itself always expects UTF-8 whatever locale.

We actually do warn about this issue already:

text/src/Data/Text/IO.hs

Lines 70 to 78 in ff4af4c

    
           -- Beware that this function (similarly to 'Prelude.readFile') is locale-dependent. 
        
           -- Unexpected system locale may cause your application to read corrupted data or 
        
           -- throw runtime exceptions about "invalid argument (invalid byte sequence)" 
        
           -- or "invalid argument (invalid character)". This is also slow, because GHC 
        
           -- first converts an entire input to UTF-32, which is afterwards converted to UTF-8. 
        
           -- 
        
           -- If your data is UTF-8, 
        
           -- using 'Data.Text.Encoding.decodeUtf8' '.' 'Data.ByteString.readFile' 
        
           -- is a much faster and safer alternative.

See also https://www.snoyman.com/blog/2016/12/beware-of-readfile/

I've defined this very set of functions in private projects more than once, so I'm keen to have them available from text directly. They are not replacing existing ones, anyone wishing to be locale-dependent can continue to do so.

Bodigrim · 2023-03-09T01:28:16Z

FWIW bytestring has been ignoring locale-dependent line-ending conversion for decades now.

Lysxia · 2023-03-09T13:05:16Z

The blogpost makes a good point. Maybe I underestimated how many system do assume UTF-8. Consider myself overruled. Do mention the lack of newline conversion then.

It makes sense for bytestring to ignore all of that because it's not necessarily dealing with text.

Bodigrim · 2023-03-11T11:02:42Z

Thanks, @oberblastmeister!

Add Text.IO.Utf8 module

2ec84e2

Lysxia added the API Addition (PVP minor) label Feb 13, 2023

parsonsmatt reviewed Feb 14, 2023

View reviewed changes

src/Data/Text/IO/Utf8.hs Show resolved Hide resolved

src/Data/Text/IO/Utf8.hs Show resolved Hide resolved

src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved

src/Data/Text/IO/Utf8.hs Show resolved Hide resolved

oberblastmeister added 3 commits February 15, 2023 09:48

add more functions

a868f4b

make decoding strict in interact

66dab03

move back into Data.Text.IO

306f7fa

Lysxia reviewed Feb 21, 2023

View reviewed changes

Revert "move back into Data.Text.IO"

7e9d7e4

This reverts commit 306f7fa.

Bodigrim reviewed Mar 3, 2023

View reviewed changes

src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved

src/Data/Text/IO/Utf8.hs Outdated Show resolved Hide resolved

oberblastmeister added 2 commits March 2, 2023 20:46

remove unnecessary documentation

5e0bd99

add module to cabal file and add note

acec5dc

Bodigrim approved these changes Mar 3, 2023

View reviewed changes

Bodigrim requested a review from Lysxia March 3, 2023 23:16

add tests

a59e97e

oberblastmeister added 2 commits March 10, 2023 13:54

fix import

e29b635

add docs

23c69de

Lysxia approved these changes Mar 11, 2023

View reviewed changes

Bodigrim merged commit 2538102 into haskell:master Mar 11, 2023

Bodigrim mentioned this pull request Apr 3, 2023

IO operations for ShortByteString haskell/bytestring#547

Open

gromakovsky mentioned this pull request Sep 25, 2023

Delete Data.Text.IO.Utf8 module serokell/haskell-with-utf8#24

Open

BebeSparkelSparkel mentioned this pull request Apr 28, 2024

Integrate UTF-8 hPutStr to standard hPutStr #589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Text.IO.Utf8 module #503

Add Text.IO.Utf8 module #503

oberblastmeister commented Feb 10, 2023

Lysxia commented Feb 13, 2023 •

edited

Loading

oberblastmeister commented Feb 13, 2023

Bodigrim commented Feb 13, 2023

parsonsmatt left a comment

oberblastmeister commented Feb 15, 2023

Lysxia left a comment

oberblastmeister commented Feb 28, 2023

Bodigrim left a comment

Bodigrim left a comment

oberblastmeister commented Mar 4, 2023

Bodigrim commented Mar 4, 2023

oberblastmeister commented Mar 6, 2023

Lysxia commented Mar 7, 2023 •

edited

Loading

oberblastmeister commented Mar 7, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 9, 2023

Bodigrim commented Mar 9, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 11, 2023

Add Text.IO.Utf8 module #503

Add Text.IO.Utf8 module #503

Conversation

oberblastmeister commented Feb 10, 2023

Lysxia commented Feb 13, 2023 • edited Loading

oberblastmeister commented Feb 13, 2023

Bodigrim commented Feb 13, 2023

parsonsmatt left a comment

Choose a reason for hiding this comment

oberblastmeister commented Feb 15, 2023

Lysxia left a comment

Choose a reason for hiding this comment

oberblastmeister commented Feb 28, 2023

Bodigrim left a comment

Choose a reason for hiding this comment

Bodigrim left a comment

Choose a reason for hiding this comment

oberblastmeister commented Mar 4, 2023

Bodigrim commented Mar 4, 2023

oberblastmeister commented Mar 6, 2023

Lysxia commented Mar 7, 2023 • edited Loading

oberblastmeister commented Mar 7, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 9, 2023

Bodigrim commented Mar 9, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 11, 2023

Lysxia commented Feb 13, 2023 •

edited

Loading

Lysxia commented Mar 7, 2023 •

edited

Loading