Incorrect result for `(== 1) . length . filter (== ',')` depending on -O level #197

hvr · 2017-09-07T12:49:07Z

@raichoo reports a nasty bug related to text (it's still unclear what exactly is going on here, although I suspect text's RULE framework to be the culprit -- I hope so at least, as the alternative would mean that several GHC releases out there may generate wrong code):

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE BangPatterns #-}
module Main where

import Prelude hiding (length, filter)
import Data.Text

currencyParser :: Text -> Bool
currencyParser x = cond == 1
  where
    cond = length fltr
    fltr = filter (== ',') x

main :: IO ()
main = print (currencyParser "0,00")

when compiling this with -O0 or running in GHCi, the proper True output is generated. However, when compiling with -O1 the wrong False output is generated.

The text was updated successfully, but these errors were encountered:

raichoo · 2017-09-07T13:00:42Z

Ah, you were faster than me ^^.

raichoo · 2017-09-07T13:37:13Z

Removing this RULE from Data.Text. Fixes the issue:

{-# RULES
"TEXT ==N/length -> compareLength/==EQ" [~1] forall t n.
    eqInt (length t) n = compareLength t n == EQ
  #-}

EDIT: posted the wrong RULE.

raichoo · 2017-09-07T16:33:40Z

If you are hit by this, I'm currently using the following workaround: strictly evaluate fltr using either seq or BangPatterns.

raichoo · 2017-09-07T19:32:28Z

So, I've narrowed this down a little more. Looking at Core. Please tell me if I'm barking up the wrong tree here.

I'm currently working with this code:

wat :: Bool
wat = length (filter (== ',') "0,00") == 1

This is the code that gets emitted with optimizations turned on.

wat
wat =
  case unpackCString# "0,00"#
  of _ { Text dt_a6KV dt1_a6KW dt2_a6KX ->
  case tagToEnum# (<# dt2_a6KX 1#) of _ {
    False ->
      let {
        len_s6W8
        len_s6W8 = uncheckedIShiftRA# dt2_a6KX 1# } in
      case tagToEnum# (># len_s6W8 1#) of _ {
        False ->
          let {
            $j_a6LS
            $j_a6LS =
              \ _ ->
                let {
                  end_a6KU
                  end_a6KU = +# dt1_a6KW dt2_a6KX } in
                letrec {
                  $wloop_cmp_s6Zq
                  $wloop_cmp_s6Zq =
                    \ ww_s6Zk ww1_s6Zo ->
                      case tagToEnum# (>=# ww1_s6Zo end_a6KU) of _ {
                        False ->
                          case indexWord16Array# dt_a6KV ww1_s6Zo of r#_a6L7 { __DEFAULT ->
                          case tagToEnum# (geWord# r#_a6L7 55296##) of _ {
                            False ->
                              case chr# (word2Int# r#_a6L7) of _ {
                                __DEFAULT -> $wloop_cmp_s6Zq ww_s6Zk (+# ww1_s6Zo 1#);
                                ','# ->
                                  case tagToEnum# (># ww_s6Zk 1#) of _ {
                                    False -> $wloop_cmp_s6Zq (+# ww_s6Zk 1#) (+# ww1_s6Zo 1#);
                                    True -> GT
                                  }
                              };
                            True ->
                              case tagToEnum# (leWord# r#_a6L7 56319##) of _ {
                                False ->
                                  case chr# (word2Int# r#_a6L7) of _ {
                                    __DEFAULT -> $wloop_cmp_s6Zq ww_s6Zk (+# ww1_s6Zo 1#);
                                    ','# ->
                                      case tagToEnum# (># ww_s6Zk 1#) of _ {
                                        False -> $wloop_cmp_s6Zq (+# ww_s6Zk 1#) (+# ww1_s6Zo 1#);
                                        True -> GT
                                      }
                                  };
                                True ->
                                  case indexWord16Array# dt_a6KV (+# ww1_s6Zo 1#)
                                  of r#1_a6Li { __DEFAULT ->
                                  case chr#
                                         (+#
                                            (+#
                                               (uncheckedIShiftL#
                                                  (-# (word2Int# r#_a6L7) 55296#) 10#)
                                               (-# (word2Int# r#1_a6Li) 56320#))
                                            65536#)
                                  of _ {
                                    __DEFAULT -> $wloop_cmp_s6Zq ww_s6Zk (+# ww1_s6Zo 2#);
                                    ','# ->
                                      case tagToEnum# (># ww_s6Zk 1#) of _ {
                                        False -> $wloop_cmp_s6Zq (+# ww_s6Zk 1#) (+# ww1_s6Zo 2#);
                                        True -> GT
                                      }
                                  }
                                  }
                              }
                          }
                          };
                        True ->
                          case tagToEnum# (<# ww_s6Zk 1#) of _ {
                            False ->
                              case ww_s6Zk of _ {
                                __DEFAULT -> GT;
                                1# -> EQ
                              };
                            True -> LT
                          }
                      }; } in
                $wloop_cmp_s6Zq 0# dt1_a6KW } in
          case len_s6W8 of _ {
            __DEFAULT ->
              case $j_a6LS void# of _ {
                __DEFAULT -> False;
                EQ -> True
              };
            1# ->
              case dt2_a6KX of _ {
                __DEFAULT ->
                  case $j_a6LS void# of _ {
                    __DEFAULT -> False;
                    EQ -> True
                  };
                1# -> True
              }
          };
        True -> False
      };
    True -> False
  }
  }

What I found interesting was the length check at the beginning:

  of _ { Text dt_a6KV dt1_a6KW dt2_a6KX ->
  case tagToEnum# (<# dt2_a6KX 1#) of _ {
    False ->
      let {
        len_s6W8
        len_s6W8 = uncheckedIShiftRA# dt2_a6KX 1# } in
      case tagToEnum# (># len_s6W8 1#) of _ {

I've written a little piece of code the emulate this check:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.Text
import Data.Text.Internal
import GHC.Exts

import GHC.Int

main :: IO ()
main = do
  print (showText ("0,00" :: Text))
  print (tagToEnum# ((>#) (uncheckedIShiftRA64# 4# 1#) 1#) :: Bool)

Turns out it yields True. The corresponding True branch however yields False directly (see end of Core output)

raichoo · 2017-09-07T19:57:18Z

This looks a lot like it. I've tried wat with different string lengths (adding and removing "0") with less than 4 characters my test yields True once it reaches 4 it yieds False.

Looks like this piece of code forces me into the false True branch once the string is longer than 3 character.

        len_s6W8 = uncheckedIShiftRA# dt2_a6KX 1#

Where dt2_a6KX is the length of the string.

raichoo · 2017-09-07T21:44:24Z

I took a look at the following function and added some tracing:

compareLengthI :: Integral a => Stream Char -> a -> Ordering
compareLengthI (Stream next s0 len) n =
    case compareSize (trace ("len: " ++ show len) len) (fromIntegral n) of
      Just o  -> trace ("n: " ++ show (fromIntegral n)) (trace ("o: " ++ show o) )o
      Nothing -> trace ("n: " ++ show (fromIntegral n)) (loop_cmp 0 s0)
    where
      loop_cmp !z s  = case next s of
                         Done       -> compare z n
                         Skip    s' -> loop_cmp z s'
                         Yield _ s' | z > n     -> GT
                                    | otherwise -> loop_cmp (z + 1) s'

The output I get is rather interesting

len: Between 2 4
n: 1
o: GT
False

This looks like it's comparing the length of "0,00" (which has a length of Between 2 4) rather than the filtered string to 1 which yields GT which is not EQ which is required by
this RULE:

{-# RULES
"TEXT ==N/length -> compareLength/==EQ" [~1] forall t n.
    eqInt (length t) n = compareLength t n == EQ
  #-}

raichoo · 2017-09-07T21:58:45Z

Rewriting wat using seq gives me the right answer (like I mentioned above):

wat :: Bool
wat = x `seq` length x == 1
  where
    x = filter (== ',') "1,00"

Which yield the following correct output:

len: Between 0 1
n: 1
True

The optimized version is comparing the length of the unfiltered string to 1 if the filter expression has not been forced before.

raichoo · 2017-09-08T09:55:24Z

Looking at core2core output. If my assumptions are correct, this is the last version of the code that is correct, after that we it seems to be checking the lenght of the original string.

wat
wat =
  case compareLengthI
         $fIntegralInt
         (filter
            (\ ds_d6kf ->
               case ds_d6kf of _ { C# x_a6u0 ->
               case x_a6u0 of _ {
                 __DEFAULT -> False;
                 ','# -> True
               }
               })
            (stream (unpackCString# "0,00"#)))
         (I# 1#)
  of _ {
    __DEFAULT -> False;
    EQ -> True
  }

This happens quite early in the process. I've disabled the "TEXT literal" RULE (which also fires quite early, hence my suspicion that it's somewhat involved in the process) and that seems to fix the behavior as well, as well as hopefully not breaking stream fusion as my other attempts might.

bgamari · 2017-09-08T16:09:10Z

Fix coming thanks to @nomeata.

raichoo · 2017-09-08T16:51:10Z

@nomeata Can I read up somewhere on how you debugged this? :)

nomeata · 2017-09-08T21:05:34Z

Well, after seeing in the Core that streams are involved, my first hypothesis was that compareLengthI was simply buggy when counting streams with Skip, which filter produces. But the code of compareLengthI looked fine (although one > could be a >= for efficiency, I believe).

In the code I saw that streams cache information about their length (upper bound and lower bounds, if present). And that made me think: I wonder if these caches are always up-to-date. And that hit it on the spot: The filter function claimed that the output had the same length as the input. Which is obviously true.

There are quite a few more functions where the len field is not properly updated…

nomeata · 2017-09-08T21:11:33Z

Althought the bug might well be in compareLengthI, as this comment says that the len field is only a hint:

-- The length hint in a Stream has two roles.  If its value is zero,
-- we trust it, and treat the stream as empty.  Otherwise, we treat it
-- as a hint: it should usually be accurate, so we use it when
-- unstreaming to decide what size array to allocate.  However, the
-- unstreaming functions must be able to cope with the hint being too
-- small or too large.
--
-- The size hint tries to track the UTF-16 code points in a stream,
-- but often counts the number of characters instead.  It can easily
-- undercount if, for instance, a transformed stream contains astral
-- plane characters (those above 0x10000).

Although I suspect that this comment reflects the state before the Size data type had explicit upper and lower bounds, and that nowadays it is meant to be correct. The text authors will now.

(Judging from the code I cannot exploit this bug to override arbitrary memory…)

nomeata · 2017-09-17T01:23:28Z

@bgamari, you started fixing this bug, didn’t you? Are you going to submit a patch? (Don’t forget the other functions that don’t update the len field accordingly.)

bgamari · 2017-09-17T04:11:43Z

Joachim Breitner <notifications@github.com> writes:

@bgamari, you started fixing this bug, didn’t you? Are you going to submit a patch? (Don’t forget the other functions that don’t update the `len` field accordingly.)

Indeed I did; it's been on my list of things to finish up but other things have been getting in the way. I'll be hiking tomorrow, but I'll try to get to it on Monday.

nomeata · 2017-09-17T04:53:32Z

Ok, no hurry from my side, I just wanted to make sure it is on someone’s list.

bgamari · 2017-09-29T23:13:45Z

I'm beginning to think that it is the compareLengthI change in a66cbb7 that is at fault here. Afterall, even assuming that we are careful about ensuring that the Size hint is always correct, the compareLengthI implementation assumes that the Size is measuring code points. However, it is not; it is measuring UTF-16 code units.This is because the hint was originally intended to serve as an initial buffer size for unstream (and therefore could be approximate), not act as a hard bound.

If we really want the compareLengthI optimization I believe we would want to track length in code points in addition to length in code units.

hvr · 2017-09-30T10:38:04Z

@bgamari here's something I should have done right away; bisecting through hackage releases: The result appears to be that the sample program above reports the correct result with text-1.1.0.1 but fails with text-1.1.1.0 (NB: you need GHC 7.8.4 or older to build/run those versions). Consequently, something in

1.1.0.1...1.1.1.0

likely caused the regression. However, the commit ~~a02670d~~ you mention predates 1.1.0.1

EDIT: so it was rather a66cbb7 (motivated by #76 & #75).... and so this regression was introduced by @ekmett it seems

bgamari · 2017-09-30T13:27:04Z

Oh dear, sorry, I cited the wrong commit; I meant a66cbb7. I have updated the original comment as well.

ekmett · 2017-09-30T18:33:20Z

My vague recollection of the original intent was to get a conservative LT answer based on a bound of how bad the expansion could be. We should be able to salvage something similar.

bgamari · 2017-09-30T18:41:19Z

Alright, fair enough; that should be doable

There were size hint issues throughout the fusion implementation. See haskell#197.

This fixes a variety of size hint bugs in text's fusion framework. These issues fell broadly into two classes, * Code point/code unit confusion * Inappropriate bounds It seems the most of the latter were introduced when the Size type was extended to track both upper and lower bounds in f4fc30c. These could manifest in a variety of issues similar to haskell#197.

Fix usage of size hints which resulted in serious bugs such as operations like `(== 1) . length . filter (== ',')` (see #197) giving wrong results.

hvr · 2017-12-17T18:01:23Z

This was fixed by #200