Space leak in `Data.Aeson.Parser` #37

hvr · 2011-11-03T11:03:11Z

This issue is mostly for documenting a known issue and a possible fix/workaround

The current parser implementation in Data/Aeson/Parser creates many thunks for records and arrays during parsing. Those thunks cause measurable overhead in heap usage and performance, which is already visible for some of the JSON documents contained in the Aeson benchmark suite. For even larger JSON documents in the MiB range, the overhead becomes even more substantial. Compared to Python's parser, Aeson then is a few times slower.

The benchmark suite itself doesn't evaluate those thunks, unless the following diff is applied:

diff --git a/benchmarks/AesonParse.hs b/benchmarks/AesonParse.hs
index a05ed5f..9e74e56 100644
--- a/benchmarks/AesonParse.hs
+++ b/benchmarks/AesonParse.hs
@@ -24,7 +24,9 @@ main = do
           let refill = B.hGet h 16384
           result <- parseWith refill json =<< refill
           case result of
-            Done _ r -> loop (good+1) bad
+            Done _ r -> do
+                        evaluate $ rnf r 
+                        loop (good+1) bad
             _        -> loop good (bad+1)
     (good, _) <- loop 0 0
     delta <- flip diffUTCTime start `fmap` getCurrentTime

The following modification removes the space-leak in Data.Aeson.Parser:

diff --git a/Data/Aeson/Parser.hs b/Data/Aeson/Parser.hs
index ba7b135..8bbe07e 100644
--- a/Data/Aeson/Parser.hs
+++ b/Data/Aeson/Parser.hs
@@ -55,19 +55,23 @@ object_ = {-# SCC "object_" #-} do
         b <- char ':' *> skipSpace *> value
         return (a,b)
   vals <- ((pair <* skipSpace) `sepBy` (char ',' *> skipSpace)) <* char '}'
-  return . Object $ Map.fromList vals
+  return $! Object $! Map.fromList vals

 array_ :: Parser Value
 array_ = {-# SCC "array_" #-} do
   skipSpace
   vals <- ((value <* skipSpace) `sepBy` (char ',' *> skipSpace)) <* char ']'
-  return . Array $ Vector.fromList vals
+  return $! Array $! Vector.fromList vals

 -- | Parse any JSON value.  Use 'json' in preference to this function
 -- if you are parsing data from an untrusted source.
 value :: Parser Value
-value = most <|> (Number <$> number)
+value = most <|> fallback
  where
+  fallback = do
+    n <- number
+    return $! Number $! n
+
   most = do
     c <- satisfy (`B8.elem` "{[\"ftn")
     case c of

This leads to the following benchmark results:

-- lazy aeson-0.3.2.12 w/ 'evaluate $ rnf r' benchmark
$ ./AesonParse 700 json-data/jp100.json 
json-data/jp100.json:
  700 good, 3.459478s
  202 per second

-- strictified aeson-0.3.2.12 w/ 'evaluate $ rnf r' benchmark
$ ./AesonParse 700 json-data/jp100.json 
json-data/jp100.json:
  700 good, 3.176462s
  220 per second

-- Python 2.7
$ ./parse.py 700 json-data/jp100.json 
json-data/jp100.json:
  1.49782395363

The text was updated successfully, but these errors were encountered:

basvandijk · 2011-11-03T11:52:54Z

Hi Herbert,

I was wondering, have you also tried strictifying and unpacking the fields of the constructors of Value as in:

diff --git a/Data/Aeson/Types/Internal.hs b/Data/Aeson/Types/Internal.hs
index 67946e9..14653d5 100644
--- a/Data/Aeson/Types/Internal.hs
+++ b/Data/Aeson/Types/Internal.hs
@@ -207,10 +207,10 @@ type Object = Map Text Value
 type Array = Vector Value

 -- | A JSON value represented as a Haskell value.
-data Value = Object Object
-           | Array Array
-           | String Text
-           | Number Number
+data Value = Object !Object
+           | Array  {-# UNPACK #-} !Array
+           | String {-# UNPACK #-} !Text
+           | Number !Number
            | Bool !Bool
            | Null
              deriving (Eq, Show, Typeable, Data)

(Of course unpacking fields does not always improve performance because fields sometimes have to be reboxed which will in fact hurt performance. This has to be benchmarked.)

hvr · 2011-11-03T12:38:41Z

Good point. I remember faintly having tried that, but I don't remember the results anymore. Unboxing the Text value didn't seem to make much sense at the time, as those kind of values are often used as-is in the data-model the JSON gets parsed into. However, I have to redo the benchmarks with the current code-base anyway, and then I'd also measure the effect of your proposed unboxing. :-)

bos · 2011-11-17T19:18:46Z

I have not found that @basvandijk's change actually fixes this problem.

It does make a difference to the amount of memory allocated, but it's small.

I ran with a 182KB sample JSON file that @hvr gave me a few months ago.

./AesonParse 100 anim.json +RTS -s

Lazy fields ran in 4.30 seconds.

 5,850,752,984 bytes allocated in the heap
 2,053,429,552 bytes copied during GC
     7,026,728 bytes maximum residency (177 sample(s))
       261,096 bytes maximum slop
            21 MB total memory in use (0 MB lost due to fragmentation)

%GC     time      64.5%  (65.1% elapsed)

Alloc rate    4,041,652,667 bytes per MUT second

Productivity  35.5% of total user, 33.7% of total elapsed

Strict fields ran in 4.16 seconds.

 5,807,525,768 bytes allocated in the heap
 1,924,700,040 bytes copied during GC
     6,465,672 bytes maximum residency (179 sample(s))
       244,960 bytes maximum slop
            20 MB total memory in use (0 MB lost due to fragmentation)

%GC     time      61.3%  (62.1% elapsed)

Alloc rate    3,807,971,406 bytes per MUT second

Productivity  38.7% of total user, 36.7% of total elapsed

bos · 2011-11-17T19:38:17Z

I should take a moment to describe why I have resisted making AesonParse fully evaluate the result of a parse, and why the individual parsing components do not fully evaluate their results before returning them.

When we're parsing, we don't know which fields in a Value the consumer of the result will actually need. If strictly evaluating all of those fields was free, that would be great, but in my measurements the kind of change that @hvr suggests at the top of this issue has always slowed parsing down by 30% to 50% when we need only a few of the fields.

Granted, the down side of the current somewhat-lazy approach is that it causes many thunks to be allocated. If the consumer really does need all the fields, then we've failed to avoid any work. Instead, we've caused ourselves extra work and intermediate allocation due to the thunking.

I have not yet been able to come up with a satisfactory one-size-fits-all solution. I am quite fond of the additional performance we get by deferring complete evaluation, but I am unhappy about the additional allocation we perform when the deferred evaluation is not needed.

The closest I can think of to a good solution is to have two parsing modules, one that's fully strict and one that's partly lazy. If you have any better ideas, please let me know :-)

hvr · 2011-11-17T20:20:18Z

The closest I can think of to a good solution is to have two parsing modules, one that's fully strict and one that's partly lazy. If you have any better ideas, please let me know :-)

IMHO, that's a legitimate solution, i.e. to let the user chose if there really isn't a one-size-fits-it-all parser :-)

I'm wondering which percentage (if that can be roughly estimated at all) of the JSON parse-tree has to be actually needed in order for the strict parser to become more/less efficient than the lazy one.

This changeset builds on top of dab1302, and just strictifies the monadic `return`s used in the attoparsec `Parser`s monad to avoid leaving thunks behind during parsing. This addresses issue haskell#37

bos · 2011-12-01T05:55:54Z

This is now fixed.

This improves the running time of the AesonParse benchmarks. I don't think there's a use-case for lazy fields. Fixes haskell#37

This changeset builds on top of dab1302, and just strictifies the monadic `return`s used in the attoparsec `Parser`s monad to avoid leaving thunks behind during parsing. This addresses issue haskell#37

basvandijk mentioned this issue Nov 16, 2011

Strict Value fields #40

Merged

bos closed this as completed in dab1302 Nov 17, 2011

bos reopened this Nov 17, 2011

hvr mentioned this issue Nov 30, 2011

Add strict parsers variants of json and value #51

Merged

bos closed this as completed Dec 1, 2011

tolysz pushed a commit to tolysz/aeson that referenced this issue May 18, 2015

Strict Value fields

23163a1

This improves the running time of the AesonParse benchmarks. I don't think there's a use-case for lazy fields. Fixes haskell#37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Space leak in `Data.Aeson.Parser` #37

Space leak in `Data.Aeson.Parser` #37

hvr commented Nov 3, 2011

basvandijk commented Nov 3, 2011

hvr commented Nov 3, 2011

bos commented Nov 17, 2011

bos commented Nov 17, 2011

hvr commented Nov 17, 2011

bos commented Dec 1, 2011

Space leak in Data.Aeson.Parser #37

Space leak in Data.Aeson.Parser #37

Comments

hvr commented Nov 3, 2011

basvandijk commented Nov 3, 2011

hvr commented Nov 3, 2011

bos commented Nov 17, 2011

bos commented Nov 17, 2011

hvr commented Nov 17, 2011

bos commented Dec 1, 2011

Space leak in `Data.Aeson.Parser` #37

Space leak in `Data.Aeson.Parser` #37