Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space leak in Data.Aeson.Parser #37

Closed
hvr opened this issue Nov 3, 2011 · 6 comments · Fixed by #40
Closed

Space leak in Data.Aeson.Parser #37

hvr opened this issue Nov 3, 2011 · 6 comments · Fixed by #40

Comments

@hvr
Copy link
Member

hvr commented Nov 3, 2011

This issue is mostly for documenting a known issue and a possible fix/workaround

The current parser implementation in Data/Aeson/Parser creates many thunks for records and arrays during parsing. Those thunks cause measurable overhead in heap usage and performance, which is already visible for some of the JSON documents contained in the Aeson benchmark suite. For even larger JSON documents in the MiB range, the overhead becomes even more substantial. Compared to Python's parser, Aeson then is a few times slower.

The benchmark suite itself doesn't evaluate those thunks, unless the following diff is applied:

diff --git a/benchmarks/AesonParse.hs b/benchmarks/AesonParse.hs
index a05ed5f..9e74e56 100644
--- a/benchmarks/AesonParse.hs
+++ b/benchmarks/AesonParse.hs
@@ -24,7 +24,9 @@ main = do
           let refill = B.hGet h 16384
           result <- parseWith refill json =<< refill
           case result of
-            Done _ r -> loop (good+1) bad
+            Done _ r -> do
+                        evaluate $ rnf r 
+                        loop (good+1) bad
             _        -> loop good (bad+1)
     (good, _) <- loop 0 0
     delta <- flip diffUTCTime start `fmap` getCurrentTime

The following modification removes the space-leak in Data.Aeson.Parser:

diff --git a/Data/Aeson/Parser.hs b/Data/Aeson/Parser.hs
index ba7b135..8bbe07e 100644
--- a/Data/Aeson/Parser.hs
+++ b/Data/Aeson/Parser.hs
@@ -55,19 +55,23 @@ object_ = {-# SCC "object_" #-} do
         b <- char ':' *> skipSpace *> value
         return (a,b)
   vals <- ((pair <* skipSpace) `sepBy` (char ',' *> skipSpace)) <* char '}'
-  return . Object $ Map.fromList vals
+  return $! Object $! Map.fromList vals

 array_ :: Parser Value
 array_ = {-# SCC "array_" #-} do
   skipSpace
   vals <- ((value <* skipSpace) `sepBy` (char ',' *> skipSpace)) <* char ']'
-  return . Array $ Vector.fromList vals
+  return $! Array $! Vector.fromList vals

 -- | Parse any JSON value.  Use 'json' in preference to this function
 -- if you are parsing data from an untrusted source.
 value :: Parser Value
-value = most <|> (Number <$> number)
+value = most <|> fallback
  where
+  fallback = do
+    n <- number
+    return $! Number $! n
+
   most = do
     c <- satisfy (`B8.elem` "{[\"ftn")
     case c of

This leads to the following benchmark results:

-- lazy aeson-0.3.2.12 w/ 'evaluate $ rnf r' benchmark
$ ./AesonParse 700 json-data/jp100.json 
json-data/jp100.json:
  700 good, 3.459478s
  202 per second

-- strictified aeson-0.3.2.12 w/ 'evaluate $ rnf r' benchmark
$ ./AesonParse 700 json-data/jp100.json 
json-data/jp100.json:
  700 good, 3.176462s
  220 per second

-- Python 2.7
$ ./parse.py 700 json-data/jp100.json 
json-data/jp100.json:
  1.49782395363
@basvandijk
Copy link
Member

Hi Herbert,

I was wondering, have you also tried strictifying and unpacking the fields of the constructors of Value as in:

diff --git a/Data/Aeson/Types/Internal.hs b/Data/Aeson/Types/Internal.hs
index 67946e9..14653d5 100644
--- a/Data/Aeson/Types/Internal.hs
+++ b/Data/Aeson/Types/Internal.hs
@@ -207,10 +207,10 @@ type Object = Map Text Value
 type Array = Vector Value

 -- | A JSON value represented as a Haskell value.
-data Value = Object Object
-           | Array Array
-           | String Text
-           | Number Number
+data Value = Object !Object
+           | Array  {-# UNPACK #-} !Array
+           | String {-# UNPACK #-} !Text
+           | Number !Number
            | Bool !Bool
            | Null
              deriving (Eq, Show, Typeable, Data)

(Of course unpacking fields does not always improve performance because fields sometimes have to be reboxed which will in fact hurt performance. This has to be benchmarked.)

@hvr
Copy link
Member Author

hvr commented Nov 3, 2011

Good point. I remember faintly having tried that, but I don't remember the results anymore. Unboxing the Text value didn't seem to make much sense at the time, as those kind of values are often used as-is in the data-model the JSON gets parsed into. However, I have to redo the benchmarks with the current code-base anyway, and then I'd also measure the effect of your proposed unboxing. :-)

@bos bos closed this as completed in dab1302 Nov 17, 2011
@bos bos reopened this Nov 17, 2011
@bos
Copy link
Collaborator

bos commented Nov 17, 2011

I have not found that @basvandijk's change actually fixes this problem.

It does make a difference to the amount of memory allocated, but it's small.

I ran with a 182KB sample JSON file that @hvr gave me a few months ago.

./AesonParse 100 anim.json +RTS -s

Lazy fields ran in 4.30 seconds.

 5,850,752,984 bytes allocated in the heap
 2,053,429,552 bytes copied during GC
     7,026,728 bytes maximum residency (177 sample(s))
       261,096 bytes maximum slop
            21 MB total memory in use (0 MB lost due to fragmentation)

%GC     time      64.5%  (65.1% elapsed)

Alloc rate    4,041,652,667 bytes per MUT second

Productivity  35.5% of total user, 33.7% of total elapsed

Strict fields ran in 4.16 seconds.

 5,807,525,768 bytes allocated in the heap
 1,924,700,040 bytes copied during GC
     6,465,672 bytes maximum residency (179 sample(s))
       244,960 bytes maximum slop
            20 MB total memory in use (0 MB lost due to fragmentation)

%GC     time      61.3%  (62.1% elapsed)

Alloc rate    3,807,971,406 bytes per MUT second

Productivity  38.7% of total user, 36.7% of total elapsed

@bos
Copy link
Collaborator

bos commented Nov 17, 2011

I should take a moment to describe why I have resisted making AesonParse fully evaluate the result of a parse, and why the individual parsing components do not fully evaluate their results before returning them.

When we're parsing, we don't know which fields in a Value the consumer of the result will actually need. If strictly evaluating all of those fields was free, that would be great, but in my measurements the kind of change that @hvr suggests at the top of this issue has always slowed parsing down by 30% to 50% when we need only a few of the fields.

Granted, the down side of the current somewhat-lazy approach is that it causes many thunks to be allocated. If the consumer really does need all the fields, then we've failed to avoid any work. Instead, we've caused ourselves extra work and intermediate allocation due to the thunking.

I have not yet been able to come up with a satisfactory one-size-fits-all solution. I am quite fond of the additional performance we get by deferring complete evaluation, but I am unhappy about the additional allocation we perform when the deferred evaluation is not needed.

The closest I can think of to a good solution is to have two parsing modules, one that's fully strict and one that's partly lazy. If you have any better ideas, please let me know :-)

@hvr
Copy link
Member Author

hvr commented Nov 17, 2011

The closest I can think of to a good solution is to have two parsing modules, one that's fully strict and one that's partly lazy. If you have any better ideas, please let me know :-)

IMHO, that's a legitimate solution, i.e. to let the user chose if there really isn't a one-size-fits-it-all parser :-)

I'm wondering which percentage (if that can be roughly estimated at all) of the JSON parse-tree has to be actually needed in order for the strict parser to become more/less efficient than the lazy one.

hvr pushed a commit to hvr/aeson that referenced this issue Nov 30, 2011
This changeset builds on top of
dab1302, and just strictifies the
monadic `return`s used in the attoparsec `Parser`s monad to avoid
leaving thunks behind during parsing.

This addresses issue haskell#37
@bos
Copy link
Collaborator

bos commented Dec 1, 2011

This is now fixed.

@bos bos closed this as completed Dec 1, 2011
tolysz pushed a commit to tolysz/aeson that referenced this issue May 18, 2015
This improves the running time of the AesonParse benchmarks.
I don't think there's a use-case for lazy fields.
Fixes haskell#37
tolysz pushed a commit to tolysz/aeson that referenced this issue May 18, 2015
This changeset builds on top of
dab1302, and just strictifies the
monadic `return`s used in the attoparsec `Parser`s monad to avoid
leaving thunks behind during parsing.

This addresses issue haskell#37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants