Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory allocations and enhance performance #90

Merged
merged 1 commit into from
Jan 21, 2021

Conversation

gettalong
Copy link
Member

I ran a benchmark using Prawn together with a TrueType font handled by
ttfunk which allocated a staggering 107 million objects (compared to 1.6
million objects for the same benchmark using HexaPDF).

A 5min investigation revealed two spots that when optimized reduced the
number to about 32 million objects.

  • TTFunk::Subset::CodePage#from_unicode:

    Don't create an array and use #pack to convert a codepoint to a
    character, just add the codepoint directly, saving an array allocation.

    Furthermore, the created string can be modified using #encode! since
    it is thrown away anyway.

  • TTFunk::SubsetCollection#use:

    The inner loop is called many times. By using a while loop instead of
    an iterator with a block we avoid allocating and calling the block.

@pointlessone
Copy link
Member

Thank you for your contribution.

I'm not quite sure if object allocations is a good metric to optimize. How performance or memory usage are affected?

@gettalong
Copy link
Member Author

All created objects must sometime be collected by the garbage collector. The more short-lived objects are created the more the garbage collector has to do. So reducing object allocations leads to better performance although the gains may not be as much as for other performance related changes.

In this case we are reducing a huge amount of allocations which will be performance relevant:

|--------------------------------------------------------------------|
|                              ||    Time |     Memory |   File size |
|--------------------------------------------------------------------|
| prawn 2.1.0 | 10x            |  2.824ms |  71.804KiB |   5.861.065 |
| prawn 2.2.0 | 10x            |  2.935ms |  76.452KiB |   6.170.089 |
| prawn 2.3.0 | 10x            |  3.658ms |  74.956KiB |   6.170.089 |
| prawn       | 10x            |  4.758ms |  74.928KiB |   6.170.089 |
| prawn-dev   | 10x            |  4.804ms |  74.924KiB |   6.170.089 |
|--------------------------------------------------------------------|
| prawn 2.1.0 | 10x ttf        |  8.129ms |  74.512KiB |   5.868.049 |
| prawn 2.2.0 | 10x ttf        | 17.205ms |  76.976KiB |   6.177.034 |
| prawn 2.3.0 | 10x ttf        | 18.104ms |  77.664KiB |   6.177.032 |
| prawn       | 10x ttf        | 19.017ms |  77.864KiB |   6.177.032 |
| prawn-dev   | 10x ttf        | 14.207ms |  77.712KiB |   6.177.032 |
|--------------------------------------------------------------------|

As you can see the TTF version of the benchmark is still much slower than the one using built-in PDF fonts. But in comparison to prawn 2.1 - 2.4 we are much faster when applying this patch.

Max memory usage is not affected because the garbage collector can and does remove the short-lived objects quickly.

lib/ttfunk/subset/code_page.rb Show resolved Hide resolved
@@ -17,12 +17,16 @@ def [](subset)
def use(characters)
characters.each do |char|
covered = false
@subsets.each_with_index do |subset, _i|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we even using each_with_index if we never use the index anywhere? The while loop is fine, but can we get the same allocation wins with a plain 'ol each?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know the reason for the allocation is the block itself. So if we would use each instead of each_with_index there would still be an allocation.

Furthermore, there is an non-negligible performance hit when using a block instead of a plain while loop due to the block invocation. So apart from saving the allocation the while loop is also faster which is especially visible in such an often called inner loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's super interesting, I wonder why blocks cause an allocation? I do vaguely remember hearing something about that at a Railsconf one time though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there was a talk some time ago that went through (nearly) all the Enumerable methods to show the time/space complexity - I also don't remember its title. However, there is a great talk by Jeremy Evans about optimizations in Roda/Sequel: http://code.jeremyevans.net/presentations/rubykaigi2019/index.html#83 (this links directly to the "avoid proc allocation slide").

I ran a benchmark using Prawn together with a TrueType font handled by
ttfunk which allocated a staggering 107 million objects (compared to 1.6
million objects for the same benchmark using HexaPDF).

A 5min investigation revealed two spots that when optimized reduced the
number to about 5 million objects.

* TTFunk::Subset::CodePage#from_unicode:

  Don't create an array and use #pack to convert a codepoint to a
  character, just add the codepoint directly, saving an array allocation.

  Furthermore, the created string can be modified using #encode! since
  it is thrown away anyway.

  Last, since the mapping is static use an internal cache for the
  mapping.

* TTFunk::SubsetCollection#use:

  The inner loop is called many times. By using a while loop instead of
  an iterator with a block we avoid allocating and calling the block.
@gettalong
Copy link
Member Author

@camertron @pointlessone Together with the other two pull requests and after applying the caching fix as per @camertron total allocations are down to around 4.4 million objects. The reduction of 96% of the allocated objects lead to a runtime decrease for the benchmark in question (HexaPDF raw_text benchmark 10x with TrueType font) of 64% (from 19,8 seconds down to 7.2 seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants