Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize out divisions in shaping at upem size with hb-ot-font #1801

Closed
behdad opened this issue Jun 27, 2019 · 9 comments
Closed

Optimize out divisions in shaping at upem size with hb-ot-font #1801

behdad opened this issue Jun 27, 2019 · 9 comments
Assignees

Comments

@behdad
Copy link
Member

behdad commented Jun 27, 2019

There's a lot that can be optimized, both in the main scaling as well as parent scaling in hb-font.cc. We should measure.

@behdad behdad self-assigned this Jun 27, 2019
@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

I measured this a bit.. It's easy enough to optimize hb_ot_get_glyph_h_advances() but the speedup seems to be in the 1% range at best (if I disable liga / kern).

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

Actually I take that back. The instruction count changes in the 1% range. But by not stalling the pipeline (removing division), instructions-per-cycle goes up from 3.00 to 3.22. That suggests ~5% speedup might be measurable.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

For the record. Current function is:

static void
hb_ot_get_glyph_h_advances (hb_font_t* font, void* font_data,
                            unsigned count,
                            const hb_codepoint_t *first_glyph,
                            unsigned glyph_stride,
                            hb_position_t *first_advance,
                            unsigned advance_stride,
                            void *user_data HB_UNUSED)
{
  const hb_ot_face_t *ot_face = (const hb_ot_face_t *) font_data;
  const OT::hmtx_accelerator_t &hmtx = *ot_face->hmtx;

  for (unsigned int i = 0; i < count; i++)
  {
    *first_advance = font->em_scale_x (hmtx.get_advance (*first_glyph, font));
    first_glyph = &StructAtOffsetUnaligned<hb_codepoint_t> (first_glyph, glyph_stride);
    first_advance = &StructAtOffsetUnaligned<hb_position_t> (first_advance, advance_stride);
  }
}

where em_scale_x (v) expands to em_scale (v, font->x_scale), and em_scale is:

  hb_position_t em_scale (int16_t v, int scale) 
  { 
    int upem = face->get_upem (); 
    int64_t scaled = v * (int64_t) scale; 
    scaled += scaled >= 0 ? upem/2 : -upem/2; /* Round. */ 
    return (hb_position_t) (scaled / upem); 
  } 

The get_upem() part obviously can be moved out of the loop. We annotate get_upem() with HB_PURE_FUNC. Not sure if that's enough for the compiler to move it out of the loop after inlining. Unlikely.

Rounding, maybe do without branch?

Finally, I don't know what the code size tradeoff would be, but at least when __OPTIMIZE_SIZE__ is not defined, we can have three loop bodies for:

  • The general case, with the multiplication and division,
  • Change division to right-shift if upem happens to be power-of-two.
  • No multiplication/division if scale happens to be equal to upem.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

Rounding, maybe do without branch?

Tried it. Didn't seem to matter.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

Another possibility is to move the division outside the loop, by first transforming em_scale to (ignore rounding):

  hb_position_t em_scale (int16_t v, int scale) 
  { 
    unsigned upem = face->get_upem (); 
    int64_t mult = ((int64_t) scale << 16) / (int) upem; 
 
    return (hb_position_t) ((v * mult) >> 16); 
  } 

Compiler doesn't seem to figure out it can move the division outside the loop, so needs to be helped out.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

Humm. Actually, we can make the font cache x_mult / y_mult. Let me try that.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

    *first_advance = font->em_scale_x (hmtx.get_advance (*first_glyph, font));

If font doesn't have variations set, we can call the version of get_advance without passing font in:

    unsigned int get_advance (hb_codepoint_t  glyph,
                              hb_font_t      *font) const
    {
      unsigned int advance = get_advance (glyph);
      if (likely (glyph < num_metrics))
      {
        advance += (font->num_coords ? var_table->get_advance_var (glyph, font->coords, font->num_coords) : 0); // TODO Optimize?!
      }
      return advance;
    }

But then we are approaching combinatorial explosion on the different branches.

@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

Okay! Completely removed division from scaling!

behdad added a commit that referenced this issue Jul 5, 2019
Part of #1801

The assumption that compiler optimizes "upem/2" to a shift only
works if upem is unsigned...  Anyway, spoon-feed the compiler.
behdad added a commit that referenced this issue Jul 5, 2019
@behdad behdad closed this as completed in f18ea1d Jul 5, 2019
@behdad
Copy link
Member Author

behdad commented Jul 5, 2019

cc @drott

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant