Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

95+% perf improvement #4

Merged
merged 1 commit into from Feb 15, 2023
Merged

Conversation

todd-richmond
Copy link
Contributor

This improves performance by 95+ percent on certain complex messages. The sample I have from a customer with large # of destination addresses took 9.5 minutes to parse through Email::Outlook::Message which relies on extracting content with Ole::Storage_lite. After these changes, that times drops to about 3-4 seconds. I'm filing a PR for a few changes in the other module as well
Multiple changes

  1. use const to ensure constants are inlined
  2. first instead of grep to short-circuit large lists
  3. inline functions only called one time that duplicated work
  4. highly optimized seek method as it was the worst cpu burn. Break before 2nd offset computation when block has been read completely
  5. pre-sort and cache block offsets as the linear scan regressed with #of small data blocks
  6. optimize pack+unpack
  7. use hash instead of array to skip work blocks - same linearity issue

@jmcnamara
Copy link
Owner

Thanks. That looks like good work.

Does this introduce a minimum perl version?

Also, these changes are quite substantial. If I merge and release them, do you agree to fix any issues introduced by the changes.

@todd-richmond
Copy link
Contributor Author

todd-richmond commented May 16, 2022

no minimum version that I can think of as the only module I added was first instead of grep and if you want, you can skip that part since originally I was using it in more places, but full-inline and logic change was required

I'll definitely fix any items anyone finds as we need this module to work well via Email::Outlook::Message. I diffed the output from that module after the 99.5% speedup and the output was the same, but obviously don't have a large test suite

If you have any suggestions, I'm find with those too. For example use const wasn't a huge win after I did all the other work if you are concerned about that chunk. the real win was pre-caching the block lookups and using a hash check instead of array to skip work

@jmcnamara
Copy link
Owner

no minimum version that I can think of as the only module I added was first instead of grep and if you want, you can skip that part since originally I was using it in more places, but full-inline and logic change was required

OLE::Storage_Lite is very much a legacy module and probably still needs to run on legacy perls. At the same time List::Util has been core since perl v5.7.3.

For example use const wasn't a huge win

constant.pm has been core since 5.004 so that should be okay as well.

I think I'll just merge the changes and if things break in the wild we can respin a fix.

I'll definitely fix any items anyone finds as we need this module to work well via Email::Outlook::Message.

I inherited OLE::Storage_Lite from the original author mainly because it was a requirement for Spreadsheet::WriteExcel and because I had submitted some fixes. If you are interested in taking over maintainership let me know.

In the meantime I'll merge this up and release it in the next few days.

@todd-richmond
Copy link
Contributor Author

any timeline to merge? Would be nice to get outside feedback on the changes. We've been running it on production systems for 6 months now and haven't seen further resource issues since

@jmcnamara
Copy link
Owner

Apologies, this dropped off the bottom of my todo list. I'll merge and release as soon as I can.

@jmcnamara jmcnamara merged commit be7a714 into jmcnamara:master Feb 15, 2023
@todd-richmond
Copy link
Contributor Author

thx! I'll delete custom patches once it hits CPAN

@jmcnamara
Copy link
Owner

It is on its way to CPAN. Thanks for the patch and apologies for the delay.

@maaarghk
Copy link

it broke spreadsheet::parseexcel :(

@todd-richmond
Copy link
Contributor Author

can you send a link to an example doc and a description of what is broken

@maaarghk
Copy link

cpanm Spreadsheet::ParseExcel

t/00_basic.t ................... ok

#   Failed test 'Spreadsheet::ParseExcel::Workbook created'
#   at t/01_parse.t line 42.
#          got: 'HASH'
#     expected: 'Spreadsheet::ParseExcel::Workbook'

#   Failed test 'current sheet is 1'
#   at t/01_parse.t line 52.
#          got: undef
#     expected: '1'

#   Failed test at t/01_parse.t line 53.
#          got: undef
#     expected: '1'

remainder of tests fail

Perl v 5.30

@todd-richmond
Copy link
Contributor Author

confirmed - debugging now, but have to quit soon so will probably take until tomorrow

@maaarghk
Copy link

it's all good, downgraded locally, if you're on your employer's dime right now then perhaps it would be a nice gift to the community at large if you take your time and add tests rather than update some specific incantations, as i note the library does not really have any : )

@todd-richmond
Copy link
Contributor Author

PR submitted - whiffed the conversion from grep () to first {}. Spreadsheet-ParseExcel tests pass again

@jmcnamara
Copy link
Owner

@todd-richmond Thanks. I'll test and repackage as soon as possible.

@jmcnamara
Copy link
Owner

v0.22 with the fix has been sent to CPAN.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 20, 2023
0.22: not documented

0.21 Fri Feb 17 2020

    + Fix for Spreadsheet::ParseExcel break due to the previous change in
      v0.20.  jmcnamara/ole-storage-lite#4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants