Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htmlToText/convert functions much slower than fromString #265

Closed
oanacapatana opened this issue Dec 5, 2022 · 5 comments
Closed

htmlToText/convert functions much slower than fromString #265

oanacapatana opened this issue Dec 5, 2022 · 5 comments

Comments

@oanacapatana
Copy link

Updating from 5.1.1 to 9.0.0 makes the html to text conversion 4 times slower.
In 5.1.1 was using fromString function which is now deprecated.
In 9.0.0 have tried with both convert and htmlToText functions and they are both 4 times slower than previous fromString method.

The method call from code is:
data = htmlToText.fromString(data, {
wordwrap: false,
noLinkBrackets: true,
preserveNewlines: true,
unorderedListItemPrefix: "*",
format: {
anchor: function (elem, fn, options) {
var h = fn(elem.children, options);
return "[" + h + "](" + elem.attribs.href + ")";
},
unorderedList: function (elem, fn, options) {
return formatUnorderedList(elem, fn, options, false);
},
text: function (elem, options) {
if (elem.next && elem.data) {
//Return the element that has an email, and add a special string to avoid removing the < >
return elem.next.name && elem.next.name.indexOf("@") !== -1 ?
elem.data + " <" + elem.next.name + DNR + ">" :
elem.data;
} else {
return elem.data;

                }
            }
        }
    });
@KillyMXI
Copy link
Member

KillyMXI commented Dec 5, 2022

Thank you for this report. This is the first time I've got feedback on package performance.

It has nothing to do with the function name. Had I preserved fromString as an exported name, it would've performed the same.

Could you please perform your measurements with versions 5.1.1, 6.0.0, 7.1.1, 8.2.1, 9.0.0 and provide exact numbers on the same test data? That would help me to figure out what caused the slowdown for you.

Please use empty options or simple { wordwrap: false } in your tests. Other options in your example had changed between versions and may affect the result.

@oanacapatana
Copy link
Author

I have made now the tests. We use this library when exporting data to excel, so I have a sample of 12120 strings that get processed. Please find below the results:
5.1.1: 640 ms
6.0.0: 4669 ms
7.1.1: 4672 ms
8.2.1: 46120 ms
9.0.0: 21550 ms

Note: I used on { wordwrap: false } as you suggested

@KillyMXI
Copy link
Member

KillyMXI commented Dec 5, 2022

Thank you.

Could you please repeat the test for versions 8.2.1 and 9.0.0 with const compiledConvert = compile({ wordwrap: false }); outside the loop and stringData = compiledConvert(html) in the loop over your samples?

const compiledConvert = compile({ wordwrap: false });
for (const sample of samples) {
  const stringData = compiledConvert(sample.htmlData);
  // ...
}

instead of

for (const sample of samples) {
  const stringData = convert(sample.htmlData, { wordwrap: false });
  // ...
}

@oanacapatana
Copy link
Author

It works now, thank you :)
I get for both 8.2.1 and 9.0.0 around 800 ms

I will close this issue now

@KillyMXI
Copy link
Member

KillyMXI commented Dec 5, 2022

Awesome. I'm glad to hear it works so well after all.

After introducing selectors, html-to-text got very flexible, but processing all options got really expensive, that's why I introduced the compile function.

Because it got so much smarter compared to version 5.1.1, getting this close in performance is a really good achievement, I think.

Thank you for working with me on this question, now I have these valuable measurements from a real word use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants