Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor the Document.findAll() function #3537

Merged
merged 8 commits into from
Apr 20, 2021

Conversation

peterbe
Copy link
Contributor

@peterbe peterbe commented Apr 16, 2021

Here's how I tested the performance gain:
Just before deleting/replacing the old findAll function, I renamed it to findAll0 just so I can make this benchmark.
Then I run this: https://gist.github.com/peterbe/318bab87834580c384b763da08919d25
First with and then without CONTENT_TRANSLATED_ROOT set.

WITH:

OPTIONS: {}
findAll : 20.151s
findAll2: 1.592s

OPTIONS: { folderSearch: 'web/html/element/a' }
findAll : 14.341s
findAll2: 1.592s

OPTIONS: { locales: Map(1) { 'fr' => true } }
findAll : 1.592s
findAll2: 1.578s

OPTIONS: {
  files: Set(2) {
    '/Users/peterbe/dev/MOZILLA/MDN/content/files/en-us/glossary/argument/index.html',
    'en-us/webassembly/text_format_to_wasm/index.html'
  }
}
findAll : 15.243s
findAll2: 1.577s

WITHOUT:

OPTIONS: {}
findAll : 3.249s
findAll2: 329.45ms

OPTIONS: { folderSearch: 'web/html/element/a' }
findAll : 3.153s
findAll2: 317.934ms

OPTIONS: { locales: Map(1) { 'fr' => true } }
findAll : 0.975ms
findAll2: 337.742ms

OPTIONS: {
  files: Set(2) {
    '/Users/peterbe/dev/MOZILLA/MDN/content/files/en-us/glossary/argument/index.html',
    'en-us/webassembly/text_format_to_wasm/index.html'
  }
}
findAll : 3.158s
findAll2: 322.591ms

(note that the benchmark makes sure that the number of found files is equal every time)

So basically, the new function is 10x faster.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2021

One way of testing is to trigger the gatherTranslations() inside content/translations.js a bunch of times. It does findAll() but also read() for each and every file.

const { translationsOf } = require("./content/translations");
translationsOf({ slug: "whatever", locale: "whatever" });

If I run this over and over I get the following outputs (thanks to this):

Time to gather all translations: 7.931s
Time to gather all translations: 8.317s
Time to gather all translations: 8.400s
Time to gather all translations: 8.249s
Time to gather all translations: 8.513s

And if I run the same exact thing on main I get:

Time to gather all translations: 28.359s
Time to gather all translations: 21.793s
Time to gather all translations: 21.114s
Time to gather all translations: 20.681s
Time to gather all translations: 21.651s

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2021

Now, the question is, does it matter? The only times we run the whole entire yarn build is when we run the Prod builds as a cron' job every 24h. In that case, it's a machine and no human having to sit and wait. But it also happens every single time we, as a dev team, run yarn build locally to test something. I've always felt that sucked a bit because there's a bit of a delay before it even starts. Half a minute just to gather all the translations every time isn't a lot when the whole build takes 1.5h. But this new findAll will make the yarn build after too because it now doesn't need to try to figure out which root a file path came from.

Also, this slower findAll() was used every time there's a content PR. The yarn build part is often one of the slowest parts of the "PR Test" workflow (when the yarn caches are hot). So every little helps there to make the yarn build faster.

@peterbe peterbe marked this pull request as ready for review April 16, 2021 19:35
@peterbe peterbe requested a review from Gregoor April 16, 2021 19:35
Copy link
Contributor

@Gregoor Gregoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love those numbers Peter. Here are some musing from reading it


const Document = require("./document");

describe("test Document.findAll()", () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
describe("test Document.findAll()", () => {
describe("Document.findAll()", () => {

describe("test Document.findAll()", () => {
it("should always return files that exist", () => {
const filePaths = [...Document.findAll().iter({ pathOnly: true })];
expect(filePaths.every((value) => fs.existsSync(value))).toBeTruthy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just testing whether fdir works? We are never adding to what it finds, so this should be given. But I'm cool with testing it nonetheless, 3 lines for making sure the lib is working as expected is still a good use of space in my book

.filter((filePath) => {
// Note! Due to a bug in fdir, any accidental errors throw here are
// swallowed!
// See https://github.com/thecodrr/fdir/issues/56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we got a reply over in that issue. If withError is not what we want, we might want to nest all of this in our own try/catch so that we can at least get some logs for it.

@@ -197,32 +198,70 @@ function unarchive(document, move) {
return created;
}

const read = memoize((folder) => {
const read = memoize((folderOrFilePath) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this PR should not be limited to changing findAll()? From what I understand this changes read()'s API, don't we have callers that are passing folders in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I was only looking at the if-branch. Still wondering though whether we can reap the findAll-speed improvements without changing fn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right!
But... may I ask to make an exception to that general rule here? Pleeeease.

The argument is that the two most important uses of this code is when we do prod builds which is a big yarn build, and when we do PR CI which builds the specified files (based on the git diff).
In both those cases it's actually smart to pass in a full absolute file path to read() because you don't need to loop through the roots to figure out where it came from.
But yes, what could happen is that the yarn build causes a call of read('/home/path/to/absolute/content/files/en-us/foo/index.html') first and then later some KS sidebar or some other code internally triggers a call to read('/en-us/foo') which is the same result but a different memoization key. But I would argue that the cache is not important enough to protect against that. The read() is so fast and because of the LRU (max 2,000 entries) you'll end up calling the read() with the same key many times anyway and not benefitting from the memoization.

So I violated a bit by changing the findAll() and the read() and that's not great. (But hopefully my explanation and justification can make up for that).
However, perhaps the correct API is that read() should always and only expect a filePath as the one and only argument. And if you only have a folder (e.g. fr/web/html/element/video) you should be using some other readByFolder() wrapper that can turn fr/web/html/element/video into /full/path/to/translated-content/files/fr/web/html/element/video/index.html before it goes into read().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't even consider the bits about the cache. Maybe we should then as part of this factor out the finding of the correct path so that the cache still gets hit? Or do you prefer getting this in first? It seems worthwhile to me to reduce the responsibilities of read a little

content/document.js Outdated Show resolved Hide resolved
@peterbe peterbe requested a review from Gregoor April 19, 2021 14:50
Copy link
Contributor

@Gregoor Gregoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on the last outstanding bit, it's your call. Choose your own adventure ⚔️

@@ -197,32 +198,70 @@ function unarchive(document, move) {
return created;
}

const read = memoize((folder) => {
const read = memoize((folderOrFilePath) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't even consider the bits about the cache. Maybe we should then as part of this factor out the finding of the correct path so that the cache still gets hit? Or do you prefer getting this in first? It seems worthwhile to me to reduce the responsibilities of read a little

@peterbe
Copy link
Contributor Author

peterbe commented Apr 20, 2021

@Gregoor I researched it. A lot.
First of all, using only the en-us content, I measured how many times read() was invoked: 57,014 times
That's to build 11,770 pages. So it's 20% cache hits on the memoization. :(
How long did it take?

Built 11,770 pages in 8.6 minutes, at a rate of 22.7 documents per second.

Then I switch to this branch in this PR and measure again. Now, because the read() is called with different parameters depending on the CLI vs. things like fixing broken links and stuff. The read() function was invoked: 71,406 times.
So now the cache hit ratio was down to 16%.
BUT the total time turned out to be exactly the same!

Built 11,770 pages in 8.6 minutes, at a rate of 22.7 documents per second.

In other words, the read() is sufficiently fast that it doesn't really matter if it gets slightly less hits.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 20, 2021

I still think we should split it up so that read() expects a full path. And calling it with a "folder" would be something like readByFolder() instead. BUt I'd rather not tackle right here right now because it would make the PR even bigger.

@peterbe peterbe merged commit 44c583f into mdn:main Apr 20, 2021
@peterbe peterbe deleted the refactor-the-documentfindall-function branch April 20, 2021 12:21
peterbe added a commit to peterbe/yari that referenced this pull request Jun 1, 2021
* refactor the Document.findAll() function

* more refactoring

* cleaning up

* adding a very simply unit test

* adding .withErrors() to avoid error swallowing

* feedbacked
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants