refactor the Document.findAll() function #3537

peterbe · 2021-04-16T13:12:20Z

Here's how I tested the performance gain:
Just before deleting/replacing the old findAll function, I renamed it to findAll0 just so I can make this benchmark.
Then I run this: https://gist.github.com/peterbe/318bab87834580c384b763da08919d25
First with and then without CONTENT_TRANSLATED_ROOT set.

WITH:

OPTIONS: {}
findAll : 20.151s
findAll2: 1.592s

OPTIONS: { folderSearch: 'web/html/element/a' }
findAll : 14.341s
findAll2: 1.592s

OPTIONS: { locales: Map(1) { 'fr' => true } }
findAll : 1.592s
findAll2: 1.578s

OPTIONS: {
  files: Set(2) {
    '/Users/peterbe/dev/MOZILLA/MDN/content/files/en-us/glossary/argument/index.html',
    'en-us/webassembly/text_format_to_wasm/index.html'
  }
}
findAll : 15.243s
findAll2: 1.577s

WITHOUT:

OPTIONS: {}
findAll : 3.249s
findAll2: 329.45ms

OPTIONS: { folderSearch: 'web/html/element/a' }
findAll : 3.153s
findAll2: 317.934ms

OPTIONS: { locales: Map(1) { 'fr' => true } }
findAll : 0.975ms
findAll2: 337.742ms

OPTIONS: {
  files: Set(2) {
    '/Users/peterbe/dev/MOZILLA/MDN/content/files/en-us/glossary/argument/index.html',
    'en-us/webassembly/text_format_to_wasm/index.html'
  }
}
findAll : 3.158s
findAll2: 322.591ms

(note that the benchmark makes sure that the number of found files is equal every time)

So basically, the new function is 10x faster.

peterbe · 2021-04-16T18:56:41Z

One way of testing is to trigger the gatherTranslations() inside content/translations.js a bunch of times. It does findAll() but also read() for each and every file.

const { translationsOf } = require("./content/translations");
translationsOf({ slug: "whatever", locale: "whatever" });

If I run this over and over I get the following outputs (thanks to this):

Time to gather all translations: 7.931s
Time to gather all translations: 8.317s
Time to gather all translations: 8.400s
Time to gather all translations: 8.249s
Time to gather all translations: 8.513s

And if I run the same exact thing on main I get:

Time to gather all translations: 28.359s
Time to gather all translations: 21.793s
Time to gather all translations: 21.114s
Time to gather all translations: 20.681s
Time to gather all translations: 21.651s

peterbe · 2021-04-16T19:35:15Z

Now, the question is, does it matter? The only times we run the whole entire yarn build is when we run the Prod builds as a cron' job every 24h. In that case, it's a machine and no human having to sit and wait. But it also happens every single time we, as a dev team, run yarn build locally to test something. I've always felt that sucked a bit because there's a bit of a delay before it even starts. Half a minute just to gather all the translations every time isn't a lot when the whole build takes 1.5h. But this new findAll will make the yarn build after too because it now doesn't need to try to figure out which root a file path came from.

Also, this slower findAll() was used every time there's a content PR. The yarn build part is often one of the slowest parts of the "PR Test" workflow (when the yarn caches are hot). So every little helps there to make the yarn build faster.

Gregoor

I love those numbers Peter. Here are some musing from reading it

Gregoor · 2021-04-19T09:16:51Z

content/document.test.js

+
+const Document = require("./document");
+
+describe("test Document.findAll()", () => {


Suggested change

describe("test Document.findAll()", () => {

describe("Document.findAll()", () => {

Gregoor · 2021-04-19T09:18:49Z

content/document.test.js

+describe("test Document.findAll()", () => {
+  it("should always return files that exist", () => {
+    const filePaths = [...Document.findAll().iter({ pathOnly: true })];
+    expect(filePaths.every((value) => fs.existsSync(value))).toBeTruthy();


Isn't this just testing whether fdir works? We are never adding to what it finds, so this should be given. But I'm cool with testing it nonetheless, 3 lines for making sure the lib is working as expected is still a good use of space in my book

Gregoor · 2021-04-19T09:20:57Z

content/document.js

+      .filter((filePath) => {
+        // Note! Due to a bug in fdir, any accidental errors throw here are
+        // swallowed!
+        // See https://github.com/thecodrr/fdir/issues/56


Looks like we got a reply over in that issue. If withError is not what we want, we might want to nest all of this in our own try/catch so that we can at least get some logs for it.

Gregoor · 2021-04-19T09:24:31Z

content/document.js

@@ -197,32 +198,70 @@ function unarchive(document, move) {
  return created;
 }

-const read = memoize((folder) => {
+const read = memoize((folderOrFilePath) => {


Wondering if this PR should not be limited to changing findAll()? From what I understand this changes read()'s API, don't we have callers that are passing folders in?

Ah I see, I was only looking at the if-branch. Still wondering though whether we can reap the findAll-speed improvements without changing fn

You're absolutely right!
But... may I ask to make an exception to that general rule here? Pleeeease.

The argument is that the two most important uses of this code is when we do prod builds which is a big yarn build, and when we do PR CI which builds the specified files (based on the git diff).
In both those cases it's actually smart to pass in a full absolute file path to read() because you don't need to loop through the roots to figure out where it came from.
But yes, what could happen is that the yarn build causes a call of read('/home/path/to/absolute/content/files/en-us/foo/index.html') first and then later some KS sidebar or some other code internally triggers a call to read('/en-us/foo') which is the same result but a different memoization key. But I would argue that the cache is not important enough to protect against that. The read() is so fast and because of the LRU (max 2,000 entries) you'll end up calling the read() with the same key many times anyway and not benefitting from the memoization.

So I violated a bit by changing the findAll() and the read() and that's not great. (But hopefully my explanation and justification can make up for that).
However, perhaps the correct API is that read() should always and only expect a filePath as the one and only argument. And if you only have a folder (e.g. fr/web/html/element/video) you should be using some other readByFolder() wrapper that can turn fr/web/html/element/video into /full/path/to/translated-content/files/fr/web/html/element/video/index.html before it goes into read().

Oh I didn't even consider the bits about the cache. Maybe we should then as part of this factor out the finding of the correct path so that the cache still gets hit? Or do you prefer getting this in first? It seems worthwhile to me to reduce the responsibilities of read a little

content/document.js

Gregoor

Commented on the last outstanding bit, it's your call. Choose your own adventure ⚔️

Gregoor · 2021-04-20T09:35:42Z

content/document.js

@@ -197,32 +198,70 @@ function unarchive(document, move) {
  return created;
 }

-const read = memoize((folder) => {
+const read = memoize((folderOrFilePath) => {


Oh I didn't even consider the bits about the cache. Maybe we should then as part of this factor out the finding of the correct path so that the cache still gets hit? Or do you prefer getting this in first? It seems worthwhile to me to reduce the responsibilities of read a little

peterbe · 2021-04-20T12:18:04Z

@Gregoor I researched it. A lot.
First of all, using only the en-us content, I measured how many times read() was invoked: 57,014 times
That's to build 11,770 pages. So it's 20% cache hits on the memoization. :(
How long did it take?

Built 11,770 pages in 8.6 minutes, at a rate of 22.7 documents per second.

Then I switch to this branch in this PR and measure again. Now, because the read() is called with different parameters depending on the CLI vs. things like fixing broken links and stuff. The read() function was invoked: 71,406 times.
So now the cache hit ratio was down to 16%.
BUT the total time turned out to be exactly the same!

Built 11,770 pages in 8.6 minutes, at a rate of 22.7 documents per second.

In other words, the read() is sufficiently fast that it doesn't really matter if it gets slightly less hits.

peterbe · 2021-04-20T12:20:19Z

I still think we should split it up so that read() expects a full path. And calling it with a "folder" would be something like readByFolder() instead. BUt I'd rather not tackle right here right now because it would make the PR even bigger.

* refactor the Document.findAll() function * more refactoring * cleaning up * adding a very simply unit test * adding .withErrors() to avoid error swallowing * feedbacked

peterbe added 3 commits April 16, 2021 09:12

refactor the Document.findAll() function

a09a96b

Merge branch 'main' into refactor-the-documentfindall-function

e8dd579

more refactoring

c4700d8

peterbe added 2 commits April 16, 2021 15:03

cleaning up

a6caf60

adding a very simply unit test

2cc5803

peterbe marked this pull request as ready for review April 16, 2021 19:35

peterbe requested a review from Gregoor April 16, 2021 19:35

This was referenced Apr 16, 2021

A big THANK YOU is in order thecodrr/fdir#57

Closed

use fdir for server search-index gathering #3541

Merged

Gregoor reviewed Apr 19, 2021

View reviewed changes

peterbe added 3 commits April 19, 2021 10:18

Merge branch 'main' into refactor-the-documentfindall-function

5b0b2a5

adding .withErrors() to avoid error swallowing

17f4991

feedbacked

035ff19

peterbe requested a review from Gregoor April 19, 2021 14:50

Gregoor approved these changes Apr 20, 2021

View reviewed changes

peterbe merged commit 44c583f into mdn:main Apr 20, 2021

peterbe deleted the refactor-the-documentfindall-function branch April 20, 2021 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor the Document.findAll() function #3537

refactor the Document.findAll() function #3537

peterbe commented Apr 16, 2021 •

edited

Loading

peterbe commented Apr 16, 2021

peterbe commented Apr 16, 2021

Gregoor left a comment

Gregoor Apr 19, 2021

Gregoor Apr 19, 2021

Gregoor Apr 19, 2021

Gregoor Apr 19, 2021

Gregoor Apr 19, 2021

peterbe Apr 19, 2021

Gregoor Apr 20, 2021

Gregoor left a comment

Gregoor Apr 20, 2021

peterbe commented Apr 20, 2021

peterbe commented Apr 20, 2021


		const Document = require("./document");

		describe("test Document.findAll()", () => {

	describe("test Document.findAll()", () => {
	describe("Document.findAll()", () => {

refactor the Document.findAll() function #3537

refactor the Document.findAll() function #3537

Conversation

peterbe commented Apr 16, 2021 • edited Loading

peterbe commented Apr 16, 2021

peterbe commented Apr 16, 2021

Gregoor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gregoor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbe commented Apr 20, 2021

peterbe commented Apr 20, 2021

peterbe commented Apr 16, 2021 •

edited

Loading