Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update-bcd: rewrite update function logic #801

Merged
merged 45 commits into from
Nov 24, 2023

Conversation

mzgoddard
Copy link
Contributor

@mzgoddard mzgoddard commented Oct 31, 2023

Summary

Rewrite update-bcd's update function into a series of smaller composable functions and generators.

Differences from mdn/browser-compat-data#20209

  • Replace monad-style internal type with flatter representation.
  • Document Update types in typescript interfaces.
  • Inline small functions into update composition.
  • Isolate system complexity to expand and compose.
  • Add debug stack for tracking steps applied in a loop iteration and value returned by them.

TODO

  • Bug in provide_entry: currently can't match paths to Identifiers in bcd files. (Every iteration is filtered at filter_entryExists at the moment.)
  • Confirm current output is produced.
  • Resolve eslint(no-unused-vars). (Seems to be a bug in eslint. The vars its complaining about are used.)
  • Log updates to a log file. update is called multiple times, maybe pass a function to update's options to emit logs? In main the function would concat an array and serialize it to a log file at the end? Or it could append each change alone to a log file?

Related Issues

mdn/browser-compat-data#19868
mdn/browser-compat-data#20209

queengooborg and others added 23 commits October 23, 2023 23:18
Generated by release.js.
* update-bcd: fix getSupportMatrix overrides param type

* fixup! update-bcd: fix getSupportMatrix overrides param type

---------

Co-authored-by: Queen Vinyl Da.i'gyu-Kazotetsu <vinyldarkscratch@gmail.com>
This was causing issues with passing arguments into the scripts.
Generated by release.js.
@foolip
Copy link
Member

foolip commented Oct 31, 2023

@mzgoddard there are conflicts here, can you resolve?

@mzgoddard
Copy link
Contributor Author

Areas for Improvement

Here are some areas that I think could be better in this PR.

  • Log severity: Some reasons an update is skipped are produced frequently. Many configurable filters like the --path filter do this. Currently severity can be marked with reason(..., {quiet: true}). quiet in this case means do not log to terminal. Instead of quiet should it be something like the common log levels: debug, info, log, warn, error? Is there a level that is not written to a log file?
  • Throw Reasons: Maybe a better way to structure filter or skip would be to throw reason's return value and catch in expand.
  • Reason function: Many reasons use UpdateLog values outside whats being tested. Maybe reason can take a function that is called with UpdateState like filter's current third argument.
  • Emit debug stack flag: I think the debug object may be useful for debugging and perhaps should be retained. In which case maybe it should be included in the object log if a flag is set. It might be useful for deeper analysis

Possible Follow-up Work

  • Start with loop over entries in-place of loop over supportMatrix. update is called with each bcd file, iterate Identifier and then search for corresponding path in supportMatrix.
  • Compose getSupportMatrix: The logs in this function look like they would be useful to be logged for other analysis. Maybe the composed functions could be integrated in update.
  • Version handling: It's hard to exhaustively consider possible version values. Replace method calls (.includes and .replace), binary operations, and compareVersions calls on version_added, version_removed, etc. At this point I think it would be functions like compareVersions but that they'd do some invariants or casting as needed first to make sure the different version types are handled correctly.
  • Filter branching: The flow of update is currently continue after first reason. A future change may want to introduce for example supporting inferredStatements.length > 1 while still supporting inferredStatements.length === 1`. This might be done by "catching" the previous reason.
  • Rewrite inferSupportStatements: inferSupportStatements is hard to understand. Its probably better at this time to leave it as is.

@mzgoddard
Copy link
Contributor Author

I think this PR is ready for review.

I fixed bugs and differences I had incidentally added. I've tried to keep the branching logic in the existing version of the script to make it easier to compare. I think this might also make it more readable by collecting the logic. Many else if branches in the existing script have implied tests that are more explicitly stated here inverted by a !(/** old if branch */).

I stroke through the fourth todo. I think we can add it as a follow up.

As part of this I've implemented the reason function improvement I mentioned earlier. Kind of like the throw reason one all of the reasons are returned and are produced by an earlier version's filter helper's separate third parameter.

provide and provideShared now require a value to be returned. An oversight I had made was returning undefined or void from a function overriding a existing value. This was how one bug had been caused with provideStatements. The current provide and providedShared uses aren't affected by this since there are not multiple calls on the same property but if there are future cases of this I think this should help warn about that.

provideStatements can return undefined or void but only because it doesn't in the positive case expect the value for that key (statements) but a tuple of the value and a reason. I'm not sure about this specific shape for the return type. I like returning a reason with the value. I hope it would be useful in logging why a change to bcd is made. I am thinking it would be nice to return a declarative object about how to change the update's state so you can combine the changes for statements and reason. But I figure since it may be good to require both, there is for now one type to be returned. Not using a more complicated return type and the corresponding logic to support it makes this api simpler.

For the reasons returned in provideStatements calls I added a skip property to the Reason type. Most of the statements produced do not stop work in the existing script. That is what this property is for. I'm not sure about the name, or if it should be in Reason or UpdateLog, or if it should be represented as a boolean.


As an aside I made a try at the first further work section, iterating Identifier instead of iterating supportMatrix.entries(): bocoup@9a202ed. I think its interesting as an example for showing how we can make changes on top of this work. Also this change would probably be an important part of logging to a file. With the current order there can be 80,000 some updates considered for each bcd json file.

@mzgoddard mzgoddard marked this pull request as ready for review November 14, 2023 06:48
@foolip foolip requested a review from Elchi3 November 14, 2023 16:13
@mzgoddard
Copy link
Contributor Author

I've cleaned up some tests for the expand, provide, and related helpers.

bocoup@4d38696

I can make a separate PR or push them here if that's better. Don't want to interfere with anyone already reviewing it with a big change like that. (Most of the change is in the test file. There is a small change to compose to generalize its types.)

Copy link
Member

@Elchi3 Elchi3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run update-bcd with this PR, it creates the same updates to BCD as it does on main. So that's good. Tests pass as well and it's great to see additional tests for the new helpers. Happy to have them in a follow-up.

I noticed the update-bcd in this PR is slower than the one on main. Is that due to the extra logging/helpers that are introduced here or is some iteration changed somewhere that slows it down?

I'm not the original author of update-bcd and I find reviewing this PR quite challenging personally. That said, I'm happy to merge this in given it will help us to use the logs this produces programmatically for our project.

How can I use the log object? Maybe that's done in bocoup@9a202ed? Would it be good to provide a new log output format (json) that I can get from the command line? npm run update-bcd --log=json or is there another way you use the Log?

Sorry, if I'm asking obvious things here. I will try to read into the code some more but update-bcd really is quite complex to me.

@Elchi3
Copy link
Member

Elchi3 commented Nov 15, 2023

Also cc'ing @queengooborg and @teoli2003 who I hope could help with the review here.

@foolip
Copy link
Member

foolip commented Nov 15, 2023

I'd be happy to review this, but wasn't sure if I should jump in because I won't be a primary maintainer for the collector now that it's moved into OWD. Please let me know if my review would help / be enough to land this.

Regarding the logging, the plan is to do logging to a file or similar as a follow-up to this.

@Elchi3
Copy link
Member

Elchi3 commented Nov 15, 2023

Thanks @foolip! I think your review would definitely help!

From the OWD side, I consider @queengooborg the main maintainer of this project but I think we are all acting as Peers here, so stuff can land once a Peer approves a PR. Adding a governance doc to this repo is still on my to-do list. Sorry for being slow with that.

@mzgoddard
Copy link
Contributor Author

If I run update-bcd with this PR, it creates the same updates to BCD as it does on main. So that's good. Tests pass as well and it's great to see additional tests for the new helpers. Happy to have them in a follow-up.

👍

I noticed the update-bcd in this PR is slower than the one on main. Is that due to the extra logging/helpers that are introduced here or is some iteration changed somewhere that slows it down?

Yes. It's much slower. I profiled it at one point to see if there were simple changes I might make to speed it up. Its spending a lot of time doing a few things. It spends most of its time in the generators produced by expand. As well its creating a lot of objects to manage the state. A good bit of the time is spent collecting those values too. I think there are ways we could improve this and keep it in expand but I don't think it'd be a simple system.

Outside of expand I think a simple change we can make for performance is something I mentioned for possible future work, iterating Identifier instead of supportMatrix like with bocoup@9a202ed. Most of the iterations through update are over path values that are not in the given bcd file.

I've opted to not do that change at this point in a desire to make this easier to review by having comparable order of logic with the current version.

I'm not the original author of update-bcd and I find reviewing this PR quite challenging personally. That said, I'm happy to merge this in given it will help us to use the logs this produces programmatically for our project.

Yeah. The change of indent on the if/else blocks makes it hard to review.

How can I use the log object? Maybe that's done in bocoup/mdn-bcd-collector@9a202ed? Would it be good to provide a new log output format (json) that I can get from the command line? npm run update-bcd --log=json or is there another way you use the Log?

As @foolip already replied we'll set this up in a follow up. I think that'll be good so we can shake out details like how they are logged in their own PR.

Sorry, if I'm asking obvious things here. I will try to read into the code some more but update-bcd really is quite complex to me.

No, I think they're good questions.

@Elchi3 Elchi3 requested a review from foolip November 20, 2023 11:58
@foolip
Copy link
Member

foolip commented Nov 21, 2023

@mzgoddard can you add a bit of documentation explaining the basic data flow? In particular I'd like to understand the roles of compose , UpdateState, provide, and the provide-prefixed methods. It looks like none of the functions composed with compose have a documented return type, but they do return stuff. Is state communicated only through the returned values, or are UpdateState objects passed around and mutated? If there are hard rules here, documenting them and asserting what's possible would help make it clearer.

It sounds like the performance problem has something to do with the number of temporary objects being generated, and perhaps it will be clear where the unnecessary work is happening given a description of the data flow.

@ChrisC
Copy link
Contributor

ChrisC commented Nov 21, 2023

Is state communicated only through the returned values, or are UpdateState objects passed around and mutated?

@foolip You're correct that a state object is passed around and updated throughout the composed generator functions. This state object is of the UpdateInternal type (which extends UpdateState & ultimately UpdateLog).

The UpdateInternal object contains data from each entry iteration, including the entry path and related browser info, existing support statement(s), & any new modifications to the statements, along with reasons for making those modifications. Additionally, it contains a "shared" object that houses all the data that we're working off of to determine whether or not to update, including the test result Support Matrix and the last version of the BCD. It also contains a "debug" object, which attempts to build a stack trace of the changes to the entry data.

I do think it will be helpful for all of us to have a more detailed, step-by-step summary/visualization of the exact transformations that happen to UpdateInternal, but here's a high-level summary first:

  1. First, we build the shared data object in the first set of provideShared functions.
  2. Then, we collect the existing support statements in the following series of provide...Statements functions,
  3. Finally, we make decisions about whether or not to update those statements in the last series of persist... functions.

It sounds like the performance problem has something to do with the number of temporary objects being generated, and perhaps it will be clear where the unnecessary work is happening given a description of the data flow.

Yes, I can work on a more detailed diagram of the data flow, and add a few more test assertions to better document what's happening (especially in the persist... stages of the data flow), but first, I'm going to work on getting Z's suggested change into this PR for the performance improvement (where we iterate through the BCD data instead of the Support Matrix ). Some follow-up testing we did today shows that this change actually gives the whole script a significant net performance gain (and not a smaller net loss as we previously reported)!

@ChrisC
Copy link
Contributor

ChrisC commented Nov 22, 2023

@foolip @Elchi3 confirmed that this recent iterator change reduces the overall execution time of the update script by about ~2x. From my last local run, execution time on the main branch is at 12.2 seconds vs. 5.91 seconds on this branch.

@foolip
Copy link
Member

foolip commented Nov 24, 2023

Thank you @ChrisC!

I have also tested the script before and after these changes. It seems faster, although I didn't measure by how much. Very nice!

I've also confirmed that npm run update-bcd (with no arguments) make the same changes with and without these changes. While that's not a guarantee that there are no differences for all possible states of mdn-bcd-results and BCD, it certainly increases confidence.

I have reviewed the code, and while I think it would benefit from some more high-level documentation, I also think this PR can be merged so that additional changes can be more incremental.

@foolip foolip merged commit a70d5c6 into openwebdocs:main Nov 24, 2023
2 checks passed
@Elchi3 Elchi3 mentioned this pull request Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants