Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attributes to control header and external source inclusion #44

Merged

Conversation

dave-hagedorn
Copy link
Contributor

emit_headers - controls whether header files are included in the generated compile_commands.json
emit_externals - same, but for external sources or headers - bazel external workspace sources, etc.

Both default to True - the existing behaviour.

Some tools like ccls don't seem to work well with headers in the compile_commands.json, and some large
projects can generate very large compile_commands.json databases when including all external files.

@dave-hagedorn
Copy link
Contributor Author

Awesome project, BTW. Really helpful to me at work and for personal projects.

@cpsauer
Copy link
Contributor

cpsauer commented Apr 18, 2022

Hi, @dave-hagedorn! I'm delighted to hear you're finding this tool useful. Thanks for a high-quality contribution!

I'm super down to merge in the option to turn off header finding. It makes sense that tools would want the just set of compile commands you'd run for a full build and wouldn't necessarily need headers to work around clangd/clangd#123. Thanks for adding.

I'm also down with an option to exclude external workspaces, though I'd imagine the project would need to be really big to need it with headers off, right? Is that the case you're finding yourself in?

Thanks also for fixing the trailing whitespace that's crept in. Sorry about that. If you know of a good, automated (GitHub bot?) solution to auto-remove trailing whitespace that gets pushed, I'd love to know.

A few last things to get us over the finish line:
(1) For excluding external workspaces, I want to check in on the heuristic with the or clauses. In particular, I'm wondering about treating all files in bazel-out as external, since I'd think generated files from the local workspace would also live inside bazel-out. (If excluding all generated files is intentional, please lmk. It does make sense that you might want to exclude all generated files if you were concerned about compilation database size and wanted to exclude all files that are rarely browsed. But then the parameter should probably have a different name to match.)
Assuming excluding all generated source files from the local workspace is not intentional: One alternative might be to directly look at the target that generated each action and filter out targets in external workspaces. I think the target generating the action is supplied in the aquery output. (Try running an aquery command like the tool would, e.g, bazel aquery "mnemonic('CppCompile',deps(//...))" --include_artifacts=false | less, and search for Target: . Looks to me like the external targets all start with @ in that output and the local ones do not. I'm imaging something similar holds in the aquery jsonproto output piped into the tool. Just add --output=jsonproto, or explore it in the python code.)
Another (even better) alternative might be excluding external workspaces directly in the aquery invocation, saving parsing work. Aquery docs, inheriting from query. I think what you want is the except keyword, inherited from query. Would you be down to experiment with that?
(2) It might simplify things to do template expansion right where the variables are used, rather than having to pass the variables down to the functions that use them. Feel free to spread the template expansions out in the file.
(3) I think the refresh_compile_commands macro/wrapper might be setting the defaults the other way.
(4) Since we're doing refresh_compile_commands docs in the docstring the top of the file, would you be down to add documentation there instead--and maybe to the README if you think that's appropriate? I can also help with this.
(5) Thoughts on include_headers and exclude_external_workspaces as alternative parameter names?

Again, thanks so much! I really appreciate your going back and making the tool even better than you found it.
Chris

P.S. I'm curious what leads you to prefer ccls over clangd!

@cpsauer
Copy link
Contributor

cpsauer commented Apr 18, 2022

I also want to double check on one more thing: The primary problem you're running into is that the tools you're using don't want the headers, and that compile_commands.json is too big--and not that the tool is too slow, right?

@dave-hagedorn
Copy link
Contributor Author

Hi Chris,

Thanks for all the info and fast response.

(1) - Good point. I forgot about generated files. I'll try your aquery inspection technique. If that turns into a rabbit hole I would just NOT exclude bazel-out/* but still exclude absoulte paths - does that work for you?

(2) - My personal preference here is to leave the script inputs in one place and propagate them throughout the code, as this makes running the script standalone for testing/dev easier. But if you have a preference here I'll defer to that, just let me know

(3) - So it is - good catch :)

(4) - Sure, can do. Do you want me to remove the doc attributes as well?

(5) - I originally had include_headers, and thought it might get confusing. "include" gets used in a lot of contexts (include dirs, etc.). But again I'll defer to you. I do like exclude_external_workspaces. What about exclude_headers (defaulting to False)?

To your last question - yup. Your tool is fast regardless, but the time it takes clangd to process a 10k+ line compile_commands.json file is prohibitive. I've also had trouble with ccls trying to consume headers in compile_commands.json.

I use both ccls and clangd. ccls indexes a lot faster and has CodeLens indicators for cross-references. Both of which are really useful to me when exploring large code bases (WebRTC comes to mind). I've used clangd for the same but the indexing took over an hour vs ~10min. and I'm right clicking all the time to get the "find references" item in the context menu.

On the other hand I've found clangd seems to have now caught up and surpassed ccls in terms of code completion especially on C++20 code, and is more forgiving of incorrect syntax. ccls sometimes fails to complete the line you're typing until it's well formed, whereas clangd doesn't seem to have this limitation. I also really like the inlay hints in clangd 14

@dave-hagedorn
Copy link
Contributor Author

Fun fact - aquery text output has a Target: attribute for each entry, and you can check if it has the @ prefix.
jsonproto output does NOT have this attribute :)

@dave-hagedorn
Copy link
Contributor Author

Fun fact - aquery text output has a Target: attribute for each entry, and you can check if it has the @ prefix. jsonproto output does NOT have this attribute :)

Digging some more I think I can link the targetID for an action with its target in the json proto

@cpsauer
Copy link
Contributor

cpsauer commented Apr 19, 2022

Thanks for being great and responsive yourself :)

Things are looking great--looks like you've resolved most of the above.
[Re (5), definitely agree; I like your names better. Thanks on (4) and (3).]

To answer the questions around (1):

  • Looks like you made it through the jsonproto ID matching. (A bit gross, I know. Sometimes there are things missing from the proto output, like, say, the action's progress message.)
    • Did you get a chance to investigate the aquery except option? (See "even better", above.) I know this one's already done, but that might be, well, "even better", direct & simple, and even more robust. (Bazel could reasonably choose to (someday) express local workspaces with @ or external with //external, for example.) Conversely, if you found that that didn't work, I'd love to know.
  • Along the way, looks like the case of external headers came up...and absolute paths. I think this one is might end up a little trickier than it appears.
    • On Windows, there are lots of unnecessarily absolute paths floating around in Bazel as I recall--and that might break things. I'm away from my Windows desktop right now, but I could check tomorrow evening? I think the risk of problems is high enough that it's worth a check.
    • You've probably thought more than I have about how these two options should interact and about what people's goals would be in using them. It's not obvious to me whether the the external toggle should apply to headers or just actions.
      • For example, (thinking as I write), I can imagine that people might be primarily motivated to exclude_external_workspaces because indexing is too slow and they have decided that they never need to browse or think about the implementations of dependencies (compile actions), only the interfaces (headers).
      • But if they need exclude_headers = false to satisfy clangd, then excluding entries for those external headers might hit the case that causes the most pain. We'll have excluded the source file near those external headers, so there won't be a nearby compile_command entry for clangd to infer, so clangd will choose something more or less at random, maybe even in the wrong language.
      • All that said, if they only ever CTRL/CMD click into those headers, clangd should infer the compile command from that and avoid the problem.
    • ^ Is it clear to you what we should do here, or what would most motivate usage of these toggles? (By default, I'd probably agree with you that we should exclude external headers, assuming we can get it working on Windows. And then I'd use it for a while and see if I run into issues.)
    • Related: When it's clear what would motivate people to use a feature, I think it's really valuable that we put it in the docs. Like for example: "Using CCLS or another tool that doesn't need compile commands entries for headers?" (I can also add this later.)

On (2): I still kinda think we should directly templatize and scope things to where the settings are used. But if you're finding that running standalone is a big boost to your development productivity, then we should lean into that. Has it been useful to run independently? I'd previously been thinking that the template wasn't really that runnable without Bazel.

One minor thought: Okay if in refresh_compile_commands we have exclude_headers and exclude_external_workspaces default to False rather than None, just so it's totally obvious to readers how to read things?

Finally, thank you so much for the color on clangd vs ccls. That's really helpful and good for me to know. Sounds like you're still seeing a big indexing speed win vs clangd14, contrary to MaskRay/ccls#880? If you're curious about how I'd thought about it previously, search "CCLS" in https://github.com/hedronvision/bazel-compile-commands-extractor/blob/main/ImplementationReadme.md. (And when I hear back from you, I'll update it to include what I've learned from you.) Also, I bet the clangd folks would love to hear your feedback--both on the codelens and on indexing speed--if you'd be willing to file an issue over there. They're responsive and super great in my experience.

-CS

@cpsauer
Copy link
Contributor

cpsauer commented Apr 22, 2022

@dave-hagedorn, I checked on the Windows absolute path sub-case while I was with that desktop....

And msvc does indeed emit absolute paths for all the headers. But if we do decide want to exclude entries for external headers if exclude_external_workspaces is turned on, then I don't think that should be too hard to handle properly.

As an outline, instead of removing all absolute paths, we'd include them if they were in the Bazel workspace (gleaned from the environment variable--see the set working directory code) but not external. Definitely worth delegating the logic to a library function, though, since windows paths are a bit trickier with, e.g. / and \ both valid.

@cpsauer
Copy link
Contributor

cpsauer commented Apr 26, 2022

(Merged in the latest changes in from master.)

@cpsauer
Copy link
Contributor

cpsauer commented Apr 26, 2022

@dave-hagedorn, I know this has been a bit more involved, but could I still get your help getting it over the finish line at some point?

@cpsauer cpsauer self-requested a review April 27, 2022 00:51
@cpsauer cpsauer self-assigned this Apr 27, 2022
@cpsauer
Copy link
Contributor

cpsauer commented May 4, 2022

@dave-hagedorn?

@tndhagedorn
Copy link

Hi @cpsauer . Sorry - I got sidetracked and now travelling for work. I will be able to pick this up again next week.
If you'd like I can close and reopen the PR in the meantime so it's not hanging around. Let me know.

Some updates - I did try subtracting external targets via query, but nothing I could find worked. - @//... - @... - //external/... etc.... So I think the exclusion needs to be done using of the @ prefix on each target name.

Regarding your other suggestions, no problem I can make the changes.

Nice catch on the MSVC paths. Thanks for checking this. Makes sense - can normalize the paths and filter out anything not in the workspace.

Regarding your point about exclude_external_workspaces=True and exclude_headers=False - I think this is OK.
compile_commands.json will still contain entries for the headers that are being developed in the workspace, and code completion will work fine in those. I sometimes ctrl+click into an external header but rarely (if ever) navigate beyond that header, and never edit its code. This works well for me whether I use ccls or clangd.

This is driven by my context/experience, but I find that tradeoff OK, vs indexing a possibly large set (at least in my experience) of external headers. If a workspace pulls in say grpc, boost, abseil, etc... plus the system headers - you can still get a very large compile_commands.json full of headers you rarely (in my opinion) navigate or edit.

Another alternative is to instead add a regex type attribute that would allow more fine-grained control of which files to exclude, although I think external/not-external is a simpler and common enough use case.

@cpsauer
Copy link
Contributor

cpsauer commented May 10, 2022

Sweet! Thanks for replying. Hope the trip is going well.

(Definitely no need to temporarily close--just wanted to check in and make sure I hadn't driven you away by being too exacting. Thanks for being a good sport about all this and a great contributor!)

Re aquery: Your great jsonproto traversal it is, then! Thanks so much for experimenting.
[Bummer that they got rid of the old //external/... syntax. I thought I'd figured out a clever solution in, e.g., bazel aquery 'deps(//:whatever) intersect //...', but it turns out that doesn't work either, since the //... doesn't pick up configuration that might be necessary.]

Re exclude_external_workspaces=True and exclude_headers=False and whether that should exclude external headers: I think the pain here comes if you close and reopen the editor or otherwise open the external header any way besides ctrl+click. Then things will get messy if the wrong command is inferred, especially if clangd deduces the wrong language, as it often does. For example, I'd definitely be tempted to set exclude_external_workspaces=True on big projects with lots of (transitive) dependencies, but not if it'd often break browsing the external headers I'm including by removing their entries. Does that make sense?
[Some other angles: Perhaps this should be a third setting? exclude_external_headers? Or perhaps, external headers are always most of the bloat, and you'd only flip exclude_external_workspaces if compile_commands.json were too big anyway, so you'd always want external headers gone? Anyway, that's what's still unresolved in my head.]

Thanks for being great and thoughtful about everything, Windows included!

@dave-hagedorn
Copy link
Contributor Author

Hi @cpsauer , back and able to wrap this up.

Here's my suggestion for headers - let's decouple sources from headers, and be explicit about how to exclude both. I think this is also what you are saying

For sources - there are two kinds: in workspace and external workspace
For headers - there are three kinds: in workspace, external workspace, and what I'll call system headers (/usr/include, /Users//Library/..., )

With that I'm thinking attributes:

exclude_external_sources: bool - if True, omit any .c, .cc, ... files from external workspaces. This is easy by parsing aquery output (as discussed above)

exclude_headers: string - one of all, external, system, external_and_system.
The rules here are pretty simple:

  • external: If the header is relative and starts with external, or is an abolute path inside BUILD_WORKSPACE_DIRECTORY/external (covers MSVC and abs paths)
  • system: if the header is NOT relative, and not an absolute path inside BUILD_WORKSPACE_DIRECTORY

With this, a user can:

  • exclude external sources - if they don't care about browsing any external sources
  • exclude all headers - makes tooling like ccls happy
  • keep all headers - makes tools like clangd happy, user can browse all headers
  • keep only workpace headers by setting "exclude_headers" to "external_and_system" - fast indexing and user can browse their own headers
  • keep only workspace and external by setting "exclude_headers" to"system" - somewhat faster indexing and user can browse their own headers and all their deps' headers. Maybe I don't need to browse all my systems' libc++ headers - that's what cppreference is for - but I still want to browse the abseil or whatever headers
  • keep only workspace and system headers by setting "exclude_headers" to "external" - can't really think of a case for this TBH, but this completes the set of options

I think this covers all possible use cases.
If that's too verbose/unnecessary, your idea is very similar, in which case I'd just have

  • exclude_external_sources
  • exclude_external_headers - would cover both external deps and system

I favour the first form. Just looking at one of my projects, I see 368 external headers and 131 system headers. Having the ability to omit indexing the system headers would speed up indexing quite a bit.

@dave-hagedorn dave-hagedorn force-pushed the option-to-exclude-headers-and-externals branch from 3880182 to 6c45b7c Compare May 17, 2022 18:32
emit_headers - controls whether header files are included in the generated compile_commands.json
emit_externals - same, but for external sources or headers - bazel external workspace sources, etc.

Both default to True - the existing behaviour.

Some tools like ccls don't work well with headers in the compile_commands.json, and some large
projects can generate very large compile_commands.json databases when including all external files.
Rename attributes - start with exclude_
Rework external exclusion algorithm - use action targets and look for  prefix
Tweak how python script passes options around
Remove extra whitespace
Tweak  macro - default args are set to None, rely on inner rule's attribute's default values
Changed attributes to exclude_external_sources and exclude_headers
@dave-hagedorn dave-hagedorn force-pushed the option-to-exclude-headers-and-externals branch from 6c45b7c to 5a768e1 Compare May 17, 2022 21:03
Copy link
Contributor

@cpsauer cpsauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, Dave! Thanks for getting back so thoughtfully.

Right there with you on decoupling. I think your names and framing are clean and great. Thanks for getting us to better thinking.


On the interface, my one concern is whether we'll be able to reliably distinguish system headers from other external headers. As an example, on some platforms, Bazel links to system/builtin headers via //external. (I know it does this on Android, at least, and I'd guess, maybe LLVM toolchains.) We could try to carefully tease these apart, but I think it might be hard to distinguish from the local_repository case more generally.

Proposed Solution: If we don't see separating system headers from external ones as a major use case, perhaps exclude_headers should have be limited to three tiers for now: "all", "external", and None, with system headers treated as external headers. Thoughts?

[Skip these next bullets unless you disagree and reading more might save a round trip.]

  • I do agree that major libraries, especially those from the OS, are more likely to have docs that obviate the need to view their headers. But the line is a bit blurry, and it feels like "all", "external", and None have clearer use cases.
  • Similarly, super granular control, like letting users specify which external workspaces to exclude, seems probably excessive if the goal is just quicker browsing. CPU time is so cheap compared to human time.
  • I'm hoping clangd will at some point traverse the include graph on its own, obviating the need to find headers at all. I also really like doing things right. So I'm always torn about how much complexity we should pack into the header finding code. It's required lots already.
  • Note to self: Excluding external headers was indeed a lot more subtle than it first appeared. (See also review comment about generated headers.)

Could I also ask you to take a polishing pass through the implementation thereafter?
I'm seeing some some minor things--typos, an unneeded include, etc.--but also some bigger things--like generated headers or running the (slow) header search when excluding headers. I'll mark some quickly that I see, but I'm more generally hoping I could ask you to give things a double check.

Then, assuming you've tested and everything, let's merge! Thanks for bearing with me and getting these cases handled really right.
Chris

P.S. Thanks for solving the rebasing--and sorry if the bad one was from me.

refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh.template.py Outdated Show resolved Hide resolved
refresh_compile_commands.bzl Outdated Show resolved Hide resolved
refresh_compile_commands.bzl Outdated Show resolved Hide resolved
refresh_compile_commands.bzl Show resolved Hide resolved
Change exclude_headers options to just all and external
Optimize some header eclusion cases (thanks @cpsauer)
Update docs
@dave-hagedorn
Copy link
Contributor Author

Hi @cpsauer, I think I've taken into accout most (all?) of your recent suggestions.

I did update README.md, feel free to amend this as needed.

At this point I don't have a lot more time to devote to this PR. Any glaring fixes I'm happy to take on but for any refinements, can I ask that you help get this across the finish line? I'm also fine to polish in a follow-up PR. Let me know if this works for you, thanks!

if not file_exists:
if not _get_files.has_logged_missing_file_error: # Just log once; subsequent messages wouldn't add anything.
_get_files.has_logged_missing_file_error = True
print(f"""\033[0;33m>>> A source file you compile doesn't (yet) exist: {source_path}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried the new version of this PR: I think this should be source_file here.

I got the following error:

File "/home/user/.cache/bazel/_bazel_q456457/1ccf5f150b64de76858888618e3183bd/execroot/ddad/bazel-out/k8-fastbuild/bin/refresh_compile_commands.runfiles/ddad/refresh_compile_commands.py", line 302, in _get_files
    Continuing gracefully...\033[0m""",  file=sys.stderr)
NameError: name 'source_path' is not defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexander-born. Fixed.

@cpsauer cpsauer merged commit 93dc21a into hedronvision:main May 28, 2022
@cpsauer
Copy link
Contributor

cpsauer commented May 28, 2022

Wahoo! Merged! Thanks @dave-hagedorn for some great new features--and breadth of support. Appreciate your bearing with, even though filtering external headers turned out to be substantially tricker than expected.

[And please, if it looks like I messed something up in my attempt to go over things and make any improvements I could find, say something!]

@cpsauer
Copy link
Contributor

cpsauer commented May 28, 2022

And sorry for the slowness. Doing my best--just a little overloaded here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants