Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we support inputs filter in bazel aquery? #93

Open
xinzhengzhang opened this issue Dec 4, 2022 · 25 comments
Open

Can we support inputs filter in bazel aquery? #93

xinzhengzhang opened this issue Dec 4, 2022 · 25 comments

Comments

@xinzhengzhang
Copy link

Bazel's aquery functions can combine many interesting usage.
In my case, I am writing an extension integrating with compile_commands based on file granularity and I need a file filter from the root target.

I have done a commit xinzhengzhang@56e4147 since the target configuration is a dictionary now I have not consider the use case of multiple targets.
Back to the issue, do we need to support this feature? If there is any suggestion how to design the interface? I'd be happy to implement it.

refresh_compile_commands(
  name = "refresh_compile_commands",
  targets = {
    "//path:target": ""
  },
  # This string will vary depending on the file opened by the extension
  input_filter = "a/b/c/file.c",
  tags = ["manual"],
)
@cpsauer
Copy link
Contributor

cpsauer commented Dec 5, 2022

Hey, @xinzhengzhang. Interesting. Yeah, there's a lot of flexibility to aquery that we're wrapping without exposing.

Could I ask for your "goal backtrace"--that is, what goal you're trying to accomplish and how? I think that'll help us get to the best solution.
(Is it broader than the swift-only use case from the PR?)

My thoughts offhand, without knowing more details: Having a single file filter regex parameter seems reasonable if it's needed, since people can just | together regexes, but there some subtitles: inputs() matches all actions that (might) use that input, I believe, so users who put some system header might be surprised to see everything match the filter instead of just outputting entries for files that match the regex. We could resolve this by matching at the beginning with inputs() for performance and then again at the end during output for consistency with expectations. Or we could just say that it should only be used for source (non-header files). Maybe a better interface would be a list of globe (converted to a regex?) Another possibility would be to have an advanced interface where we let people control the filter with aquery syntax. This all kinda depends on the needs we're trying to serve here.

I'm tagging also @alexander-born, just in case it's relevant to the "current file" use case he'd been describing in another issue.

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 5, 2022

In bilibili we have a monorepo organized by bazel with tens of millions of lines of code and it is impossible to extract all target(//...) to the root project. I have tried that a 4GB compile_commands is too hard for parsing not matter source kit-lsp or clangd.

So I plan to implement a plugin based on file-level. It will observer event of text editor and auto generate file-based compile_commands.json. Of course it will sacrifice some features like find references but it doesn't matter in my scene, because it is also completely unavailable in Xcode...

For source file(.swift .m .mm) we can find the corresponding target which guide how the file should be compiled.
For header file(.h) we can find the top level target depends on this header and extractor them and the header is also unpacked by the way. (It will not work if we are depend c library using module, so I disabled header extractor by default in my plugin)

I have done a very very early version for internal use https://github.com/xinzhengzhang/bis

@cpsauer
Copy link
Contributor

cpsauer commented Dec 8, 2022

I see, so the problem is an overwhelmingly large codebase/compile_commands.json. 4GB without headers is definitely overwhelmingly huge, but do I think it's kinda entertaining that it sounds like this tool managed to generate it. I presume Xcode isn't giving you references, since it, like sourcekit-lsp, is overwhelmed.

Just to make sure you're avoiding a trap here: Presumably the downstream swift, cc, objc_libraries require inherited configuration from the binary rules used to build them--and the *_library rules therefore aren't configured correctly if built or aquery'd on their own. They'd be configured for the host (macOS) instead of the target (iOS, for example). Therefore it wouldn't work to just find the target that uses a given source or header file; instead you'd have to find a root target that transitively depends on the library and get the commands by aquery'ing the deps of that target. (Maybe that's what you mean by "top-level" target--but I want to make sure.)

I assume it wouldn't work to just resolve this by having people specify the top-level targets they're working on in refresh_compile_commands? Assuming it's not one monolithic target, that'd narrow things down considerably. (Maybe this is what you're trying to automate.)

I also want to make sure there isn't some easy way to slim down compile_commands.json without making you do all this work. This is already with header extraction turned off, on a focused list of targets in `refresh_compile_commands", right? Could I ask you to also check that the overwhelming json size isn't just because of duplicate listings of files? That is, go into the 4gb file and make sure that each source file is listed only at most a few times (once per architecture), rather than like a 1000 times, or something that would lead to easily avoidable large file size.

@xinzhengzhang
Copy link
Author

Thanks for the reminder and I didn't step on this pit and there are two reason why I need a input filter

  1. Since compile-commands-extractor is based on bazel aquery I can not set any cfg to match the target configuration. ios_application using his own configuration like
    load("@build_bazel_rules_apple//apple/internal:transition_support.bzl", "transition_support")
    
    _ios_application = rule(
        cfg = transition_support.apple_rule_transition,
        attrs = {
            "deps", attr.label_list(
                cfg = apple_common.multi_arch_split
            ),
            "_allowlist_function_transition": attr.label(
                default = "@bazel_tools//tools/allowlists/function_transition_allowlist",
            ),
        }
    )
    
  2. In my project, many R&D are used to make a library containing hundreds or even thousands of sources file and no header file is actually still annoying when developing. It takes miniutes to finished the extractor (assume they have not compiled). The goal of my plugin is to allow open files to be indexed at the seconds, so I need to filter to a certain file

The 4gb file contains the header file, because the habit of OC is one source file and one header file, removing the header file can greatly speed up the extractor but the size is still about 2gb.

There are indeed some useless things such as protoc codegen tools, but there is no large-scale repetition

@cpsauer
Copy link
Contributor

cpsauer commented Dec 9, 2022

Makes sense! To rehash: Sounds like you understand the trap; are finding that you want entries for headers; and want to use this to cut down processing and header search to something you can do in real time. I just wish I could help find an easier solution for you.

Two last other, divergent ideas, in case they're useful: There's some chance you might be able to hand write (or generate) a single clangd config file that would cover all files well enough. And I assume you've already tried the exclude_external_sources = True and exclude_headers = "external" families of options, as quick ways of trying to filter things down. (I'm not sure if you've already considered using the set subtraction syntax in aquery, like deps(:a) - deps(:b), but IIRC, that won't work correctly here because configuration issue means that the actions will be different.)

If those two ideas don't seem better--and you think this is of general use and can be done well, I'd be open to a good PR!
But, as above, there's some trickiness, and I do think that doing this well enough to not confuse users might be harder than it seems. I think the biggest of these difficulties is that accidentally including headers in the pattern will lead to a large number of unintentional matches, which is likely to confuse users. Specifying anything but a single source file would therefore require or-ing together all possible source extensions, which people are unlikely to do right. How do you think we should resolve that? One option would be to take in a list of packages and transform them into a regex? I know you don't directly need this to handle your single-source case, but it might unify the header case, and I think we'd probably need to have a good interface to make this not confusing and generally useful enough to merge in.

[One other side thought: It occurs to me that you might want to be able to pass targets as runtime flags instead of needing to modify a build file every time. I'd be open to that, too, if this works out.]

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 10, 2022

Seems that how the design looks, it will confuse users. How about we just open aquery like this way?

# AqueryProvider
AqueryInfo = provider(
    fields= {
        "aquery_string": "aquery passed to extractor"
    }
)

# rule: aquery_provider
attr = {
    "labels_to_flags": attr.string_dict(mandatory = True),
    "exclude_swift": attr.bool,
} 

return [AqueryInfo(aquery_string="mnemonic('(Objc|Cpp)Compile',deps(//xxxx:yy))")]

# rule: _refresh_compile_commands
attrs = {
    "aquerys": attr.label_list(
        mandatory = True,
        providers = [AqueryInfo] 
    )
    "exclude_external_sources": attr.bool(default = False),
    "exclude_headers": attr.string(values = ["all", "external", ""]),
},

# rule(macros): refresh_compile_commands
def refresh_compile_commands(labels_to_flags, aquerys, ...):

And about the runtime flag if we open aquery directly like above do you mind if I remove

additional_flags = shlex.split(flags) + sys.argv[1:]

instead of using params like -- --aquery "//xxx:y --features=" "//xxx:yy"

@cpsauer
Copy link
Contributor

cpsauer commented Dec 14, 2022

Hmm, that seems more complicated to me than just using an aquery string_dict, though maybe there's something I'm missing?

@cpsauer
Copy link
Contributor

cpsauer commented Dec 14, 2022

What'd you think of the other proposal to keep things simple and correct for users? That is, something like:
source_filter_packages = ["//foo", "//bar/..."]

I assume neither of the other workarounds did it for you?

@xinzhengzhang
Copy link
Author

So will we make aquery like this?

bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', filter('(//foo|//bar)', deps(//:target)))"

@cpsauer
Copy link
Contributor

cpsauer commented Dec 17, 2022

Roughly! Yeah, so the idea is to have an interface that solves your problem, while also being useful to others. I think this package list interface might do that--output should be small enough that I'm hoping it'll work for you, and more generally, it'll make it easy for others to filter down overwhelmingly large codebases, while not being too hard to implement

(I think you mean that as a rough quick example, but just to make sure we're on the same page: It'd need to be inputs(), match the whole path, end with an or group of source extensions to avoid the problem where headers match, handle recursive packages with ..., and deal with packages ending with /. For that example, something like bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', inputs('(//foo/[^/]*|//bar/.*)(\\.cc|<etc.>)', deps(//:target)))" )

@cpsauer
Copy link
Contributor

cpsauer commented Dec 17, 2022

[I'll check quickly to make sure that windows aquery doesn't need backslash separators. Assume we just need to think about forward slash unless I get back to you on that.]

@xinzhengzhang
Copy link
Author

Roughly! Yeah, so the idea is to have an interface that solves your problem, while also being useful to others. I think this package list interface might do that--output should be small enough that I'm hoping it'll work for you, and more generally, it'll make it easy for others to filter down overwhelmingly large codebases, while not being too hard to implement

(I think you mean that as a rough quick example, but just to make sure we're on the same page: It'd need to be inputs(), match the whole path, end with an or group of source extensions to avoid the problem where headers match, handle recursive packages with ..., and deal with packages ending with /. For that example, something like bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', inputs('(//foo/[^/]*|//bar/.*)(\\.cc|<etc.>)', deps(//:target)))" )

There is a little confused here.
Since the header file should not appeared in Objc|Cpp|Swift actions, why do we still need inputs() to match the whole path?

@cpsauer
Copy link
Contributor

cpsauer commented Dec 17, 2022

I'm not quite understanding what you're asking. But doesn't inputs work by filtering actions to just those that have one input whose full path matches the regex?

And I think headers are listed as inputs to the action by Bazel--the problem for users is that a header will also be an input to all rules that depend on it, potentially leading to way more matches than the one expected.

Also, I checked the Windows thing and we're good to go: Bazel emits forward slashes there, too.

[It's past midnight my time, so I'm going to sleep soon, but I'll be back.]

@xinzhengzhang
Copy link
Author

These two ways will have the same result under package-based intercae like source_filter_packages = ["//foo", "//bar/..."] but there are two differences. 1. No need to exhaustively list the extensions of source files 2. Maybe a little bit better in performance (and I haven't encountered performance problems)

  1. bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', filter('(//foo|//bar)', deps(//:target)))"
actions = target.transitiveTargets    // Suppose there are 1000 targets
    .filter { $0.label.match('//foo/.*') }  // We matched 1000 times and filter 10 targets
    .filter { $0.mnemonic.match('CppCompile') } // Matched 10 times and filter 5 targets
    .flatMap { $0.actions } // 5 targets
  1. bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', inputs('(//foo/[^/]|//bar/.)(\.cc|<etc.>)', deps(//:target)))"
actions = target.transitiveTargets   // Suppose there are 1000 targets
    .filter { // Each target may have 100 inputs matched 1000 * 100 times and filter 10 targets
        $0.inputs.contains { input in
            input.match('//foo/.*/\.cc')
        }
    }
   .filter { $0.mnemonic.match('CppCompile') } // Matched 10 times and filter 5 targets
   .flatMap { $0.actions } // 5 targets

I think unless we need to filter things other than package information, such as which compiler used, whether to use modules, or directly filter in the query regardless of semantics or performance.

@cpsauer
Copy link
Contributor

cpsauer commented Dec 18, 2022

Yep--you're totally right. Sorry I pushed us towards the wrong solution, and thanks for pushing back. Please proceed if that interface would work for you, too!

(Backtrace: I hadn't realized there was indeed already a target-based filter function one could apply to aquery.)

I (probably) should have built the external source filtering this way! If it's easy to fix my mistake when you're implementing this, I'd love it. If not, I can do that after.

@xinzhengzhang
Copy link
Author

My bad I just gave one line of code without explaining more context.

No problem, I will submit a pr to implement source_filter_packages and modify the implementation of external source based on the above discussion

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 22, 2022

Hi @cpsauer
I have submit #97 for source_filter_packages

In my own project, the speed is still not ideal without excluding the header. Because there are too many sources in some targets. But it's okay, this is our own problem, I will add additional file-level filtering based on packages filter in my fork :)

And there is another problem about filtering external source, we can use syntax like ^// to filter all external packages but if we need to join these two feature the way I can think will be

bazel aquery "mnemonic('(Objc|Cpp|Swift)Compile', filter('(//foo|//bar)', filter('^//', deps(//:target))))"

Do you have any other ideas?

@cpsauer
Copy link
Contributor

cpsauer commented Dec 23, 2022

Hey, @xinzhengzhang! Thanks for your good efforts and code.

I'm bummed to hear that it doesn't end up solving things efficiently enough for you. The goal had been to get something into the mainline that both solved things for you and was useful to folks more broadly.

Given the current direction doesn't solve things for you, I'm going to propose we change course. How about if we instead add a runtime flag that lets you query a the command for just a single file, getting you exactly what you want. Something like --file=A/B/foo.cpp? Sorry this one's taken a bit of a discovery process. Sometimes things take exploration, especially with a medium as malleable as code.

Some notes:

  • I'm proposing a runtime flag for speed and ease of use. It'd save retemplating (and time holding bazel lock--you can't set output base or you'll miss generated files)). But also you might also want to run it in parallel or without having to edit a BUILD file that'd require additional analysis phases.
  • I'd propose we only allow the --file= to simplify parsing. Though if you anticipate wanting to batch files (probably not?) then we could instead just do a --files=comma,separated,list? Using positional arguments or --files foo bar is going to get painful, given we also pass through arguments to Bazel, hence the thought of just accepting the = form.
  • I'd propose we that the output be updating compile_commands.json in place, removing previous entries for the file(s) in question and replacing them, and thus handling the incremental update logic here, too, rather than pushing it to the user.
    • One issue that's going to come up is whether we should list all the ways each source file is compiled or just one. All would be consistent with the current behavior, but maybe there's a good reason to change. I haven't thought that one through yet.
  • Headers remain tricky. Consider the case where you've got mixed languages, like an Objective-C compile action at the top level target and a C++ header up the chain. Even if the top level target could technically access the C++ header with Bazel (but doesn't because that would lead to a build error), we wouldn't want to use the top level command for the header. I imagine you might have already hit this case with the plugin. Any ideas on how to handle it well?
  • The more that I think about it, your very-large-project use case seems likely to come up more often and across editors given that Bazel is often used on very large code bases. The benefit of having this logic here is that we could reuse it across those editor plugins. It's also helpful even on smaller codebases, helping automatically helping keep things up to date.

What do you think?

@cpsauer
Copy link
Contributor

cpsauer commented Dec 23, 2022

[This is less important, but I should say also that the previous approach's implementation doesn't work as cleanly as one might have thought. Some examples: We're actually matching targets, not packages, so //foo:... doesn't actually match all packages underneath, but rather all targets within the //foo package with a name at least three characters long. //foo behaves like one might have expected //foo/... to, //foo/... doesn't match targets in //foo, etc. These are probably all fixable, but still, the other isn't a slam dunk merge as is.]

@xinzhengzhang
Copy link
Author

"I'd propose we only allow the --file= to simplify parsing. Though if you anticipate wanting to batch files (probably not?) then we could instead just do a --files=comma,separated,list? Using positional arguments or --files foo bar is going to get painful, given we also pass through arguments to Bazel, hence the thought of just accepting the = form."

Yes, I totally agree too and in my case `--file` is enough.

"I'd propose we that the output be updating compile_commands.json in place, removing previous entries for the file(s) in question and replacing them, and thus handling the incremental update logic here, too, rather than pushing it to the user.
One issue that's going to come up is whether we should list all the ways each source file is compiled or just one. All would be consistent with the current behavior, but maybe there's a good reason to change. I haven't thought that one through yet."

Since there can only be one `output_base` if we remaining the multiple compilation commands of one source can lsp works?  for example `-fmodule-map-file=bazel-out/<path>/xxx.module.modulemap` in compile commands
And in my plugin, it is true that the previous one is merged with the newly parsed (Overriding with newly value if same file path)

"Headers remain tricky. Consider the case where you've got mixed languages, like an Objective-C compile action at the top level target and a C++ header up the chain. Even if the top level target could technically access the C++ header with Bazel (but doesn't because that would lead to a build error), we wouldn't want to use the top level command for the header. I imagine you might have already hit this case with the plugin. Any ideas on how to handle it well?"

Good question... extractor by header file is directly rejected in my plugin. And we can extractor it by opening a source file that depends on it.
How about make the params modified to `--source_file`?

@cpsauer
Copy link
Contributor

cpsauer commented Dec 23, 2022

we can use syntax like `^//

That's the best I can think of, too! Fixed in a426081 Thanks for alerting me to filter and better ways of doing this :)

@cpsauer
Copy link
Contributor

cpsauer commented Dec 23, 2022

Since there can only be one output_base if we remaining the multiple compilation commands of one source can lsp works? for example -fmodule-map-file=bazel-out/<path>/xxx.module.modulemap in compile commands
And in my plugin, it is true that the previous one is merged with the newly parsed (Overriding with newly value if same file path)

Ah, so the usual case where there are multiple compile commands for a file is when it's being recompiled for multiple OS's or architectures. This means that there are different generated directories under bazel-out. You certainly raise an interesting issue, though. I hadn't thought about that case.

Good question... extractor by header file is directly rejected in my plugin. And we can extractor it by opening a source file that depends on it.
How about make the params modified to --source_file?

I think some trickiness remains: How do we automatically find a source file that depends on a given header? Presumably you don't want to make the user manually find a source file that depends upon it... Hmm. One (not great) solution would be to input filter to all actions that could include the header, search the source files (maybe in path-closeness order) for one that contains the filename of the header, and verify that it's the right source by extracting the header. Any better ideas? Maybe there's an (a)query filtering mechanism for actions that directly depend on the header?

Update: Oh, yeah! How about using attr to at least find the target that directly uses the header in srcs or hdrs?

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 23, 2022

How about using attr to at least find the target that directly uses the header in srcs or hdrs?

it is good! Very simple and effective I will implemente it like this

@xinzhengzhang xinzhengzhang mentioned this issue Dec 23, 2022
8 tasks
@xinzhengzhang
Copy link
Author

#99

@cpsauer
Copy link
Contributor

cpsauer commented Dec 24, 2022

(Will continue discussion over there!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants