New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1828223 - Implement heuristic resolution of CXXDependentScopeMemberExpr #628
Conversation
e8e7577
to
312b0c6
Compare
Just as a timeline update, I was at a team meet-up last week and should be able to get to this this week, although it might be another day or two before the jet lag is out of my system. |
Investigation of Changes using gfx/2d/Matrix.hThanks for doing the dev run, this makes it much easier to dig into the impacts of the changes! Given your links to https://searchfox.org/mozilla-central/source/gfx/2d/Matrix.h from https://bugzilla.mozilla.org/show_bug.cgi?id=1781178 I'm using that as a basis for comparison by doing (roughly): $ curl https://searchfox.org/mozilla-central/raw-analysis/gfx/2d/Matrix.h > trunk-Matrix.h
$ curl https://dev.searchfox.org/mozilla-central/raw-analysis/gfx/2d/Matrix.h > dev-Matrix.h
$ diff -U3 trunk-Matrix.h dev-Matrix.h There are a lot of changes, so I'm just going to broadly categorize them. I should also note the files aren't particularly sorted as-is, so it may be useful to pass them through New Uses show up!Woooo! This is really fantastic! Redundant target records with varying contextsym values, example of IsIdentityOn trunk, this use of IsIdentity is not linked: if (aMatrix.IsIdentity()) {
return aStream << "[ I ]";
} We now get the use with the following new target records: {"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERNSt6__ndk113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"}
{"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERSoRKNS0_10BaseMatrixIT_EE"}
{"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"} and the following new source record: {"loc":"00078:16-26","source":1,"syntax":"function,use","pretty":"function mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","type":"_Bool (void) const"} The difference between the target records is their contextsym, they otherwise are the same. Specifically, we have:
No normalization happens here because target records are de-duplicated on their overall hash only. This raises the question of whether the crossref database somehow consolidates these. We do not expect to see (and in fact do not see) redundant search results because If we do a sorch lookup by adding a bash function like the following: dev-sorch-raw () {
QUERY=$1
curl -H 'Accept: application/json' "https://dev.searchfox.org/mozilla-central/sorch?q=${QUERY}&path=" | jq '.'
} and then do {
"path": "gfx/2d/Matrix.h",
"path_kind": "Normal",
"lines": [
{
"lno": 78,
"bounds": [
12,
22
],
"line": "if (aMatrix.IsIdentity()) {",
"context": "mozilla::gfx::operator<<",
"contextsym": "_ZN7mozilla3gfxlsERNSt6__ndk113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"
},
{
"lno": 78,
"bounds": [
12,
22
],
"line": "if (aMatrix.IsIdentity()) {",
"context": "mozilla::gfx::operator<<",
"contextsym": "_ZN7mozilla3gfxlsERSoRKNS0_10BaseMatrixIT_EE"
},
{
"lno": 78,
"bounds": [
12,
22
],
"line": "if (aMatrix.IsIdentity()) {",
"context": "mozilla::gfx::operator<<",
"contextsym": "_ZN7mozilla3gfxlsERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"
}
]
}, Because the contextsym is used for magic diagrams, it's probably appropriate to enhance the data-model and merge logic to support the contextsym being a list of symbols which are merged with a similar set of rules to source record merging. In particular, it probably makes sense to merge only iff the loc, kind, pretty, sym, and context are all the same. Varying target records whose symbols vary but their contextsyms do not:A slight variation on the above is for {"loc":"00768:38-41","source":1,"syntax":"function,use","pretty":"function std::numeric_limits::max","sym":"_ZNSt14numeric_limits3maxEv,_ZNSt3__114numeric_limits3maxEv,_ZNSt6__ndk114numeric_limits3maxEv","type":"type (void) noexcept"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt14numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt3__114numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt6__ndk114numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"} We can see here that we experience 3 different symbols which get unified in the source record so that all 3 symbols are listed. For the target records, although the symbols vary, the contextsyms are all the same:
Consistent with what I said in the previous section, I believe this is an example of a case where it would not be appropriate to merge the target records because the symbols don't match. At least at this time. For now the varying symbols mean the target records will go in different crossref entries and then get unified by Conceptually speaking, we don't gain much from having the distinct symbols, but there also isn't much harm because we have merged them in the source record and so it's just a thing our logic ends up dealing with and could automatically explain to people, etc. We get a bunch of noise in structured records because of "anonymous struct at PATH" and includes via source dir and objdir public header symlinksFor the structured record for sym
The latter is of course the symlink created by the build process for the public include path and the former is the actual source location. The situation here is weird because we end up with 1 copy of the structured record for each path for each platform, so our symmetry breaking logic ends up non-deterministic and we inverted which got to be the canonical variation and which got to be the non-canonical variation. This is largely unrelated to the changes happening here, so this should just end up as a future work enhancement, quite possibly to do something slightly more intelligent in terms of creating the pretty identifier for anonymous namespaces and structs. |
Just a heads up, I did two deploys to dev.searchfox.org, one on Monday with this patch, and a second one on Wednesday for this patch + an additional patch to use clangd's full HeuristicResolver implementation (that's on this branch). In the second one a few additional uses are resolved (though not that many, because I haven't yet added new uses of HeuristicResolver in MozsearchIndexer that are needed to make some of the other test cases in bug 1781178 work). |
Heuristics and/or Concrete
Aside: It would be great to have a specific test case that could exercise the permutations being discussed here. I understand there to be broadly 3 use-cases here:
For the first two cases, it does seem like it would definitely be preferable for our source code rendering code which operates in terms of source records and jumpref lookups to be able to use concrete results if available and not end up broadening that to the overload set. As discussed in my previous comment, we do have a merge stage for the multi-platform mozilla-central case, but there's nothing stopping us from always running merge logic even in single-platform cases to simplify things. I could imagine we could do something so that we annotate the overload set source records we emit so that they will get replaced by concrete records when present for the same token with the same pretty. For now I'll call this a "weak source record", but there are potentially more clever things we could do; more a little below. Our clang indexer can't / shouldn't need global awarenessI expect our clang indexer plugin won't always have enough information to make only determinations itself[1], so the best course of action is likely to go with and, but eventually make sure that we emit sufficient metadata that we can distinguish between the different qualities of "use". The most straightforward approach to this is for merging (which only happens once all C++ indexing is complete) to have the concrete source records clobber the weak overload set source records, but we could handle things later on in subsequent passes as well if we need the crossref's process global awareness of everything. While adding something like a "weak" property could work, there are also options where we could further refine the semantics of the overload set "use". In Bug 1419018 :kats suggested distinguishing between lhs/rhs for uses and in Bug 1733217 I dug into that a little more proposing an additional "usage" field. We could also introduce a new "kind" besides "use" like "potential use" which currently would result in an additional section in the "search" results output, but I think that would probably be a surprising/unpleasant UX since the "use" kind tends to be quite voluminous and people might never get down to "potential use". That said, similar to your prior enhancement of #613 I think it would make sense to land this as-is and then figure out how to best address the "weak" or "usage" use-cases. What's in this patch is already strictly an amazing step forward for searchfox; the only alternative users have right now is fulltext search which is going to have order of magnitude more false positives. 1: Technically, because of the way we do locking on the analysis files, the clang indexer could actually try and reason about the existing records from other ingestions of the file, but I don't think that we would want to do that because:
HeuristicResolver
I have no problem if you want to copy the classes over now or in the future when you make your next change. I see the files already have per-file license headers and their Apache 2.0 license is compatible with mozsearch's MPL2. They also look nicely standalone, so it's not like we'd spend a lot of time trying to disentangle them from other clangd code. One thing we might do is put them in a "from-clangd" subdirectory with a README that explains what they are. This could help make it clear to anyone wanting to make changes that they're touching a file that has an upstream without us having to add "HEY DON'T MODIFY THIS FILE" comments every time we copy over a new revision from clangd. But again, no need to do that now. I definitely don't think we want to reinvent any wheels clangd has invented, or to make it harder to upstream enhancements made in searchfox to clangd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in my comments, I think this is amazing as-is and can probably already land (and have the MozsearchIndexer.cpp changes propagated to m-c). I'm happy to continue discussing potential next steps as it relates to weak source records or "usages", etc. here or on bugzilla. I did mention some potential spin-offs, but there's no need for you to file anything you're not personally interested in/planning on implementing. (Like, in particular, the contextsym merging isn't something you need to do or file.)
Thanks for all your work on searchfox! This is very exciting!
Here are a couple of (somewhat contrived) examples to illustrate the two cases:
An example here might be: template <typename Surface>
struct DrawingContext {
void draw(Rectangle);
void draw(Triangle);
void draw(Circle);
};
template <typename Surface, typename Shape>
void foo(DrawingContext<Surface>& d, Shape& s) {
d.draw(s);
} Let's suppose that, for whatever reason, the indexer has only seen instantiations of
An example here might be: template <typename OutputStream>
struct Serializer {
void serialize(std::string);
void serialize(int);
void serialize(URL);
// bunch of other overloads
void serialize(FontMetrics);
};
template <typename OutputStream>
void serializeFont(Serializer<OutputStream> s, Font f) {
// ...
s.serialize(f.GetMetrics());
} This time, all instantiations use (Why aren't the heuristics smart enough to figure out the target will always be Not sure if you just wanted to see the test cases to illustrate the tradeoff, or if you meant adding them to the test suite -- I'm happy to do the latter if you'd like. |
Yeah, a "weak" annotation of some sort makes a lot of sense to me. Eclipse CDT uses something similar for e.g. find-references results where it's less sure that a use is definitely referencing the entity you're searching for, and annotates those as a "potential result" in the results list. I'll file a follow-up for exploring this further. |
Sounds good! I'll leave this patch as-is, and expand |
Thanks for the specific example, this really enhances my ability to discuss this aspect of template indexing! I agree that the heuristic result's overload set is fundamentally interesting in this case. I think the concrete uses and the ability to distinguish them would also be useful, both at the point of the use for context menu purposes (source record) and in the search results (target records). Re: CDT and "potential results", we also see something similar from GitHub and Sourcegraph, although those are more about the indexer; the terms "precise" and "search based" in particular are used to distinguish between "we used something that understands the full semantics of the language" versus "we used tree-sitter so it's not perfect". As we potentially overhaul our understanding of JS from our current "we assume all JS we ever see is loaded into a single global" to using the more precise scip-typescript in some cases (or all cases, but with a pile of hacks), it could very much make sense for searchfox to grow a concept of "symbol confidence". Specifically, if implemented, we'd:
We don't need to implement this right now, but I always like to have a working destination in mind, and I may try noodling around with it in my hobby time, as it seems quite useful to have an incremental approach for improving our JS implementation incrementally rather than an all-or-nothing transition.
I think the proposal above could still work for this.
Yeah, that seems like a lot of work. I think it's very reasonable to say that ugly/complex code is going to result in an ugly/complex experience. That said, I think reasonable things searchfox could help do in situations like this are:
I think both are useful, but I think it would be particularly useful to add them to the test suite, as this makes it easier to dig in to what's currently happening. Also, it means in the future people can just link to the in-(mozsearch-)tree examples rather than have to potentially find this thread or sketch them up from whole cloth. Also it helps make sure details are filled in. To do some quick investigation myself, I tried to make your shapes example more concrete in this class GenericSurface {};
class Rectangle {};
class Triangle {};
class Circle {};
template <typename Surface> struct DrawingContext {
void draw(Rectangle);
void draw(Triangle);
void draw(Circle);
};
template <typename Surface, typename Shape>
void foo(DrawingContext<Surface> &d, Shape &s) {
d.draw(s);
}
int main(void) {
GenericSurface surface;
DrawingContext<GenericSurface> context;
Circle circle;
foo(context, circle);
} |
I filed bug 1833552 for distinguishing between heuristic and concrete results in the context of templates. Please feel free to transmogrify it into a more general issue about symbol confidence, or to make it depend on such an issue etc.
I filed bug 1833548 about further improvements in this space. I think there is some low-hanging fruit here (simple heuristics like arity) that can be tackled without solving the more general problem of overload resolution with dependent arguments.
Will do, I filed bug 1833529 for this. |
Posting a draft fix for bug 1828223 for feedback.
I would appreciate feedback on two points in particular:
Currently, in
IndexConsumer::VisitCXXDependentScopeMemberExpr()
, I both (1) produce a heuristic result (that's what's new in this patch), and (2) record the location for possibly producing concrete results when visiting instantiations (if we have any) during theAnalyzeDependent
phase (what the code was doing before).I'm wondering, should we add some logic to only produce the heuristic result if we didn't end up producing concrete results from the
AnalyzeDependent
phase?I feel like I could go either way on this. On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch. On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads. In other words, I don't think either approach is categorically more accurate than the other.
The
HeuristicResolver
facility added in this patch is a slimmed-down version of clangd's (currently much more featureful) facility of the same name.My current approach is to port over pieces of this facility to Searchfox as we need them to make specific testcases work.
However, one could imagine a different approach of just... copying the class from clangd wholesale, as a piece of code which has been battle-tested in a different project? (It doesn't have any dependencies on anything else in clangd, only public clang APIs.)
Let me know if you have a preference here, I'm happy to switch approaches.
(Granted, even better would be to upstream
HeuristicResolver
tolibClangTooling
or somesuch and expose it as a public clang API, and have both clangd and Searchfox consume it from there. I'd like to explore this, but I'd prefer not to block our progress on Searchfox enhancements on this LLVM change, because LLVM patches can be slow to be reviewed (and then we'd need to adopt the new LLVM version in which it appears and so on.))