Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1828223 - Implement heuristic resolution of CXXDependentScopeMemberExpr #628

Merged
merged 2 commits into from May 12, 2023

Conversation

theres-waldo
Copy link
Contributor

Posting a draft fix for bug 1828223 for feedback.

I would appreciate feedback on two points in particular:

  1. Currently, in IndexConsumer::VisitCXXDependentScopeMemberExpr(), I both (1) produce a heuristic result (that's what's new in this patch), and (2) record the location for possibly producing concrete results when visiting instantiations (if we have any) during the AnalyzeDependent phase (what the code was doing before).

    I'm wondering, should we add some logic to only produce the heuristic result if we didn't end up producing concrete results from the AnalyzeDependent phase?

    I feel like I could go either way on this. On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch. On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads. In other words, I don't think either approach is categorically more accurate than the other.

  2. The HeuristicResolver facility added in this patch is a slimmed-down version of clangd's (currently much more featureful) facility of the same name.

    My current approach is to port over pieces of this facility to Searchfox as we need them to make specific testcases work.

    However, one could imagine a different approach of just... copying the class from clangd wholesale, as a piece of code which has been battle-tested in a different project? (It doesn't have any dependencies on anything else in clangd, only public clang APIs.)

    Let me know if you have a preference here, I'm happy to switch approaches.

    (Granted, even better would be to upstream HeuristicResolver to libClangTooling or somesuch and expose it as a public clang API, and have both clangd and Searchfox consume it from there. I'd like to explore this, but I'd prefer not to block our progress on Searchfox enhancements on this LLVM change, because LLVM patches can be slow to be reviewed (and then we'd need to adopt the new LLVM version in which it appears and so on.))

@theres-waldo theres-waldo changed the title [WIP] Bug 1818223 - Implement heuristic resolution of CXXDependentScopeMemberExpr [WIP] Bug 1828223 - Implement heuristic resolution of CXXDependentScopeMemberExpr Apr 25, 2023
@asutherland
Copy link
Contributor

Just as a timeline update, I was at a team meet-up last week and should be able to get to this this week, although it might be another day or two before the jet lag is out of my system.

@asutherland
Copy link
Contributor

asutherland commented May 6, 2023

Investigation of Changes using gfx/2d/Matrix.h

Thanks for doing the dev run, this makes it much easier to dig into the impacts of the changes! Given your links to https://searchfox.org/mozilla-central/source/gfx/2d/Matrix.h from https://bugzilla.mozilla.org/show_bug.cgi?id=1781178 I'm using that as a basis for comparison by doing (roughly):

$ curl https://searchfox.org/mozilla-central/raw-analysis/gfx/2d/Matrix.h > trunk-Matrix.h
$ curl https://dev.searchfox.org/mozilla-central/raw-analysis/gfx/2d/Matrix.h > dev-Matrix.h
$ diff -U3 trunk-Matrix.h dev-Matrix.h         

There are a lot of changes, so I'm just going to broadly categorize them. I should also note the files aren't particularly sorted as-is, so it may be useful to pass them through sort as well.

New Uses show up!

Woooo! This is really fantastic!

Redundant target records with varying contextsym values, example of IsIdentity

On trunk, this use of IsIdentity is not linked:

if (aMatrix.IsIdentity()) {
  return aStream << "[ I ]";
}

We now get the use with the following new target records:

{"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERNSt6__ndk113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"}
{"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERSoRKNS0_10BaseMatrixIT_EE"}
{"loc":"00078:16-26","target":1,"kind":"use","pretty":"mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","context":"mozilla::gfx::operator<<","contextsym":"_ZN7mozilla3gfxlsERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"}

and the following new source record:

{"loc":"00078:16-26","source":1,"syntax":"function,use","pretty":"function mozilla::gfx::BaseMatrix::IsIdentity","sym":"_ZNK7mozilla3gfx10BaseMatrix10IsIdentityEv","type":"_Bool (void) const"}

The difference between the target records is their contextsym, they otherwise are the same. Specifically, we have:

  • _ZN7mozilla3gfxlsERNSt6__ndk113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE
  • _ZN7mozilla3gfxlsERSoRKNS0_10BaseMatrixIT_EE
  • _ZN7mozilla3gfxlsERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE

No normalization happens here because target records are de-duplicated on their overall hash only. This raises the question of whether the crossref database somehow consolidates these. We do not expect to see (and in fact do not see) redundant search results because router.py explicitly only allows a single result per line; this is primarily intended to suppress text search results from being displayed when we already showed semantic results, but also could be doing something here.

If we do a sorch lookup by adding a bash function like the following:

dev-sorch-raw () {
  QUERY=$1
  curl -H 'Accept: application/json' "https://dev.searchfox.org/mozilla-central/sorch?q=${QUERY}&path=" | jq '.'
}

and then do dev-sorch-raw mozilla::gfx::BaseMatrix::IsIdentity we do in fact see the redundant lines in our uses path hit list:

            {
              "path": "gfx/2d/Matrix.h",
              "path_kind": "Normal",
              "lines": [
                {
                  "lno": 78,
                  "bounds": [
                    12,
                    22
                  ],
                  "line": "if (aMatrix.IsIdentity()) {",
                  "context": "mozilla::gfx::operator<<",
                  "contextsym": "_ZN7mozilla3gfxlsERNSt6__ndk113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"
                },
                {
                  "lno": 78,
                  "bounds": [
                    12,
                    22
                  ],
                  "line": "if (aMatrix.IsIdentity()) {",
                  "context": "mozilla::gfx::operator<<",
                  "contextsym": "_ZN7mozilla3gfxlsERSoRKNS0_10BaseMatrixIT_EE"
                },
                {
                  "lno": 78,
                  "bounds": [
                    12,
                    22
                  ],
                  "line": "if (aMatrix.IsIdentity()) {",
                  "context": "mozilla::gfx::operator<<",
                  "contextsym": "_ZN7mozilla3gfxlsERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEERKNS0_10BaseMatrixIT_EE"
                }
              ]
            },

Because the contextsym is used for magic diagrams, it's probably appropriate to enhance the data-model and merge logic to support the contextsym being a list of symbols which are merged with a similar set of rules to source record merging. In particular, it probably makes sense to merge only iff the loc, kind, pretty, sym, and context are all the same.

Varying target records whose symbols vary but their contextsyms do not:

A slight variation on the above is for std::numeric_limits::max where we get new records:

{"loc":"00768:38-41","source":1,"syntax":"function,use","pretty":"function std::numeric_limits::max","sym":"_ZNSt14numeric_limits3maxEv,_ZNSt3__114numeric_limits3maxEv,_ZNSt6__ndk114numeric_limits3maxEv","type":"type (void) noexcept"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt14numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt3__114numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"}
{"loc":"00768:38-41","target":1,"kind":"use","pretty":"std::numeric_limits::max","sym":"_ZNSt6__ndk114numeric_limits3maxEv","context":"mozilla::gfx::Matrix4x4Typed::ProjectRectBounds","contextsym":"_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"}

We can see here that we experience 3 different symbols which get unified in the source record so that all 3 symbols are listed. For the target records, although the symbols vary, the contextsyms are all the same:

  • "_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"
  • "_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"
  • "_ZNK7mozilla3gfx14Matrix4x4Typed17ProjectRectBoundsERKNS0_9RectTypedIT_TL0__EERKNS2_IT0_S4_EE"

Consistent with what I said in the previous section, I believe this is an example of a case where it would not be appropriate to merge the target records because the symbols don't match. At least at this time. For now the varying symbols mean the target records will go in different crossref entries and then get unified by router.py in a similar fashion to the previous section. Which I guess is an argument that addressing the issue raised in the previous section is not a particularly high priority since there will also be cases where we can't merge things anyways.

Conceptually speaking, we don't gain much from having the distinct symbols, but there also isn't much harm because we have merged them in the source record and so it's just a thing our logic ends up dealing with and could automatically explain to people, etc.

We get a bunch of noise in structured records because of "anonymous struct at PATH" and includes via source dir and objdir public header symlinks

For the structured record for sym "T_1e0231fb6037e57" with pretty "mozilla::gfx::Matrix5x4::(anonymous)::(anonymous)" at loc "01889:4-10" in Matrix.h, we see a diff because we end up seeing 2 variations for "anonymous struct":

  • "mozilla::gfx::Matrix5x4::(anonymous)::(anonymous)::(anonymous struct at /builds/worker/checkouts/gecko/gfx/2d/Matrix.h:1889:5)"
  • "mozilla::gfx::Matrix5x4::(anonymous)::(anonymous)::(anonymous struct at /builds/worker/workspace/obj-build/dist/include/mozilla/gfx/Matrix.h:1889:5)"

The latter is of course the symlink created by the build process for the public include path and the former is the actual source location. The situation here is weird because we end up with 1 copy of the structured record for each path for each platform, so our symmetry breaking logic ends up non-deterministic and we inverted which got to be the canonical variation and which got to be the non-canonical variation.

This is largely unrelated to the changes happening here, so this should just end up as a future work enhancement, quite possibly to do something slightly more intelligent in terms of creating the pretty identifier for anonymous namespaces and structs.

@theres-waldo
Copy link
Contributor Author

theres-waldo commented May 7, 2023

Thanks for doing the dev run, this makes it much easier to dig into the impacts of the changes!

Just a heads up, I did two deploys to dev.searchfox.org, one on Monday with this patch, and a second one on Wednesday for this patch + an additional patch to use clangd's full HeuristicResolver implementation (that's on this branch). In the second one a few additional uses are resolved (though not that many, because I haven't yet added new uses of HeuristicResolver in MozsearchIndexer that are needed to make some of the other test cases in bug 1781178 work).

@asutherland
Copy link
Contributor

Heuristics and/or Concrete

1. Currently, in `IndexConsumer::VisitCXXDependentScopeMemberExpr()`, I both (1) produce a heuristic result (that's what's new in this patch), **and** (2) record the location for possibly producing concrete results when visiting instantiations (if we have any) during the `AnalyzeDependent` phase (what the code was doing before).
   I'm wondering, should we add some logic to **only** produce the heuristic result if we didn't end up producing concrete results from the `AnalyzeDependent` phase?
   I feel like I could go either way on this. On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch. On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads. In other words, I don't think either approach is _categorically_ more accurate than the other.

Aside: It would be great to have a specific test case that could exercise the permutations being discussed here.

I understand there to be broadly 3 use-cases here:

  1. The source record use case where the user is looking at the source listing for a use and would really like a "go to def" option to go to the exact implementation and it's potentially a lot of legwork for the user to figure out which implementation to go to if there are multiple "go to def" options or if there is no "go to def" option and they end up having to use the "search" choice and end up having to pick implementations from there.
  2. The source record use case where the user is looking at the source listing for a definition and would really like to be able to get a "search" endpoint listing of the callers of that exact definition.
    • From the source listing, they might then appreciate if the listing included something like the current approach for C++ virtual method overrides where when looking at the specific symbol for a subclass's override, they can click on a link to go to the superclass's symbol for the virtual method, which then lists all of the overrides of that method.
  3. The "search" use-case where a user types in a pretty identifier name. This situation is notable because we expect the user to get a soup of all of the symbols relating to the overload set here because searchfox's "pretty" identifiers never include any ability to distinguish any further than that. So anything we do will tend to be good enough for this situation.

For the first two cases, it does seem like it would definitely be preferable for our source code rendering code which operates in terms of source records and jumpref lookups to be able to use concrete results if available and not end up broadening that to the overload set.

As discussed in my previous comment, we do have a merge stage for the multi-platform mozilla-central case, but there's nothing stopping us from always running merge logic even in single-platform cases to simplify things. I could imagine we could do something so that we annotate the overload set source records we emit so that they will get replaced by concrete records when present for the same token with the same pretty. For now I'll call this a "weak source record", but there are potentially more clever things we could do; more a little below.

Our clang indexer can't / shouldn't need global awareness

I expect our clang indexer plugin won't always have enough information to make only determinations itself[1], so the best course of action is likely to go with and, but eventually make sure that we emit sufficient metadata that we can distinguish between the different qualities of "use". The most straightforward approach to this is for merging (which only happens once all C++ indexing is complete) to have the concrete source records clobber the weak overload set source records, but we could handle things later on in subsequent passes as well if we need the crossref's process global awareness of everything.

While adding something like a "weak" property could work, there are also options where we could further refine the semantics of the overload set "use". In Bug 1419018 :kats suggested distinguishing between lhs/rhs for uses and in Bug 1733217 I dug into that a little more proposing an additional "usage" field. We could also introduce a new "kind" besides "use" like "potential use" which currently would result in an additional section in the "search" results output, but I think that would probably be a surprising/unpleasant UX since the "use" kind tends to be quite voluminous and people might never get down to "potential use".

That said, similar to your prior enhancement of #613 I think it would make sense to land this as-is and then figure out how to best address the "weak" or "usage" use-cases. What's in this patch is already strictly an amazing step forward for searchfox; the only alternative users have right now is fulltext search which is going to have order of magnitude more false positives.

1: Technically, because of the way we do locking on the analysis files, the clang indexer could actually try and reason about the existing records from other ingestions of the file, but I don't think that we would want to do that because:

  • The clang indexer never actually knows when it's going to be the last one to touch that file.
  • I think a major performance optimization we could undertake for indexing is to get rid of the file locking and instead have all the clang indexers send their output over a (unix) socket to a rust daemon that can handle de-duplication on the fly. The current approach of locking the file, loading in the current contents, sorting/merging, then writing the file back out is likely the reason that a build that normally takes ~12 minutes instead takes ~54 minutes.

HeuristicResolver

2. The `HeuristicResolver` facility added in this patch is a slimmed-down version of clangd's (currently much more featureful) facility [of the same name](https://searchfox.org/llvm/rev/be9c91843bab5bb46574c27836bfcd9ad6fc9ef5/clang-tools-extra/clangd/HeuristicResolver.h#45).
   My current approach is to port over pieces of this facility to Searchfox as we need them to make specific testcases work.
   However, one could imagine a different approach of just... copying the class from clangd wholesale, as a piece of code which has been battle-tested in a different project? (It doesn't have any dependencies on anything else in clangd, only public clang APIs.)
   Let me know if you have a preference here, I'm happy to switch approaches.
   (Granted, even better would be to upstream `HeuristicResolver` to `libClangTooling` or somesuch and expose it as a public clang API, and have both clangd and Searchfox consume it from there. I'd like to explore this, but I'd prefer not to block our progress on Searchfox enhancements on this LLVM change, because LLVM patches can be slow to be reviewed (and then we'd need to adopt the new LLVM version in which it appears and so on.))

I have no problem if you want to copy the classes over now or in the future when you make your next change. I see the files already have per-file license headers and their Apache 2.0 license is compatible with mozsearch's MPL2. They also look nicely standalone, so it's not like we'd spend a lot of time trying to disentangle them from other clangd code. One thing we might do is put them in a "from-clangd" subdirectory with a README that explains what they are. This could help make it clear to anyone wanting to make changes that they're touching a file that has an upstream without us having to add "HEY DON'T MODIFY THIS FILE" comments every time we copy over a new revision from clangd. But again, no need to do that now.

I definitely don't think we want to reinvent any wheels clangd has invented, or to make it harder to upstream enhancements made in searchfox to clangd.

Copy link
Contributor

@asutherland asutherland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in my comments, I think this is amazing as-is and can probably already land (and have the MozsearchIndexer.cpp changes propagated to m-c). I'm happy to continue discussing potential next steps as it relates to weak source records or "usages", etc. here or on bugzilla. I did mention some potential spin-offs, but there's no need for you to file anything you're not personally interested in/planning on implementing. (Like, in particular, the contextsym merging isn't something you need to do or file.)

Thanks for all your work on searchfox! This is very exciting!

@theres-waldo
Copy link
Contributor Author

Heuristics and/or Concrete

  1. Currently, in IndexConsumer::VisitCXXDependentScopeMemberExpr(), I both (1) produce a heuristic result (that's what's new in this patch), and (2) record the location for possibly producing concrete results when visiting instantiations (if we have any) during the AnalyzeDependent phase (what the code was doing before).
    I'm wondering, should we add some logic to only produce the heuristic result if we didn't end up producing concrete results from the AnalyzeDependent phase?
    I feel like I could go either way on this. On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch. On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads. In other words, I don't think either approach is categorically more accurate than the other.

Aside: It would be great to have a specific test case that could exercise the permutations being discussed here.

Here are a couple of (somewhat contrived) examples to illustrate the two cases:

On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch.

An example here might be:

template <typename Surface>
struct DrawingContext {
  void draw(Rectangle);
  void draw(Triangle);
  void draw(Circle);
};

template <typename Surface, typename Shape>
void foo(DrawingContext<Surface>& d, Shape& s) {
  d.draw(s);
}

Let's suppose that, for whatever reason, the indexer has only seen instantiations of foo with Shape = Circle (could be due to imperfections in the indexer implementation in identifying and merging callers from other translation units, or maybe callers with other shapes just haven't been written yet). The concrete results would only include draw(Circle), while the heuristic results would include all three draw() overloads. In this case the heuristic results are better because conceptually any of the overloads is a valid target.

On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads.

An example here might be:

template <typename OutputStream>
struct Serializer {
  void serialize(std::string);
  void serialize(int);
  void serialize(URL);
  // bunch of other overloads
  void serialize(FontMetrics);
};

template <typename OutputStream>
void serializeFont(Serializer<OutputStream> s, Font f) {
  // ...
  s.serialize(f.GetMetrics());
}

This time, all instantiations use serialize(FontMetrics), and the concrete results will only include that, but the heuristic results will include all overloads most of which are irrelevant.

(Why aren't the heuristics smart enough to figure out the target will always be serialize(FontMetrics), you ask? The answer is that currently we don't attempt to perform overload resolution prior to instantiation. That's something we can consider adding in the future, especially in cases like this where the arguments themselves are not dependent. However, you can easily imagine a slight variation where, say, Font has a template parameter too (which serializeFont() is also parameterized on), so now the arguments are dependent, which is not something the C++ overload resolution algorithm handles (and in particular trying to run clang's implementation of the overload resolution algorithm on dependent inputs will quickly run into asserts and such). We could probably devise and implement a heuristic overload resolution algorithm that handles dependent inputs and gives the expected results in cases like this, but now we're in "small research project" territory.)


Not sure if you just wanted to see the test cases to illustrate the tradeoff, or if you meant adding them to the test suite -- I'm happy to do the latter if you'd like.

@theres-waldo
Copy link
Contributor Author

I could imagine we could do something so that we annotate the overload set source records we emit so that they will get replaced by concrete records when present for the same token with the same pretty. For now I'll call this a "weak source record", but there are potentially more clever things we could do; more a little below.

[...]

Yeah, a "weak" annotation of some sort makes a lot of sense to me. Eclipse CDT uses something similar for e.g. find-references results where it's less sure that a use is definitely referencing the entity you're searching for, and annotates those as a "potential result" in the results list.

I'll file a follow-up for exploring this further.

@theres-waldo
Copy link
Contributor Author

  1. The HeuristicResolver facility added in this patch is a slimmed-down version of clangd's (currently much more featureful) facility of the same name.
    My current approach is to port over pieces of this facility to Searchfox as we need them to make specific testcases work.
    However, one could imagine a different approach of just... copying the class from clangd wholesale, as a piece of code which has been battle-tested in a different project? (It doesn't have any dependencies on anything else in clangd, only public clang APIs.)
    Let me know if you have a preference here, I'm happy to switch approaches.

I have no problem if you want to copy the classes over now or in the future when you make your next change. I see the files already have per-file license headers and their Apache 2.0 license is compatible with mozsearch's MPL2. They also look nicely standalone, so it's not like we'd spend a lot of time trying to disentangle them from other clangd code. One thing we might do is put them in a "from-clangd" subdirectory with a README that explains what they are. This could help make it clear to anyone wanting to make changes that they're touching a file that has an upstream without us having to add "HEY DON'T MODIFY THIS FILE" comments every time we copy over a new revision from clangd. But again, no need to do that now.

I definitely don't think we want to reinvent any wheels clangd has invented, or to make it harder to upstream enhancements made in searchfox to clangd.

Sounds good! I'll leave this patch as-is, and expand HeuristicResolver to be a full copy of clangd's in my next patch.

@theres-waldo theres-waldo changed the title [WIP] Bug 1828223 - Implement heuristic resolution of CXXDependentScopeMemberExpr Bug 1828223 - Implement heuristic resolution of CXXDependentScopeMemberExpr May 12, 2023
@theres-waldo theres-waldo merged commit 7ad0e59 into mozsearch:master May 12, 2023
@theres-waldo theres-waldo deleted the bug1828223 branch May 12, 2023 23:34
@asutherland
Copy link
Contributor

On the one hand, the indexer might only know about a subset of the instantiations actually occurring in the project, and the concrete results might therefore miss something that the heuristic results catch.

An example here might be:

template <typename Surface>
struct DrawingContext {
  void draw(Rectangle);
  void draw(Triangle);
  void draw(Circle);
};

template <typename Surface, typename Shape>
void foo(DrawingContext<Surface>& d, Shape& s) {
  d.draw(s);
}

Let's suppose that, for whatever reason, the indexer has only seen instantiations of foo with Shape = Circle (could be due to imperfections in the indexer implementation in identifying and merging callers from other translation units, or maybe callers with other shapes just haven't been written yet). The concrete results would only include draw(Circle), while the heuristic results would include all three draw() overloads. In this case the heuristic results are better because conceptually any of the overloads is a valid target.

Thanks for the specific example, this really enhances my ability to discuss this aspect of template indexing!

I agree that the heuristic result's overload set is fundamentally interesting in this case. I think the concrete uses and the ability to distinguish them would also be useful, both at the point of the use for context menu purposes (source record) and in the search results (target records).

Re: CDT and "potential results", we also see something similar from GitHub and Sourcegraph, although those are more about the indexer; the terms "precise" and "search based" in particular are used to distinguish between "we used something that understands the full semantics of the language" versus "we used tree-sitter so it's not perfect". As we potentially overhaul our understanding of JS from our current "we assume all JS we ever see is loaded into a single global" to using the more precise scip-typescript in some cases (or all cases, but with a pile of hacks), it could very much make sense for searchfox to grow a concept of "symbol confidence".

Specifically, if implemented, we'd:

  • Establish a total ordering for symbol confidence along these lines (which includes cases you may not care about but I'm including for my own future reference) from most confident to least confident:
    • concrete: Definitely saw this (use) happen. Most things in C++ would be concrete and we could default to this.
      • For JS via scip-typescript, exact cross-references generated from that could be concrete, but we'd add distinct confidences for when we add references beyond what scip-typescript provides.
    • template-overload: This is part of the template overload set. Going very specific here to start with because I think this is conceptually unique to templates for C++ so far, at least compared to virtual method overrides where we explicitly do have a symbol with a declaration location. (Well, non-templated overloads are similar, but arguably if we want to make that overload set more explicit, it should still be distinct from the template overload set for UI purposes.)
    • js-global: We added/upgraded this edge because of high confidence about files being loaded into the same global, even though scip-typescript didn't know it. For example, using the machinery I proposed in https://bugzilla.mozilla.org/show_bug.cgi?id=1775130.
    • repo-linked: We're using a heuristic defined in the repository definition to add this edge (potentially replacing some other edge), like an edge from nsIRunnable subclass constructors to their run methods.
    • js-symbolic: Symbolic execution was used to glue js-global or higher confidence things together.
    • js-soup-qualified: This is a guess based on our classic JS analyzer's modeling of everything existing in a single global for qualified symbols. So for var Singleton = { aMethod() {} }, this would correspond to the current Singleton#aMethod symbol. This works well with singletons because they will use the fully qualified name which reduces collisions.
    • js-soup-flat: The flattened symbol model of the classic JS analyzer modeling, corresponding to #aMethod from the prior example. This works better than the qualified variant for classes where the type isn't known (which is all the time for the classic JS analyzer).
  • Add a confidence (maybe shorted to con?) attribute on source records that would be a parallel array to sym with each confidence entry being associated with the sym array entry at the same index.
  • Add a confidence/con attribute on target records that is singular because symbols in target records are scalar.
  • Source record merging would choose higher confidence values over lower confidence values when merging. There would only ever be a single confidence value for a given symbol in a source record.
  • Target records would be subject to merging and would similarly favor higher confidence values over lower confidence values.
  • The output-file.rs/format.rs machinery would propagate the confidence values through to a data-confidences attribute on the HTML (so the context menu logic could consult it).
  • The context menu would, whenever it sees a token with confidences including "template-overload":
    • Prefix any "go to def" variations with "concrete" or "potential" as appropriate.
      • It might also be necessary to have some overload protection here as it's possible for both the concrete and potential sets to be quite large. It likely would make sense to not show any "go to def"s if the concrete set is large (>5?), or to show the concrete and suppress the "potential"s if the concrete set is reasonable but the potential set is large (>5). Numbers could potentially be added to the "Search" variations to clarify the number of symbols in the overload sets.
    • Build a "search" menu variation for both "concrete" and "potential"(?) sets whenever it sees that there is a "template-overload" confidence involved. The potential set would include the concrete confidence symbols as well, as those would have replaced the "template-overload" confidences.
  • The "search" endpoint as exposed by router.py could further tuple the qualtype for uses so that "template-overload" would be separate from the normal (concrete) uses.
  • The "query" endpoint's faceting could expose the confidence directly as one of its mechanisms for faceting and avoid needing to separate the concrete uses from potential uses.

We don't need to implement this right now, but I always like to have a working destination in mind, and I may try noodling around with it in my hobby time, as it seems quite useful to have an incremental approach for improving our JS implementation incrementally rather than an all-or-nothing transition.

On the other hand, if e.g. we're working with an overload set, it's possible that all instantiations end up using only a subset of the overloads, and the heuristic results would add noise by providing all overloads.

An example here might be:

template <typename OutputStream>
struct Serializer {
  void serialize(std::string);
  void serialize(int);
  void serialize(URL);
  // bunch of other overloads
  void serialize(FontMetrics);
};

template <typename OutputStream>
void serializeFont(Serializer<OutputStream> s, Font f) {
  // ...
  s.serialize(f.GetMetrics());
}

This time, all instantiations use serialize(FontMetrics), and the concrete results will only include that, but the heuristic results will include all overloads most of which are irrelevant.

I think the proposal above could still work for this.

(Why aren't the heuristics smart enough to figure out the target will always be serialize(FontMetrics), you ask? The answer is that currently we don't attempt to perform overload resolution prior to instantiation. That's something we can consider adding in the future, especially in cases like this where the arguments themselves are not dependent. However, you can easily imagine a slight variation where, say, Font has a template parameter too (which serializeFont() is also parameterized on), so now the arguments are dependent, which is not something the C++ overload resolution algorithm handles (and in particular trying to run clang's implementation of the overload resolution algorithm on dependent inputs will quickly run into asserts and such). We could probably devise and implement a heuristic overload resolution algorithm that handles dependent inputs and gives the expected results in cases like this, but now we're in "small research project" territory.)

Yeah, that seems like a lot of work. I think it's very reasonable to say that ugly/complex code is going to result in an ugly/complex experience. That said, I think reasonable things searchfox could help do in situations like this are:

  • Provide ways to visualize / facet / filter things.
  • Letting humans provide superseding graph edges through magic comments or repository definitions which can skip over situations like this where the static control flow graph goes through what effectively is a super node and everything else is dynamic. For example, the strict control flow graph for our IPC calls would be a mess where everything is connected to everything else. But we explicitly introduce a magic link from SendFoo to RecvFoo which hides the complex stuff under the hood.

Not sure if you just wanted to see the test cases to illustrate the tradeoff, or if you meant adding them to the test suite -- I'm happy to do the latter if you'd like.

I think both are useful, but I think it would be particularly useful to add them to the test suite, as this makes it easier to dig in to what's currently happening. Also, it means in the future people can just link to the in-(mozsearch-)tree examples rather than have to potentially find this thread or sketch them up from whole cloth. Also it helps make sure details are filled in.

To do some quick investigation myself, I tried to make your shapes example more concrete in this template_shapes.cpp (which you should feel free to further iterate on or ignore, etc.):

class GenericSurface {};

class Rectangle {};
class Triangle {};
class Circle {};

template <typename Surface> struct DrawingContext {
  void draw(Rectangle);
  void draw(Triangle);
  void draw(Circle);
};

template <typename Surface, typename Shape>
void foo(DrawingContext<Surface> &d, Shape &s) {
  d.draw(s);
}

int main(void) {
  GenericSurface surface;
  DrawingContext<GenericSurface> context;
  Circle circle;

  foo(context, circle);
}

@theres-waldo
Copy link
Contributor Author

it could very much make sense for searchfox to grow a concept of "symbol confidence"

I filed bug 1833552 for distinguishing between heuristic and concrete results in the context of templates. Please feel free to transmogrify it into a more general issue about symbol confidence, or to make it depend on such an issue etc.

currently we don't attempt to perform overload resolution prior to instantiation. That's something we can consider adding in the future

I filed bug 1833548 about further improvements in this space. I think there is some low-hanging fruit here (simple heuristics like arity) that can be tackled without solving the more general problem of overload resolution with dependent arguments.

I think it would be particularly useful to add [an example illustrating the behaviour for heuristic vs. concrete results] to the test suite

Will do, I filed bug 1833529 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants