Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

Open
haoNoQ opened this issue Mar 24, 2023 · 2 comments
Open

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

haoNoQ opened this issue Mar 24, 2023 · 2 comments
Assignees

Comments

@haoNoQ
Copy link
Collaborator

haoNoQ commented Mar 24, 2023

This keeps coming up so I guess I'll make a reusable writeup about this very important fundamental issue. It's probably the single most fundamental problem in the static analyzer, that keeps hurting us in various ways. For example, I claim that problems like "invalidation caused us to explore an infeasible path" aren't actually caused by invalidation; they're caused by this issue instead. Without this issue we'd have some real chance of having zero false positives.

The rough idea is that forking the analysis state to explore different branches is often unjustified. In particular, just because we've encountered N consecutive if-statements, this doesn't mean that all 2^N possible paths through them are feasible. The problem becomes very apparent at N > 1 but even at N = 1 it could be quite problematic.

The analyzer simulates the source code line by line, and every time it encounters a branch (such as if-statement), if it doesn't know which branch is taken, it forks the analysis to continue simulation of the two branches independently. The two "forks" will never talk to each other again; they emit warnings independently. They know that there was a fork in their history, but they don't know and don't care if the other branch has been explored at all or not, let alone what was the outcome.

The main motivating example for such behavior looks roughly like this:

int foo(int x) {
  if (x != 0) {
    // do irrelevant stuff
  }
  return 123 / x;
}

This code is obviously problematic: it checks the value for zero for some reason, but the check is no longer in effect when the actual division happens, so there's a clear unmitigated risk of division by zero. More precisely, either the program divides by zero, or the check for zero is dead code; one way or another, the code doesn't make sense. The analyzer discovers the problem by first forking the analysis at x != 0 to explore both branches, and then noticing that one of the forks proceeds to divide by literal zero. The warning is then presented to the user as a two-step process: "Assuming x != 0 is false, there's division by zero". If the check is dead code then the statement becomes a "vacuous truth" but it's a useful warning nonetheless because the code still doesn't make sense. Also note that in the absence of the check the warning disappears, as the analyzer doesn't find any path on which division by literal zero happens, which is also the desired behavior.

Unfortunately, this approach doesn't scale beyond this one toy example. Eg., the following division by zero is likely to be a false positive simply because it might be part of foo()'s contract that either x or y should be non-null:

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      return 123 / x;
    }
  }

In this simple example, if the warning is indeed false, it can be suppressed with an assertion, which isn't too bad: the assertion clarifies the function's contract for the reader (be it a human or a tool) and adds a runtime check that can catch the problem early and avoid dealing with undefined behavior.

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      assert(x);
      return 123 / x;
    }
  }

But in many other cases such assertions may turn out to be really weird and obscure:

  • Instead of x and y being parameters, they could be value of the same variable before and after invalidation, where invalidation is overly conservative and in reality the variable won't change. This is what causes invalidations to introduce weird paths that we have to explore, and these false positives look much uglier than the parameter case.
  • Even if there's just one if-statement, if the check is in an "inlined" function call (into which we've stepped in the middle of the analysis), the check may not constitute dead code just because the value can't be zero on the current call site (could be for other call sites). This situation is covered by after-the-fact false positive suppressions, so-called "inlined defensive checks" suppressions.
  • If the loop is there in the program, doesn't mean it can or cannot be executed a certain number of times (most notably, 0 times). But we have to assume a few concrete possibilities, even if exact bounds are unknown. This is like an if-statement with unknown number of branches, where picking every individual branch is theoretically incorrect. But if we give up here, we can't analyze any program that has loops.

One of the practical approaches to reducing such false positives may be to explore the set of equivalent reports as a whole and see if having each of the equivalent reports be false implies that there is dead code in the program. If it does, than the positive is true (for the purposes of this issue) and the report may be adjusted to tell the user that certain assumptions are in fact irrelevant. If it doesn't, then the positive is likely to be false in the sense. Note that this approach doesn't require us to have all-path knowledge; the set of path we explored is enough to estimate our confidence.

But generally, even if we implement such suppressions, we're very likely to end up with a completely different tool, that's not necessarily more practical than the existing tool. As pointed out above, in many cases these paths aren't even infeasible, and even when infeasible, the warnings are often are still desired. You can't really eliminate these false positives without throwing out a lot of useful reports. So we'll probably always be in the business of trying out various practical tradeoffs and not solving this problem in its entirety.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 24, 2023

@llvm/issue-subscribers-clang-static-analyzer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants