[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

haoNoQ · 2023-03-24T01:40:57Z

This keeps coming up so I guess I'll make a reusable writeup about this very important fundamental issue. It's probably the single most fundamental problem in the static analyzer, that keeps hurting us in various ways. For example, I claim that problems like "invalidation caused us to explore an infeasible path" aren't actually caused by invalidation; they're caused by this issue instead. Without this issue we'd have some real chance of having zero false positives.

The rough idea is that forking the analysis state to explore different branches is often unjustified. In particular, just because we've encountered N consecutive if-statements, this doesn't mean that all 2^N possible paths through them are feasible. The problem becomes very apparent at N > 1 but even at N = 1 it could be quite problematic.

The analyzer simulates the source code line by line, and every time it encounters a branch (such as if-statement), if it doesn't know which branch is taken, it forks the analysis to continue simulation of the two branches independently. The two "forks" will never talk to each other again; they emit warnings independently. They know that there was a fork in their history, but they don't know and don't care if the other branch has been explored at all or not, let alone what was the outcome.

The main motivating example for such behavior looks roughly like this:

int foo(int x) {
  if (x != 0) {
    // do irrelevant stuff
  }
  return 123 / x;
}

This code is obviously problematic: it checks the value for zero for some reason, but the check is no longer in effect when the actual division happens, so there's a clear unmitigated risk of division by zero. More precisely, either the program divides by zero, or the check for zero is dead code; one way or another, the code doesn't make sense. The analyzer discovers the problem by first forking the analysis at x != 0 to explore both branches, and then noticing that one of the forks proceeds to divide by literal zero. The warning is then presented to the user as a two-step process: "Assuming x != 0 is false, there's division by zero". If the check is dead code then the statement becomes a "vacuous truth" but it's a useful warning nonetheless because the code still doesn't make sense. Also note that in the absence of the check the warning disappears, as the analyzer doesn't find any path on which division by literal zero happens, which is also the desired behavior.

Unfortunately, this approach doesn't scale beyond this one toy example. Eg., the following division by zero is likely to be a false positive simply because it might be part of foo()'s contract that either x or y should be non-null:

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      return 123 / x;
    }
  }

In this simple example, if the warning is indeed false, it can be suppressed with an assertion, which isn't too bad: the assertion clarifies the function's contract for the reader (be it a human or a tool) and adds a runtime check that can catch the problem early and avoid dealing with undefined behavior.

  void foo(int x, int y) {
    if (x) { ... }
    if (!y) {
      assert(x);
      return 123 / x;
    }
  }

But in many other cases such assertions may turn out to be really weird and obscure:

Instead of x and y being parameters, they could be value of the same variable before and after invalidation, where invalidation is overly conservative and in reality the variable won't change. This is what causes invalidations to introduce weird paths that we have to explore, and these false positives look much uglier than the parameter case.
Even if there's just one if-statement, if the check is in an "inlined" function call (into which we've stepped in the middle of the analysis), the check may not constitute dead code just because the value can't be zero on the current call site (could be for other call sites). This situation is covered by after-the-fact false positive suppressions, so-called "inlined defensive checks" suppressions.
If the loop is there in the program, doesn't mean it can or cannot be executed a certain number of times (most notably, 0 times). But we have to assume a few concrete possibilities, even if exact bounds are unknown. This is like an if-statement with unknown number of branches, where picking every individual branch is theoretically incorrect. But if we give up here, we can't analyze any program that has loops.

One of the practical approaches to reducing such false positives may be to explore the set of equivalent reports as a whole and see if having each of the equivalent reports be false implies that there is dead code in the program. If it does, than the positive is true (for the purposes of this issue) and the report may be adjusted to tell the user that certain assumptions are in fact irrelevant. If it doesn't, then the positive is likely to be false in the sense. Note that this approach doesn't require us to have all-path knowledge; the set of path we explored is enough to estimate our confidence.

But generally, even if we implement such suppressions, we're very likely to end up with a completely different tool, that's not necessarily more practical than the existing tool. As pointed out above, in many cases these paths aren't even infeasible, and even when infeasible, the warnings are often are still desired. You can't really eliminate these false positives without throwing out a lot of useful reports. So we'll probably always be in the business of trying out various practical tradeoffs and not solving this problem in its entirety.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-03-24T01:41:15Z

@llvm/issue-subscribers-clang-static-analyzer

haoNoQ · 2023-03-24T02:03:02Z

Recent conversations:

https://discourse.llvm.org/t/gsoc-2023-proposal-improve-and-stabilize-the-static-analyzer-s-taint-analysis-checks/68858/4

https://discourse.llvm.org/t/clang-static-analyzer-makes-wrong-assumptions-whether-a-loop-body-is-entered-even-with-sufficient-context/69229

This problem relates to taint analysis:

https://discourse.llvm.org/t/is-there-any-way-to-identify-potential-execution-paths-that-call-panic-from-public-functions/68771/4

haoNoQ added the clang:static analyzer label Mar 24, 2023

haoNoQ self-assigned this Mar 24, 2023

haoNoQ mentioned this issue Apr 4, 2023

[clang static analyzer] core.NullDereference false positive with *p #61884

Open

Geoffrey1014 mentioned this issue Apr 9, 2023

CSA evaluates (255UL == b)==true to be FALSE with the fact that 255UL == b #60042

Open

haoNoQ mentioned this issue Jul 5, 2023

The clang static analyzer cannot generate path-sensitive error reports after enabling the ipa-always-inline-size parameter #63681

Open

haoNoQ mentioned this issue Sep 27, 2023

False positive clang-analyzer-core.DivideZero with std::span #64584

Open

This was referenced May 2, 2024

[clang][static analyzer] false positive cplusplus.NewDelete ("use after free") w/ reference counting #90498

Closed

[analyzer] MallocChecker: Recognize std::atomics in smart pointer suppression. #90918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

haoNoQ commented Mar 24, 2023 •

edited by VoltrexKeyva

llvmbot commented Mar 24, 2023

haoNoQ commented Mar 24, 2023

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

[Umbrella] Overly eager assumption chains - the "if(x) if(y)" problem. #61669

Comments

haoNoQ commented Mar 24, 2023 • edited by VoltrexKeyva

llvmbot commented Mar 24, 2023

haoNoQ commented Mar 24, 2023

haoNoQ commented Mar 24, 2023 •

edited by VoltrexKeyva