[clang][dataflow] Strengthen pointer comparison. #75170

martinboehme · 2023-12-12T12:22:09Z

Instead of comparing the identity of the PointerValues, compare the
underlying StorageLocations.
If the StorageLocations are the same, return a definite "true" as the
result of the comparison. Before, if the PointerValues were different, we
would return an atom, even if the storage locations themselves were the same.
If the StorageLocations are different, return an atom (as before). Pointers
that have different storage locations may still alias, so we can't return a
definite "false" in this case.

The application-level gains from this are relatively modest. For the Crubit
nullability check running on an internal codebase, this change reduces the
number of functions on which the SAT solver times out from 223 to 221; the
number of "pointer expression not modeled" errors reduces from 3815 to 3778.

Still, it seems that the gain in precision is generally worthwhile.

@Xazax-hun inspired me to think about this with his
comments
on a different PR.

llvmbot · 2023-12-12T12:22:39Z

@llvm/pr-subscribers-clang-analysis

@llvm/pr-subscribers-clang

Author: None (martinboehme)

Changes

Instead of comparing the identity of the PointerValues, compare the
underlying StorageLocations.
If the StorageLocations are different, return a definite "false" as the
result of the comparison. Before, if the PointerValues were different, the
best we could do was to return an atom (because the StorageLocations might
still be the same).

On an internal codebase, this change reduces SAT solver timeouts by over 20% and
"maximum iterations reached" errors by over 50%. In addition, it obviously
improves the precision of the analysis.

@Xazax-hun inspired me to think about this with his
comments
on a different PR.

The one thing where the new code currently does the wrong thing is when
comparing the addresses of different members of a union. By the language
standard, all members of a union should have the same address, but we currently
model them with different StorageLocations, and so with this change, we will
return false when comparing the addreses.

I propose that this is acceptable because is unlikely to affect the behavior of
real-world code in meaningful ways.

With this change, the test TransferTest.DifferentReferenceLocInJoin started to
fail because the code under test no longer set up the desired state where a
variable of reference type is mapped to two different storage locations in
environments being joined. Rather than trying to modify this test to set up the
test condition again, I have chosen to replace the test with an equivalent
test in DataflowEnvironmentTest.cpp that sets up the test condition directly;
because this test is more direct, it will also be less brittle in the face of
future changes.

Full diff: https://github.com/llvm/llvm-project/pull/75170.diff

3 Files Affected:

(modified) clang/lib/Analysis/FlowSensitive/Transfer.cpp (+5)
(modified) clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp (+42)
(modified) clang/unittests/Analysis/FlowSensitive/TransferTest.cpp (+60-44)

diff --git a/clang/lib/Analysis/FlowSensitive/Transfer.cpp b/clang/lib/Analysis/FlowSensitive/Transfer.cpp
index bbf5f12359bc70..55db15e03b39cb 100644
--- a/clang/lib/Analysis/FlowSensitive/Transfer.cpp
+++ b/clang/lib/Analysis/FlowSensitive/Transfer.cpp
@@ -60,6 +60,11 @@ static BoolValue &evaluateBooleanEquality(const Expr &LHS, const Expr &RHS,
     if (auto *RHSBool = dyn_cast_or_null<BoolValue>(RHSValue))
       return Env.makeIff(*LHSBool, *RHSBool);
 
+  if (auto *LHSPtr = dyn_cast_or_null<PointerValue>(LHSValue))
+    if (auto *RHSPtr = dyn_cast_or_null<PointerValue>(RHSValue))
+      return Env.getBoolLiteralValue(&LHSPtr->getPointeeLoc() ==
+                                     &RHSPtr->getPointeeLoc());
+
   return Env.makeAtomicBoolValue();
 }
 
diff --git a/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp b/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
index 003434a58b1075..de18f7de02c12d 100644
--- a/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
+++ b/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
@@ -24,6 +24,7 @@ namespace {
 
 using namespace clang;
 using namespace dataflow;
+using ::clang::dataflow::test::findValueDecl;
 using ::clang::dataflow::test::getFieldValue;
 using ::testing::Contains;
 using ::testing::IsNull;
@@ -152,6 +153,47 @@ TEST_F(EnvironmentTest, JoinRecords) {
   }
 }
 
+TEST_F(EnvironmentTest, DifferentReferenceLocInJoin) {
+  // This tests the case where the storage location for a reference-type
+  // variable is different for two states being joined. We used to believe this
+  // could not happen and therefore had an assertion disallowing this; this test
+  // exists to demonstrate that we can handle this condition without a failing
+  // assertion. See also the discussion here:
+  // https://discourse.llvm.org/t/70086/6
+
+  using namespace ast_matchers;
+
+  std::string Code = R"cc(
+    void f(int &ref) {}
+  )cc";
+
+  auto Unit =
+      tooling::buildASTFromCodeWithArgs(Code, {"-fsyntax-only", "-std=c++11"});
+  auto &Context = Unit->getASTContext();
+
+  ASSERT_EQ(Context.getDiagnostics().getClient()->getNumErrors(), 0U);
+
+  const ValueDecl *Ref = findValueDecl(Context, "ref");
+
+  Environment Env1(DAContext);
+  StorageLocation &Loc1 = Env1.createStorageLocation(Context.IntTy);
+  Env1.setStorageLocation(*Ref, Loc1);
+
+  Environment Env2(DAContext);
+  StorageLocation &Loc2 = Env2.createStorageLocation(Context.IntTy);
+  Env2.setStorageLocation(*Ref, Loc2);
+
+  EXPECT_NE(&Loc1, &Loc2);
+
+  Environment::ValueModel Model;
+  Environment EnvJoined = Environment::join(Env1, Env2, Model);
+
+  // Joining environments with different storage locations for the same
+  // declaration results in the declaration being removed from the joined
+  // environment.
+  EXPECT_EQ(EnvJoined.getStorageLocation(*Ref), nullptr);
+}
+
 TEST_F(EnvironmentTest, InitGlobalVarsFun) {
   using namespace ast_matchers;
 
diff --git a/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp b/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
index 8da55953a32986..a4b47f25f36908 100644
--- a/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
+++ b/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
@@ -3893,6 +3893,66 @@ TEST(TransferTest, BooleanInequality) {
       });
 }
 
+TEST(TransferTest, PointerEquality) {
+  std::string Code = R"(
+    void target() {
+      int i = 0;
+      int i_other = 0;
+      int *p1 = &i;
+      int *p2 = &i;
+      int *p_other = &i_other;
+      int *null = nullptr;
+
+      bool p1_eq_p1 = (p1 == p1);
+      bool p1_eq_p2 = (p1 == p2);
+      bool p1_eq_p_other = (p1 == p_other);
+
+      bool p1_eq_null = (p1 == null);
+      bool p1_eq_nullptr = (p1 == nullptr);
+      bool null_eq_nullptr = (null == nullptr);
+      bool nullptr_eq_nullptr = (nullptr == nullptr);
+
+      // We won't duplicate all of the tests above with `!=`, as we know that
+      // the implementation simply negates the result of the `==` comparison.
+      // Instaed, just spot-check one case.
+      bool p1_ne_p_other = (p1 != p_other);
+
+      (void)0; // [[p]]
+    }
+  )";
+  runDataflow(
+      Code,
+      [](const llvm::StringMap<DataflowAnalysisState<NoopLattice>> &Results,
+         ASTContext &ASTCtx) {
+        const Environment &Env = getEnvironmentAtAnnotation(Results, "p");
+
+        // Check the we have indeed set things up so that `p1` and `p2` have
+        // different pointer values.
+        EXPECT_NE(&getValueForDecl<PointerValue>(ASTCtx, Env, "p1"),
+                  &getValueForDecl<PointerValue>(ASTCtx, Env, "p2"));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p1"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p2"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p_other"),
+                  &Env.getBoolLiteralValue(false));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_null"),
+                  &Env.getBoolLiteralValue(false));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_nullptr"),
+                  &Env.getBoolLiteralValue(false));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "null_eq_nullptr"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(
+            &getValueForDecl<BoolValue>(ASTCtx, Env, "nullptr_eq_nullptr"),
+            &Env.getBoolLiteralValue(true));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_ne_p_other"),
+                  &Env.getBoolLiteralValue(true));
+      });
+}
+
 TEST(TransferTest, IntegerLiteralEquality) {
   std::string Code = R"(
     void target() {
@@ -6315,48 +6375,4 @@ TEST(TransferTest, LambdaCaptureThis) {
       });
 }
 
-TEST(TransferTest, DifferentReferenceLocInJoin) {
-  // This test triggers a case where the storage location for a reference-type
-  // variable is different for two states being joined. We used to believe this
-  // could not happen and therefore had an assertion disallowing this; this test
-  // exists to demonstrate that we can handle this condition without a failing
-  // assertion. See also the discussion here:
-  // https://discourse.llvm.org/t/70086/6
-  std::string Code = R"(
-    namespace std {
-      template <class T> struct initializer_list {
-        const T* begin();
-        const T* end();
-      };
-    }
-
-    void target(char* p, char* end) {
-      while (p != end) {
-        if (*p == ' ') {
-          p++;
-          continue;
-        }
-
-        auto && range = {1, 2};
-        for (auto b = range.begin(), e = range.end(); b != e; ++b) {
-        }
-        (void)0;
-        // [[p]]
-      }
-    }
-  )";
-  runDataflow(
-      Code,
-      [](const llvm::StringMap<DataflowAnalysisState<NoopLattice>> &Results,
-         ASTContext &ASTCtx) {
-        const Environment &Env = getEnvironmentAtAnnotation(Results, "p");
-
-        // Joining environments with different storage locations for the same
-        // declaration results in the declaration being removed from the joined
-        // environment.
-        const ValueDecl *VD = findValueDecl(ASTCtx, "range");
-        ASSERT_EQ(Env.getStorageLocation(*VD), nullptr);
-      });
-}
-
 } // namespace

Xazax-hun

This is a massive improvement! Thanks! Looks great to me, I have some not too important comments inline.

Xazax-hun · 2023-12-12T18:20:30Z

clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp

@@ -152,6 +153,47 @@ TEST_F(EnvironmentTest, JoinRecords) {
  }
 }

+TEST_F(EnvironmentTest, DifferentReferenceLocInJoin) {


While running analysis on C++ code is indirect, it has the advantage of documenting what code can trigger such behavior. I think having direct tests is fine, I just wanted to make this decision explicit.

Agree. The original test case was minimized from a failing assertion, but it would likely be fiddly to try and recreate a test case that triggers this again. Testing directly also has the benefit of being less brittle, so I think it's an acceptable tradeoff in this case.

Xazax-hun · 2023-12-12T18:21:24Z

clang/unittests/Analysis/FlowSensitive/TransferTest.cpp

@@ -3893,6 +3893,66 @@ TEST(TransferTest, BooleanInequality) {
      });
 }

+TEST(TransferTest, PointerEquality) {


I think it would be great to add a test case with union members to document the limitation.

Good point -- done.

Xazax-hun · 2023-12-12T18:46:30Z

An alternative approach would be to do hash consing. Since pointer values are completely determined by the pointee (for some definition of the pointee, e.g., we might have void* and int* pointers pointing to the same storage location), we could have a single representation for all of the equal pointer values and continue to rely on identity. This can also be beneficial for memory consumption, but we have to pay for that by making value creation a bit more expensive (and the pointer values needs to be immutable).

That being said, this would be a bit of a departure from the current architecture, so I think even if we want to do this, it should be a separate PR.

martinboehme · 2023-12-13T11:00:48Z

An alternative approach would be to do hash consing. Since pointer values are completely determined by the pointee (for some definition of the pointee, e.g., we might have void* and int* pointers pointing to the same storage location), we could have a single representation for all of the equal pointer values and continue to rely on identity. This can also be beneficial for memory consumption, but we have to pay for that by making value creation a bit more expensive (and the pointer values needs to be immutable).

That being said, this would be a bit of a departure from the current architecture, so I think even if we want to do this, it should be a separate PR.

Thanks, these are good points.

I think we may want to revisit multiple aspects of pointer values and storage locations at some point. For example, storage locations always have a type, and so storage locations of different types necessarily compare different. For example:

struct Base {};
struct Derived : public Base {};
Derived d;
Derived *p_derived = &d;
Base *p_base = d;

// `p_derived == p_base` should hold, but the framework assumes `p_derived != p_base`.

It's probably a good thing for storage locations to have types (because we want to be able to model the dynamic type of an object), but this probably means that we want to model addresses separately from storage locations. This would then also help in other cases, such as union members and pointer interconvertibility. But all of this would be a major undertaking, I think.

Xazax-hun · 2023-12-13T17:41:59Z

It's probably a good thing for storage locations to have types (because we want to be able to model the dynamic type of an object), but this probably means that we want to model addresses separately from storage locations.

Our knowledge of the dynamic type might change during the analysis. Consider the following snippet:

void f(Base* b)
{
  if (dynamic_cast<Derived*>(b)) {
    // Here, we know that the dynamic type is (a subclass of) Derived
  }
}

So, we either need to mutate storage locations, or we need to store the dynamic types on the side. The Clang Static Analyzer (CSA) is doing the latter. While CSA currently has types associated with the storage locations, we are considering undoing this decision and make storage locations only represented by sizes and offsets. We plan to always rely on the AST to determine the static types, and use a data structure on the side to track the dynamic type information.

For some frameworks having typed storage locations might be a good idea, it is not obvious to me where is the sweet spot and it will need careful consideration. (And as far as I understand, we are not even sure whether we want to switch to a different representation like access paths in the future.)

But all of this would be a major undertaking, I think.

Absolutely agree. The Clang Static Analyzer still has problems with type casts in some cases, it is a really challenging problem. That being said, the CSA is already immensely useful even with the partial modeling it has, so this is probably one of the less pressing problems to solve in the near/medium term.

martinboehme · 2023-12-14T13:39:17Z

Heads up: I'm holding off on merging this patch because it makes some tests in the Crubit nullability analysis fail. I'm still investigating exactly why this is.

martinboehme · 2024-01-24T12:55:53Z

Finally getting back to this.

The reason this was causing tests to fail is really pretty obvious in hindsight: Different StorageLocations may alias, so if the StorageLocations for two PointerValues are different, we can't assume that the pointers must compare unequal.

Here's an example:

void f(int *p1, int *p2) {
  if (p1 == p2) {
    do_something();
  }
}

The framework initializes p1 and p2 with different storage locations, so the current state of this PR causes the comparison p1 == p2 to be evaluated to a false literal, which in turn causes the do_something() block to be considered dead code. But this is nonsense: p1 and p2 may alias, so it's definitely possible that do_something() will be executed.

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I'll change the PR accordingly and will let you know once it's ready for another look.

@Xazax-hun

- Instead of comparing the identity of the `PointerValue`s, compare the underlying `StorageLocation`s. - If the `StorageLocation`s are the same, return a definite "true" as the result of the comparison. Before, if the `PointerValue`s were different, we would return an atom, even if the storage locations themselves were the same. - If the `StorageLocation`s are different, return an atom (as before). Pointers that have different storage locations may still alias, so we can't return a definite "false" in this case. The application-level gains from this are relatively modest. For the Crubit nullability check running on an internal codebase, this change reduces the number of functions on which the SAT solver times out from 223 to 221; the number of "pointer expression not modeled" errors reduces from 3815 to 3778. Still, it seems that the gain in precision is generally worthwhile. @Xazax-hun inspired me to think about this with his [comments](llvm#73860 (review)) on a different PR.

martinboehme · 2024-05-06T10:09:22Z

I'll change the PR accordingly and will let you know once it's ready for another look.

Sorry, this completely fell off my radar. I've finally revisited this now, and the PR is ready for review again. PTAL.

Xazax-hun · 2024-05-06T15:26:01Z

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I am wondering, if it would make sense to "fix up" the state when we discover aliasing? I.e., "merging" the two storage locations. Consider:

void f(MyBoolPair* a, MyBoolPair *b) {
  if (a == b && a->first && !b->second)
  {
    // Here, we should know: a->first == b->first == true and a->second == b->second == false
  } 
}

martinboehme · 2024-05-07T07:34:29Z

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I am wondering, if it would make sense to "fix up" the state when we discover aliasing? I.e., "merging" the two storage locations. Consider:
void f(MyBoolPair* a, MyBoolPair *b) {
  if (a == b && a->first && !b->second)
  {
    // Here, we should know: a->first == b->first == true and a->second == b->second == false
  } 
}

This sounds great -- but I think getting this to work wouldn't be trivial. At least, I don't yet have any good ideas on how to make it work.

I think we'd have to do something like this:

When we evaluate a == b, associate this expression with an atom, then remember that this atom means that the storage locations for the two pointers are the same.
When we enter a new block, if we can prove from the flow condition that the atom mentioned in step 1 is true, remember that the storage locations for a and b must alias.
When we evaluate b->first -- I guess we iterate over all of the storage locations that must alias with the storage location for b, and if, for at least one of them, we can prove that b->first is true, we know it must be true? (I'm not super-conversant in alias analysis -- would have to read up more about this.)

Would require some more thought and definitely quite a bit of effort to implement, I think.

martinboehme · 2024-05-07T08:05:18Z

CI failure (DataFlowSanitizer-x86_64 :: release_shadow_space.c) looks unrelated.

See the TODOs that have now been fixed. The dataflow framework now considers pointers with the same storage location to compare equal (see upstream [#75170](llvm/llvm-project#75170), and this allows us to directly compare pointers in the tests. PiperOrigin-RevId: 635791000 Change-Id: I1351d76eee76668b2a3be0efbce829c8c65874ee

martinboehme requested a review from ymand December 12, 2023 12:22

llvmbot added clang Clang issues not falling into any other category clang:dataflow Clang Dataflow Analysis framework - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html clang:analysis labels Dec 12, 2023

martinboehme requested a review from Xazax-hun December 12, 2023 12:22

Xazax-hun approved these changes Dec 12, 2023

View reviewed changes

martinboehme force-pushed the piper_export_cl_588011092 branch from 270172e to dd4aa8b Compare December 13, 2023 10:52

ymand approved these changes Dec 20, 2023

View reviewed changes

martinboehme force-pushed the piper_export_cl_588011092 branch from dd4aa8b to 0ec8426 Compare May 6, 2024 10:06

ymand approved these changes May 6, 2024

View reviewed changes

Xazax-hun approved these changes May 6, 2024

View reviewed changes

martinboehme merged commit f3fbd21 into llvm:main May 7, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clang][dataflow] Strengthen pointer comparison. #75170

[clang][dataflow] Strengthen pointer comparison. #75170

martinboehme commented Dec 12, 2023 •

edited

llvmbot commented Dec 12, 2023 •

edited

Xazax-hun left a comment

Xazax-hun Dec 12, 2023

martinboehme Dec 13, 2023

Xazax-hun Dec 12, 2023

martinboehme Dec 13, 2023

Xazax-hun commented Dec 12, 2023

martinboehme commented Dec 13, 2023

Xazax-hun commented Dec 13, 2023 •

edited

martinboehme commented Dec 14, 2023

martinboehme commented Jan 24, 2024

martinboehme commented May 6, 2024

Xazax-hun commented May 6, 2024

martinboehme commented May 7, 2024

martinboehme commented May 7, 2024

[clang][dataflow] Strengthen pointer comparison. #75170

[clang][dataflow] Strengthen pointer comparison. #75170

Conversation

martinboehme commented Dec 12, 2023 • edited

llvmbot commented Dec 12, 2023 • edited

Xazax-hun left a comment

Choose a reason for hiding this comment

Xazax-hun Dec 12, 2023

Choose a reason for hiding this comment

martinboehme Dec 13, 2023

Choose a reason for hiding this comment

Xazax-hun Dec 12, 2023

Choose a reason for hiding this comment

martinboehme Dec 13, 2023

Choose a reason for hiding this comment

Xazax-hun commented Dec 12, 2023

martinboehme commented Dec 13, 2023

Xazax-hun commented Dec 13, 2023 • edited

martinboehme commented Dec 14, 2023

martinboehme commented Jan 24, 2024

martinboehme commented May 6, 2024

Xazax-hun commented May 6, 2024

martinboehme commented May 7, 2024

martinboehme commented May 7, 2024

martinboehme commented Dec 12, 2023 •

edited

llvmbot commented Dec 12, 2023 •

edited

Xazax-hun commented Dec 13, 2023 •

edited