Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang][dataflow] Strengthen pointer comparison. #75170

Merged
merged 1 commit into from
May 7, 2024

Conversation

martinboehme
Copy link
Contributor

@martinboehme martinboehme commented Dec 12, 2023

  • Instead of comparing the identity of the PointerValues, compare the
    underlying StorageLocations.

  • If the StorageLocations are the same, return a definite "true" as the
    result of the comparison. Before, if the PointerValues were different, we
    would return an atom, even if the storage locations themselves were the same.

  • If the StorageLocations are different, return an atom (as before). Pointers
    that have different storage locations may still alias, so we can't return a
    definite "false" in this case.

The application-level gains from this are relatively modest. For the Crubit
nullability check running on an internal codebase, this change reduces the
number of functions on which the SAT solver times out from 223 to 221; the
number of "pointer expression not modeled" errors reduces from 3815 to 3778.

Still, it seems that the gain in precision is generally worthwhile.

@Xazax-hun inspired me to think about this with his
comments
on a different PR.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:dataflow Clang Dataflow Analysis framework - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html clang:analysis labels Dec 12, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Dec 12, 2023

@llvm/pr-subscribers-clang-analysis

@llvm/pr-subscribers-clang

Author: None (martinboehme)

Changes
  • Instead of comparing the identity of the PointerValues, compare the
    underlying StorageLocations.

  • If the StorageLocations are different, return a definite "false" as the
    result of the comparison. Before, if the PointerValues were different, the
    best we could do was to return an atom (because the StorageLocations might
    still be the same).

On an internal codebase, this change reduces SAT solver timeouts by over 20% and
"maximum iterations reached" errors by over 50%. In addition, it obviously
improves the precision of the analysis.

@Xazax-hun inspired me to think about this with his
comments
on a different PR.

The one thing where the new code currently does the wrong thing is when
comparing the addresses of different members of a union. By the language
standard, all members of a union should have the same address, but we currently
model them with different StorageLocations, and so with this change, we will
return false when comparing the addreses.

I propose that this is acceptable because is unlikely to affect the behavior of
real-world code in meaningful ways.

With this change, the test TransferTest.DifferentReferenceLocInJoin started to
fail because the code under test no longer set up the desired state where a
variable of reference type is mapped to two different storage locations in
environments being joined. Rather than trying to modify this test to set up the
test condition again, I have chosen to replace the test with an equivalent
test in DataflowEnvironmentTest.cpp that sets up the test condition directly;
because this test is more direct, it will also be less brittle in the face of
future changes.


Full diff: https://github.com/llvm/llvm-project/pull/75170.diff

3 Files Affected:

  • (modified) clang/lib/Analysis/FlowSensitive/Transfer.cpp (+5)
  • (modified) clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp (+42)
  • (modified) clang/unittests/Analysis/FlowSensitive/TransferTest.cpp (+60-44)
diff --git a/clang/lib/Analysis/FlowSensitive/Transfer.cpp b/clang/lib/Analysis/FlowSensitive/Transfer.cpp
index bbf5f12359bc70..55db15e03b39cb 100644
--- a/clang/lib/Analysis/FlowSensitive/Transfer.cpp
+++ b/clang/lib/Analysis/FlowSensitive/Transfer.cpp
@@ -60,6 +60,11 @@ static BoolValue &evaluateBooleanEquality(const Expr &LHS, const Expr &RHS,
     if (auto *RHSBool = dyn_cast_or_null<BoolValue>(RHSValue))
       return Env.makeIff(*LHSBool, *RHSBool);
 
+  if (auto *LHSPtr = dyn_cast_or_null<PointerValue>(LHSValue))
+    if (auto *RHSPtr = dyn_cast_or_null<PointerValue>(RHSValue))
+      return Env.getBoolLiteralValue(&LHSPtr->getPointeeLoc() ==
+                                     &RHSPtr->getPointeeLoc());
+
   return Env.makeAtomicBoolValue();
 }
 
diff --git a/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp b/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
index 003434a58b1075..de18f7de02c12d 100644
--- a/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
+++ b/clang/unittests/Analysis/FlowSensitive/DataflowEnvironmentTest.cpp
@@ -24,6 +24,7 @@ namespace {
 
 using namespace clang;
 using namespace dataflow;
+using ::clang::dataflow::test::findValueDecl;
 using ::clang::dataflow::test::getFieldValue;
 using ::testing::Contains;
 using ::testing::IsNull;
@@ -152,6 +153,47 @@ TEST_F(EnvironmentTest, JoinRecords) {
   }
 }
 
+TEST_F(EnvironmentTest, DifferentReferenceLocInJoin) {
+  // This tests the case where the storage location for a reference-type
+  // variable is different for two states being joined. We used to believe this
+  // could not happen and therefore had an assertion disallowing this; this test
+  // exists to demonstrate that we can handle this condition without a failing
+  // assertion. See also the discussion here:
+  // https://discourse.llvm.org/t/70086/6
+
+  using namespace ast_matchers;
+
+  std::string Code = R"cc(
+    void f(int &ref) {}
+  )cc";
+
+  auto Unit =
+      tooling::buildASTFromCodeWithArgs(Code, {"-fsyntax-only", "-std=c++11"});
+  auto &Context = Unit->getASTContext();
+
+  ASSERT_EQ(Context.getDiagnostics().getClient()->getNumErrors(), 0U);
+
+  const ValueDecl *Ref = findValueDecl(Context, "ref");
+
+  Environment Env1(DAContext);
+  StorageLocation &Loc1 = Env1.createStorageLocation(Context.IntTy);
+  Env1.setStorageLocation(*Ref, Loc1);
+
+  Environment Env2(DAContext);
+  StorageLocation &Loc2 = Env2.createStorageLocation(Context.IntTy);
+  Env2.setStorageLocation(*Ref, Loc2);
+
+  EXPECT_NE(&Loc1, &Loc2);
+
+  Environment::ValueModel Model;
+  Environment EnvJoined = Environment::join(Env1, Env2, Model);
+
+  // Joining environments with different storage locations for the same
+  // declaration results in the declaration being removed from the joined
+  // environment.
+  EXPECT_EQ(EnvJoined.getStorageLocation(*Ref), nullptr);
+}
+
 TEST_F(EnvironmentTest, InitGlobalVarsFun) {
   using namespace ast_matchers;
 
diff --git a/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp b/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
index 8da55953a32986..a4b47f25f36908 100644
--- a/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
+++ b/clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
@@ -3893,6 +3893,66 @@ TEST(TransferTest, BooleanInequality) {
       });
 }
 
+TEST(TransferTest, PointerEquality) {
+  std::string Code = R"(
+    void target() {
+      int i = 0;
+      int i_other = 0;
+      int *p1 = &i;
+      int *p2 = &i;
+      int *p_other = &i_other;
+      int *null = nullptr;
+
+      bool p1_eq_p1 = (p1 == p1);
+      bool p1_eq_p2 = (p1 == p2);
+      bool p1_eq_p_other = (p1 == p_other);
+
+      bool p1_eq_null = (p1 == null);
+      bool p1_eq_nullptr = (p1 == nullptr);
+      bool null_eq_nullptr = (null == nullptr);
+      bool nullptr_eq_nullptr = (nullptr == nullptr);
+
+      // We won't duplicate all of the tests above with `!=`, as we know that
+      // the implementation simply negates the result of the `==` comparison.
+      // Instaed, just spot-check one case.
+      bool p1_ne_p_other = (p1 != p_other);
+
+      (void)0; // [[p]]
+    }
+  )";
+  runDataflow(
+      Code,
+      [](const llvm::StringMap<DataflowAnalysisState<NoopLattice>> &Results,
+         ASTContext &ASTCtx) {
+        const Environment &Env = getEnvironmentAtAnnotation(Results, "p");
+
+        // Check the we have indeed set things up so that `p1` and `p2` have
+        // different pointer values.
+        EXPECT_NE(&getValueForDecl<PointerValue>(ASTCtx, Env, "p1"),
+                  &getValueForDecl<PointerValue>(ASTCtx, Env, "p2"));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p1"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p2"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_p_other"),
+                  &Env.getBoolLiteralValue(false));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_null"),
+                  &Env.getBoolLiteralValue(false));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_eq_nullptr"),
+                  &Env.getBoolLiteralValue(false));
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "null_eq_nullptr"),
+                  &Env.getBoolLiteralValue(true));
+        EXPECT_EQ(
+            &getValueForDecl<BoolValue>(ASTCtx, Env, "nullptr_eq_nullptr"),
+            &Env.getBoolLiteralValue(true));
+
+        EXPECT_EQ(&getValueForDecl<BoolValue>(ASTCtx, Env, "p1_ne_p_other"),
+                  &Env.getBoolLiteralValue(true));
+      });
+}
+
 TEST(TransferTest, IntegerLiteralEquality) {
   std::string Code = R"(
     void target() {
@@ -6315,48 +6375,4 @@ TEST(TransferTest, LambdaCaptureThis) {
       });
 }
 
-TEST(TransferTest, DifferentReferenceLocInJoin) {
-  // This test triggers a case where the storage location for a reference-type
-  // variable is different for two states being joined. We used to believe this
-  // could not happen and therefore had an assertion disallowing this; this test
-  // exists to demonstrate that we can handle this condition without a failing
-  // assertion. See also the discussion here:
-  // https://discourse.llvm.org/t/70086/6
-  std::string Code = R"(
-    namespace std {
-      template <class T> struct initializer_list {
-        const T* begin();
-        const T* end();
-      };
-    }
-
-    void target(char* p, char* end) {
-      while (p != end) {
-        if (*p == ' ') {
-          p++;
-          continue;
-        }
-
-        auto && range = {1, 2};
-        for (auto b = range.begin(), e = range.end(); b != e; ++b) {
-        }
-        (void)0;
-        // [[p]]
-      }
-    }
-  )";
-  runDataflow(
-      Code,
-      [](const llvm::StringMap<DataflowAnalysisState<NoopLattice>> &Results,
-         ASTContext &ASTCtx) {
-        const Environment &Env = getEnvironmentAtAnnotation(Results, "p");
-
-        // Joining environments with different storage locations for the same
-        // declaration results in the declaration being removed from the joined
-        // environment.
-        const ValueDecl *VD = findValueDecl(ASTCtx, "range");
-        ASSERT_EQ(Env.getStorageLocation(*VD), nullptr);
-      });
-}
-
 } // namespace

Copy link
Collaborator

@Xazax-hun Xazax-hun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive improvement! Thanks! Looks great to me, I have some not too important comments inline.

@@ -152,6 +153,47 @@ TEST_F(EnvironmentTest, JoinRecords) {
}
}

TEST_F(EnvironmentTest, DifferentReferenceLocInJoin) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running analysis on C++ code is indirect, it has the advantage of documenting what code can trigger such behavior. I think having direct tests is fine, I just wanted to make this decision explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The original test case was minimized from a failing assertion, but it would likely be fiddly to try and recreate a test case that triggers this again. Testing directly also has the benefit of being less brittle, so I think it's an acceptable tradeoff in this case.

@@ -3893,6 +3893,66 @@ TEST(TransferTest, BooleanInequality) {
});
}

TEST(TransferTest, PointerEquality) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to add a test case with union members to document the limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- done.

@Xazax-hun
Copy link
Collaborator

An alternative approach would be to do hash consing. Since pointer values are completely determined by the pointee (for some definition of the pointee, e.g., we might have void* and int* pointers pointing to the same storage location), we could have a single representation for all of the equal pointer values and continue to rely on identity. This can also be beneficial for memory consumption, but we have to pay for that by making value creation a bit more expensive (and the pointer values needs to be immutable).

That being said, this would be a bit of a departure from the current architecture, so I think even if we want to do this, it should be a separate PR.

@martinboehme
Copy link
Contributor Author

An alternative approach would be to do hash consing. Since pointer values are completely determined by the pointee (for some definition of the pointee, e.g., we might have void* and int* pointers pointing to the same storage location), we could have a single representation for all of the equal pointer values and continue to rely on identity. This can also be beneficial for memory consumption, but we have to pay for that by making value creation a bit more expensive (and the pointer values needs to be immutable).

That being said, this would be a bit of a departure from the current architecture, so I think even if we want to do this, it should be a separate PR.

Thanks, these are good points.

I think we may want to revisit multiple aspects of pointer values and storage locations at some point. For example, storage locations always have a type, and so storage locations of different types necessarily compare different. For example:

struct Base {};
struct Derived : public Base {};
Derived d;
Derived *p_derived = &d;
Base *p_base = d;

// `p_derived == p_base` should hold, but the framework assumes `p_derived != p_base`.

It's probably a good thing for storage locations to have types (because we want to be able to model the dynamic type of an object), but this probably means that we want to model addresses separately from storage locations. This would then also help in other cases, such as union members and pointer interconvertibility. But all of this would be a major undertaking, I think.

@Xazax-hun
Copy link
Collaborator

Xazax-hun commented Dec 13, 2023

It's probably a good thing for storage locations to have types (because we want to be able to model the dynamic type of an object), but this probably means that we want to model addresses separately from storage locations.

Our knowledge of the dynamic type might change during the analysis. Consider the following snippet:

void f(Base* b)
{
  if (dynamic_cast<Derived*>(b)) {
    // Here, we know that the dynamic type is (a subclass of) Derived
  }
}

So, we either need to mutate storage locations, or we need to store the dynamic types on the side. The Clang Static Analyzer (CSA) is doing the latter. While CSA currently has types associated with the storage locations, we are considering undoing this decision and make storage locations only represented by sizes and offsets. We plan to always rely on the AST to determine the static types, and use a data structure on the side to track the dynamic type information.

For some frameworks having typed storage locations might be a good idea, it is not obvious to me where is the sweet spot and it will need careful consideration. (And as far as I understand, we are not even sure whether we want to switch to a different representation like access paths in the future.)

But all of this would be a major undertaking, I think.

Absolutely agree. The Clang Static Analyzer still has problems with type casts in some cases, it is a really challenging problem. That being said, the CSA is already immensely useful even with the partial modeling it has, so this is probably one of the less pressing problems to solve in the near/medium term.

@martinboehme
Copy link
Contributor Author

Heads up: I'm holding off on merging this patch because it makes some tests in the Crubit nullability analysis fail. I'm still investigating exactly why this is.

@martinboehme
Copy link
Contributor Author

Finally getting back to this.

The reason this was causing tests to fail is really pretty obvious in hindsight: Different StorageLocations may alias, so if the StorageLocations for two PointerValues are different, we can't assume that the pointers must compare unequal.

Here's an example:

void f(int *p1, int *p2) {
  if (p1 == p2) {
    do_something();
  }
}

The framework initializes p1 and p2 with different storage locations, so the current state of this PR causes the comparison p1 == p2 to be evaluated to a false literal, which in turn causes the do_something() block to be considered dead code. But this is nonsense: p1 and p2 may alias, so it's definitely possible that do_something() will be executed.

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I'll change the PR accordingly and will let you know once it's ready for another look.

-  Instead of comparing the identity of the `PointerValue`s, compare the
   underlying `StorageLocation`s.

-  If the `StorageLocation`s are the same, return a definite "true" as the
   result of the comparison. Before, if the `PointerValue`s were different, we
   would return an atom, even if the storage locations themselves were the same.

-  If the `StorageLocation`s are different, return an atom (as before). Pointers
   that have different storage locations may still alias, so we can't return a
   definite "false" in this case.

The application-level gains from this are relatively modest. For the Crubit
nullability check running on an internal codebase, this change reduces the
number of functions on which the SAT solver times out from 223 to 221; the
number of "pointer expression not modeled" errors reduces from 3815 to 3778.

Still, it seems that the gain in precision is generally worthwhile.

@Xazax-hun inspired me to think about this with his
[comments](llvm#73860 (review))
on a different PR.
@martinboehme
Copy link
Contributor Author

I'll change the PR accordingly and will let you know once it's ready for another look.

Sorry, this completely fell off my radar. I've finally revisited this now, and the PR is ready for review again. PTAL.

@Xazax-hun
Copy link
Collaborator

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I am wondering, if it would make sense to "fix up" the state when we discover aliasing? I.e., "merging" the two storage locations. Consider:

void f(MyBoolPair* a, MyBoolPair *b) {
  if (a == b && a->first && !b->second)
  {
    // Here, we should know: a->first == b->first == true and a->second == b->second == false
  } 
}

@martinboehme
Copy link
Contributor Author

In summary, if the storage locations for two pointers are the same, we can return a true literal for the comparison, but if the storage locations are different, we need to return an atom.

I am wondering, if it would make sense to "fix up" the state when we discover aliasing? I.e., "merging" the two storage locations. Consider:

void f(MyBoolPair* a, MyBoolPair *b) {
  if (a == b && a->first && !b->second)
  {
    // Here, we should know: a->first == b->first == true and a->second == b->second == false
  } 
}

This sounds great -- but I think getting this to work wouldn't be trivial. At least, I don't yet have any good ideas on how to make it work.

I think we'd have to do something like this:

  1. When we evaluate a == b, associate this expression with an atom, then remember that this atom means that the storage locations for the two pointers are the same.

  2. When we enter a new block, if we can prove from the flow condition that the atom mentioned in step 1 is true, remember that the storage locations for a and b must alias.

  3. When we evaluate b->first -- I guess we iterate over all of the storage locations that must alias with the storage location for b, and if, for at least one of them, we can prove that b->first is true, we know it must be true? (I'm not super-conversant in alias analysis -- would have to read up more about this.)

Would require some more thought and definitely quite a bit of effort to implement, I think.

@martinboehme
Copy link
Contributor Author

CI failure (DataFlowSanitizer-x86_64 :: release_shadow_space.c) looks unrelated.

@martinboehme martinboehme merged commit f3fbd21 into llvm:main May 7, 2024
3 of 4 checks passed
copybara-service bot pushed a commit to google/crubit that referenced this pull request May 21, 2024
See the TODOs that have now been fixed. The dataflow framework now considers
pointers with the same storage location to compare equal (see upstream
[#75170](llvm/llvm-project#75170), and this allows us to
directly compare pointers in the tests.

PiperOrigin-RevId: 635791000
Change-Id: I1351d76eee76668b2a3be0efbce829c8c65874ee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:analysis clang:dataflow Clang Dataflow Analysis framework - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants