Skip to content

[analyzer] Normalize sub-array indices in RegionStore initializer res…#200044

Open
gamesh411 wants to merge 3 commits into
llvm:mainfrom
gamesh411:fix/regionstore-normalize-subarray-indices
Open

[analyzer] Normalize sub-array indices in RegionStore initializer res…#200044
gamesh411 wants to merge 3 commits into
llvm:mainfrom
gamesh411:fix/regionstore-normalize-subarray-indices

Conversation

@gamesh411
Copy link
Copy Markdown
Contributor

@gamesh411 gamesh411 commented May 27, 2026

…olution

After #198346, alpha.unix.cstring.UninitializedRead reports a false positive when a pointer into a fully-initialized const multidimensional array is advanced past an inner dimension boundary and used as a source argument to memcpy. The root cause is in convertOffsetsFromSvalToUnsigneds in RegionStore, which returned UndefinedVal for any element index exceeding its sub-array extent, conflating pointer arithmetic legality with memory initializedness.

This patch separates the two concerns. The RegionStore now normalizes indices that overflow an inner dimension by carrying into the outer dimension via divmod, the same way arr[0][5] in int arr[4][3] denotes the same memory as arr[1][2]. UndefinedVal is returned only when the computed flat offset exceeds the total array allocation. Whether cross-subobject pointer arithmetic constitutes undefined behavior per C/C++ standards is a separate concern for individual checkers to diagnose. No existing checker flags sub-array boundary crossing as UB, verified both before and after #198346.

Fixes #199271

…olution

After llvm#198346, alpha.unix.cstring.UninitializedRead reports a false
positive when a pointer into a fully-initialized const multidimensional
array is advanced past an inner dimension boundary and used as a source
argument to memcpy. The root cause is in convertOffsetsFromSvalToUnsigneds
in RegionStore, which returned UndefinedVal for any element index exceeding
its sub-array extent, conflating pointer arithmetic legality with memory
initializedness.

This patch separates the two concerns. The RegionStore now normalizes
indices that overflow an inner dimension by carrying into the outer
dimension via divmod, the same way arr[0][5] in int arr[4][3] denotes
the same memory as arr[1][2]. UndefinedVal is returned only when the
computed flat offset exceeds the total array allocation. Whether
cross-subobject pointer arithmetic constitutes undefined behavior per
C/C++ standards is a separate concern for individual checkers to
diagnose. No existing checker flags sub-array boundary crossing as UB,
verified both before and after llvm#198346.

Fixes llvm#199271
@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 28, 2026

@llvm/pr-subscribers-clang-static-analyzer-1

@llvm/pr-subscribers-clang

Author: Endre Fülöp (gamesh411)

Changes

…olution

After #198346, alpha.unix.cstring.UninitializedRead reports a false positive when a pointer into a fully-initialized const multidimensional array is advanced past an inner dimension boundary and used as a source argument to memcpy. The root cause is in convertOffsetsFromSvalToUnsigneds in RegionStore, which returned UndefinedVal for any element index exceeding its sub-array extent, conflating pointer arithmetic legality with memory initializedness.

This patch separates the two concerns. The RegionStore now normalizes indices that overflow an inner dimension by carrying into the outer dimension via divmod, the same way arr[0][5] in int arr[4][3] denotes the same memory as arr[1][2]. UndefinedVal is returned only when the computed flat offset exceeds the total array allocation. Whether cross-subobject pointer arithmetic constitutes undefined behavior per C/C++ standards is a separate concern for individual checkers to diagnose. No existing checker flags sub-array boundary crossing as UB, verified both before and after #198346.

Fixes #199271


Full diff: https://github.com/llvm/llvm-project/pull/200044.diff

4 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Core/RegionStore.cpp (+39-48)
  • (modified) clang/test/Analysis/bstring_UninitRead.c (+7)
  • (modified) clang/test/Analysis/initialization.c (+2-2)
  • (modified) clang/test/Analysis/initialization.cpp (+86-3)
diff --git a/clang/lib/StaticAnalyzer/Core/RegionStore.cpp b/clang/lib/StaticAnalyzer/Core/RegionStore.cpp
index 6a426331c93c3..7f747b0b1eceb 100644
--- a/clang/lib/StaticAnalyzer/Core/RegionStore.cpp
+++ b/clang/lib/StaticAnalyzer/Core/RegionStore.cpp
@@ -1804,65 +1804,56 @@ getElementRegionOffsetsWithBase(const ElementRegion *ER) {
 
 /// This is a helper function for `getConstantValFromConstArrayInitializer`.
 ///
-/// Convert array of offsets from `SVal` to `uint64_t` in consideration of
-/// respective array extents.
-/// \param SrcOffsets [in]   The array of offsets of type `SVal` in reversed
-///   order (expectedly received from `getElementRegionOffsetsWithBase`).
-/// \param ArrayExtents [in] The array of extents.
-/// \param DstOffsets [out]  The array of offsets of type `uint64_t`.
+/// Flatten per-dimension SVal offsets into a linear index, bounds-check
+/// against the total allocation, and decompose back into per-dimension
+/// uint64_t indices.
+///
+/// \param SrcOffsets [in]   Per-dimension offsets in reversed order
+///   (as received from `getElementRegionOffsetsWithBase`).
+/// \param ArrayExtents [in] Extents per dimension (outer to inner).
+/// \param DstOffsets [out]  Normalized per-dimension indices.
 /// \returns:
-/// - `std::nullopt` for successful convertion.
-/// - `UndefinedVal` or `UnknownVal` otherwise. It's expected that this SVal
-///   will be returned as a suitable value of the access operation.
-///   which should be returned as a correct
+/// - `std::nullopt` on success.
+/// - `UndefinedVal` if the flat offset is out of bounds.
+/// - `UnknownVal` if any index is symbolic.
 ///
 /// \example:
 ///   const int arr[10][20][30] = {}; // ArrayExtents { 10, 20, 30 }
-///   int x1 = arr[4][5][6]; // SrcOffsets { NonLoc(6), NonLoc(5), NonLoc(4) }
-///                          // DstOffsets { 4, 5, 6 }
-///                          // returns std::nullopt
-///   int x2 = arr[42][5][-6]; // returns UndefinedVal
-///   int x3 = arr[4][5][x2];  // returns UnknownVal
+///   int x1 = arr[4][5][6]; // DstOffsets { 4, 5, 6 }, returns std::nullopt
+///   int x2 = arr[0][0][35]; // DstOffsets { 0, 1, 5 }, returns std::nullopt
+///   int x3 = arr[0][25][-5]; // DstOffsets { 1, 4, 25 }, returns std::nullopt
+///   int x4 = arr[10][0][0]; // returns UndefinedVal (flat offset >= total)
+///   int x5 = arr[4][5][x1]; // returns UnknownVal
 static std::optional<SVal>
 convertOffsetsFromSvalToUnsigneds(const SmallVector<SVal, 2> &SrcOffsets,
                                   const SmallVector<uint64_t, 2> ArrayExtents,
                                   SmallVector<uint64_t, 2> &DstOffsets) {
-  // Check offsets for being out of bounds.
-  // C++20 [expr.add] 7.6.6.4 (excerpt):
-  //   If P points to an array element i of an array object x with n
-  //   elements, where i < 0 or i > n, the behavior is undefined.
-  //   Dereferencing is not allowed on the "one past the last
-  //   element", when i == n.
-  // Example:
-  //  const int arr[3][2] = {{1, 2}, {3, 4}};
-  //  arr[0][0];  // 1
-  //  arr[0][1];  // 2
-  //  arr[0][2];  // UB
-  //  arr[1][0];  // 3
-  //  arr[1][1];  // 4
-  //  arr[1][-1]; // UB
-  //  arr[2][0];  // 0
-  //  arr[2][1];  // 0
-  //  arr[-2][0]; // UB
-  DstOffsets.resize(SrcOffsets.size());
+  // Flatten to a linear offset so that both positive overflow and negative
+  // indices across sub-array boundaries resolve to the correct element.
+  int64_t FlatOffset = 0;
   auto ExtentIt = ArrayExtents.begin();
-  auto OffsetIt = DstOffsets.begin();
-  // Reverse `SValOffsets` to make it consistent with `ArrayExtents`.
   for (SVal V : llvm::reverse(SrcOffsets)) {
-    if (auto CI = V.getAs<nonloc::ConcreteInt>()) {
-      // When offset is out of array's bounds, result is UB.
-      const llvm::APSInt &Offset = CI->getValue();
-      if (Offset.isNegative() || Offset.uge(*(ExtentIt++)))
-        return UndefinedVal();
-      // Store index in a reversive order.
-      *(OffsetIt++) = Offset.getZExtValue();
-      continue;
-    }
-    // Symbolic index presented. Return Unknown value.
-    // FIXME: We also need to take ElementRegions with symbolic indexes into
-    // account.
-    return UnknownVal();
+    auto CI = V.getAs<nonloc::ConcreteInt>();
+    if (!CI)
+      return UnknownVal();
+    FlatOffset = FlatOffset * static_cast<int64_t>(*(ExtentIt++)) +
+                 CI->getValue()->getExtValue();
   }
+
+  int64_t TotalSize = 1;
+  for (uint64_t E : ArrayExtents)
+    TotalSize *= static_cast<int64_t>(E);
+
+  if (FlatOffset < 0 || FlatOffset >= TotalSize)
+    return UndefinedVal();
+
+  DstOffsets.resize(ArrayExtents.size());
+  uint64_t Remaining = static_cast<uint64_t>(FlatOffset);
+  for (int I = DstOffsets.size() - 1; I >= 0; --I) {
+    DstOffsets[I] = Remaining % ArrayExtents[I];
+    Remaining /= ArrayExtents[I];
+  }
+
   return std::nullopt;
 }
 
diff --git a/clang/test/Analysis/bstring_UninitRead.c b/clang/test/Analysis/bstring_UninitRead.c
index 8c79f73569015..25ff4d82f441b 100644
--- a/clang/test/Analysis/bstring_UninitRead.c
+++ b/clang/test/Analysis/bstring_UninitRead.c
@@ -152,3 +152,10 @@ void ff_mov_lang_to_iso639() {
   memcpy(ff_mov_lang_to_iso639_to,
          mov_mdhd_language_map[ff_mov_lang_to_iso639_code], 4);
 }
+
+void memcpy_multidim_crossing_subarray(void *dst) {
+  const long b[4][1200] = {};
+  const long *f = &b[0][0];
+  f = f + 2001; // crosses into b[1][801]
+  memcpy(dst, f, sizeof(long)); // no-warning
+}
diff --git a/clang/test/Analysis/initialization.c b/clang/test/Analysis/initialization.c
index 32bdf733bcb98..fb75fb9739759 100644
--- a/clang/test/Analysis/initialization.c
+++ b/clang/test/Analysis/initialization.c
@@ -73,10 +73,10 @@ void glob_arr_index3(void) {
 
 void negative_index(void) {
   int x = 2, y = -2;
-  clang_analyzer_eval(glob_arr2[x][y] == 5); // expected-warning{{UNDEFINED}}
+  clang_analyzer_eval(glob_arr2[x][y] == 5); // expected-warning{{TRUE}}
   x = 3;
   y = -3;
-  clang_analyzer_eval(glob_arr2[x][y] == 7); // expected-warning{{UNDEFINED}}
+  clang_analyzer_eval(glob_arr2[x][y] == 7); // expected-warning{{TRUE}}
 }
 
 void glob_invalid_index3(void) {
diff --git a/clang/test/Analysis/initialization.cpp b/clang/test/Analysis/initialization.cpp
index d064dd6172d63..100cd89431e94 100644
--- a/clang/test/Analysis/initialization.cpp
+++ b/clang/test/Analysis/initialization.cpp
@@ -87,9 +87,9 @@ void glob_ptr_index2() {
   int const *ptr = glob_arr5[1];
   clang_analyzer_eval(ptr[0] == 3); // expected-warning{{TRUE}}
   clang_analyzer_eval(ptr[1] == 4); // expected-warning{{TRUE}}
-  clang_analyzer_eval(ptr[2] == 5); // expected-warning{{UNDEFINED}}
-  clang_analyzer_eval(ptr[3] == 0); // expected-warning{{UNDEFINED}}
-  clang_analyzer_eval(ptr[4] == 0); // expected-warning{{UNDEFINED}}
+  clang_analyzer_eval(ptr[2] == 5); // expected-warning{{TRUE}}
+  clang_analyzer_eval(ptr[3] == 0); // expected-warning{{TRUE}}
+  clang_analyzer_eval(ptr[4] == 0); // expected-warning{{TRUE}}
 }
 
 void glob_invalid_index5() {
@@ -256,3 +256,86 @@ void initlistWithinInitlist() {
   // no-crash
   clang_analyzer_dump(glob[0]); // expected-warning-re {{reg_${{[0-9]+}}<enum E Element{glob,0 S64b,enum E}>}}
 }
+
+const int glob_arr_cross[4][3] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}};
+void glob_array_cross_subarray_boundary() {
+  const int *p = &glob_arr_cross[0][0];
+  clang_analyzer_eval(p[4] == 5);   // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[7] == 8);   // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[11] == 12); // expected-warning{{TRUE}}
+}
+
+const int glob_arr_sparse[4][3] = {{1, 2}, {0, 0, 7}};
+void glob_array_sparse_cross_boundary() {
+  const int *p = &glob_arr_sparse[0][0];
+  clang_analyzer_eval(p[0] == 1); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[1] == 2); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[2] == 0); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[3] == 0); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[5] == 7); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[6] == 0); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[11] == 0); // expected-warning{{TRUE}}
+}
+
+const int glob_arr_3d[2][3][4] = {
+    {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
+    {{13, 14, 15, 16}, {17, 18, 19, 20}, {21, 22, 23, 24}}};
+void glob_array_3d_cross_boundary() {
+  const int *p = &glob_arr_3d[0][0][0];
+  clang_analyzer_eval(p[0] == 1);   // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[5] == 6);   // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[11] == 12); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[12] == 13); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[23] == 24); // expected-warning{{TRUE}}
+}
+
+// Limitation: type-punned access through incompatible pointer type.
+const int glob_arr_pun[2][3] = {{1, 2, 3}, {4, 5, 6}};
+void glob_array_type_pun_unresolved() {
+  const unsigned *p = (const unsigned *)&glob_arr_pun[0][0];
+  clang_analyzer_dump(p[0]); // expected-warning-re{{reg_${{[0-9]+}}<unsigned int Element{glob_arr_pun,0 S64b,unsigned int}>}}
+}
+
+const int glob_arr_recast[4][3] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}};
+void glob_array_recast_resolves_against_declared_type() {
+  const int (*q)[6] = (const int (*)[6])&glob_arr_recast[0][0];
+  clang_analyzer_eval(q[0][4] == 5); // expected-warning{{TRUE}}
+  clang_analyzer_eval(q[1][0] == 4); // expected-warning{{TRUE}}
+}
+
+void glob_array_write_then_read_cross_boundary() {
+  int arr[4][3];
+  arr[0][0] = 10;
+  arr[1][1] = 42;
+  arr[2][2] = 99;
+  arr[3][2] = 77;
+
+  int *p = &arr[0][0];
+  clang_analyzer_eval(p[0] == 10); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[4] == 42); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[8] == 99); // expected-warning{{TRUE}}
+  clang_analyzer_eval(p[11] == 77); // expected-warning{{TRUE}}
+}
+
+void glob_array_write_then_read_before_start() {
+  int arr[4][3];
+  arr[0][0] = 10;
+  int *p = &arr[0][0];
+  int x = p[-1]; // expected-warning{{Assigned value is uninitialized}}
+  (void)x;
+}
+
+// Negative inner index from a non-zero outer base lands within the allocation.
+const int glob_arr_neg[4][3] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}};
+void glob_array_negative_inner_index() {
+  const int *p = &glob_arr_neg[2][0];
+  clang_analyzer_eval(p[-1] == 6); // expected-warning{{TRUE}}
+}
+
+void glob_array_write_then_read_past_end() {
+  int arr[4][3];
+  arr[3][2] = 77;
+  int *p = &arr[0][0];
+  int x = p[12]; // expected-warning{{Assigned value is uninitialized}}
+  (void)x;
+}

@gamesh411
Copy link
Copy Markdown
Contributor Author

My idea is to move the encoding of UB thanks to cross-subobject indexing out from the RegionStore and make it a concern of the checkers (and anyway, the encoding of this information is not exploited as of now, as the cross-subobject UB is not reported in any way before or after this change, so we are not losing anything by removing this responsibility/capability from the RegionStore).

Reasons for putting this change in the RegionStore and not in the checker code itself:

  1. While technically UB, it seems to be a widely used pattern in real-world C codebases to depend on the flat layout of a multidimensional declaration being there and available.
  2. It would be beneficial to model memory in the RegionStore without the added complication of determining whether a particular way of accessing that region is allowed by the language.

Copy link
Copy Markdown
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, my only remark is about comment clarity.

The root cause is in convertOffsetsFromSvalToUnsigneds in RegionStore, which returned UndefinedVal for any element index exceeding its sub-array extent, conflating pointer arithmetic legality with memory initializedness.

I completely agree with this diagnosis.

Whether cross-subobject pointer arithmetic constitutes undefined behavior per C/C++ standards is a separate concern for individual checkers to diagnose. No existing checker flags sub-array boundary crossing as UB, verified both before and after #198346.

In fact I'm planning to change the behavior of security.ArrayBound to let it report situations like referencing arr[1][5] in an array declared as int arr[4][3] -- because it is technically UB and the array bound checker should be able to report this.

However if the array bound checker does not report it directly (because it is disabled or the situation is more complex and some heuristic silences the warning) then the RegionStore should not report it indirectly (by returning an Undefined value which can trigger reports from various "use of undefined value" checkers).


More generally, I think UndefinedVal should be reserved for representing the contents of memory that should not be read (e.g uninitialized local, area returned by malloc, perhaps a moved-from object etc.) and we should never use it as the result of operations that trigger undefined behavior (e.g. out-of-bounds access, bad bitwise shift, zero division).

When the analyzer thinks that an undefined operation happens (on a certain execution path), it should immediately produce a bug report (which is focused on the root cause, can be cleanly suppressed or disabled etc.) instead of returning UndefinedVal (which will probably trigger some use-of-undefined checker later, and can be hard to understand or suppress, but may as well go unnoticed).

Returning UndefinedVal as a "second line of defense" is a very counter-productive approach: if the execution path wasn't terminated by an error outright, then the analyzer should admit that it doesn't understand the situation and use UnknownVal.

Comment thread clang/lib/StaticAnalyzer/Core/RegionStore.cpp
@earnol
Copy link
Copy Markdown
Contributor

earnol commented May 28, 2026

I would like to suggest to add multidimensional test into lit list:

const long b[1][4][1200][1] = {0};
void *memcpy(void *, const void *, unsigned long);
extern void *dst[];
void e() {
  const long *f = &b[0][0][0][0];
  for (unsigned long i = 0; i < 2; i++) {
    memcpy(dst[i], f, sizeof(long));
    f = f + 2001;
  }
}

To test situation of skip across multiple dimensions.

@steakhal
Copy link
Copy Markdown
Contributor

What releases are affected?
Should we backport anything?

@gamesh411
Copy link
Copy Markdown
Contributor Author

gamesh411 commented May 28, 2026

What releases are affected? Should we backport anything?

If you mean #198346, AFAIK, it is not part of any release yet, as it was merged last week. (unless I'm unaware of a branch off that happened for the next release). I have checked that LLVM 22.1.6 does not contain it.

@earnol
Copy link
Copy Markdown
Contributor

earnol commented May 29, 2026

LGTM as of now.

Copy link
Copy Markdown
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:static analyzer clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[analyzer] False positive alpha.unix.cstring.UninitializedRead on pointer arithmetic across sub-array boundary

4 participants