Skip to content

Conversation

@jkorous-apple
Copy link
Owner

Introduce EntityId and EntityIdTable to provide efficient, lightweight handles for working with EntityNames in the Scalable Static Analysis Framework (SSAF).

Introduces two key components:

  • EntityId: Lightweight opaque handle representing an entity in an EntityIdTable
  • EntityIdTable: Entity name interning table that maps unique EntityNames to EntityIds

The interning table ensures each EntityName maps to exactly one EntityId, providing fast equality comparisons and lookups. EntityIds are index-based and remain stable for the lifetime of their table. This enables efficient entity tracking and comparison operations needed for whole-program analysis.

Comment on lines +27 to +28
/// EntityIds are created by EntityIdTable. Equality and ordering comparisons
/// are well-defined for EntityIds created by the same EntityIdTable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Entities also have their ordering. If I recall they are based of USR, Suffix, Namespace.
My point is that if I have EntityName A and B, and they are ordered A < B.
Does that also imply that EntityId of A and B are also ordered the same way?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider this. Suppose the ordering is different, do you see that as a problem?

Comment on lines +25 to +28
class EntityIdTable {
std::set<EntityName> Entities;

std::vector<const EntityName*> IdToEntity;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table owns these entities in the Entities std::set, which is a stable associative container.
Could we just hand out pointers to the elements instead of keeping a separate index -> pointer mapping in the form of IdToEntity vector?

This is an interesting question because an EntityId is basically as large as a pointer (due to the size_t member). Pointers also have natural EQ and LT operators, and their semantics are clear.

Copy link
Owner Author

@jkorous-apple jkorous-apple Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to not surface const EntityName* to the users as that can be misleading and it's hard to communicate that only entities from a given table can be compared against each other. But to avoid that createEntityId could just cast the pointer to size_t.

Another issue is deserialization. With the current design, if we take serialized EntityTable and deserialize entities in the same order they ware added, we keep the IDs valid and won't have to update references to EntityId-s in the summaries.

Comment on lines 46 to 53
EXPECT_TRUE(Id1 < Id2 || Id2 < Id1);
EXPECT_TRUE(Id1 < Id3 || Id3 < Id1);
EXPECT_TRUE(Id2 < Id3 || Id3 < Id2);

// Transitivity: if a < b and b < c, then a < c
if (Id1 < Id2 && Id2 < Id3) {
EXPECT_TRUE(Id1 < Id3);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If feels arbitrary that we didn't want to express a specific ordering among Id1, Id2, Id3 - (except that they are inequality to each other), and here we suddenly only check transitivity if a very specific ordering applies, such as: Id1 < Id2 < Id3.

I'd suggest either pin a specific ordering in both cases or keep it relaxed too.

Comment on lines +36 to +42
void EntityIdTable::forEach(std::function<void(const EntityName&, EntityId)> Callback) const {
for (size_t Index = 0; Index < IdToEntity.size(); ++Index) {
EntityId EId(Index);
const EntityName& Name = *IdToEntity[Index];
Callback(Name, EId);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we had a test for this API, while the rest was thoroughly tested.

jkorous-apple pushed a commit that referenced this pull request Dec 8, 2025
… errors (llvm#169989)

We can see the following while running clang-repl in C mode 
```
anutosh491@vv-nuc:/build/anutosh491/llvm-project/build/bin$ ./clang-repl --Xcc=-x --Xcc=c --Xcc=-std=c23
clang-repl> printf("hi\n");
In file included from <<< inputs >>>:1:
input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and
      later do not support implicit function declarations [-Wimplicit-function-declaration]
    1 | printf("hi\n");
      | ^
input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf'
error: Parsing failed.
clang-repl> #include <stdio.h>
hi
```

In debug mode while dumping the generated Module, i see this 
```
clang-repl> printf("hi\n");
In file included from <<< inputs >>>:1:
input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and
      later do not support implicit function declarations [-Wimplicit-function-declaration]
    1 | printf("hi\n");
      | ^
input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf'
error: Parsing failed.
clang-repl> #include <stdio.h>

=== compile-ptu 1 ===
[TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)]
[LLVM IR]
; ModuleID = 'incr_module_1'
source_filename = "incr_module_1"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [4 x i8] c"hi\0A\00", align 1
@llvm.global_ctors = appending global [1 x { i32, ptr, ptr }] [{ i32, ptr, ptr } { i32 65535, ptr @_GLOBAL__sub_I_incr_module_1, ptr null }]

define internal void @__stmts__0() #0 {
entry:
  %call = call i32 (ptr, ...) @printf(ptr noundef @.str)
  ret void
}

declare i32 @printf(ptr noundef, ...) #1

; Function Attrs: noinline nounwind uwtable
define internal void @_GLOBAL__sub_I_incr_module_1() swiftlang#2 section ".text.startup" {
entry:
  call void @__stmts__0()
  ret void
}

attributes #0 = { "min-legal-vector-width"="0" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes swiftlang#2 = { noinline nounwind uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"clang version 22.0.0git (https://github.com/anutosh491/llvm-project.git 81ad8fb)"}
=== end compile-ptu ===

execute-ptu 1: [TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)]
hi
```

Basically I see that CodeGen emits IR for a cell before we know whether
DiagnosticsEngine has an error. For C code like `printf("hi\n");`
without <stdio.h>, Sema emits a diagnostic but still produces a
"codegen-able" `TopLevelStmt`, so the `printf` call is IR-generated into
the current module.

Previously, when `Diags.hasErrorOccurred()` was true, we only cleaned up
the PTU AST and left the CodeGen module untouched. The next successful
cell then called `GenModule()`, which returned that same module (now
also containing the next cell’s IR), causing side effects from the
failed cell (e.g. printf)
…dling

Introduce EntityId and EntityIdTable to provide efficient, lightweight handles
for working with EntityNames in the Scalable Static Analysis Framework (SSAF).

Introduces two key components:
- EntityId: Lightweight opaque handle representing an entity in an EntityIdTable
- EntityIdTable: Entity name interning table that maps unique EntityNames to EntityIds

The interning table ensures each EntityName maps to exactly one EntityId,
providing fast equality comparisons and lookups. EntityIds are index-based
and remain stable for the lifetime of their table. This enables efficient
entity tracking and comparison operations needed for whole-program analysis.
@jkorous-apple
Copy link
Owner Author

I didn't realize I can't retarget this PR to the upstream repository.
Let's continue here:
llvm#171660

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants