-
Notifications
You must be signed in to change notification settings - Fork 0
[clang][ssaf] Add EntityId and EntityIdTable for efficient entity handling #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// EntityIds are created by EntityIdTable. Equality and ordering comparisons | ||
| /// are well-defined for EntityIds created by the same EntityIdTable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, Entities also have their ordering. If I recall they are based of USR, Suffix, Namespace.
My point is that if I have EntityName A and B, and they are ordered A < B.
Does that also imply that EntityId of A and B are also ordered the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider this. Suppose the ordering is different, do you see that as a problem?
| class EntityIdTable { | ||
| std::set<EntityName> Entities; | ||
|
|
||
| std::vector<const EntityName*> IdToEntity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table owns these entities in the Entities std::set, which is a stable associative container.
Could we just hand out pointers to the elements instead of keeping a separate index -> pointer mapping in the form of IdToEntity vector?
This is an interesting question because an EntityId is basically as large as a pointer (due to the size_t member). Pointers also have natural EQ and LT operators, and their semantics are clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to not surface const EntityName* to the users as that can be misleading and it's hard to communicate that only entities from a given table can be compared against each other. But to avoid that createEntityId could just cast the pointer to size_t.
Another issue is deserialization. With the current design, if we take serialized EntityTable and deserialize entities in the same order they ware added, we keep the IDs valid and won't have to update references to EntityId-s in the summaries.
| EXPECT_TRUE(Id1 < Id2 || Id2 < Id1); | ||
| EXPECT_TRUE(Id1 < Id3 || Id3 < Id1); | ||
| EXPECT_TRUE(Id2 < Id3 || Id3 < Id2); | ||
|
|
||
| // Transitivity: if a < b and b < c, then a < c | ||
| if (Id1 < Id2 && Id2 < Id3) { | ||
| EXPECT_TRUE(Id1 < Id3); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If feels arbitrary that we didn't want to express a specific ordering among Id1, Id2, Id3 - (except that they are inequality to each other), and here we suddenly only check transitivity if a very specific ordering applies, such as: Id1 < Id2 < Id3.
I'd suggest either pin a specific ordering in both cases or keep it relaxed too.
| void EntityIdTable::forEach(std::function<void(const EntityName&, EntityId)> Callback) const { | ||
| for (size_t Index = 0; Index < IdToEntity.size(); ++Index) { | ||
| EntityId EId(Index); | ||
| const EntityName& Name = *IdToEntity[Index]; | ||
| Callback(Name, EId); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we had a test for this API, while the rest was thoroughly tested.
… errors (llvm#169989) We can see the following while running clang-repl in C mode ``` anutosh491@vv-nuc:/build/anutosh491/llvm-project/build/bin$ ./clang-repl --Xcc=-x --Xcc=c --Xcc=-std=c23 clang-repl> printf("hi\n"); In file included from <<< inputs >>>:1: input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 1 | printf("hi\n"); | ^ input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf' error: Parsing failed. clang-repl> #include <stdio.h> hi ``` In debug mode while dumping the generated Module, i see this ``` clang-repl> printf("hi\n"); In file included from <<< inputs >>>:1: input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 1 | printf("hi\n"); | ^ input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf' error: Parsing failed. clang-repl> #include <stdio.h> === compile-ptu 1 === [TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)] [LLVM IR] ; ModuleID = 'incr_module_1' source_filename = "incr_module_1" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @.str = private unnamed_addr constant [4 x i8] c"hi\0A\00", align 1 @llvm.global_ctors = appending global [1 x { i32, ptr, ptr }] [{ i32, ptr, ptr } { i32 65535, ptr @_GLOBAL__sub_I_incr_module_1, ptr null }] define internal void @__stmts__0() #0 { entry: %call = call i32 (ptr, ...) @printf(ptr noundef @.str) ret void } declare i32 @printf(ptr noundef, ...) #1 ; Function Attrs: noinline nounwind uwtable define internal void @_GLOBAL__sub_I_incr_module_1() swiftlang#2 section ".text.startup" { entry: call void @__stmts__0() ret void } attributes #0 = { "min-legal-vector-width"="0" } attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } attributes swiftlang#2 = { noinline nounwind uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } !llvm.module.flags = !{!0, !1, !2, !3, !4} !llvm.ident = !{!5} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{i32 8, !"PIC Level", i32 2} !2 = !{i32 7, !"PIE Level", i32 2} !3 = !{i32 7, !"uwtable", i32 2} !4 = !{i32 7, !"frame-pointer", i32 2} !5 = !{!"clang version 22.0.0git (https://github.com/anutosh491/llvm-project.git 81ad8fb)"} === end compile-ptu === execute-ptu 1: [TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)] hi ``` Basically I see that CodeGen emits IR for a cell before we know whether DiagnosticsEngine has an error. For C code like `printf("hi\n");` without <stdio.h>, Sema emits a diagnostic but still produces a "codegen-able" `TopLevelStmt`, so the `printf` call is IR-generated into the current module. Previously, when `Diags.hasErrorOccurred()` was true, we only cleaned up the PTU AST and left the CodeGen module untouched. The next successful cell then called `GenModule()`, which returned that same module (now also containing the next cell’s IR), causing side effects from the failed cell (e.g. printf)
…dling Introduce EntityId and EntityIdTable to provide efficient, lightweight handles for working with EntityNames in the Scalable Static Analysis Framework (SSAF). Introduces two key components: - EntityId: Lightweight opaque handle representing an entity in an EntityIdTable - EntityIdTable: Entity name interning table that maps unique EntityNames to EntityIds The interning table ensures each EntityName maps to exactly one EntityId, providing fast equality comparisons and lookups. EntityIds are index-based and remain stable for the lifetime of their table. This enables efficient entity tracking and comparison operations needed for whole-program analysis.
|
I didn't realize I can't retarget this PR to the upstream repository. |
Introduce EntityId and EntityIdTable to provide efficient, lightweight handles for working with EntityNames in the Scalable Static Analysis Framework (SSAF).
Introduces two key components:
The interning table ensures each EntityName maps to exactly one EntityId, providing fast equality comparisons and lookups. EntityIds are index-based and remain stable for the lifetime of their table. This enables efficient entity tracking and comparison operations needed for whole-program analysis.