Skip to content

Commit 6d8714b

Browse files
authored
[Clang][CIR][Doc] Document CIR code duplication plans (#166457)
This adds a document describing known problems with code duplication in the CIR codegen implementation, strategies to mitigate the risks caused by that code duplication, and a general long-term plan for minimizing the problem.
1 parent ee77c58 commit 6d8714b

File tree

2 files changed

+246
-1
lines changed

2 files changed

+246
-1
lines changed
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
================================
2+
ClangIR Code Duplication Roadmap
3+
================================
4+
5+
.. contents::
6+
:local:
7+
8+
Introduction
9+
============
10+
11+
This document describes the general approach to code duplication in the ClangIR
12+
code generation implementation. It acknowledges specific problems with the
13+
current implementation, discusses strategies for mitigating the risk inherent in
14+
the current approach, and describes a general long-term plan for addressing the
15+
issue.
16+
17+
Background
18+
==========
19+
20+
The ClangIR code generation is very closely modeled after Clang's LLVM IR code
21+
generation, and we intend for the CIR produced to eventually be semantically
22+
equivalent to the LLVM IR produced when not going through ClangIR. However, we
23+
acknowledge that as the ClangIR implementation is under development, there will
24+
be differences in semantics, both because we have not yet implemented all
25+
features of the classic codegen and because the CIR dialect is still evolving
26+
and does not yet have a way to represent all of the necessary semantics.
27+
28+
We have chosen to model the ClangIR code generation directly after the classic
29+
codegen, to the point of following identical code structure, using similar names
30+
and often duplicating the logic because this seemed to be the most certain path
31+
to producing equivalent results. Having such nearly identical code allows for
32+
direct comparison between the CIR codegen and the LLVM IR codegen to find what
33+
is missing or incorrect in the CIR implementation.
34+
35+
However, we recognize that this is not a sustainable permanent solution. As
36+
bugs are fixed and new features are added to the classic codegen, the process of
37+
keeping the analogous CIR code up to date will be a purely manual process.
38+
39+
Long term, we need a more sustainable approach.
40+
41+
Current Strategy
42+
================
43+
44+
Practical considerations require that we make steady progress towards a working
45+
implementation of ClangIR. This necessity is directly opposed to the goal of
46+
minimizing code duplication.
47+
48+
For this reason, we have decided to accept a large amount of code duplication
49+
in the short term, even with the explicit understanding that this is producing
50+
a significant amount of technical debt as the project progresses.
51+
52+
As the CIR implementation is developed, we often note small pieces of code that
53+
could be shared with the classic codegen if they were moved to a different part
54+
of the source, such as a shared utility class in some directory available to
55+
both codegen implementations or by moving the function into a related AST class.
56+
It is left to the discretion of the developer and reviewers to decide whether
57+
such refactoring should be done during the CIR development, or if it is
58+
sufficient to leave a comment in the code indicating this as an opportunity for
59+
future improvement. Because much of the current code is likely to change when
60+
the long term code sharing strategy is complete, we will lean towards only
61+
implementing refactorings that make sense independent of the code sharing
62+
problem.
63+
64+
We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI
65+
could be refactored to allow parts of there implementation to be shared today
66+
through inheritence and templated base classes. However, this may prove to be
67+
wasted effort when the permanent solution is developed. Also, deferring this
68+
kind of intertwined implementation prevents introducing cross-dependencies that
69+
would make it more difficult to remove one IR code generation implementation
70+
without degrading the quality of the other. Therefore, we have decided that it
71+
is better to accept significant amounts of code duplication now, and defer
72+
this type of refactoring until it is clear what the permanent solution will be.
73+
74+
Mitigation Through Testing
75+
==========================
76+
77+
The most important tactic that we are using to mitigate the risk of CIR diverging
78+
from classic codegen is to incorporate two sets of LLVM IR checks in the CIR
79+
codegen LIT tests. One set checks the LLVM IR that is produced by first
80+
generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR
81+
that is produced directly by the classic codegen.
82+
83+
At the time that tests are created, we compare the LLVM IR output from these two
84+
paths to verify (manually) that any meaningful differences between them are the
85+
result of known missing features in the current CIR implementation. Whenever
86+
possible, differences are corrected in the same PR that the test is being added,
87+
updating the CIR implementation as it is being developed.
88+
89+
However, these tests serve a second purpose. They also serve as sentinels to
90+
alert us to changes in the classic codegen behavior that will need to be
91+
accounted for in the CIR implementation. While we appreciate any help from
92+
developers contributing to classic codegen, our current expectation is that it
93+
will be the responsibility of the ClangIR contributors to update the CIR
94+
implementation when these tests fail.
95+
96+
As the CIR implementation gets closer to the goal of IR that is semantically
97+
equivalent to the LLVM IR produced by the classic codegen, we would like to
98+
enhance the CIR tests to perform some automatic verification of the equivalence
99+
of the generated LLVM IR, perhaps using a combination of tools such as `opt
100+
-pass-normalize` and Alive2.
101+
102+
Eventually, we would like to be able to run all existing classic codegen tests
103+
using the CIR path as well.
104+
105+
Other Considerations
106+
====================
107+
108+
The close modeling of CIR after classic codegen has also meant that the CIR
109+
dialect often represents language details at a much lower level than it ideally
110+
should.
111+
112+
In the interest of having a complete working implementation of ClangIR as soon
113+
as is practical, we have chosen to take the approach of following the classic
114+
codegen implementation closely in the initial implementation and only raising
115+
the representation in the CIR dialect to a higher level when there is a clear
116+
and immediate benefit to doing so.
117+
118+
Over time, we expect to progressively raise the CIR representation to a higher
119+
level and remove low level details, including ABI-specific handling from the
120+
dialect. (See the "Long Term Vision" section below for more details.) Having
121+
a working implementation in place makes it easier to verify that the
122+
high-level representation and subsequent lowering are correct.
123+
124+
Mixing With Other Dialects
125+
==========================
126+
127+
Mixing of dialects is a central design feature of MLIR. The CIR dialect is
128+
currently more self-contained than most dialects, but even now we generate
129+
the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP
130+
and CUDA are added, similar mixing will occur.
131+
132+
We also expect CIR to be at least partially lowered to other dialects during
133+
the optimization phase to enable features such as data dependence analysis, even
134+
if we will eventually be lowering it to LLVM IR.
135+
136+
Therefore, any plan for generating LLVM IR from CIR must be integrated with the
137+
general MLIR lowering design, which typically involves lowering to the LLVM
138+
dialect, which is then transformed to LLVM IR.
139+
140+
Other Consumers of CIR and MLIR
141+
===============================
142+
143+
We must also consider that we will not always be lowering CIR to LLVM IR. CIR,
144+
usually mixed with other dialects, will also be directed to offload targets
145+
and other code generators through interfaces that are opaque to Clang, such as
146+
SPIR-V and MLIR core dialects. We must still produce semantically correct CIR
147+
for these consumers.
148+
149+
Long Term Vision
150+
================
151+
152+
As the CIR implementation matures, we will eliminate target-specific handling
153+
from the high-level CIR generated by Clang. The high-level CIR will then be
154+
progressively lowered to a form that is closer to LLVM IR, including a pass
155+
that inserts ABI-specific handling, potentially representing the target-specific
156+
details in another dialect. More complex transformations, such as library-aware
157+
idiom recognition or advanced loop representations—may occur later in the
158+
compilation pipeline through additional passes, which can be controlled by
159+
specific compiler flags.
160+
161+
As we raise CIR to this higher level implementation, there will naturally be
162+
less code duplication, and less need to have the same logic repeated in the
163+
CIR generation.
164+
165+
We will continue to use that same basic design and structure for CIR code
166+
generation, with classes like CIRGenModule and CIRGenFunction that serve the
167+
same purpose as their counterparts in classic codegen, but the handling there
168+
will be more closely tied to core semantics and therefore less likely to require
169+
frequent changes to stay in sync with classic codegen.
170+
171+
As the handling of low-level details is moved to later lowering phases, we will
172+
need to move away from the current tight coupling of the CIR and classic codegen
173+
implementations. As this happens, we will look for ways that this handling can
174+
be moved to new classes that are specifically designed to be shared among
175+
clients that are targeting different IR substrates. That is, rather than trying
176+
to overlay reuse onto the existing implementations, we will replace relevant
177+
parts of the existing implementation, piece by piece, as appropriate, with new
178+
implementations that perform the same function but with a more general design.
179+
180+
Example: C Calling Convention Handling
181+
======================================
182+
183+
C calling convention handling is an example of a general purpose redesign that
184+
is already underway. This was started independently of CIR, but it will be
185+
directly useful for lowering from high-level call representation in CIR to a
186+
representation that includes the target- and calling convention-specific details
187+
of function signatures, parameter type coercion, and so on.
188+
189+
The current CIR implementation duplicates most of the classic codegen handling
190+
for function call handling, but it omits several pieces that handle type
191+
coercion. This leads to an implementation that has all of the complexity of the
192+
class codegen without actually achieving the goals of that complexity. It will
193+
be a significant improvement to the CIR implementation to simplify the function
194+
call handling in such a way that it generates a high-level representation of the
195+
call, while preserving all information that will be needed to lower the call to
196+
an ABI-compliant representation in a later phase of compilation.
197+
198+
This provides a clear example where trying to refactor the classic codegen in
199+
some way to be reused by CIR would have been counterproductive. The classic
200+
codegen implementation was tightly coupled with Clang's LLVM IR generation. The
201+
implementation is being completely redesigned to allow general reuse, not just by
202+
CIR, but also by other front ends.
203+
204+
The CIR calling convention lowering will make use of the general purpose C
205+
calling convention library that is being created, but it should create an MLIR
206+
transform pass on top of that library that is general enough to be used by other
207+
dialects, such as FIR, that also need the same calling convention handling.
208+
209+
Significant Areas For Improvement
210+
=================================
211+
212+
The following list enumerates some of the areas where significant restructuring
213+
of the code is needed to enable better code sharing between CIR and classic
214+
codegen. Each of these areas is relatively self-contained in the codegen
215+
implementation, making the path to a shared implementation relatively clear.
216+
217+
- Constant expression evaluation
218+
- Complex multiplication and division expansion
219+
- Builtin function handling
220+
- Exception Handling and C++ Cleanups
221+
- Inline assembly handling
222+
- C++ ABI Handling
223+
224+
- VTable generation
225+
- Virtual function calls
226+
- Constructor and destructor arguments
227+
- Dynamic casts
228+
- Base class address calculation
229+
- Type descriptors
230+
- Array new and delete
231+
232+
Pervasive Low-Level Issues
233+
==========================
234+
235+
This section lists some of the features where a non-trivial amount of code
236+
is shared between CIR and classic codegen, but the handling of the feature
237+
is distributed across the codegen implementation, making it more difficult
238+
to design an abstraction that can easily be shared.
239+
240+
- Global variable and function linkage
241+
- Alignment management
242+
- Debug information
243+
- TBAA handling
244+
- Sanitizer integration
245+
- Lifetime markers

clang/docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ Design Documents
121121
ControlFlowIntegrityDesign
122122
HardwareAssistedAddressSanitizerDesign.rst
123123
ConstantInterpreter
124-
124+
ClangIRCodeDuplication
125125

126126
Indices and tables
127127
==================

0 commit comments

Comments
 (0)