diff --git a/clang/docs/ClangIRCodeDuplication.rst b/clang/docs/ClangIRCodeDuplication.rst new file mode 100644 index 0000000000000..77df17370fd7b --- /dev/null +++ b/clang/docs/ClangIRCodeDuplication.rst @@ -0,0 +1,236 @@ +================================ +ClangIR Code Duplication Roadmap +================================ + +.. contents:: + :local: + +Introduction +============ + +This document describes the general approach to code duplication in the ClangIR +code generation implementation. It acknowledges specific problems with the +current implementation, discusses strategies for mitigating the risk inherent in +the current approach, and describes a general long-term plan for addressing the +issue. + +Background +========== + +The ClangIR code generation is very closely modeled after Clang's LLVM IR code +generation, and we intend for the CIR produced to eventually be semantically +equivalent to the LLVM IR produced when not going through ClangIR. However, we +acknowledge that as the ClangIR implementation is under development, there will +be differences in semantics, both because we have not yet implemented all +features of the classic codegen and because the CIR dialect is still evolving +and does not yet have a way to represent all of the necessary semantics. + +We have chosen to model the ClangIR code generation directly after the classic +codegen, to the point of following identical code structure, using similar names +and often duplicating the logic because this seemed to be the most certain path +to producing equivalent results. Having such nearly identical code allows for +direct comparison between the CIR codegen and the LLVM IR codegen to find what +is missing or incorrect in the CIR implementation. + +However, we recognize that this is not a sustainable permanent solution. As +bugs are fixed and new features are added to the classic codegen, the process of +keeping the analogous CIR code up to date will be a purely manual process. + +Long term, we need a more sustainable approach. + +Current Strategy +================ + +Practical considerations require that we make steady progress towards a working +implementation of ClangIR. This necessity is directly opposed to the goal of +minimizing code duplication. + +For this reason, we have decided to accept a large amount of code duplication +in the short term, even with the explicit understanding that this is producing +a significant amount of technical debt as the project progresses. + +As the CIR implementation is developed, we often note small pieces of code that +could be shared with the classic codegen if they were moved to a different part +of the source, such as a shared utility class in some directory available to +both codegen implementations or by moving the function into a related AST class. +It is left to the discretion of the developer and reviewers to decide whether +such refactoring should be done during the CIR development, or if it is +sufficient to leave a comment in the code indicating this as an opportunity for +future improvement. Because much of the current code is likely to change when +the long term code sharing strategy is complete, we will lean towards only +implementing refactorings that make sense independent of the code sharing +problem. + +We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI +could be refactored to allow parts of there implementation to be shared today +through inheritence and templated base classes. However, this may prove to be +wasted effort when the permanent solution is developed, so we have decided that +it is better to accept significant amounts of code duplication now, and defer +this type of refactoring until it is clear what the permanent solution will be. + +Mitigation Through Testing +========================== + +The most important tactic that we are using to mitigate the risk of CIR diverging +from classic codegen is to incorporate two sets of LLVM IR checks in the CIR +codegen LIT tests. One set checks the LLVM IR that is produced by first +generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR +that is produced directly by the classic codegen. + +At the time that tests are created, we compare the LLVM IR output from these two +paths to verify (manually) that any meaningful differences between them are the +result of known missing features in the current CIR implementation. Whenever +possible, differences are corrected in the same PR that the test is being added, +updating the CIR implementation as it is being developed. + +However, these tests serve a second purpose. They also serve as sentinels to +alert us to changes in the classic codegen behavior that will need to be +accounted for in the CIR implementation. While we appreciate any help from +developers contributing to classic codegen, our current expectation is that it +will be the responsibility of the ClangIR contributors to update the CIR +implementation when these tests fail. + +As the CIR implementation gets closer to the goal of IR that is semantically +equivalent to the LLVM IR produced by the classic codegen, we would like to +enhance the CIR tests to perform some automatic verification of the equivalence +of the generated LLVM IR, perhaps using a tool such as Alive2. + +Eventually, we would like to be able to run all existing classic codegen tests +using the CIR path as well. + +Other Considerations +==================== + +The close modeling of CIR after classic codegen has also meant that the CIR +dialect often represents language details at a much lower level than it ideally +should. + +In the interest of having a complete working implementation of ClangIR as soon +as is practical, we have chosen to take the approach of following the classic +codegen implementation closely in the initial implementation and only raising +the representation in the CIR dialect to a higher level when there is a clear +and immediate benefit to doing so. + +Over time, we expect to progressively raise the CIR representation to a higher +level and remove low level details, including ABI-specific handling from the +dialect. However, having a working implementation in place makes it easier to +verify that the high level representation and subsequent lowering are correct. + +Mixing With Other Dialects +========================== + +Mixing of dialects is a central design feature of MLIR. The CIR dialect is +currently more self-contained than most dialects, but even now we generate +the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP +and CUDA are added, similar mixing will occur. + +We also expect CIR to be at least partially lowered to other dialects during +the optimization phase to enable features such as data dependence analysis, even +if we will eventually be lowering it to LLVM IR. + +Therefore, any plan for generating LLVM IR from CIR must be integrated with the +general MLIR lowering design, which typically involves lowering to the LLVM +dialect, which is then transformed to LLVM IR. + +Other Consumers of CIR and MLIR +=============================== + +We must also consider that we will not always be lowering CIR to LLVM IR. CIR, +usually mixed with other dialects, will also be directed to offload targets +and other code generators through interfaces that are opaque to Clang. We must +still produce semantically correct CIR for these consumers. + +Long Term Vision +================ + +As the CIR implementation matures, we will eliminate target-specific handling +from the high-level CIR generated by Clang. The high-level CIR will then be +progressively lowered to a form that is closer to LLVM IR, including a pass +that inserts ABI-specific handling, potentially representing the target-specific +details in another dialect. + +As we raise CIR to this higher level implementation, there will naturally be +less code duplication, and less need to have the same logic repeated in the +CIR generation. + +We will continue to use that same basic design and structure for CIR code +generation, with classes like CIRGenModule and CIRGenFunction that serve the +same purpose as their counterparts in classic codegen, but the handling there +will be more closely tied to core semantics and therefore less likely to require +frequent changes to stay in sync with classic codegen. + +As the handling of low-level details is moved to later lowering phases, we will +need to move away from the current tight coupling of the CIR and classic codegen +implementations. As this happens, we will look for ways that this handling can +be moved to new classes that are specifically designed to be shared among +clients that are targeting different IR substrates. That is, rather than trying +to overlay reuse onto the existing implementations, we will replace relevant +parts of the existing implementation, piece by piece, as appropriate, with new +implementations that perform the same function but with a more general design. + +Example: C Calling Convention Handling +====================================== + +C calling convention handling is an example of a general purpose redesign that +is already underway. This was started independently of CIR, but it will be +directly useful for lowering from high-level call representation in CIR to a +representation that includes the target- and calling convention-specific details +of function signatures, parameter type coercion, and so on. + +The current CIR implementation duplicates most of the classic codegen handling +for function call handling, but it omits several pieces that handle type +coercion. This leads to an implementation that has all of the complexity of the +class codegen without actually achieving the goals of that complexity. It will +be a significant improvement to the CIR implementation to simplify the function +call handling in such a way that it generates a high-level representation of the +call, while preserving all information that will be needed to lower the call to +an ABI-compliant representation in a later phase of compilation. + +This provides a clear example where trying to refactor the classic codegen in +some way to be reused by CIR would have been counterproductive. The classic +codegen implementation was tightly coupled with Clang's LLVM IR generation. The +implementation is being completely redesigned to allow general reuse, not just by +CIR, but also by other front ends. + +The CIR calling convention lowering will make use of the general purpose C +calling convention library that is being created, but it should create an MLIR +transform pass on top of that library that is general enough to be used by other +dialects, such as FIR, that also need the same calling convention handling. + +Significant Areas For Improvement +================================= + +The following list enumerates some of the areas where significant restructuring +of the code is needed to enable better code sharing between CIR and classic +codegen. Each of these areas is relatively self-contained in the codegen +implementation, making the path to a shared implementation relatively clear. + +- Constant expression evaluation +- Complex multiplication and division expansion +- Builtin function handling +- Exception Handling and C++ Cleanups +- Inline assembly handling +- C++ ABI Handling + + - VTable generation + - Virtual function calls + - Constructor and destructor arguments + - Dynamic casts + - Base class address calculation + - Type descriptors + - Array new and delete + +Pervasive Low-Level Issues +========================== + +This section lists some of the features where a non-trivial amount of code +is shared between CIR and classic codegen, but the handling of the feature +is distributed across the codegen implementation, making it more difficult +to design an abstraction that can easily be shared. + +- Global variable and function linkage +- Alignment management +- Debug information +- TBAA handling +- Sanitizer integration +- Lifetime markers