Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISubprogram has declaration pointing to itself #59241

Closed
apolloww opened this issue Nov 29, 2022 · 31 comments
Closed

DISubprogram has declaration pointing to itself #59241

apolloww opened this issue Nov 29, 2022 · 31 comments

Comments

@apolloww
Copy link
Contributor

apolloww commented Nov 29, 2022

Repro source:

class Request final {
  public:
    Request(){}
};

class Response final {
  public:
    Response(){}
};

class LoggingAllowedFields {
  public:
    template <typename T>
    static const int loggingFieldsForType();

    template <>
    const int loggingFieldsForType<Request>();

    template <>
    const int loggingFieldsForType<Response>();
};


template <typename PAYLOAD>
int makeLogData(const PAYLOAD& payload) {
  return LoggingAllowedFields::loggingFieldsForType<PAYLOAD>();
}

Response *log(Request* request) {
  makeLogData(*request);
  Response *response = new Response{};
  makeLogData(*response);
  return response;
}

run the command:

clang-16 -o repro.o -fopenmp -c -g2 -O1 repro.cc

stack overflow inside constructCallSiteEntryDIEs from asm printer pass when constructing callsite
callee LoggingAllowedFields::loggingFieldsForType<Request>()
caller makeLogData<Request>(Request const&).

with the follow debug info:

!54 = !DISubprogram(name: "loggingFieldsForType<Request>", linkageName: "_ZN20LoggingAllowedFields20loggingFieldsForTypeI7RequestEEKiv", scope: !55, file: ! 1, line: 17, type: !58, scopeLine: 17, flags: DIFlagPublic | DIFlagPrototyped | DIFlagStaticMember, spFlags: DISPFlagOptimized, templateParams: !64)
!55 = distinct !DICompositeType(tag: DW_TAG_class_type, name: "LoggingAllowedFields", file: !1, line: 11, size: 8, flags: DIFlagTypePassByValue, elements: ! 56, identifier: "_ZTS20LoggingAllowedFields")
!56 = !{!54, !57}
!57 = distinct !DISubprogram(name: "loggingFieldsForType<Response>", linkageName: "_ZN20LoggingAllowedFields20loggingFieldsForTypeI8ResponseEEKiv", scope: ! 55, file: !1, line: 20, type: !58, flags: DIFlagPrototyped, spFlags: DISPFlagOptimized, templateParams: !62, declaration: !57, retainedNodes: !61)

Notice that DISubprogram loggingFieldsForType<Response>'s declaration field points to itself, which leads to infinite call stack.

The self-reference DISubprogram exists in the unoptimized IRs straight from clang codegen, so it is reproducible with the following cmd as well,

clang-16 -o repro.ll -fopenmp -g2 -O1 repro.cc -Xclang -disable-llvm-passes -emit-llvm -S

In order to reproduce, I need all of -fopenmp, -O1( or use a -O level >= 1) and -g2:

  • -fopenmp affects which declarations go into CodeGenModule::DeferredDeclsToEmit and their orders. The source has nothing to do with openmp, but this flag is applied universally in our internal code repo. As explained below, I think certain order of the declarations causes the issue.
  • -O level seems to affect debug info verbosity. If I give -O0, less debug info is generated and the issue is gone.
  • -g2 implies -debug-info-kind=constructor. If I give -debug-info-kind=limited, the issue is gone.

Below is what I've found so far after digging into clang codegen.
When -fopenmp is given, the order of function codegen is

  1. codegen function log(Request*): create temporary DICompositeType for Response and store in TypeCache. It has DIFlagFwdDecl flag.

  2. codegen function int makeLogData<Request>(Request const&): the type definition of loggingFieldsForType is created. This in turn creates a DISubprogram for loggingFieldsForType<Response>(), and uses the type created in previous step for its templateParams field. The type is stored in the DISubprograms hashmap using the templateParams as part of key.

  3. codegen function for the constructor Response::Response(): because of ctor homing, it creates the complete type for Response and replaces the previously created temporary node with a distinct DICompositeType node.

  4. codegen function int makeLogData<Response>(Response const&)

    1. searching the TypeCache for type Response now returns the distinct node.
    2. searching the DISubprograms hashmap for loggingFieldsForType<Response>() does not return anything because the key value has been updated.
    3. A new DISubprogram for loggingFieldsForType<Response>() is created.

If -fopenmp is not given, the order of function codegen changes, now #3 Response::Response(), the constructor is generated before #2, so that in #4, the search will hit in the hashmap.

I am not sure if openmp is to blame here, it looks to me that it exposes an issue in debug info generation when ctor homing is used.

cc: @dwblaikie @ayermolo

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 29, 2022

@llvm/issue-subscribers-debuginfo

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 29, 2022

@llvm/issue-subscribers-openmp

@dwblaikie
Copy link
Collaborator

if you out-of-line the functions (rather than defining them within the class - though maybe still marking them as inline) can you still reproduce it? Perhaps then you can reorder them and reproduce the failure even without openmp? (might require other techniques to move when the ctors are instantiated while leaving their written definitions where they are currently..)

@apolloww
Copy link
Contributor Author

apolloww commented Nov 29, 2022

Thanks for the suggestion. If I change the source as follows:

class Request final {
  public:
    Request();
};

class Response final {
  public:
    Response();
};


class LoggingAllowedFields {
  public:
    template <typename T>
    static const int loggingFieldsForType();

    template <>
    const int loggingFieldsForType<Request>();

    template <>
    const int loggingFieldsForType<Response>();
};

int makeLogDataRequest(const Request& payload);
int makeLogDataResponse(const Response& payload);

Request::Request() {}

Response *log(Request* request) {
  makeLogDataRequest(*request);
  Response *response = new Response{};
  makeLogDataResponse(*response);
  return response;
}

int makeLogDataRequest(const Request& payload) {
  return LoggingAllowedFields::loggingFieldsForType<Request>();
}

Response::Response() {}

int makeLogDataResponse(const Response& payload) {
  return LoggingAllowedFields::loggingFieldsForType<Response>();
}

and build it with

clang-16 -o repro.o -c -g2 -O1 repro.cc

It seems to crash at the same place.

@dwblaikie
Copy link
Collaborator

Yeah, looks like the IR is invalid - if this is compiled with -emit-llvm, it bails out:

function declaration may only have a unique !dbg attachment
ptr @_ZN20LoggingAllowedFields20loggingFieldsForTypeI8ResponseEEKiv
fatal error: error in backend: Broken module found, compilation aborted!

& I think it might be relatively recent - so your best bet is probably to bisect it & see what revision introduced the problem and followup in post-commit review there.

@apolloww
Copy link
Contributor Author

apolloww commented Nov 30, 2022

We found this issue when trying out clang-15 internally, and it also repro on clang-12 after back-ported https://reviews.llvm.org/D106084 (it produces smaller object files) or just force -debug-info-kind=constructor, so I suspect this is not new.

cc: @amykhuang

@dwblaikie
Copy link
Collaborator

Ah, yeah, sorry, I'd sync'd back 1000 revisions and thought it didn't reproduce, but I messed that up somehow. So, yeah, probably somewhat inherent to ctor homing. But that means it can probably be reproduced with other homing strategies too...

let's see

struct Response;

struct LoggingAllowedFields {
    template <typename T>
    static const int loggingFieldsForType();

    template <>
    const int loggingFieldsForType<int>();

    template <>
    const int loggingFieldsForType<Response>();
};

void makeLogDataRequest() {
  LoggingAllowedFields::loggingFieldsForType<int>();
}

struct Response {
    void func();
};
void Response::func() { }

int makeLogDataResponse() {
  return LoggingAllowedFields::loggingFieldsForType<Response>();
}

Seems to reproduce without ctor homing?

@dwblaikie
Copy link
Collaborator

Looks like this happened around Clang 10? https://godbolt.org/z/5zejTh9s3

@dwblaikie
Copy link
Collaborator

Got it down a bit further & still crashing: https://godbolt.org/z/nzj7v16xa

struct t2;
struct t1 {
    template <typename T>
    static const int f1();
    template <>
    const int f1<t2>();
};
t1 v1;
struct t2 { };
t2 v2;
void f2() {
   t1::f1<t2>();
}

@apolloww
Copy link
Contributor Author

apolloww commented Dec 1, 2022

It seems it compiles fine on clang-9 but fails on clnag-10 and up. I'll do some bisecting.

@apolloww
Copy link
Contributor Author

apolloww commented Dec 1, 2022

bisected to https://reviews.llvm.org/D69743

@dwblaikie
Copy link
Collaborator

oh, yeah, the call site stuff adding in declarations for entities we hadn't previously tried to produce declarations for did trip over a few things - not too surprised that there are still latent issues like this. @vedantk any chance you could take a look at this?

@vedantk
Copy link
Collaborator

vedantk commented Dec 2, 2022

The trouble starts immediately after debug info gen for the call t1::f1<t2>() in CGDebugInfo::EmitFuncDeclForCallSite. I locally added a verifyModule() call at the end of CGDebugInfo::EmitFunctionDecl and could confirm the "function declaration may only have a unique !dbg attachment" issue manifests after the call to DBuilder.finalizeSubprogram(SP);.

Looking at the history, it seems this fits into a known pattern of bugs that historically were dealt with by adding escape hatches (ad hoc) for "bad" kinds of callees (statics, inlines):

  // If there is no DISubprogram attached to the function being called,
  // create the one describing the function in order to have complete
  // call site debug info.
  if (!CalleeDecl->isStatic() && !CalleeDecl->isInlined())
    EmitFunctionDecl(CalleeDecl, CalleeDecl->getLocation(), CalleeType, Func);

It's been years now since I've had time to work on debug info in particular or llvm in general, and I haven't kept up with how the project has evolved, so I'm scratching my head about what to do here. Istm that the easy/lazy stopgap fix would be to add yet another escape hatch (maybe if CalleeDecl->getTemplateKind() is some known "bad" type?). That doesn't seem ideal, though, as it's just papering over the deeper issue of the finalizeSubprogram step being unable to create a uniqued DISubprogram: that's an issue I have even less background knowledge to lean on, but maybe @adrian-prantl has some pointers?

My role at Apple has changed, so unfortunately I'm unlikely to be able to take a deeper look at this any time soon..

@apolloww
Copy link
Contributor Author

apolloww commented Dec 2, 2022

From

  if (!CalleeDecl->isStatic() && !CalleeDecl->isInlined())
    EmitFunctionDecl(CalleeDecl, CalleeDecl->getLocation(), CalleeType, Func);

no DISubprogram should be generated for static decl. t1::f1<t2> is a static member method, so why does it pass the check?

@dwblaikie
Copy link
Collaborator

No worries @vedantk - thanks for the context you've provided.

From

  if (!CalleeDecl->isStatic() && !CalleeDecl->isInlined())
    EmitFunctionDecl(CalleeDecl, CalleeDecl->getLocation(), CalleeType, Func);

no DISubprogram should be generated for static decl. t1::f1<t2> is a static member method, so why does it pass the check?

FunctionDecl::isStatic tests whether the function has static storage duration (is it file-local), which is different from whether it's a static member of a class. (it might be both (if the class is in an anonymous namespace - then it has static storage duration and it's a static member of a class) or could be one or the other, or neither)

@apolloww
Copy link
Contributor Author

apolloww commented Dec 2, 2022

Thanks for the explanation. Just noticed it returns StorageClass.

@shiltian shiltian added clang Clang issues not falling into any other category and removed openmp labels Jan 13, 2023
@apolloww
Copy link
Contributor Author

As @vedantk suggested, we can tighten the constraint here, like !CalleeDecl->isFunctionTemplateSpecialization(), to exclude such case from attaching DI node to function decl. But this also regresses good cases, like t1::f1<t3>() below.

struct t2;
struct t3;
struct t1 {
  template <typename T>
    static const int f1();
  template <>
    const int f1<t2>();
  template <>
    const int f1<t3>();
};

struct t3 { };
t3 v3;

t1 v1;
struct t2 { };
t2 v2;
void f2() {
  t1::f1<t2>();  // bad
  t1::f1<t3>();  // good
}

I think the thing goes wrong with decl at callsitet1::f1<t2>() is that type t2 is defined after type t1 is constructed, and that creates a temporary node for t2.

!10 = !DISubprogram(name: "f1<t2>", linkageName: "_ZN2t12f1I2t2EEKiv", scope: !8, file: !3, line: 7, type: !11, scopeLine: 7, flags: DIFlagPrototyped | DIFlagStaticMember, spFlags: DISPFlagOptimized, templateParams: !15)
!15 = !{!16}
!16 = !DITemplateTypeParameter(name: "T", type: !17)
!17 = <temporary!> !DICompositeType(tag: DW_TAG_structure_type, name: "t2", file: !3, line: 1, flags: DIFlagFwdDecl | DIFlagNonTrivial, identifier: "_ZTS2t2")

At the callsite, a new DISubprogram is created and its declaration points to the previously created one.

declare !dbg !36 noundef i32 @_ZN2t12f1I2t2EEKiv() #1

!23 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "t2", file: !3, line: 16, size: 8, flags: DIFlagTypePassByValue, elements: !5, identifier: "_ZTS2t2")

!36 = !DISubprogram(name: "f1<t2>", linkageName: "_ZN2t12f1I2t2EEKiv", scope: !8, file: !3, line: 7, type: !11, flags: DIFlagPrototyped, spFlags: DISPFlagOptimized, templateParams: !37, declaration: !10, retainedNodes: !5)
!37 = !{!38}
!38 = !DITemplateTypeParameter(name: "T", type: !23)

This creates a self-reference cycle when the temporary node is replaced and RAUW later during CGDebugInfo::finalize(), and the node becomes distinct.

void MDNode::handleChangedOperand(void *Ref, Metadata *New) {
  ... ...
  // Drop uniquing for self-reference cycles and deleted constants.
  if (New == this || (!New && Old && isa<ConstantAsMetadata>(Old))) {
    if (!isResolved())
      resolve();
    storeDistinctInContext();
    return;
  }
  ... ...
}

I wonder if there is a way to break the cycle. Also what does it mean by having a declaration (!36) points to another declaration (!10)? Can we simply make the declaration field as nullptr if the DISubprogram is a declaration itself?

@apolloww
Copy link
Contributor Author

apolloww commented Feb 1, 2023

This is currently blocking our internal toolchain upgrade. I am OK with skipping templated functions at this moment even though it is a strong condition.

@dwblaikie, do you have any suggestions?

@dwblaikie
Copy link
Collaborator

@aprantl @JDevlieghere sounds like Apple originally implemented the call site stuff? Is it something you've still got interest in/time to look at a good fix for this bug? Otherwise it's likely we'll end up with a fairly coarse-grained fix, disabling call site info for calls to function template instantiations or something similar?

@felipepiovezan
Copy link
Contributor

Sorry for the delay, I have something I need to finish first, but I'd like to have a look at this soon!

@felipepiovezan
Copy link
Contributor

felipepiovezan commented Feb 9, 2023

Also what does it mean by having a declaration (!36) points to another declaration (!10)?

I've been trying to answer this question and the docs don't seem very clear on whether this is supposed to be allowed or not. I'll experiment forbidding this and checking what breaks.

In the meantime, I also tracked where a declaration to another declaration is created, and it is in this function:

clang::CodeGen::CGDebugInfo::EmitFunctionDecl(...) {
   4211   llvm::DISubprogram::DISPFlags SPFlags = llvm::DISubprogram::SPFlagZero;
   4212   if (CGM.getLangOpts().Optimize)
   4213     SPFlags |= llvm::DISubprogram::SPFlagOptimized;
   4214
   4215   llvm::DINodeArray Annotations = CollectBTFDeclTagAnnotations(D);
   4216   llvm::DISubroutineType *STy = getOrCreateFunctionType(D, FnType, Unit);
-> 4217   llvm::DISubprogram *SP =
   4218       DBuilder.createFunction(FDContext, Name, LinkageName, Unit, LineNo, STy,
   4219                               ScopeLine, Flags, SPFlags, TParamsArray.get(),
   4220                               getFunctionDeclaration(D), nullptr, Annotations);
}

Note that SPFlags never sets IsDefinition, and yet we call getFunctionDeclaration(D). If that call returns non-null for D (which it does in our problematic example), then we create a declaration pointing to another declaration.

Turns out the call to getFunctionDeclaration is very very old. I've gone as far as 2013 to track any slightly related changes to that line: 18cfbc5
I don't think this patch is changing anything relevant for what we are investigating though.

@apolloww
Copy link
Contributor Author

apolloww commented Feb 9, 2023

Yeah, I think this is where the decl references decl happens. I wonder if we could create a new subprogram for decl only if there is none exists (Does this follow the expectation of subprogram for decl should always be unique?). Even though we'd use a subprogram with incomplete type, in the example, the one with only a fwd decl of t2 would be picked, but during finalization, I feel it should be updated to the complete type.

@felipepiovezan
Copy link
Contributor

felipepiovezan commented Feb 9, 2023

Also, none of the check-clang tests fail if we replace getFunctionDeclaration(D) with nullptr...

@dwblaikie
Copy link
Collaborator

Also, none of the check-clang tests fail if we replace getFunctionDeclaration(D) with nullptr...

Seems worth a go, then. Though probably worth doing some manual testing of surrounding cases would be worthwhile. (try call site debug info with a plain function declaration, with a member function declared/defined before or after the call, etc)

@felipepiovezan
Copy link
Contributor

felipepiovezan commented Feb 10, 2023

Seems worth a go, then. Though probably worth doing some manual testing of surrounding cases would be worthwhile. (try call site debug info with a plain function declaration, with a member function declared/defined before or after the call, etc)

I'll give those a try! Also compiling Clang itself before/after the patch and seeing if there is any difference whatsoever in debug information.

@dwblaikie
Copy link
Collaborator

I'll give those a try! Also compiling Clang itself before/after the patch and seeing if there is any difference whatsoever in debug information.

An optimized build/otherwise with call site debug info enabled?

@felipepiovezan
Copy link
Contributor

I'll give those a try! Also compiling Clang itself before/after the patch and seeing if there is any difference whatsoever in debug information.

An optimized build/otherwise with call site debug info enabled?

Oh, good point! I had forgotten that the whole issue started with call site debug info, which requires optimizations.
Going to check both to be safe!

@felipepiovezan
Copy link
Contributor

felipepiovezan commented Feb 13, 2023

Good news, using the proposed patch to rebuild clang, I checked the dwarfdump of libLLVMSupport: the dwarf is identical (ignoring the attributes decl_file/producer/comp_dir).

I'll open a patch in phab

@felipepiovezan
Copy link
Contributor

https://reviews.llvm.org/D143921

@apolloww I'm not sure what your username is in Phab

@apolloww
Copy link
Contributor Author

Subscribed. And thanks for amending the doc as well.

felipepiovezan added a commit that referenced this issue Feb 20, 2023
The function `CGDebugInfo::EmitFunctionDecl` is supposed to create a
declaration -- never a _definition_ -- of a subprogram. This is made
evident by the fact that the SPFlags never have the "Declaration" bit
set by that function.

However, when `EmitFunctionDecl` calls `DIBuilder::createFunction`, it
still tries to fill the "Declaration" argument by passing it the result
of `getFunctionDeclaration(D)`. This will query an internal cache of
previously created declarations and, for most code paths, we return
nullptr; all is good.

However, as reported in [0], there are pathological cases in which we
attempt to recreate a declaration, so the cache query succeeds,
resulting in a subprogram declaration whose declaration field points to
another declaration. Through a series of RAUWs, the declaration field
ends up pointing to the SP itself. Self-referential MDNodes can't be
`unique`, which causes the verifier to fail (declarations must be
`unique`).

We can argue that the caller should check the cache first, but this is
not a correctness issue (declarations are `unique` anyway). The bug is
that `CGDebugInfo::EmitFunctionDecl` should always pass `nullptr` to the
declaration argument of `DIBuilder::createFunction`, expressing the fact
that declarations don't point to other declarations. AFAICT this is not
something for which any reasonable meaning exists.

This seems a lot like a copy-paste mistake that has survived for ~10
years, since other places in this file have the exact same call almost
token-by-token.

I've tested this by compiling LLVMSupport with and without the patch, O2
and O0, and comparing the dwarfdump of the lib. The dumps are identical
modulo the attributes decl_file/producer/comp_dir.

[0]: #59241

Differential Revision: https://reviews.llvm.org/D143921
felipepiovezan added a commit to felipepiovezan/llvm-project that referenced this issue Feb 20, 2023
The function `CGDebugInfo::EmitFunctionDecl` is supposed to create a
declaration -- never a _definition_ -- of a subprogram. This is made
evident by the fact that the SPFlags never have the "Declaration" bit
set by that function.

However, when `EmitFunctionDecl` calls `DIBuilder::createFunction`, it
still tries to fill the "Declaration" argument by passing it the result
of `getFunctionDeclaration(D)`. This will query an internal cache of
previously created declarations and, for most code paths, we return
nullptr; all is good.

However, as reported in [0], there are pathological cases in which we
attempt to recreate a declaration, so the cache query succeeds,
resulting in a subprogram declaration whose declaration field points to
another declaration. Through a series of RAUWs, the declaration field
ends up pointing to the SP itself. Self-referential MDNodes can't be
`unique`, which causes the verifier to fail (declarations must be
`unique`).

We can argue that the caller should check the cache first, but this is
not a correctness issue (declarations are `unique` anyway). The bug is
that `CGDebugInfo::EmitFunctionDecl` should always pass `nullptr` to the
declaration argument of `DIBuilder::createFunction`, expressing the fact
that declarations don't point to other declarations. AFAICT this is not
something for which any reasonable meaning exists.

This seems a lot like a copy-paste mistake that has survived for ~10
years, since other places in this file have the exact same call almost
token-by-token.

I've tested this by compiling LLVMSupport with and without the patch, O2
and O0, and comparing the dwarfdump of the lib. The dumps are identical
modulo the attributes decl_file/producer/comp_dir.

[0]: llvm#59241

Differential Revision: https://reviews.llvm.org/D143921

(cherry picked from commit 997dc7e)
CarlosAlbertoEnciso pushed a commit to SNSystems/llvm-debuginfo-analyzer that referenced this issue Feb 21, 2023
The function `CGDebugInfo::EmitFunctionDecl` is supposed to create a
declaration -- never a _definition_ -- of a subprogram. This is made
evident by the fact that the SPFlags never have the "Declaration" bit
set by that function.

However, when `EmitFunctionDecl` calls `DIBuilder::createFunction`, it
still tries to fill the "Declaration" argument by passing it the result
of `getFunctionDeclaration(D)`. This will query an internal cache of
previously created declarations and, for most code paths, we return
nullptr; all is good.

However, as reported in [0], there are pathological cases in which we
attempt to recreate a declaration, so the cache query succeeds,
resulting in a subprogram declaration whose declaration field points to
another declaration. Through a series of RAUWs, the declaration field
ends up pointing to the SP itself. Self-referential MDNodes can't be
`unique`, which causes the verifier to fail (declarations must be
`unique`).

We can argue that the caller should check the cache first, but this is
not a correctness issue (declarations are `unique` anyway). The bug is
that `CGDebugInfo::EmitFunctionDecl` should always pass `nullptr` to the
declaration argument of `DIBuilder::createFunction`, expressing the fact
that declarations don't point to other declarations. AFAICT this is not
something for which any reasonable meaning exists.

This seems a lot like a copy-paste mistake that has survived for ~10
years, since other places in this file have the exact same call almost
token-by-token.

I've tested this by compiling LLVMSupport with and without the patch, O2
and O0, and comparing the dwarfdump of the lib. The dumps are identical
modulo the attributes decl_file/producer/comp_dir.

[0]: llvm/llvm-project#59241

Differential Revision: https://reviews.llvm.org/D143921
@apolloww apolloww closed this as completed Mar 9, 2023
@EugeneZelenko EugeneZelenko added clang:codegen and removed clang Clang issues not falling into any other category labels Mar 9, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 9, 2023

@llvm/issue-subscribers-clang-codegen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants