New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bytecode from iree-import-tflite
causes debug info to be dropped in translateModuleToLLVMIR
#16781
Comments
If at the very start of the If I dump the parent I think at this point we have narrowed this bug down to something about bitcode. In the bad case, the module as loaded from bitcode is corrupted in a way that somehow causes debug info to be dropped in |
Confirmed with this experiment: If I process the bad input file with Whereas as noted earlier, without If I process first without |
iree-import-tflite
causes debug info to be dropped in translateModuleToLLVMIR
Fascinating, I would have gotten lost the moment the bug failed to express as text |
Fully traced through BytecodeReader, here is the divergence: Call stack at point where things diverge:
Looking at frame Since it's in generated code I can't link it easily so here's a paste. This is at static Attribute readAttribute(MLIRContext* context, DialectBytecodeReader &reader) {
uint64_t kind;
if (failed(reader.readVarInt(kind)))
return Attribute();
switch (kind) {
case 0:
return readArrayAttr(context, reader);
case 1:
return readDictionaryAttr(context, reader);
case 2:
return readStringAttr(context, reader);
case 3:
return readStringAttrWithType(context, reader);
case 4:
return readFlatSymbolRefAttr(context, reader);
case 5:
return readSymbolRefAttr(context, reader);
case 6:
return readTypeAttr(context, reader);
case 7:
return readUnitAttr(context, reader);
case 8:
return readIntegerAttr(context, reader);
case 9:
return readFloatAttr(context, reader);
case 10:
return readCallSiteLoc(context, reader);
case 11:
return readFileLineColLoc(context, reader);
case 12:
return readFusedLoc(context, reader);
case 13:
return readFusedLocWithMetadata(context, reader);
case 14:
return readNameLoc(context, reader);
case 15:
return readUnknownLoc(context, reader);
case 16:
return readDenseResourceElementsAttr(context, reader);
case 17:
return readDenseArrayAttr(context, reader);
case 18:
return readDenseIntOrFPElementsAttr(context, reader);
case 19:
return readDenseStringElementsAttr(context, reader);
case 20:
return readSparseElementsAttr(context, reader);
case 21:
return readDistinctAttr(context, reader);
default:
reader.emitError() << "unknown attribute code: " << kind;
return Attribute();
}
return Attribute();
} So:
Interpretation:
What's exactly the bug here? If simply round-tripping via text IR is enough to fix those locations, maybe that means that they should have been good locations in the first place, so the bug is |
Can you give a sample value for one of those? Does it reference the |
Again the good bytecode file is obtained from the bad bytecode file here by two successive |
Here are some hypotheses:
We should also decouple the existence of source locations from whatever other debug information may or may not be included. |
Indeed:
Things I considered:
What I now think we should do:Remember that I need this in the first place to have debug info in ukernels, and the motivation for this is having a good Tracy source view. Now, as of #16757 , users who care about having a good source view are users who pass
|
The decoupling there was intentional. Tracing showing snapshotted sources produced by I think the best paths forward are some mix of
Whether that is connected with |
We need the upstream bug fixed. There's no reason debug info should fail if there's an UnknownLoc. I think there's a bit too much focus on workarounds that dictate how features behave in ways that should be orthogonal. |
I meant something like Scott alludes to in the last sentence of his above comment:
Emphasis on "if they are missing". The situation here is the locations are UnknownLoc, so there is a need to add actual locations to get a useful source view. This is likely a preexisting problem: I bet that if we try to tracy those tflite models with UnknownLoc, source view isn't showing anything. My point was, this doesn't even feel like an "upstream bug" - there is no source location because there is no source, and even that may not be a bug given that TFLite is a binary format. So like Scott suggests in the above pasted sentence, I was thinking that for high enough debugLevel (I was suggesting debugLevel >= 3) we could actually rewrite UnknownLoc locations with the snapshotted source locations. I'm OK with leaving alone any existing actual locations (anything else than UnknownLoc). If you still see an "upstream bug" here that needs fixing, I would like to understand what that is. I can think of two potential candidates but I'm way out of my depth in either:
|
After posting the previous comment I thought: Wait, either I'm missing something, or #16757 does not actually fix #15699. I am at current top-of-tree (ee32fc7). I did an experiment on a OPT-1.3b model from shark_tank, so this is a PyTorch model not TFLite. The command lines are in this gist: https://gist.github.com/bjacob/75717f6335603628d09c58313aa0d47d . It's totally self contained with a link to download the imported MLIR from. I just added the usual Tracy-ing flags, and ran two experiments:
I thought that "#16757 fixes #15699" meant that Screenshots:
|
For Tracy, you should prefer the new |
Source locations and debug information should be separable. We don't want to have a tight coupling between those concepts. If one frontend like TFLite doesn't include one type of metadata, that shouldn't stop us from including other kinds of metadata. |
Ah, do I need to patch some PR? Trying
I tried the suggested one but it needs a value?
I see, that sounds great. I had assumed that
I need a lecture on this topic. I just don't really understand at the moment. |
Let me confirm how the new pass is working... it might be actually be added automatically with debug level >= 3.
Depends on what version/commit you're building at :p. You do need Ben's PR / 15d9039 |
Yep, as mentioned above I am at current top-of-tree (ee32fc7) which is past that (I checked). |
the extra sources are embedded with debug-level>=3 What source locations are passed to LLVM and what source files we embed in the executables are orthogonal. It is useful to be able to point at our new captured sources, but that should be a separate flag. We could have an iree-hal-executable-source-stage= flag that does the location rewrite to a captured stage by name. This way you could have the debug info reference tensor linalg, buffer linag, LLVM dialect, etc. If we added that flag we could remove the implicit source location overwriting of dump-sources, as that was always a hack. But we want that to be optional and off by default: to users (not internal devs) the source locations should always point back to their source programs by default. And if because of a bug that means this fails to work we need to fix that bug. So:
|
I see. What I'm seeing here is that with both TFLite and Pytorch sources, often we just don't have source locations out of the import process. So tracy source view doesn't show anything by default. Could we in that case and under debug-level>=3 set source locations to the embedded executable sources (which are tensor-linalg) ? Way more useful than no source at all. Also a technicality - I get your explanation of "What source locations are passed to LLVM and what source files we embed in the executables are orthogonal", but then I don't understand how that works. If the source files in the embedded executable sources are orthogonal to the source locations passed to LLVM, how is the mapping done between locations in the final object code and locations in the embedded executable sources ? |
iree-import-tflite
causes debug info to be dropped in translateModuleToLLVMIR
iree-import-tflite
and torch-mlir
) causes debug info to be dropped in translateModuleToLLVMIR
I think I'm missing something - are you saying there are zero locations from tflite and pytorch? Or just some missing locations? The pass that captures the locations (which runs several times, each time with a compilation stage name) can set the flag to replace the source locations to point at the captured source - that's exactly what dump-sources does today - the only difference is in one we update the locations and in the other we don't. |
Aah so to actually answer that question I had to patch the above-mentioned generated And you're right actually the "bad" imported bytecode has only one In the below table, "bad" means exhibits the pathologies discussed here, no source view etc.
Now I tried running the same experiment on that OPT-1.3 model but realized that it is text IR to begin with so it just doesn't go through that bytecode-reader code, and I have no idea why it exhibits the pathology. |
iree-import-tflite
and torch-mlir
) causes debug info to be dropped in translateModuleToLLVMIR
iree-import-tflite
causes debug info to be dropped in translateModuleToLLVMIR
These are the NameLoc values in this PoseNet model: |
At this point I think I agree that it seems really unnecessarily rigid of the current DebugTranslation.cpp code to just give up on this perfectly good debug info just because of one |
Note though, at this point we are trying to solve 2 problems, and a fix at that level would only solve 1 not 2:
|
I tried again fixing So it's just ignoring the name string in the NameLoc and walking straight to the ChildLoc, ending with an UnknownLoc, returning null. |
Note: the round tripped via text is actually bad. It loses all traces to the original Tflite file and references an intermediate file. It accidentally works. The bug is how namedloc is being treated, well and indeed mixing two concerns together (but one definitely can have a namedloc and debug info and that should work). |
This is to continue the investigation in #15756, which is being reverted now.
Summary:
#15756 should have been a 1-line PR, simply adding
-gline-tables-only
to the clang flags that we use to compile ukernels to bitcode. All it does is generate some debug info in the compiled ukernels bitcode.This runs into a bug discussed on this comment thread: https://github.com/openxla/iree/pull/15756/files#r1411583898
iree-import-tflite
is run throughiree-opt
with no flags...mlir-opt
, or even just dumping the IR immediately after loading the input file inmlir-opt
...--iree-hal-dump-executable-sources-to
(https://github.com/openxla/iree/pull/15756/files#r1518324689). The reason is that that causes debug info to be reintroduced after the bug had already caused it to be dropped..codegen.bc
actually does have some limited debug info (on each function, but not on the ops inside each function). But the debug info somehow gets dropped inside ofmlir::translateModuleToLLVMIR
(https://github.com/openxla/iree/pull/15756/files#r1518324689). I haven't been able to debug exactly where, but that's the closest i've been able to get.The text was updated successfully, but these errors were encountered: