Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with abnormal selection pattern #26

Open
pfalcon opened this issue Jan 22, 2018 · 18 comments
Open

Dealing with abnormal selection pattern #26

pfalcon opened this issue Jan 22, 2018 · 18 comments

Comments

@pfalcon
Copy link
Owner

pfalcon commented Jan 22, 2018

Reference: https://academic.oup.com/comjnl/article-pdf/20/1/45/975771/200045.pdf

Abnormal selection pattern is an if/if-else pattern, where branch(es) are not dominated by the pattern header. In other words, it's an if/if-else, where something else jumps directly into "true" or "false" branch, besides the "if" condition checking code itself.

In the real code, this pattern occurs quite often, because it's pretty low-hanging optimization for code like:

if (cond1) goto l1
...
l1:
$r0 = 0
return
...
if (cond2) goto l2
...
l2:
$r0 = 0
return

It's only natural to transform it to:

if (cond1) goto l2
...
if (cond2) goto l2
...
l2:
$r0 = 0
return
@pfalcon
Copy link
Owner Author

pfalcon commented Jan 22, 2018

Copied from #13 (comment):

Currently, SABl doesn't test for abnormal selection thoroughly, which leads to incorrect decompilation, e.g. https://github.com/pfalcon/ScratchABlock/blob/master/tests/decomp/_xtos_set_interrupt_handler_arg/_xtos_set_interrupt_handler_arg.lst.exp.clean

Generally, there's a gap with true structuredness and C-structuredness, which is more expressive. It may seem that it applies to loops, but it turns out it applies even to if too. "True" structuredness is when each structure element has one entry and one exit, up to a complete function. C-structuredness allow multiple exits (but not entries).

For example, following nice-looking C if is only nice-looking thanks to 2 returns:

if (cond1) {
  // do sth
  if (cond2) {
    // do sth
    return 1;
  }
}
return 0;

That however contains abnormal selection. Converted to fully structured shape, it's more verbose and less nicely looking.

So, the point is that SABl so far follows "fully structured" way, which means boring splitting nodes to resolve abnormal selection. But to get nice concise C-structuredness, that will later need to be reverted ;-I.

Or maybe not, maybe we can detect ladder-style abnormal selection (when one branch is dominated by header, and another - not) and apply multiple-return transformation. It's hard to decide what to do when and even harder to actually do and prove that it's right.

So, apparently stick to fully structured way for now. At least that makes it MISRA-compliant decompiler, LOL.

@maximumspatium
Copy link
Contributor

Below a couple of unsorted thoughts on the topic:

From what I've read so far, the current approach to CFG structuring is searching for predefined node configurations and to transform the corresponding CFG regions to high-level structures. If no match has been found, node splitting is performed. Selecting nodes to be split seems to be not trivial. Several algorithms have been proposed in the past.

Notably, two papers propose some different approaches to CFG structuring:

  • No More Gotos claims to be able to structure arbitrary CFGs in a "pattern-independent" way, that is, without searching the CFG for predefined control structures. The proposed algorithm searches for regions satisfying special criteria instead. Abnormal selections will be converted to the "single entry, single exit" form by means of so-called semantic-preserving transformations that can be understand as high-level node splitting.
  • Structuring 2-way Branches in Binary Executables proposes an interesting method for structuring 2-way branches based on the observation that high-level structures will result in a (complex) 2-way branches patterns that can be captured by compound branch and cascade branch subgraphs. To cope with real-world programs, the authors propose to structure CFGs based on these subgraph types derived from the underlying CFG.

Anyway, CFGs of real-world programs often don't satisfy the restriction "single entry, single exit". My recent decompilation target (pure C) has shown that anything more complex than a single loop or a couple of ifs requires some kind of handling of abnormal selection paths. The question is what method should be implemented.

I noticed that the common case of abnormal selection path is caused by return statements inside a control structure. We may start by investigating such cases first. I'll try to hack a program for searching for abnormal selections and to sample their CFGs...

@maximumspatium
Copy link
Contributor

maximumspatium commented Nov 7, 2018

Abnormal selection pattern is an if/if-else pattern, where branch(es) are not dominated by the pattern header. In other words, it's an if/if-else, where something else jumps directly into "true" or "false" branch, besides the "if" condition checking code itself.

The above mentioned definition seems to be valid for a subset of abnormal selection paths unless I'm missing something. What's about return statements inside of if/else:

if (a == 1) {
    return -1;
} else {
    process();
}

The return statement in the code above will be transformed either to an immediate return node (i.e. node without successors) or to a jump to another return node. Both of them introduce abnormal selection paths.

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 7, 2018

The above mentioned definition seems to be valid for a subset of abnormal selection paths unless I'm missing something.

So, as you know, I didn't look into this for quite a while. So, it's a chance to check if years of previous reading left any mark, or I'm as tabula-rasa as before, and be shamed for talking nonsense.

So, my answer is no, the definition above is necessary and sufficient definition of abnormal selection (even more specifically, there're 2 definitions, and they're equivalent).

if (a == 1) {
    return -1;
} else {
    process();
}

My response is that nothing's abnormal here, the graph is well-structured.

The return statement in the code above will be transformed either to an immediate return node (i.e. node without successors) or to a jump to another return node.

So, we (well, algorithms) work on normalized graphs, where there's a single entry/single exit indeed. This is semantically equivalent to "immediate return node (i.e. node without successors)". But when it's formulated on single-exit graph, it becomes clear that nothing abnormal is there.

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 7, 2018

Abnormal selection pattern is an if/if-else pattern, where branch(es) are not dominated by the pattern header. In other words, it's an if/if-else, where something else jumps directly into "true" or "false" branch, besides the "if" condition checking code itself.

So, this definitions leads 2 to ways to detect abnormal selection nodes:

  1. We indeed can walk "if" nodes and check that their successors are dominated by the if nodes.
  2. Can try to cut the need for dominators computation, and analyze just edges instead. Out-edges from "if" node should be labeled with a condition (including special "anything else" condition encoded as None). While normal edges should not have any labels. Then, can iterate over all nodes, checking if in-edges is a mix of labeled and unlabeled ones.

Does algo 2 work as expected? Not quite, and we found a bug in the original "equivalent" definition too. Consider a do-while node. So, there's header node, and we normally arrive to it via normal flow, unlabeled edge. Then later we have a condition node which loops back to header. As it's condition, it has labeled edges.

So, need to adjust 2nd part of definition:

"In other words, it's an if/if-else, which has forward-edges into its "false"/"true" branches, and something else jumps directly into "true" or "false" branch either. Back-edges are loops and are ok."

Formulating initial definition in terms of dominance helped us catch that bug - indeed, dominance is quite related to DFS numbering, and thus criterion for being a back-edge.

So, comparing these 2 algos with the current code of match_abnormal_sel, shows how naive the latter is ;-). And as it warns, it's context-dependent, while algos above should be context-free, and thus move the entire structuring system to be "finite Church-Rosser". Actually, we can just apply abnormal selection detection in the very beginning and fix up the graph once and for all.

Action items: implement both algos, compare the results, be ready to eat my hat ;-).

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 9, 2018

Currently, SABl doesn't test for abnormal selection thoroughly, which leads to incorrect decompilation, e.g. https://github.com/pfalcon/ScratchABlock/blob/master/tests/decomp/_xtos_set_interrupt_handler_arg/_xtos_set_interrupt_handler_arg.lst.exp.clean

Anyway, for now we need to solve this specific issue, so in 5709440 I pushed solution I drafted yet then in January.

@maximumspatium
Copy link
Contributor

Action items: implement both algos, compare the results, be ready to eat my hat ;-).

I'll give it a try...

@maximumspatium
Copy link
Contributor

Algo 1 (with dominators computation) is sketched in daf0f4f

Out-edges from "if" node should be labeled with a condition (including special "anything else" condition encoded as None). While normal edges should not have any labels.

How can I distinguish "normal" edges from "anything else" condition edges in an unprocessed CFG? It seems that only IFs "true" edges have the "cond" property attached while "false" edges don't have "cond".

@maximumspatium
Copy link
Contributor

maximumspatium commented Nov 12, 2018

For your _xtos_set_interrupt_handler_arg example, a better structuring strategy seems to emit multiply return statements:

void _xtos_set_interrupt_handler_arg()
{
  if ((i32)$a2_0 < 0 || (i32)$a2_0 > 14 || *(u8*)($a2_0 + xtos_int_level_table) >= 0x3)
    return 0;
  $a6 = *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x70);
  if ($a3_0 == 0) {
    *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x70) = _xtos_unhandled_interrupt;
    *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x74) = $a2_0;
  } else {
    *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x70) = $a3_0;
    *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x74) = $a4_0;
  }
  return ($a6 != _xtos_unhandled_interrupt ? *(u32*)(_xtos_interrupt_table + -($a2_0 * 8) + 0x70) : 0);
}

That's what I'd do as human being. The two conditional assignments to $a2 have been converted to the ternary operator; the identical return statements in the both if/else branches have been moved out and collapsed by means of common subexpression elimination. This produces a much better code.

The funny part is that code now closely resembles its original 😃

What transformation in SABl funnels all return nodes to a single one? Can that be reverted later to produce a nicer output?

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 12, 2018

How can I distinguish "normal" edges from "anything else" condition edges in an unprocessed CFG? It seems that only IFs "true" edges have the "cond" property attached while "false" edges don't have "cond".

Really? That sucks, because I thought they have "cond", just set to None. Looking at parser.py, what you says seems to be true. Well, could add a pass to make it like that, then of course would need to make sure it doesn't break anything. (It shouldn't, because "cond" should be queried as .get("cond"). But then would need to grep for cases where it isn't.)

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 12, 2018

a better structuring strategy seems to emit multiply return statements:

Sure, that's quite obvious. (Except should be configurable. Or specifically, should prefer multiple return's to goto's, but should allow to prefer multiple returns or single exit otherwise.)

What transformation in SABl funnels all return nodes to a single one?

So, I did my part - different transformations are now properly (almost!) categorized to allow to look up them easily. Now your part to eyeball them to know which exist, and know answers to questions like above.

Can that be reverted later to produce a nicer output?

Yep, it needs to be. Just need to figure out when (also mostly obvious) and how (match them afterwards). But before that, needs to refactor the stuff, which I'm doing. And fix various goofs, which are many, e.g. 8ad0352. Whole DFS numbering is hosed (and not only in my code, in the literature many obvious-to-mention things are skipped which gives an impression of "well-kept secret" to them). (Of course another explanation is that it's so trivial that everyone understands it reading "between the lines", except for stupid me, lol.)

@maximumspatium
Copy link
Contributor

Whole DFS numbering is hosed (and not only in my code, in the literature many obvious-to-mention things are skipped which gives an impression of "well-kept secret" to them).

Oops! I'll take a look in my books tonight. I'm eager to see how it was implemented elsewhere. Some books provide detailed algos as pseudo code...

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 12, 2018

Algos is not everything. Overall concept description (i.e. how different concepts mix and match (and don't mix)) is what important. Ok, if you're eager to look in books, here's quiz for you: what's common (and uncommon) between DFS number and preorder and postorder traversals?

@maximumspatium
Copy link
Contributor

what's common (and uncommon) between DFS number and preorder and postorder traversals?

What do you mean with "DFS number"? Iterating over the graph nodes and assigning an increasing unique number to each node when it's first visited?

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 13, 2018

Well, I would mostly mean the same what others mean (google up "dfs number"?), modulo misunderstandings and goofs on my side. But the latter largely depends on how others define it, and not on its own, but in the constellation of related concepts. Again, the question is "How DFS number and preorder and postorder traversals of a graph relate, if in any way at all?". I don't see that question be clearly answered in majority of texts. Take as example 1997 Muchnick - in his presentation those are 3 notions, without clear definition of their exact relation.

four graph-theoretic concepts that are important to several of
the algorithms we use below

the number assigned to each node in a depth-first search is the node’s depth-first number

the second and third notions we need are two traversals of the nodes of a rooted, directed graph and the orders they induce in the set of nodes of the graph

preorder traversal

postorder traversal

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 13, 2018

Anyway, quiz time is over, fix time is on. The whole story is here: 14b4344

@maximumspatium
Copy link
Contributor

Well, I would mostly mean the same what others mean (google up "dfs number"?)

G@0gle search doesn't reveal anything useful or even related to graphs, hence the question. Moreover, I don't remember seeing the term "dfs number" anywhere else except some tutorials on DFS where each graph node has a number assigned to it, mainly for a better visualization and referencing.

If "DFS number is the order in which the vertices are explored" (a definition found somewhere in the web), then the answer to the question "How DFS number and preorder and postorder traversals of a graph relate, if in any way at all?" is simple: DFS number is just another name for pre-order.

Regarding the preorder / postorder traversals of a digraph, Muchnick's book gives two definitions:

preorder traversal of an acyclic digraph is a traversal in which each node is processed before its descendants while postorder traversal of the same graph is in which each node is processed after its descendants.

While this theoretical definition looks fine and dandy, I've learned elsewhere that there are at least two major problems with DFS traversals over digraphs:

  • there is no clear order in which DFS will explore graph edges so >1 DFS traversal of the same graph can exist. I remember seeing a graph in an online tutorial where some other node visiting order causes node's children to be visited before their parent during the pre-order traversal. This makes pre-order traversal alone practically useless. In contrast, post-order and reverse post-order traversals are useful in many algorithms.
  • the above mentioned book as well as many others speak solely about acyclic digraphs. Processing of cyclic digraphs often isn't mentioned at all because cycles represent another huge obstacle to overcome.

@pfalcon
Copy link
Owner Author

pfalcon commented Nov 14, 2018

the answer to the question "How DFS number and preorder and postorder traversals of a graph relate, if in any way at all?" is simple: DFS number is just another name for pre-order.

Well, once you know the right answer (and the right answer went in the commit linked), it's always easy to lay any theory behind it ;-). The matter is again, to be able to navigate the ocean of concepts and be able to to tell how some of them relate and how some differ. And it's easy to mix up things. And so far I didn't meet someone who can unmix up them in "real time" (i.e. max in days' worth of time after being made, not months' or years').

node's children to be visited before their parent during the pre-order traversal.

A note from someone who's sharpening his eye: they never use terms like "children" for arbitrary graph. There, we have "predecessors" and "successors". "Children"/"descendants" are reserved for tree-like structures (e.g. spanning tree of a graph).

This makes pre-order traversal alone practically useless. In contrast, post-order and reverse post-order traversals are useful in many algorithms.

Good note. That's of course why SABl so far had only postorder numbers. That's also why I initially wanted to rush to add them on detecting the goof discussed, but decided against (again, so far), to avoid sweeping changes on the tests. It's of course used too (e.g. edge classification), and will be added on real demand.

Sweeping changes was always a major concern to me. I now try to downprioritize it - there're so many things to change, many bugs to fix, and even their number of goofs to facepalm thru. And the testsuite is at something of it now, so refactoring is easy. (And so, more sweeping changes to come.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants