-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement goto in C #30
Comments
|
I wrote a blog post a couple months ago on the subject of how to handle http://jamey.thesharps.us/2016/05/how-to-eliminate-goto-statements.html This is not particularly easy to do, and I think it's going to be relatively low on my personal priorities, but if somebody else wants to tackle that challenge then I'd be happy to help! |
|
For anyone who is interested, this google search leads to a nice paper on the subject. I think I may give this a go at some point in the near future (as in the coming week or so). My approach will probably be to make a separate module |
|
Hi, I'm interested in tackling this, but I'm not sure how to start. I've written node splitting code in Python before (https://github.com/Storyyeller/Krakatau), but I don't have any experience with Haskell. As an example of C code where this would be useful, the library miniz.c uses unstructured switch statements and gotos. |
|
@harpocrates: You might consider doing it on the Rust AST rather than the C AST, because Rust's labeled loops let you translate goto statements that move outward, either forward or backward, in ways that C doesn't have syntax to express. But Pascal did, so here's a 1985 paper on the topic: "Eliminating go to's while Preserving Program Structure", Lyle Ramshaw @Storyyeller: Awesome! I'd suggest starting by building a data structure to represent the control-flow graph. Maybe take a look at the That said, I suspect you could make a pretty big contribution here just by writing up pseudocode as comments in this issue! Among the questions I have: In addition to keeping track of the control-flow graph, we also need to keep track of which variables are in scope for each basic-block, and ensure that those variables are still in scope wherever we wind up putting the basic blocks in the output. But ideally each variable would be in scope in exactly the basic blocks in which it was visible in the original program, because if we extend a variable's lifetime that could lead to either name-shadowing bugs or, eventually, triggering Drops at the wrong time. Is that possible, and what kind of data structure do we need to track that information? |
|
The pseudocode is pretty simple - find all strongly connected components - for each one, check if more than one node has incoming edges from nodes not in the scc. If so, chose one node as head and duplicate the other node in the scc, with one copy that is dominated by the head, and one copy that is not reachable from head. Then recurse into the processed nodes, finding all sccs when backedges to head are excluded and so on. As for variables, in Krakatau I dodged that issue by using SSA form and giving every basic block its own copy of all live variables. I suppose that's not an option when you're translating from C if you want to preserve some semblance of the original variable definitions. Then again, I'm not sure how goto interacts with variable scoping in the first place. |
|
P.S. C code doesn't have destructors, so drops aren't really an issue. |
|
Hmm, you're right, drops aren't an issue. I was thinking of a hypothetical future where some idiom-transformer pass replaces calls to You're right, I don't really want to go to SSA form since I want the generated Rust to look as much like the input C as possible. I don't know how goto/switch are supposed to interact with variable scoping either, and I can't find anything in C99 that gives me any hints. Hopefully somebody will come along and tell us. 😁 My guess is that initializers are evaluated when control visits the declaration, and that variables have indeterminate value when control exits the block in which they're declared. So if you By the way, I'd be tempted to begin with an implementation that rejects control-flow graphs which are not reducible, which would mean we never have to split nodes. One of the papers I cited in the blog post above reported that, across a range of open source projects, functions almost always had reducible control flow despite using plenty of My impression from the various papers I've read on this topic is that our first challenge is going to be to recognize a reasonable variety of control-flow patterns from an unstructured control-flow graph. e.g., "this subgraph looks like an |
|
miniz appears to be using irreducible control flow, so that's one example As for the rest, structuring is relatively easy if you don't care how ugly On Mon, Jul 11, 2016 at 8:26 PM, Jamey Sharp notifications@github.com
|
|
Thanks to @gsauthof for giving a solid citation to "C99 6.2.4 (5), especially the first and the last 2 sentences":
So stack-allocated variables are supposed to have storage for retaining their value as soon as the block in which they're declared is entered, and they lose their value only when that block is exited. If there's an initializer, then the variable initialization happens every time control visits the variable's declaration. Here's an example. int sum(int count)
{
goto a;
b:
--count;
goto d;
a: ;
int x = 0;
goto d;
c:
return x;
d:
if(count <= 0)
goto c;
goto e;
e:
x += count;
goto b;
}I'm impressed at how well GCC does on this, actually. Not so impressed with clang to be honest... So I think the right algorithm is something like this:
It's OK if the block contains other basic-blocks that weren't part of the variable's lifetime. C doesn't guarantee when the variable will lose its value, only when it will retain it. But this means that we may need to rename some variables that weren't in the same scope in C but get moved to the same scope in the structured Rust. In the common case where the function was already fully structured, this algorithm keeps each variable declaration in the same block it originally appeared in, because that is the smallest block which contains all the same code as the original. It would be nice to also preserve where the declaration is within the block, if it's nested in between other statements or something; maybe making the initialization explicit is enough to let us put the declaration in the same spot. Finally: it would be nice to inform the Rust compiler that a variable should be once again treated as not-yet-initialized, when a declaration with no initializer gets executed, so we can get warnings about uninitialized variables. |
|
I don't think there's much hope of preserving source level details in a On Tue, Jul 12, 2016 at 1:13 AM, Jamey Sharp notifications@github.com
|
|
On further thought, I think the only reasonable way to handle variable scopes is to hoist all (non-VLA) variables to function scope, renaming them to avoid name clashes if necessary. It's a bit ugly, but I don't see any other way to handle cases like the below At least the fact that C has no destructors means that increasing the scope of variables has no effect on behavior. You just have to rename them to avoid clashes when there are multiple variables of the same name, as above. |
|
Good example! The sketch of an algorithm I gave above does cover this case. It'll only produce one block, because there aren't any conditional branches in the CFG. If there's only one block, then of course both variables go into that block. And I did note that "we may need to rename some variables that weren't in the same scope in C but get moved to the same scope in the structured Rust." I would like to see your example translate to basically the same code that this one does, where the rule is that variable declarations are placed as late as possible while still preceding all their uses. #include <stdio.h>
int main(int argc, char *argv[]) {
int x1 = 4;
printf("x = %d, &x = %p\n", x1, (void*)&x1);
const char *x2;
x2 = "Bar!";
printf("x = %s, &x = %p\n", x2, (void*)&x2);
x1 = 32;
printf("x = %d, &x = %p\n", x1, (void*)&x1);
x2 = "Bar 3!";
printf("x = %s, &x = %p\n", x2, (void*)&x2);
x1 = -3;
printf("x = %d, &x = %p\n", x1, (void*)&x1);
return 0;
}Here are straw-man data types for control flow graphs, which I'm writing down in hopes it will clarify some of my thoughts. A function's control-flow graph is a collection of basic blocks, where each one is assigned a unique block ID, along with the block ID of the function's entry point. type BlockID = Integer
data CFG = CFG
{ entryBlock :: BlockID
, cfgBlocks :: [(BlockID, BasicBlock)]
}A basic block consists of zero or more statements, which run in order. Then the The basic block also references the list of local variables which are alive within this basic block. Rust data BasicBlock = BasicBlock
{ inScope :: [Rust.Var]
, blockBody :: [Statement]
, conditionalBranches :: [ConditionalBranch]
, elseBranch :: Terminator
}A statement is either a Rust expression, or a special marker that a variable is to be considered not-initialized after this point. Such a variable can be initialized with an assignment expression in a subsequent statement. data Statement
= Statement Rust.Expr
| Deinitialize Rust.VarA conditional branch evaluates a boolean-valued Rust expression, and if it evaluates to true, then branches to the given basic block. (As an alternative, data ConditionalBranch = ConditionalBranch
{ condition :: Rust.Expr
, branchTarget :: BlockID
}If we fall off the end of a basic block, we must either return from the function (making this an exit block) or branch to another basic block. data Terminator
= Goto BlockID
| Return (Maybe Rust.Expr)Each local C declaration would translate to
I wrote code that resembled some of this for a previous project, which may be helpful to refer to. It was specifically to transform a function with Python-style https://github.com/GaloisInc/ivory/blob/master/ivory/src/Ivory/Language/Coroutine.hs It's probably also worth comparing LLVM's CFG representation and Rust's MIR representation. I hope that was helpful...? |
|
Is there any reason to have the deinitialize statements? Why not just On Tue, Jul 12, 2016 at 8:26 PM, Jamey Sharp notifications@github.com
|
|
I started trying to understand the code, and there are some things I am confused by. Do you mind if I ask questions here, or is there a better place? For example, consider It is my understanding that this creates a linked list. That seems like it would be inefficient. Why not use a balanced tree instead? |
|
Sorry, @Storyyeller, I lost track of this thread! I have several answers to your question about linked-list performance. I kind of wish Corrode had a mailing list for these kinds of questions, but I'm not fond of the usual list hosting options so I guess we'll stick with with using GitHub issues. I'd probably suggest opening a new issue for discussing questions about the implementation, though; this isn't really related to the question of how to handle goto. Most important answer: I'm not concerned about efficiency in Corrode right now, and I won't worry about it until somebody's trying to translate a source file that's taking, I don't know, 30 seconds or more? I do try to avoid doing stupidly wasteful things, but in Corrode, given a choice between writing something simple or writing something that runs fast, I will pick the simple thing. 😄 I don't expect translating C to Rust to be a routine part of anyone's day, so if this kind of one-off activity is a little slow sometimes, that's better than making the code harder to understand for people who are trying to work on it. I feel like Corrode is faster than either a C or Rust compiler on the same source, anyway, which would be fair since it doesn't do any optimization or very much type-checking or linting. All that said, I did actually consider the performance implications of using a list for the environment instead of The trade-off, of course, is in symbol lookup, where lists are O(n) and maps are O(log(n)). But these are worst-case times. If the most-used symbols are within log(n) of the top of the environment, then the list still wins—and that seems like a plausible enough distribution for C environments that I suspected the gains in symbol insertion would dominate. That was partly based on experience with another project where I'd found that Data.Set was measurably slower than a list for a similar kind of use case. My final reason is that in Haskell (and several other functional languages), when you're not sure what type to use, you use a list. There are more built-in functions for working with lists than with anything else in these languages so it's the easiest choice. Generally, in these languages specifically, you should be prepared to offer some justification for why you aren't using a list. So, to recap: Using a list was the easy and idiomatic choice; I have an argument I found persuasive as to why it might be the higher-performance choice; and even if it isn't, I don't really care because there's no evidence yet that this particular choice is causing performance problems. Sorry that was a long answer; I thought it was an interesting question and I may have gone overboard. I hope it helped? Regarding my comments about deinitialize statements, I'm not sure what more I can say about the rationale:
|
|
I've pushed a first cut at generating control-flow graphs to a new
My implementation is not up to my preferred standard (it has zero documentation, for one thing) but I wanted to get something out for folks to play with. If you're interested in implementing One tricky thing: Since a You'll need to store a mapping from We'll still need to work on transforming the CFG to Rust control flow structures (I wrote some generic transformations already) and then ensuring that variables are declared appropriately (the current CFG code just drops variable declarations entirely). But those tasks are nicely separate from handling |
|
I concluded that there are too many issues with automatic C to Rust conversion and that manual rewrites are a better path forward, so I am unlikely to contribute to Corrode. Sorry. |
|
In case anyone's interested in working on |
The text was updated successfully, but these errors were encountered: