New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement goto in C #30

Closed
meagon opened this Issue Jul 10, 2016 · 18 comments

Comments

Projects
None yet
4 participants
@meagon
Copy link

meagon commented Jul 10, 2016

extern int printf (const char *format, ...);
extern int getchar(void);

int main(void)
{
    int n=0;
    printf("input a string :\n");
loop: if(getchar()!='\n')
      {
          n++;
          goto loop;
      }
      printf("length is %d\n",n);
}
corrode: ("goto.c": line 10): Corrode doesn't handle this yet:
    loop:
        if (getchar() != '\n')
        {
            n++;
            goto loop;
        }
@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 10, 2016

I wrote a blog post a couple months ago on the subject of how to handle goto:

http://jamey.thesharps.us/2016/05/how-to-eliminate-goto-statements.html

This is not particularly easy to do, and I think it's going to be relatively low on my personal priorities, but if somebody else wants to tackle that challenge then I'd be happy to help!

@harpocrates

This comment has been minimized.

Copy link

harpocrates commented Jul 11, 2016

For anyone who is interested, this google search leads to a nice paper on the subject.

I think I may give this a go at some point in the near future (as in the coming week or so). My approach will probably be to make a separate module Language.C.Analysis.Goto that will preprocess the C source itself. That way, testing/using this will be pretty distinct from the rest of corrode.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 11, 2016

Hi, I'm interested in tackling this, but I'm not sure how to start. I've written node splitting code in Python before (https://github.com/Storyyeller/Krakatau), but I don't have any experience with Haskell.

As an example of C code where this would be useful, the library miniz.c uses unstructured switch statements and gotos.

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 11, 2016

@harpocrates: You might consider doing it on the Rust AST rather than the C AST, because Rust's labeled loops let you translate goto statements that move outward, either forward or backward, in ways that C doesn't have syntax to express. But Pascal did, so here's a 1985 paper on the topic:

"Eliminating go to's while Preserving Program Structure", Lyle Ramshaw

@Storyyeller: Awesome! I'd suggest starting by building a data structure to represent the control-flow graph. Maybe take a look at the Language.Rust.AST module for examples of how to declare data types in Haskell.

That said, I suspect you could make a pretty big contribution here just by writing up pseudocode as comments in this issue!

Among the questions I have: In addition to keeping track of the control-flow graph, we also need to keep track of which variables are in scope for each basic-block, and ensure that those variables are still in scope wherever we wind up putting the basic blocks in the output. But ideally each variable would be in scope in exactly the basic blocks in which it was visible in the original program, because if we extend a variable's lifetime that could lead to either name-shadowing bugs or, eventually, triggering Drops at the wrong time. Is that possible, and what kind of data structure do we need to track that information?

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 11, 2016

The pseudocode is pretty simple - find all strongly connected components - for each one, check if more than one node has incoming edges from nodes not in the scc. If so, chose one node as head and duplicate the other node in the scc, with one copy that is dominated by the head, and one copy that is not reachable from head. Then recurse into the processed nodes, finding all sccs when backedges to head are excluded and so on.

As for variables, in Krakatau I dodged that issue by using SSA form and giving every basic block its own copy of all live variables. I suppose that's not an option when you're translating from C if you want to preserve some semblance of the original variable definitions. Then again, I'm not sure how goto interacts with variable scoping in the first place.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 11, 2016

P.S. C code doesn't have destructors, so drops aren't really an issue.

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 12, 2016

Hmm, you're right, drops aren't an issue. I was thinking of a hypothetical future where some idiom-transformer pass replaces calls to free and other destructor-like functions with implicit drops. But such a transformation would have to fix up block scopes so the drops happen at the right time anyway, so anything we do now doesn't make that any worse. Nevermind!

You're right, I don't really want to go to SSA form since I want the generated Rust to look as much like the input C as possible.

I don't know how goto/switch are supposed to interact with variable scoping either, and I can't find anything in C99 that gives me any hints. Hopefully somebody will come along and tell us. 😁 My guess is that initializers are evaluated when control visits the declaration, and that variables have indeterminate value when control exits the block in which they're declared. So if you goto out of a block and then back into it, none of its variables are initialized any more. Or so I'd expect.

By the way, I'd be tempted to begin with an implementation that rejects control-flow graphs which are not reducible, which would mean we never have to split nodes. One of the papers I cited in the blog post above reported that, across a range of open source projects, functions almost always had reducible control flow despite using plenty of goto statements.

My impression from the various papers I've read on this topic is that our first challenge is going to be to recognize a reasonable variety of control-flow patterns from an unstructured control-flow graph. e.g., "this subgraph looks like an if statement; that subgraph looks like a while loop."

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 12, 2016

miniz appears to be using irreducible control flow, so that's one example
where this is useful.

As for the rest, structuring is relatively easy if you don't care how ugly
the resulting code is. Once you have a reducible graph, you can just turn
every node with backedges into an infinite loop, every conditional node
into an if statement, and use labeled breaks for forward jumps. Obviously,
to make the results reasonable, you'll probably want to do some additioanl
sugarings. For example, convert loop { if x { break } ... } to while x {
... }.

On Mon, Jul 11, 2016 at 8:26 PM, Jamey Sharp notifications@github.com
wrote:

Hmm, you're right, drops aren't an issue. I was thinking of a hypothetical
future where some idiom-transformer pass replaces calls to free and other
destructor-like functions with implicit drops. But such a transformation
would have to fix up block scopes so the drops happen at the right time
anyway, so anything we do now doesn't make that any worse. Nevermind!

You're right, I don't really want to go to SSA form since I want the
generated Rust to look as much like the input C as possible.

I don't know how goto/switch are supposed to interact with variable
scoping either, and I can't find anything in C99 that gives me any hints.
Hopefully somebody will come along and tell us. 😁 My guess is that
initializers are evaluated when control visits the declaration, and that
variables have indeterminate value when control exits the block in which
they're declared. So if you goto out of a block and then back into it,
none of its variables are initialized any more. Or so I'd expect.

By the way, I'd be tempted to begin with an implementation that rejects
control-flow graphs which are not reducible, which would mean we never have
to split nodes. One of the papers I cited in the blog post above reported
that, across a range of open source projects, functions almost always had
reducible control flow despite using plenty of goto statements.

My impression from the various papers I've read on this topic is that our
first challenge is going to be to recognize a reasonable variety of
control-flow patterns from an unstructured control-flow graph. e.g., "this
subgraph looks like an if statement; that subgraph looks like a while
loop."


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#30 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA9A-lDlmMig6DO9DeyDQWDGI_N8Y_iTks5qUwllgaJpZM4JIwg8
.

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 12, 2016

Thanks to @gsauthof for giving a solid citation to "C99 6.2.4 (5), especially the first and the last 2 sentences":

For such an object ..., its lifetime extends from entry into the block with which it is associated until execution of that block ends in any way. ... If an initialization is specified for the object, it is performed each time the declaration is reached in the execution of the block; otherwise, the value becomes indeterminate each time the declaration is reached.

So stack-allocated variables are supposed to have storage for retaining their value as soon as the block in which they're declared is entered, and they lose their value only when that block is exited. If there's an initializer, then the variable initialization happens every time control visits the variable's declaration.

Here's an example.

int sum(int count)
{
  goto a;

b:
  --count;
  goto d;

a: ;
  int x = 0;
  goto d;

c:
  return x;

d:
  if(count <= 0)
    goto c;
  goto e;

e:
  x += count;
  goto b;
}

I'm impressed at how well GCC does on this, actually. Not so impressed with clang to be honest...

Compiler Explorer link

So I think the right algorithm is something like this:

  • translate local-variable initialization as an assignment statement, so we remember exactly what point in the control flow to perform initialization at;
  • associate each variable with a set of basic-blocks that came from anywhere inside the compound statement where the variable was declared;
  • restructure the control-flow graph to use properly nested loops and if statements and all that;
  • and declare each variable in the smallest generated block that contains all of the basic-blocks which were part of the variable's lifetime.

It's OK if the block contains other basic-blocks that weren't part of the variable's lifetime. C doesn't guarantee when the variable will lose its value, only when it will retain it. But this means that we may need to rename some variables that weren't in the same scope in C but get moved to the same scope in the structured Rust.

In the common case where the function was already fully structured, this algorithm keeps each variable declaration in the same block it originally appeared in, because that is the smallest block which contains all the same code as the original. It would be nice to also preserve where the declaration is within the block, if it's nested in between other statements or something; maybe making the initialization explicit is enough to let us put the declaration in the same spot.

Finally: it would be nice to inform the Rust compiler that a variable should be once again treated as not-yet-initialized, when a declaration with no initializer gets executed, so we can get warnings about uninitialized variables. std::mem::uninitialized is kind of the opposite, since that pretends a variable has been initialized when it actually hasn't been. I don't know how to do this.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 12, 2016

I don't think there's much hope of preserving source level details in a
transformation like this. Perhaps it would be better to have an option to
only run the transformation when gotos or unstructured switch statements
are present in a function.

On Tue, Jul 12, 2016 at 1:13 AM, Jamey Sharp notifications@github.com
wrote:

Thanks to @gsauthof https://github.com/gsauthof for giving a solid
citation https://twitter.com/golfmikesierra/status/752746518923321344
to "C99 6.2.4 (5), especially the first and the last 2 sentences":

For such an object ..., its lifetime extends from entry into the block
with which it is associated until execution of that block ends in any way.
... If an initialization is specified for the object, it is performed each
time the declaration is reached in the execution of the block; otherwise,
the value becomes indeterminate each time the declaration is reached.

So stack-allocated variables are supposed to have storage for retaining
their value as soon as the block in which they're declared is entered, and
they lose their value only when that block is exited. If there's an
initializer, then the variable initialization happens every time control
visits the variable's declaration.

Here's an example.

int sum(int count)
{
goto a;

b:
--count;
goto d;

a: ;
int x = 0;
goto d;

c:
return x;

d:
if(count <= 0)
goto c;
goto e;

e:
x += count;
goto b;
}

I'm impressed at how well GCC does on this, actually. Not so impressed
with clang to be honest...

Compiler Explorer link https://godbolt.org/g/yNrYlZ

So I think the right algorithm is something like this:

  • translate local-variable initialization as an assignment statement,
    so we remember exactly what point in the control flow to perform
    initialization at;
  • associate each variable with a set of basic-blocks that came from
    anywhere inside the compound statement where the variable was declared;
  • restructure the control-flow graph to use properly nested loops and
    if statements and all that;
  • and declare each variable in the smallest generated block that
    contains all of the basic-blocks which were part of the variable's lifetime.

It's OK if the block contains other basic-blocks that weren't part of the
variable's lifetime. C doesn't guarantee when the variable will lose its
value, only when it will retain it. But this means that we may need to
rename some variables that weren't in the same scope in C but get moved to
the same scope in the structured Rust.

In the common case where the function was already fully structured, this
algorithm keeps each variable declaration in the same block it originally
appeared in, because that is the smallest block which contains all the
same code as the original. It would be nice to also preserve where the
declaration is within the block, if it's nested in between other statements
or something; maybe making the initialization explicit is enough to let us
put the declaration in the same spot.

Finally: it would be nice to inform the Rust compiler that a variable
should be once again treated as not-yet-initialized, when a declaration
with no initializer gets executed, so we can get warnings about
uninitialized variables. std::mem::uninitialized
https://doc.rust-lang.org/std/mem/fn.uninitialized.html is kind of the
opposite, since that pretends a variable has been initialized when it
actually hasn't been. I don't know how to do this.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#30 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA9A-uPBhd3clxdJBAsSvZYldmNt-_pVks5qU0zEgaJpZM4JIwg8
.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 13, 2016

On further thought, I think the only reasonable way to handle variable scopes is to hoist all (non-VLA) variables to function scope, renaming them to avoid name clashes if necessary. It's a bit ugly, but I don't see any other way to handle cases like the below


#include <stdio.h>

int main(int argc, char *argv[]) {
  {
    int x = 4;

    printf("x = %d, &x = %p\n", x, (void*)&x);
    goto bar;

    bar2:;
    x = 32;
    printf("x = %d, &x = %p\n", x, (void*)&x);
    goto bar3;

    bar4:;
    x = -3;
    printf("x = %d, &x = %p\n", x, (void*)&x);
  }
  return 0;

  {
    const char *x;

    bar:;
    x = "Bar!";
    printf("x = %s, &x = %p\n", x, (void*)&x);
    goto bar2;

    bar3:;
    x = "Bar 3!";
    printf("x = %s, &x = %p\n", x, (void*)&x);
    goto bar4;
  }
}

At least the fact that C has no destructors means that increasing the scope of variables has no effect on behavior. You just have to rename them to avoid clashes when there are multiple variables of the same name, as above.

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 13, 2016

Good example!

The sketch of an algorithm I gave above does cover this case. It'll only produce one block, because there aren't any conditional branches in the CFG. If there's only one block, then of course both variables go into that block. And I did note that "we may need to rename some variables that weren't in the same scope in C but get moved to the same scope in the structured Rust."

I would like to see your example translate to basically the same code that this one does, where the rule is that variable declarations are placed as late as possible while still preceding all their uses.

#include <stdio.h>

int main(int argc, char *argv[]) {
    int x1 = 4;
    printf("x = %d, &x = %p\n", x1, (void*)&x1);

    const char *x2;
    x2 = "Bar!";
    printf("x = %s, &x = %p\n", x2, (void*)&x2);

    x1 = 32;
    printf("x = %d, &x = %p\n", x1, (void*)&x1);

    x2 = "Bar 3!";
    printf("x = %s, &x = %p\n", x2, (void*)&x2);

    x1 = -3;
    printf("x = %d, &x = %p\n", x1, (void*)&x1);

    return 0;
}

Here are straw-man data types for control flow graphs, which I'm writing down in hopes it will clarify some of my thoughts.

A function's control-flow graph is a collection of basic blocks, where each one is assigned a unique block ID, along with the block ID of the function's entry point.

type BlockID = Integer
data CFG = CFG
    { entryBlock :: BlockID
    , cfgBlocks :: [(BlockID, BasicBlock)]
    }

A basic block consists of zero or more statements, which run in order. Then the conditionalBranches are tried in order, and the elseBranch is used if none of them match.

The basic block also references the list of local variables which are alive within this basic block. Rust let-bindings are not given explicitly within a basic block. Instead, a variable's binding is inserted into the smallest block which contains all basic blocks that had that variable in scope, and is placed immediately before the first statement which contains a reference to that variable.

data BasicBlock = BasicBlock
    { inScope :: [Rust.Var]
    , blockBody :: [Statement]
    , conditionalBranches :: [ConditionalBranch]
    , elseBranch :: Terminator
    }

A statement is either a Rust expression, or a special marker that a variable is to be considered not-initialized after this point. Such a variable can be initialized with an assignment expression in a subsequent statement.

data Statement
    = Statement Rust.Expr
    | Deinitialize Rust.Var

A conditional branch evaluates a boolean-valued Rust expression, and if it evaluates to true, then branches to the given basic block. (As an alternative, branchTarget could be a Terminator instead of a BlockID.)

data ConditionalBranch = ConditionalBranch
    { condition :: Rust.Expr
    , branchTarget :: BlockID
    }

If we fall off the end of a basic block, we must either return from the function (making this an exit block) or branch to another basic block.

data Terminator
    = Goto BlockID
    | Return (Maybe Rust.Expr)

Each local C declaration would translate to Statement (Rust.Assign ...) if it has an initializer, or to Deinitialize otherwise. That serves two purposes:

  1. It preserves information about how variable declarations were interleaved with the rest of the code. If we follow the rule I proposed above that variable declarations are placed as late as possible, then code that was already structured will be unchanged by this transformation, aside perhaps from variable renaming.
  2. It expresses C99's rule that a variable becomes re-initialized every time its initializer is reached, or becomes indeterminate if it does not have an initializer.

I wrote code that resembled some of this for a previous project, which may be helpful to refer to. It was specifically to transform a function with Python-style yield statements into pure C, so some of what it does isn't relevant, but it had to operate on a control-flow graph with a basic-block representation much like this one.

https://github.com/GaloisInc/ivory/blob/master/ivory/src/Ivory/Language/Coroutine.hs

It's probably also worth comparing LLVM's CFG representation and Rust's MIR representation.

I hope that was helpful...?

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 13, 2016

Is there any reason to have the deinitialize statements? Why not just
record which variables were originally in scope for each basic block?

On Tue, Jul 12, 2016 at 8:26 PM, Jamey Sharp notifications@github.com
wrote:

Good example!

The sketch of an algorithm I gave above does cover this case. It'll only
produce one block, because there aren't any conditional branches in the
CFG. If there's only one block, then of course both variables go into that
block. And I did note that "we may need to rename some variables that
weren't in the same scope in C but get moved to the same scope in the
structured Rust."

I would like to see your example translate to basically the same code that
this one does, where the rule is that variable declarations are placed as
late as possible while still preceding all their uses.

#include <stdio.h>
int main(int argc, char argv[]) {
int x1 = 4;
printf("x = %d, &x = %p\n", x1, (void
)&x1);

const char *x2;
x2 = "Bar!";
printf("x = %s, &x = %p\n", x2, (void*)&x2);

x1 = 32;
printf("x = %d, &x = %p\n", x1, (void*)&x1);

x2 = "Bar 3!";
printf("x = %s, &x = %p\n", x2, (void*)&x2);

x1 = -3;
printf("x = %d, &x = %p\n", x1, (void*)&x1);

return 0;

}

Here are straw-man data types for control flow graphs, which I'm writing
down in hopes it will clarify some of my thoughts.

A function's control-flow graph is a collection of basic blocks, where
each one is assigned a unique block ID, along with the block ID of the
function's entry point.

type BlockID = Integerdata CFG = CFG
{ entryBlock :: BlockID
, cfgBlocks :: [(BlockID, BasicBlock)]
}

A basic block consists of zero or more statements, which run in order.
Then the conditionalBranches are tried in order, and the elseBranch is
used if none of them match.

The basic block also references the list of local variables which are
alive within this basic block. Rust let-bindings are not given explicitly
within a basic block. Instead, a variable's binding is inserted into the
smallest block which contains all basic blocks that had that variable in
scope, and is placed immediately before the first statement which contains
a reference to that variable.

data BasicBlock = BasicBlock
{ inScope :: [Rust.Var]
, blockBody :: [Statement]
, conditionalBranches :: [ConditionalBranch]
, elseBranch :: Terminator
}

A statement is either a Rust expression, or a special marker that a
variable is to be considered not-initialized after this point. Such a
variable can be initialized with an assignment expression in a subsequent
statement.

data Statement
= Statement Rust.Expr
| Deinitialize Rust.Var

A conditional branch evaluates a boolean-valued Rust expression, and if it
evaluates to true, then branches to the given basic block. (As an
alternative, branchTarget could be a Terminator instead of a BlockID.)

data ConditionalBranch = ConditionalBranch
{ condition :: Rust.Expr
, branchTarget :: BlockID
}

If we fall off the end of a basic block, we must either return from the
function (making this an exit block) or branch to another basic block.

data Terminator
= Goto BlockID
| Return (Maybe Rust.Expr)

Each local C declaration would translate to Statement (Rust.Assign ...)
if it has an initializer, or to Deinitialize otherwise. That serves two
purposes:

It preserves information about how variable declarations were
interleaved with the rest of the code. If we follow the rule I proposed
above that variable declarations are placed as late as possible, then code
that was already structured will be unchanged by this transformation, aside
perhaps from variable renaming.
2.

It expresses C99's rule that a variable becomes re-initialized every
time its initializer is reached, or becomes indeterminate if it does not
have an initializer.

I wrote code that resembled some of this for a previous project, which may
be helpful to refer to. It was specifically to transform a function with
Python-style yield statements into pure C, so some of what it does isn't
relevant, but it had to operate on a control-flow graph with a basic-block
representation much like this one.

https://github.com/GaloisInc/ivory/blob/master/ivory/src/Ivory/Language/Coroutine.hs

It's probably also worth comparing LLVM's CFG representation
http://llvm.org/docs/LangRef.html#functionstructure and Rust's MIR
representation
https://manishearth.github.io/rust-internals-docs/rustc/mir/repr/struct.BasicBlockData.html
.

I hope that was helpful...?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#30 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA9A-oANnJJnvJNShzlsz_Rf216AU9INks5qVFr6gaJpZM4JIwg8
.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Jul 21, 2016

I started trying to understand the code, and there are some things I am confused by. Do you mind if I ask questions here, or is there a better place?

For example, consider
type Environment = [(IdentKind, (Rust.Mutable, CType))]

It is my understanding that this creates a linked list. That seems like it would be inefficient. Why not use a balanced tree instead?

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Jul 31, 2016

Sorry, @Storyyeller, I lost track of this thread!

I have several answers to your question about linked-list performance. I kind of wish Corrode had a mailing list for these kinds of questions, but I'm not fond of the usual list hosting options so I guess we'll stick with with using GitHub issues. I'd probably suggest opening a new issue for discussing questions about the implementation, though; this isn't really related to the question of how to handle goto.

Most important answer: I'm not concerned about efficiency in Corrode right now, and I won't worry about it until somebody's trying to translate a source file that's taking, I don't know, 30 seconds or more?

I do try to avoid doing stupidly wasteful things, but in Corrode, given a choice between writing something simple or writing something that runs fast, I will pick the simple thing. 😄 I don't expect translating C to Rust to be a routine part of anyone's day, so if this kind of one-off activity is a little slow sometimes, that's better than making the code harder to understand for people who are trying to work on it.

I feel like Corrode is faster than either a C or Rust compiler on the same source, anyway, which would be fair since it doesn't do any optimization or very much type-checking or linting.

All that said, I did actually consider the performance implications of using a list for the environment instead of Data.Map, which is Haskell's standard balanced tree. Although I haven't measured it, this may be one of the few cases where a linked list can be faster than another data structure. When we leave a scope, we have to remove all the declarations that were local to that scope from the environment. Because both data structures are immutable and "persistent", we can do this by saving a reference to the old environment and restoring it on leaving the scope, so scope exit is O(1), aside from later GC costs. But with a list, adding symbols is O(1) in both time and space, while with a map, it's O(log(n)) in both time and space, which also adds to GC pressure. So for those activities, the list is the clear winner.

The trade-off, of course, is in symbol lookup, where lists are O(n) and maps are O(log(n)). But these are worst-case times. If the most-used symbols are within log(n) of the top of the environment, then the list still wins—and that seems like a plausible enough distribution for C environments that I suspected the gains in symbol insertion would dominate. That was partly based on experience with another project where I'd found that Data.Set was measurably slower than a list for a similar kind of use case.

My final reason is that in Haskell (and several other functional languages), when you're not sure what type to use, you use a list. There are more built-in functions for working with lists than with anything else in these languages so it's the easiest choice. Generally, in these languages specifically, you should be prepared to offer some justification for why you aren't using a list.

So, to recap: Using a list was the easy and idiomatic choice; I have an argument I found persuasive as to why it might be the higher-performance choice; and even if it isn't, I don't really care because there's no evidence yet that this particular choice is causing performance problems.

Sorry that was a long answer; I thought it was an interesting question and I may have gone overboard. I hope it helped?

Regarding my comments about deinitialize statements, I'm not sure what more I can say about the rationale:

That serves two purposes:

  • It preserves information about how variable declarations were interleaved with the rest of the code. If we follow the rule I proposed above that variable declarations are placed as late as possible, then code that was already structured will be unchanged by this transformation, aside perhaps from variable renaming.
  • It expresses C99's rule that a variable becomes re-initialized every time its initializer is reached, or becomes indeterminate if it does not have an initializer.
@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Oct 28, 2016

I've pushed a first cut at generating control-flow graphs to a new cfg branch. Commit 5fbe21c:

  • demonstrates constructing a CFG from the language-c AST,
  • demonstrates pattern-matching and rewriting that CFG for simplifications and other transformations,
  • and provides corrode-cfg, a command-line tool to inspect the CFG produced for a given C source file.

My implementation is not up to my preferred standard (it has zero documentation, for one thing) but I wanted to get something out for folks to play with.

If you're interested in implementing goto, you could try extending walkStatement in src/Language/Rust/Corrode/CFG/C.hs to handle the CLabel and CGoto AST constructors. Just constructing a correct control-flow graph for these statements is a reasonable challenge by itself, so I'd suggest doing only that much as a single pull request.

One tricky thing: Since a goto statement can jump either forward or backward within a function, we could either encounter a label before any goto that uses it, or encounter the goto before the label. Fortunately, you can call newLabel from CFG.hs before you know what statements should go in the new basic block. So the first time you encounter the identifier of a label, whether that's in CLabel or CGoto, you can allocate a basic block Label for it.

You'll need to store a mapping from Ident to Label in a State monad so you can keep track of whether you're seeing an identifier for the first time. I recommend using the combined RWS monad, similar to what src/Language/Rust/Corrode/C.md does, replacing the Reader monad I'm using now. (You won't need the Writer monad for this, but handling switch statements will need it; see #65.)

We'll still need to work on transforming the CFG to Rust control flow structures (I wrote some generic transformations already) and then ensuring that variables are declared appropriately (the current CFG code just drops variable declarations entirely). But those tasks are nicely separate from handling goto or switch.

@Storyyeller

This comment has been minimized.

Copy link

Storyyeller commented Oct 29, 2016

I concluded that there are too many issues with automatic C to Rust conversion and that manual rewrites are a better path forward, so I am unlikely to contribute to Corrode. Sorry.

@jameysharp

This comment has been minimized.

Copy link
Owner

jameysharp commented Nov 5, 2016

In case anyone's interested in working on goto support: In commit a834b82 on the cfg branch, I've integrated the control-flow graph logic into the main C.md, instead of the separate CFG/C.hs module I had it in before. You should be able to add support for labels and CGoto statements to interpretStatement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment