New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Formal C Memory Model Supporting Integer-Pointer Casts #30

Open
nikomatsakis opened this Issue Sep 21, 2016 · 17 comments

Comments

Projects
None yet
10 participants
@nikomatsakis
Owner

nikomatsakis commented Sep 21, 2016

Links: PDF, ACM

This is a memory model targeting C. The C specification itself has very loose rules that don't support a lot of common things people do in C programs, primarily around using pointers as integers (e.g., bitmasking, etc).

The key ideas of the model are as follows:

  • When you allocate memory, it is initially assigned a logical identifier and (conceptually) has no physical address (yet).
  • Once a pointer is cast to an integer, the memory block it points at is assigned a physical address.
  • Once a memory block has a physical address, then it is much less optimizable. For example, if you cast from an integer i to a pointer p, you have to assume that accesses through p could affect any block with a physical address, but you know they can't affect anything that doesn't have a physical address.

Some examples from the paper:

Keeping local variables private (supported)

Consider this question:

int f() {
  int a = 0;
  g(); // Can `g()` observe/affect value of `a`?
  return a;
}

In this model, the answer is no: the allocation of a is private, and at no point did the code do something like (int) &a to convert it into a concrete block.

Keeping local allocations private (supported)

Consider this question:

p = malloc();
*p = 123;
bar(); // Can this affect `*p`?
a = *p;
hash_put(h, p, a);

In this model, the answer is no: the allocation of p is private, and at no point did the code do something like (int) p to convert it into a concrete block.

The paper calls this "ownership transfer". I've avoided this term because it's so different from what Rust means by this term. It's more like "retaining" or "respecting" ownership (of p).

The paper says that clang -O2 does this kind of optimization, but not gcc -O2, and claims therefore that it may not be that important.

Dead Cast Elimination (unsupported in the model)

Consider this question (not from paper):

foo(int *p) {
    int x = (int) p; // Can we drop this unused value?
    println("%d", *p);
}

In this model, we cannot eliminate (int) p even if the result is not used because it has a side-effect of (potentially) assigning p a concrete value.

The paper addresses this by doing this optimization later in the pipeline. Basically at the point where we drop this model and convert to something more concrete.

Example limitation

Here is an example of an optimization prohibited by this approach.

p = malloc(1);
*p = 123;
b = (int) p;
bar();
a = *p;
hash_put(h, b, a); // Can we optimize this to `hash_put(h, b, 123)`?

The answer is no: p was made "concrete", and hence may be affected by bar(). Interestingly, a slight reordering would enable the optimization:

p = malloc(1);
*p = 123;
bar(); // `p` is not concrete **yet**...
b = (int) p;
a = *p;
hash_put(h, b, a); // ...hence no opportunity between cast and here to modify `*p`
@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Sep 21, 2016

Collaborator

My notes:

First, keep in mind that most of the examples are equally valid when alloca (i.e. a local) is used instead of malloc, and therefore are important even if we don't care about optimizing malloc-heavy code.

Second, in Rust we probably want to have the physical addresses only be legal for limited periods of time. In fact, that basically results in a variant of the ACA model where all mallocs are asserted.

The paper says that clang -O2 does this kind of optimization, but not gcc -O2, and claims therefore that it may not be that important.

It might not be that important for performance, but LLVM will happily do the optimization for us, so it has to be a part of our memory model.

Collaborator

arielb1 commented Sep 21, 2016

My notes:

First, keep in mind that most of the examples are equally valid when alloca (i.e. a local) is used instead of malloc, and therefore are important even if we don't care about optimizing malloc-heavy code.

Second, in Rust we probably want to have the physical addresses only be legal for limited periods of time. In fact, that basically results in a variant of the ACA model where all mallocs are asserted.

The paper says that clang -O2 does this kind of optimization, but not gcc -O2, and claims therefore that it may not be that important.

It might not be that important for performance, but LLVM will happily do the optimization for us, so it has to be a part of our memory model.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Sep 21, 2016

I need to read the paper more closely, but I wonder whether it was considered to do dataflow analysis on cast(ed) pointers---i.e. once a pointer is casted, only functions that get the cast-result integer or the original pointer are allowed/assumed to make arbitrary reads/writes to the block in which the pointer points.

(On the Rust side of things, I've similar thought about doing dataflow analysis on "poisoned" unsafe values.)

Ericson2314 commented Sep 21, 2016

I need to read the paper more closely, but I wonder whether it was considered to do dataflow analysis on cast(ed) pointers---i.e. once a pointer is casted, only functions that get the cast-result integer or the original pointer are allowed/assumed to make arbitrary reads/writes to the block in which the pointer points.

(On the Rust side of things, I've similar thought about doing dataflow analysis on "poisoned" unsafe values.)

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Sep 22, 2016

Collaborator

Memory models don't do dataflow. Also, any attempt at interesting analysis runs into the problem that optimizers will optimize x-x to 0.

Collaborator

arielb1 commented Sep 22, 2016

Memory models don't do dataflow. Also, any attempt at interesting analysis runs into the problem that optimizers will optimize x-x to 0.

@jeehoonkang

This comment has been minimized.

Show comment
Hide comment
@jeehoonkang

jeehoonkang Sep 22, 2016

Hello, I am an author of the paper, and would like to participate in this discussion. I am really glad that the Rust community is interested in the paper!

  • @Ericson2314 @arielb1 I would like to add my two cents on the dataflow analysis. The main reason we didn't do dataflow analysis is its complexity: it is really hard to track the leakage of pointer-to-integer casting information. Even, we have to consider not only explicit but also implicit leakage. For e.g. consider the following code (assume a pointer is 32bit):

      int *p = malloc(1<<31 + 1);
      uintptr_t pi = (uintptr_t) p;
      *((int *)(1 <<31)) = 42; // undefined behavior?
    

    In this example, the information that 1 << 31 is a valid pointer is implicitly leaked to the store, even if the store is not directly related to the malloc and the pointer-to-integer cast. This example is really contrived and impractical, but it shows that it will be somewhat complicated to define the precise meaning of information leakage.

  • @arielb1 May I ask what is the ACA model? Google didn't give me useful information..

  • I would like to understand why the Rust community is interested in the casts between integers and pointers: isn't it (mem::transmute) contained in the unsafe Rust?

  • It is not directly related to this issue ;-) but I would like to advertise that recently my colleagues and I wrote a paper (under submission) on C/C++ relaxed-memory concurrency model: http://sf.snu.ac.kr/promise-concurrency/ Hope it also be interesting to the Rust community!

jeehoonkang commented Sep 22, 2016

Hello, I am an author of the paper, and would like to participate in this discussion. I am really glad that the Rust community is interested in the paper!

  • @Ericson2314 @arielb1 I would like to add my two cents on the dataflow analysis. The main reason we didn't do dataflow analysis is its complexity: it is really hard to track the leakage of pointer-to-integer casting information. Even, we have to consider not only explicit but also implicit leakage. For e.g. consider the following code (assume a pointer is 32bit):

      int *p = malloc(1<<31 + 1);
      uintptr_t pi = (uintptr_t) p;
      *((int *)(1 <<31)) = 42; // undefined behavior?
    

    In this example, the information that 1 << 31 is a valid pointer is implicitly leaked to the store, even if the store is not directly related to the malloc and the pointer-to-integer cast. This example is really contrived and impractical, but it shows that it will be somewhat complicated to define the precise meaning of information leakage.

  • @arielb1 May I ask what is the ACA model? Google didn't give me useful information..

  • I would like to understand why the Rust community is interested in the casts between integers and pointers: isn't it (mem::transmute) contained in the unsafe Rust?

  • It is not directly related to this issue ;-) but I would like to advertise that recently my colleagues and I wrote a paper (under submission) on C/C++ relaxed-memory concurrency model: http://sf.snu.ac.kr/promise-concurrency/ Hope it also be interesting to the Rust community!

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Sep 22, 2016

Collaborator

Thanks for coming to talk to us @jeehoonkang. We'll take a look at the relaxed-memory paper, but currently we have enough trouble getting the single-threaded semantics right. @orilahav's previous taming-acquire-release paper was excellent, so I hope this one is similarly good.

First, the ACA and capability models are our currently very-informal-and-incomplete candidate models (see #26 and #28). It would be nice if someone in the know came to look and them.

I don't like a "dataflow"-style model for the same reason as you. However, C has its odd "counterfactual" definition for restrict - a pointer expression E is said to be based on object P if ... modifying P ... would change the value of E - which is quite vague (did someone come up with some less-vague version?), and LLVM has its own set of vague noalias semantics, which we have to conform with.

I would like to understand why the Rust community is interested in the casts between integers and pointers: isn't it (mem::transmute) contained in the unsafe Rust?

That's the entire point of the memory model: If you keep to hereditarily-safe Rust, you could just have the "fully symbolic" memory model. Correctly-written unsafe Rust is supposed to be UB-free, and we are trying to find a good enough definition of UB that programmers, sanitizers and optimizers can agree on.

Also, in Rust, the widely-used raw pointers are pretty much interconvertible with integers (at the very least, int<->raw-ptr casts are safe and therefore must never be UB). The ACA model's idea is to handle raw pointers just like integers, which works moderately well.

If we can tame LLVM (it's not much use having a memory model that your optimizer misoptimizes by design!), I think that we can have something that handles raw pointers and integers separately.

Collaborator

arielb1 commented Sep 22, 2016

Thanks for coming to talk to us @jeehoonkang. We'll take a look at the relaxed-memory paper, but currently we have enough trouble getting the single-threaded semantics right. @orilahav's previous taming-acquire-release paper was excellent, so I hope this one is similarly good.

First, the ACA and capability models are our currently very-informal-and-incomplete candidate models (see #26 and #28). It would be nice if someone in the know came to look and them.

I don't like a "dataflow"-style model for the same reason as you. However, C has its odd "counterfactual" definition for restrict - a pointer expression E is said to be based on object P if ... modifying P ... would change the value of E - which is quite vague (did someone come up with some less-vague version?), and LLVM has its own set of vague noalias semantics, which we have to conform with.

I would like to understand why the Rust community is interested in the casts between integers and pointers: isn't it (mem::transmute) contained in the unsafe Rust?

That's the entire point of the memory model: If you keep to hereditarily-safe Rust, you could just have the "fully symbolic" memory model. Correctly-written unsafe Rust is supposed to be UB-free, and we are trying to find a good enough definition of UB that programmers, sanitizers and optimizers can agree on.

Also, in Rust, the widely-used raw pointers are pretty much interconvertible with integers (at the very least, int<->raw-ptr casts are safe and therefore must never be UB). The ACA model's idea is to handle raw pointers just like integers, which works moderately well.

If we can tame LLVM (it's not much use having a memory model that your optimizer misoptimizes by design!), I think that we can have something that handles raw pointers and integers separately.

@RalfJung

This comment has been minimized.

Show comment
Hide comment
@RalfJung

RalfJung Sep 22, 2016

Collaborator

I feel like we have to treat raw pointers and integers separately, because LLVM does. Yes, you can cast them, but for example LLVM feels free to reorder accesses to pointers it deems "noaliasing" even if, after casting them to integers, they compare equal.

Collaborator

RalfJung commented Sep 22, 2016

I feel like we have to treat raw pointers and integers separately, because LLVM does. Yes, you can cast them, but for example LLVM feels free to reorder accesses to pointers it deems "noaliasing" even if, after casting them to integers, they compare equal.

@jugglerchris

This comment has been minimized.

Show comment
Hide comment
@jugglerchris

jugglerchris Sep 22, 2016

I'm a bit of an outsider, but I think it would be a shame if the rules for writing unsafe code ended up being compromised by the current LLVM (which I consider an implementation detail) at the expense of ease of understanding (for "unsafe" programmers) or possible future optimisations (for future Rust implementations, or updated LLVM).

jugglerchris commented Sep 22, 2016

I'm a bit of an outsider, but I think it would be a shame if the rules for writing unsafe code ended up being compromised by the current LLVM (which I consider an implementation detail) at the expense of ease of understanding (for "unsafe" programmers) or possible future optimisations (for future Rust implementations, or updated LLVM).

@gilhur

This comment has been minimized.

Show comment
Hide comment
@gilhur

gilhur Sep 22, 2016

@arielb1 We have a formal model for "noalis" in LLVM that can be very well explained in our pointer-integer casting model (as long as our understanding of no alias is correct). I or Jeehoon will get back with more details tmr because it is midnight in Korea :)

gilhur commented Sep 22, 2016

@arielb1 We have a formal model for "noalis" in LLVM that can be very well explained in our pointer-integer casting model (as long as our understanding of no alias is correct). I or Jeehoon will get back with more details tmr because it is midnight in Korea :)

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Sep 22, 2016

Collaborator

I feel like we have to treat raw pointers and integers separately, because LLVM does. Yes, you can cast them, but for example LLVM feels free to reorder accesses to pointers it deems "noaliasing" even if, after casting them to integers, they compare equal.

You can compare pointers even without casting them to integers.

Collaborator

arielb1 commented Sep 22, 2016

I feel like we have to treat raw pointers and integers separately, because LLVM does. Yes, you can cast them, but for example LLVM feels free to reorder accesses to pointers it deems "noaliasing" even if, after casting them to integers, they compare equal.

You can compare pointers even without casting them to integers.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Sep 22, 2016

Collaborator

I'm a bit of an outsider, but I think it would be a shame if the rules for writing unsafe code ended up being compromised by the current LLVM (which I consider an implementation detail) at the expense of ease of understanding (for "unsafe" programmers) or possible future optimisations (for future Rust implementations, or updated LLVM).

Our memory model does need to be compatible with LLVM's or it won't be worth much.

Collaborator

arielb1 commented Sep 22, 2016

I'm a bit of an outsider, but I think it would be a shame if the rules for writing unsafe code ended up being compromised by the current LLVM (which I consider an implementation detail) at the expense of ease of understanding (for "unsafe" programmers) or possible future optimisations (for future Rust implementations, or updated LLVM).

Our memory model does need to be compatible with LLVM's or it won't be worth much.

@jugglerchris

This comment has been minimized.

Show comment
Hide comment
@jugglerchris

jugglerchris Sep 22, 2016

Our memory model does need to be compatible with LLVM's or it won't be worth much.

Yes, of course. Are changes/enhancements to LLVM to support a better overall Rust memory model an option that would be considered? I understand that resources are limited for such work.

jugglerchris commented Sep 22, 2016

Our memory model does need to be compatible with LLVM's or it won't be worth much.

Yes, of course. Are changes/enhancements to LLVM to support a better overall Rust memory model an option that would be considered? I understand that resources are limited for such work.

@sanxiyn

This comment has been minimized.

Show comment
Hide comment
@sanxiyn

sanxiyn Sep 28, 2016

@gilhur Can you tell us more about this formal model for noalias in LLVM?

sanxiyn commented Sep 28, 2016

@gilhur Can you tell us more about this formal model for noalias in LLVM?

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Sep 29, 2016

Owner

Just FYI, I took a stab at summarizing the "promising semantics" paper in #32, though I don't think I did it justice. =) Still digesting.

Owner

nikomatsakis commented Sep 29, 2016

Just FYI, I took a stab at summarizing the "promising semantics" paper in #32, though I don't think I did it justice. =) Still digesting.

@burdges

This comment has been minimized.

Show comment
Hide comment
@burdges

burdges Sep 30, 2016

Interesting SOPS'13 paper on analyzing optimization-unstable code : Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior (SOSP'13) by Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama.

burdges commented Sep 30, 2016

Interesting SOPS'13 paper on analyzing optimization-unstable code : Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior (SOSP'13) by Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Armando Solar-Lezama.

@gilhur

This comment has been minimized.

Show comment
Hide comment
@gilhur

gilhur Oct 4, 2016

@sanxiyn The formal model for noalias in LLVM essentially requires the notion of logical memory, which is the underlying model for CompCert and also for our ptr-int casting model. I am a bit hesitant to explain the idea here because we haven't published it yet. I will write a short tech report about it and give a link to it here. But please wait until the PLDI deadline in Nov because I am too busy until then.

gilhur commented Oct 4, 2016

@sanxiyn The formal model for noalias in LLVM essentially requires the notion of logical memory, which is the underlying model for CompCert and also for our ptr-int casting model. I am a bit hesitant to explain the idea here because we haven't published it yet. I will write a short tech report about it and give a link to it here. But please wait until the PLDI deadline in Nov because I am too busy until then.

@Zoxc

This comment has been minimized.

Show comment
Hide comment
@Zoxc

Zoxc Jul 23, 2017

@gilhur I would like some information about your model for noalias/pointer aliasing in LLVM, especially for things relating to pointer equality and pointer comparisons.

Zoxc commented Jul 23, 2017

@gilhur I would like some information about your model for noalias/pointer aliasing in LLVM, especially for things relating to pointer equality and pointer comparisons.

@gilhur

This comment has been minimized.

Show comment
Hide comment
@gilhur

gilhur Jul 28, 2017

@Zoxc We are currently working on developing an LLVM model for handling pointers including casting between pointers and integers, and pointer equality/comparison. We also update the LLVM compiler accordingly including the alias analysis module. Probably we can make some written documents available within a few months. I will notify you here when it is out.

gilhur commented Jul 28, 2017

@Zoxc We are currently working on developing an LLVM model for handling pointers including casting between pointers and integers, and pointer equality/comparison. We also update the LLVM compiler accordingly including the alias analysis module. Probably we can make some written documents available within a few months. I will notify you here when it is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment