Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO CONCURRENT might be broken #62

Open
certik opened this issue Nov 1, 2019 · 86 comments
Open

DO CONCURRENT might be broken #62

certik opened this issue Nov 1, 2019 · 86 comments
Labels
Clause 11 Standard Clause 11: Execution control

Comments

@certik
Copy link
Member

certik commented Nov 1, 2019

@klausler reported in #60 (comment):

DO CONCURRENT is fundamentally broken. It only guarantees to the compiler that the iterations of the loop can be run in any serial order. Its default localization rule (any variable read in an iteration will see the most recent value written in the same iteration, if any) prevents straightforward parallel execution.

Let's discuss that here. @klausler, can you work with @gklimowicz to fix that? Gary has some proposals regarding "do concurrent". Or there is no way to fix this issue.

@certik certik mentioned this issue Nov 1, 2019
@klausler
Copy link

klausler commented Nov 1, 2019

See https://j3-fortran.org/doc/year/19/19-134.txt , which describes the problem and suggests a solution. The Committee "deferred" this problem and hasn't wanted to discuss it since.

@certik
Copy link
Member Author

certik commented Nov 4, 2019

@klausler thanks a lot for submitting the proposal! Thanks also for commenting under the other issues. I am really sorry the committee didn't discuss this. Did they provide any feedback at all?

@klausler
Copy link

klausler commented Nov 5, 2019

I received no official response.

@certik
Copy link
Member Author

certik commented Nov 5, 2019

@klausler thank you. As a member of the committee, I apologize. This is unacceptable to me and I am trying to convince the committee that we need to consider every proposal that gets officially submitted (even if for just 5 to 10 minutes). In fact I was at the February 2019 meeting, but I don't recall what happened to your paper, as that was my first meeting and I was just trying to figure out how the committee works. Now when we have this GitHub repository, I plan to track every technical comment the committee makes in issues.

If it makes you feel any better, the committee didn't consider my proposal either (in #1). I know it happened to others too. I feel it's very inefficient, because had the committee provided feedback to you, you could have submitted a better paper for the October 2019 meeting, and so on, and we could have had this feature in a much more "ready" shape.

@sblionel, here is an example of a proposal that I suggest the committee spends 5 to 10 minutes at plenary to discuss and then I volunteer to summarize the feedback in this issue here. I still think this would be the most efficient. But I would be fine with the alternative that multiple committee members provide feedback here directly in the issue and the committee does not officially consider any such proposals until later, as that would still be an improvement and it would move us in the right direction.

@klausler
Copy link

klausler commented Nov 5, 2019

My problem, as an implementor, is that Fortran users expect from its name that DO CONCURRENT is sufficient to guarantee to the compiler that loop iterations can be safely executed in parallel; but the standard only suffices to guarantee that the iterations can be executed in arbitrary serial order. What are my options? (1) just parallelize it anyway; (2) provide an option or directive or extension to force parallelization in the face of unresolvable localizations; (3) emit a warning about unresolvable localizations and recommend the use of OpenMP directives instead? People expect DO CONCURRENT to just mean "do concurrently", and it needs to be fixed or replaced so that it does.

@certik
Copy link
Member Author

certik commented Nov 5, 2019

@klausler I agree. Let's discuss it further. Your proposal has an example:

SUBROUTINE FOO(N, A, B, T, K, L)
  IMPLICIT NONE
  INTEGER, INTENT(IN) :: N, K(N), L(N)
  REAL, INTENT(IN) :: A(N)
  REAL, INTENT(OUT) :: B(N)
  REAL, INTENT(INOUT) :: T(N)
  INTEGER :: J
  DO CONCURRENT (J=1:N)
    T(K(J)) = A(J)
    B(J) = T(L(J))
  END DO
END SUBROUTINE FOO

How would this be written if your proposal is accepted? Using the LOCAL specifier.

@klausler
Copy link

klausler commented Nov 5, 2019

I should correct this case; I intend T() to be dead after the loop.

The validity of this loop depends on the values of K() and L(). Today, this loop is valid if the elements of K() are pairwise distinct from each other and from the noncorresponding elements of L() (obviously), and also in the case that K(J)==L(J) for all J. Even the extreme case of K(J)==L(J)==CONSTANT is fine.

I would restrict DO CONCURRENT to be valid only under the first condition. There's wouldn't be recourse to a LOCAL clause. If you want the compiler to automatically localize a temporary variable, it would have to be an obvious whole variable, not an array element or pointer target or anything else whose identity could not be known until execution.

@certik
Copy link
Member Author

certik commented Nov 5, 2019

@klausler let's write down the actual examples, it's will be much easier for me (and I am sure others) to follow the arguments.

K = [1, 2, 3]
L = [4, 5, 6]
K = [1, 2, 3]
L = [1, 2, 3]
K = [1, 1, 1]
L = [1, 1, 1]

Currently 1., 2., and 3. are allowed. You are proposing for 1. and 2. to be allowed, but 3. to be forbidden (I corrected this sentence based on the comment below). Given that this depends on the values of the arrays K and L, which are only known at runtime, how would the compiler enforce it?

Isn't the idea of do concurrent that the user is responsible for ensuring that the loops are independent, which in many cases (such as this one) is impossible to do for the compiler, but quite possible for the user, since he knows his problem?

@klausler
Copy link

klausler commented Nov 5, 2019

@klausler let's write down the actual examples, it's will be much easier for me (and I am sure others) to follow the arguments.

K = [1, 2, 3]
L = [4, 5, 6]
K = [1, 2, 3]
L = [1, 2, 3]
K = [1, 1, 1]
L = [1, 1, 1]

Currently 1., 2., and 3. are allowed. You are proposing for 1. to be allowed, but 2. and 3. to be forbidden. Given that this depends on the values of the arrays K and L, which are only known at runtime, how would the compiler enforce it?

Isn't the idea of do concurrent that the user is responsible for ensuring that the loops are independent, which in many cases (such as this one) is impossible to do for the compiler, but quite possible for the user, since he knows his problem?

No, I'm proposing only that your third case be disallowed. And I don't know what the original idea behind DO CONCURRENT was, but I think that the users and implementors that I deal with would agree with your characterization of the obligations that it places on the compiler and programmer (basically the same as an OpenMP parallel loop). The bug in DO CONCURRENT is that the compiler has to allow for your third case and make it work, either by running the code serially or detecting it and handling it dynamically.

EDIT: Your second case is fine as written; change it to K==L==[1,1,2] and it would be a better example of the condition I had in mind.

@certik
Copy link
Member Author

certik commented Nov 5, 2019

I corrected my comment above to match what you are proposing.

Regarding your last edit K==L==[1,1,2], let's call this a case 4. Isn't case 4 the same as case 3 that should be disallowed? I am still a bit confused.

@klausler
Copy link

klausler commented Nov 5, 2019

When all of the elements of K() are pairwise distinct, and distinct from all of the noncorresponding elements of L(), the loop is safe to parallelize.

Otherwise, and this is the problem, DO CONCURRENT also permits elements of K() to match each other, so long as K(J)==L(J). This is the problem. (The case of all elements of K() and L() being the same value is a degenerate case of this second situation.)

So K==[1,2,3],L==[4.5,6] is fine; so are K==[1,2,3],L==[1,5,6] and K==[1,2,3],L==[1,2,3].

But K=[1,1,2],L=[1,1,2] should not be; nor should K==[1,1,1],L=[1,1,1], and these are the cases that I think should be invalidated.

@certik
Copy link
Member Author

certik commented Nov 5, 2019

Thank you. I think it is clear now which cases are allowed and which ones are not.

Let's get some feedback from other members of the committee.

@sblionel, do you know what the original idea behind do concurrent was and if what we are proposing here is aligned with it? See @klausler's previous comment: #62 (comment), and the rest of the discussion here.

If other members of the committee agree that this should be fixed, then the next step is to update the proposal. I am happy to help.

@sblionel
Copy link
Member

sblionel commented Nov 5, 2019

This discussion puzzles me somewhat, as I don't agree with some of the assertions made. However, this is not my area of expertise and I would prefer to see opinions of committee members more versed in parallelism, such as Bill Long of Cray.

DO CONCURRENT was designed as a replacement for F95's FORALL as it was determined that FORALL's semantics made parallelism very difficult. The whole idea of DO CONCURRENT is that the iterations in any order and to any degree of parallelism, as the user promises there are no cross-iteration dependencies. There are already Fortran implementations that successfully parallelize DO CONCURRENT (Intel and probably Cray), so I don't really understand what the problem is. DO CONCURRENT was mainly modeled on OpenMP PARALLEL DO, especially with the F18 additions of locality clauses.

I see that Peter's paper got "deferred" at the February 2019 meeting and not taken up again. I'm not a good person to discuss this with.

@klausler
Copy link

klausler commented Nov 5, 2019

I suspect that the language in what is now 11.1.7.5 para. 4 (first bullet) was intended to handle cases like:

DO CONCURRENT (J=1:N)
  TMP = ...
  ... = TMP
END DO

by requiring the processor to automatically localize the obvious temporary TMP. Unfortunately, the language is vague enough to also cover the cases that I've mentioned above where the "variable" in question can't be identified at compilation time. It doesn't make a difference if the iterations of the loop are executed in some arbitrary serial order, but it does matter when parallelizing (or failing to parallelize because of the requirement in 11.1.7.5 para. 4, first bullet).

(The specific language reads: If a variable has unspecified locality, • if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates; ...)

@klausler
Copy link

klausler commented Dec 2, 2019

This discussion puzzles me somewhat, as I don't agree with some of the assertions made. However, this is not my area of expertise and I would prefer to see opinions of committee members more versed in parallelism, such as Bill Long of Cray.

DO CONCURRENT was designed as a replacement for F95's FORALL as it was determined that FORALL's semantics made parallelism very difficult. The whole idea of DO CONCURRENT is that the iterations in any order and to any degree of parallelism, as the user promises there are no cross-iteration dependencies. There are already Fortran implementations that successfully parallelize DO CONCURRENT (Intel and probably Cray), so I don't really understand what the problem is.

I will try to make this as clear as I possibly can, using the example that I presented earlier.

subroutine foo(N, A, B, T, K, L)
  implicit none
  integer, intent(in) :: N, K(N), L(N)
  real, intent(in) :: A(N)
  real, intent(out) :: B(N)
  real, intent(inout) :: T(N)
  integer :: J
  do concurrent (J=1:N)
    ! During execution, K(J) and L(J) are both always 1.  So the store
    ! and load to/from T() always affect T(1) in each iteration.  Since
    ! T(1) is defined in each iteration before it is referenced, this
    ! program conforms with F2008 and F2018.
    T(K(J)) = A(J)
    B(J) = T(L(J)) ! must be A(J) whenever K(J)==L(J)
  end do
end subroutine foo

This subroutine complies with all of the constraints and "shalls" in the Fortran 2018 standard that pertain to DO CONCURRENT. But it cannot be executed in parallel and produce correct results.

This is because the DO CONCURRENT construct, despite its name, imposes restrictions on the program that are sufficient to guarantee that the iterations of the loop may be run in any sequential order. The restrictions necessary to guarantee safe execution in arbitrary sequential order are not sufficient to guarantee safe execution in parallel.

When ifort parallelizes this loop (ifort -parallel), it will fail during execution when placed in a test harness. (Alternatively, when this subroutine appears in the same source file as the test harness, ifort will expand it inline and then refuse to parallelize it due to a correctly diagnosed data dependence.)

DO CONCURRENT was mainly modeled on OpenMP PARALLEL DO, especially with the F18 additions of locality clauses.

Except that OpenMP PARALLEL DO imposes stricter restrictions on the loop, so that it can be safely executed in parallel. DO CONCURRENT's restrictions are only strong enough to ensure safe serial execution in arbitrary order of iterations. Perhaps it was believed that these weaker restrictions would be easier to comply with, and that a sufficiently smart compiler could apply automatic localization and then safely parallelize. This turns out not to be the case, since a compiler can apply automatic localization only to variables that can be identified at compilation time.

I see that Peter's paper got "deferred" at the February 2019 meeting and not taken up again. I'm not a good person to discuss this with.

Why not?

@sblionel
Copy link
Member

sblionel commented Dec 2, 2019

Why not?

Because I don't head J3, nor even the HPC subgroup under whose purview this would fall.

As I wrote above, I do not consider myself an expert on the parallel features - I have a general understanding but that is all. Bill Long of Cray is the person I consider most knowledgeable about parallelism, though there are others on the committee who seem to understand it well.

I will ask Bill to take a look at this thread and see if he wants to offer an opinion.

@wclodius2
Copy link
Contributor

@klausler your paper is written more as a feature request (this is the language of the standard, this is how I interpret the language, this is how it should be changed) when I think an interpretation request (this is the language of the standard, this is how I interpret the language, is my interpretation correct for that language, is my interpretation what was intended) would get more prompt attention.

@klausler
Copy link

@klausler your paper is written more as a feature request (this is the language of the standard, this is how I interpret the language, this is how it should be changed) when I think an interpretation request (this is the language of the standard, this is how I interpret the language, is my interpretation correct for that language, is my interpretation what was intended) would get more prompt attention.

It's been 16 months, so prompt attention is a lost cause at this point. As is J3. I've removed myself from membership on that committee.

@sblionel
Copy link
Member

sblionel commented Jul 4, 2020

I raised this question again yesterday on the J3 email list. You can read the thread here. If I understand it correctly, these issues have been known for a while and are what prompted the addition in F2018 of the DEFAULT(NONE) locality specifier, as requiring the compiler to analyze the loop for possible sharing was difficult. The general opinion seems to be that explicitly specifying the locality of all variables is needed to enable parallelization, but that changing the F2008 behavior would break programs.

@klausler
Copy link

klausler commented Jul 4, 2020

If anybody had tried to solve the problem of my specific example, they would have learned that the recently added locality clauses are not sufficient to its needs. They accept only the names of whole variables (and, less important, can't distinguish between a pointer and its target).

@klausler
Copy link

klausler commented Jul 4, 2020

DO CONCURRENT's locality rules are broken even apart from parallelization concerns.

Is the following a conforming program under Fortran 2018?

PROGRAM ONE
REAL :: A(2) = 0.
INTEGER :: J
DO CONCURRENT (J=1:2) SHARED(A)
A(J) = A(J) + 1.
END DO
PRINT *, A
END

Quoting 11.1.7.5(3):
If a variable has SHARED locality, appearances of the variable within
the DO CONCURRENT construct refer to the variable in the innermost
executable construct or scoping unit that includes the DO CONCURRENT
construct. If it is defined or becomes undefined during any iteration,
it shall not be referenced, defined, or become undefined during any
other iteration.

A has SHARED locality but is both defined and referenced in all iterations.

If A had unspecified locality (no SHARED(A) locality specifier), then
paragraph (4) (first bullet point) would apply:
If a variable has unspecified locality, if it is referenced in an iteration
it shall either be previously defined during that iteration, or shall not be
defined or become undefined during any other iteration; if it is defined or
becomes undefined by more than one iteration it becomes undefined when the
loop terminates;

And the program would not be conforming, both for referencing A in each
iteration without a prior definition, and for using the undefinied
variable after the termination of the loop.

Can the word "variable" possibly refer to the elements of A, rather than
to its entirety? Elsewhere in subclause 11.1.7, there are sites in
which the word "variable" clearly refers to whole arrays:

11.1.7.5(2), emphasis added
A variable that has LOCAL or LOCAL_INIT locality is a construct entity with
the same type, type parameters, and rank as the variable with the same
name in the innermost executable construct or scoping unit that includes
the DO CONCURRENT construct, and the outside variable is inaccessible by
that name within the construct. ... If it is not a pointer, it has the
same bounds as the outside variable.

11.1.7.5(4)
If a variable has unspecified locality, ... if it is noncontiguous and
is supplied as an actual argument corresponding to a contiguous
INTENT (IN OUT) dummy argument in an iteration, it shall either be
previously defined in that iteration or shall not be defined in any
other iteration;

@sblionel
Copy link
Member

sblionel commented Jul 4, 2020

Peter, would you please ask this on the J3 email list, of which you're still a member? I think you'd get a more authoritative response there. But to answer the question, "Can the word variable possibly refer to the elements of A", the answer is yes. R902 defines variable as including designator and designator (R901) includes array-element.

I admit it can be a bit confusing where sometimes variable names are mentioned (and thus excluding array elements), but in the places where it just says variable, then an array element qualifies. In your example, the variable reference is A(J), not A, and hence I would say that your example conforms, since no element of A is referenced or defined in more than one iteration.

@klausler
Copy link

klausler commented Jul 4, 2020

Thanks for the reminder.

@certik
Copy link
Member Author

certik commented Jul 6, 2020

I just asked on the J3 mailinglist to clarify the main problem raised in this issue:

https://mailman.j3-fortran.org/pipermail/j3/2020-July/012241.html

@certik
Copy link
Member Author

certik commented Jul 7, 2020

@klausler here is an answer by Malcolm:

https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html

If I understand it correctly, he says that to get maximum performance, one has to do DEFAULT(SHARED). He provided some background and motivation why things are designed the way they are. He said "We discussed them at great length." I wish the discussion was archived somewhere, so that we don't need to repeat it. But at least we are discussing this now and it is archived now.

@klausler
Copy link

klausler commented Jul 7, 2020

@klausler here is an answer by Malcolm:

https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html

If I understand it correctly, he says that to get maximum performance, one has to do DEFAULT(SHARED). He provided some background and motivation why things are designed the way they are. He said "We discussed them at great length." I wish the discussion was archived somewhere, so that we don't need to repeat it. But at least we are discussing this now and it is archived now.

There is no DEFAULT(SHARED) in Fortran. It might be a good idea, but it's not in the language.

@sblionel
Copy link
Member

sblionel commented Jul 7, 2020

Right - to me, DEFAULT(SHARED) is as bad as IMPLICIT. Much better to explicitly specify the localities of all the variables in the block.

@klausler
Copy link

klausler commented Jul 7, 2020

The value of a DEFAULT(SHARED), if it actually were in the language, is that it would apply to variables that can't be named in an explicit locality specifier.

@certik
Copy link
Member Author

certik commented Jul 7, 2020

@klausler I am still confused: with the current Fortran Standard and your example, compilers are required to put in (potentially) costly runtime checks, or is there a way to write it using explicit locality specifiers to parallelize efficiently?

@FortranFan
Copy link
Member

@klausler wrote Nov. 17, 2020 7:01 PM EST:

.. Still not sure what we're going to do in the flang compilers -- there's good arguments for both standard conformance as well as for just doing the right thing.

c.f. https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html where J3 "effectively" suggested something which is not yet in the standard.

So, is it possible for flang compilers to do both!?

That is, first have a standard-conforming implementation of DO CONCURRENT.

But then also consider an "Experimental" edition of flang compiler(s) that attempts to do the "right thing", perhaps via a "DEFAULT(SHARED)" or some suitable extension that is Fortrannic and which can then be proposed for Fortran 202Y as further improvement to be incorporated into the standard?

@klausler
Copy link

@klausler wrote Nov. 17, 2020 7:01 PM EST:

.. Still not sure what we're going to do in the flang compilers -- there's good arguments for both standard conformance as well as for just doing the right thing.

c.f. https://mailman.j3-fortran.org/pipermail/j3/2020-July/012244.html where J3 "effectively" suggested something which is not yet in the standard.

So, is it possible for flang compilers to do both!?

That is, first have a standard-conforming implementation of DO CONCURRENT.

But then also consider an "Experimental" edition of flang compiler(s) that attempts to do the "right thing", perhaps via a "DEFAULT(SHARED)" or some suitable extension that is Fortrannic and which can then be proposed for Fortran 202Y as further improvement to be incorporated into the standard?

Anything is possible, but these are all just second-best alternatives to J3 just fixing the problems.

@wyphan
Copy link

wyphan commented May 23, 2022

Hi @klausler , now that my Google Summer of Code project proposal for DO CONCURRENT support in GFortran has been accepted, I'd like to set up a meeting to discuss about this. (I wish GitHub has a messaging feature...)

If you happen to be at the Fortran Discourse forums, please send me a PM with your NVIDIA email address so I can send you a calendar invite to the meeting I'm in the process of setting up with Jeff Larkin, Güray Özen, and the folks at Predictive Science who published arXiv:2110.10151 about the DO CONCURRENT implementation in NVIDIA nvfortran. Otherwise you can reach me at wileam [at] phan [dot] codes.

Of course, anyone in this thread who is interested is welcome too.

@klausler
Copy link

klausler commented May 24, 2022

I have reviewed the document for LLVM Fortran that describes the problems with DO CONCURRENT in the standard language, and unfortunately nothing has changed on the language standards front since I wrote it in 2020. In particular, nothing has been done to fix DO CONCURRENT in the draft Fortran 202X standards (apart maybe from SIMPLE procedures). Perhaps somebody on WG5/J3 may take it up for Fortran 202Y but that is of course a long ways out.

@certik
Copy link
Member Author

certik commented May 25, 2022

Perhaps somebody on WG5/J3 may take it up for Fortran 202Y but that is of course a long ways out.

@klausler you have the most knowledge on this particular issue (since you wrote the Flang document!). Do you think you could please submit proposals for 2Y to fix this? I'll help champion it and advocate for it, but it would take me much longer to write up than it would take for you, since you have thought about all the details here and what needs to be done.

@klausler
Copy link

My 2019 paper, which was ignored by J3, remains my favored solution; recycle it if you like. It would be nice if the semantics of locality specifiers with regard to pointers and ASSOCIATE/SELECT TYPE names would also be clarified, too, as being invalid for LOCAL().

@rouson
Copy link

rouson commented May 25, 2022

A J3 mailing list discussion of this topic spanned 44 emails over 11 days in July 2020. I wrote email 42 attempting to crystallize the discussion into a practice that I could teach. My takeaway: it suffices for every do concurrent construct to include a default(none) statement obligating the programmer to declare every variable accessed inside the construct as shared, local, or local_init. Email 43 essentially concurred and compared default(none) to implicit none so I think of default(none) as just good code hygiene and consider the issue settled for my purposes. I would be satisfied with a compiler that parallelizes, vectorizes, or offloads do concurrent only when default(none) is present. I therefore don't consider do concurrent broken any more than I would consider all of Fortran broken because programmers are free to not use implicit none.

If someone submits a Fortran 202Y proposal related to this issue, I suggest that it either involve backward-compatible changes to do concurrent or that a new feature be proposed that is syntactically similar to do concurrent but with semantics amenable to parallelization. The new feature could be do parallel and could resemble do concurrent but with different constraints.

@rouson
Copy link

rouson commented May 25, 2022

@klausler could you share with us what course of action NVIDIA chose for offloading do concurrent? If, for example, the compiler allows me pass a flag that toggles offloading off and on at compile time, I would be happy. I would always include default(none), which at least would force me to think through locality so that I have a good understanding of whether it's safe to offload.

@klausler
Copy link

The three problems with DO CONCURRENT (...) DEFAULT(NONE) are that (1) you have to name every data item in every loop in a locality clause, (2) the F'2018 locality clauses only allow simple names and (3) the semantics of pointers in locality clauses are not defined.

A better (but not best) suggestion was to use DEFAULT(SHARED) but the problem with that is that it doesn't exist.

So I still recommend that the default implicit localization rule be changed (perhaps by syntax) to pertain only to names that could appear in an explicit LOCAL() clause. The fact that it was defined so that it applies to array elements and components is accidental; if fixed, it would work the way everybody assumes that it already does.

@rouson
Copy link

rouson commented May 25, 2022

The three problems with DO CONCURRENT (...) DEFAULT(NONE) are that (1) you have to name every data item in every loop in a locality clause, (2) the F'2018 locality clauses only allow simple names and (3) the semantics of pointers in locality clauses are not defined.

This is helpful. I hope the NVIDIA representative(s) on J3 can champion your suggestions. I'd like to see these issues addressed but I would be a poor champion because I have little personal use for most of the features that break do concurrent. It seems that the problematic features mostly involve pointers (which I rarely use), indirect addressing (which I seldom use), or module variables (which I mostly avoid) so I wouldn't have many useful insights in any of the discussions that arise around a new proposal.

A better (but not best) suggestion was to use DEFAULT(SHARED) but the problem with that is that it doesn't exist.

A J3 member in the aforementioned mailing list discussion stated that default(shared) was considered by the committee and that it was decided that default(none) would be less error-prone. A reference was made to the complexities of OpenMP's default(shared). I don't know enough to have an opinion. I just want to say that the suggestion was considered and another path taken for reasons that were provided.

So I still recommend that the default implicit localization rule be changed (perhaps by syntax) to pertain only to names that could appear in an explicit LOCAL() clause. The fact that it was defined so that it applies to array elements and components is accidental; if fixed, it would work the way everybody assumes that it already does.

Maybe but I think the committee gives more weight to what the committee intended than to what users might incorrectly assume. I think the unstated goal is to make the standard clear, consistent, and useful. If it's unclear, add a note. If it's inconsistent, fix it. If it's clear and consistent but not useful, replace the feature much like do concurrent effectively replaces forall, but don't break existing code that relies upon the feature as long as that code has an unambiguous interpretation that is consistent with what those who wrote the standard intended. This is just my observation of how the committee operates and I'm ok with it. It's why I usually recommend that people attend several meetings before making a proposal. It can take some time to read the room and understand the dynamics and the implicit aims.

@klausler
Copy link

klausler commented May 25, 2022

Maybe but I think the committee gives more weight to what the committee intended than to what users might incorrectly assume. I think the unstated goal is to make the standard clear, consistent, and useful. If it's unclear, add a note. If it's inconsistent, fix it. If it's clear and consistent but not useful, replace the feature much like do concurrent effectively replaces forall, but don't break existing code that relies upon the feature as long as that code has an unambiguous interpretation that is consistent with what those who wrote the standard intended. This is just my observation of how the committee operates and I'm ok with it. It's why I usually recommend that people attend several meetings before making a proposal. It can take some time to read the room and understand the dynamics and the implicit aims.

There's good precedent in a similar case of J3 doing the right thing to preserve the intent of a feature. PURE procedures have restrictions that prevent them from causing side effects. A hole was discovered by which one could sneak modification of a global variable into a PURE subprogram -- declare a derived type with an initialized pointer component outside the procedure, then within, declare a variable of that type, and then use its default initialized pointer component to write to the global variable.

The response, admirably, was to plug the hole. The change might invalidate existing code, but it was the right thing to do.

DO CONCURRENT's problems are similar. The definition of the default localization rule makes it possible to write a loop whose iterations cannot execute concurrently, even though all of the many documented restrictions in the standard are satisfied. That's the hole in the spec. Worse, a compiler can't always detect at compilation time whether the hole is being exploited, and may have to conservatively assume that it is. The hole is easy to plug. And like the hole in PURE procedures, it should be plugged.

@wyphan
Copy link

wyphan commented May 25, 2022

@klausler This is exactly why I want to start discussion. I plan to push for DEFAULT as a GNU extension and implement it accordingly during the GSoC. I'm inviting you, @jefflarkin, and @grypp from NVIDIA, Ron Caplan and Miko Stulajter from @predsci because they published arXiv:2110.10151 as the first real-world use case for DO CONCURRENT, as well as my GSoC mentors Tobias Burnus and @tschwinge from GCC/Siemens. And as I indicated before, other interested folks in this thread are welcome too. Tentatively it will be held between June 6-10, which is the week after ISC '22. Hopefully, upon successful implementation as a GNU extension, the standards committee will consider to include it in Fortran 202Y.

Edit: after a careful re-read of R1130 in J3/18-007r1 section 11.1.7.2, turns out DEFAULT(NONE) is included in Fortran 2018, but not DEFAULT(SHARED) or any other DEFAULT arguments.

@wyphan
Copy link

wyphan commented May 25, 2022

Btw, the first thread from the discussion over the J3 mailing list that @rouson mentioned can be accessed here: https://mailman.j3-fortran.org/pipermail/j3/2020-July/012229.html

@klausler
Copy link

Another WG5/J3 meeting has come and gone with no recorded action on fixing DO CONCURRENT. The HPC subgroup didn't even submit a report on their assigned F'202Y discussion items.

At this point, the best implementation option appears to me to be to ignore the broken standard and assume that the default localization rules apply only to variables that could have appeared in an explicit LOCAL clause. J3 has had three years to fix this and done nothing.

@sblionel
Copy link
Member

Your item is on the list of things being considered, and there was quite a bit of discussion, but no action at this time. There is quite a bit of disagreement on the matter, especially with some of the claims, but it is being taken seriously. You're correct that the HPC subgroup didn't submit a paper with initial comments, but there will be further discussion.

It;'s too bad that you chose to withdraw from the committee since you obviously have a passion for the issues.

@klausler
Copy link

Your item is on the list of things being considered, and there was quite a bit of discussion, but no action at this time. There is quite a bit of disagreement on the matter, especially with some of the claims, but it is being taken seriously. You're correct that the HPC subgroup didn't submit a paper with initial comments, but there will be further discussion.

It;'s too bad that you chose to withdraw from the committee since you obviously have a passion for the issues.

I became convinced that J3's current process is incapable of producing quality work. The best that I can do is describe the bugs in the standard as I encounter them as an implementer, so that they're documented and you can fix them or not in the standard as you choose. It's not that different of a situation from being a user of a buggy compiler -- one works around the bugs, but still reports them responsibly in the hope that something might be done about them before they affect other users.

@klausler
Copy link

The plan for HPC features in Fortran 202Y (https://j3-fortran.org/doc/year/23/23-146.txt) omits any mention of fixing DO CONCURRENT.

@jeffhammond
Copy link

jeffhammond commented Sep 12, 2023

I took another look at this, particularly Peter's examples.

The following is #62 (comment) except with the read from T removed.

subroutine foo(N, A, B, T, K, L)
  implicit none
  integer, intent(in) :: N, K(N), L(N)
  real, intent(in) :: A(N)
  real, intent(out) :: B(N)
  real, intent(inout) :: T(N)
  integer :: J
  do concurrent (J=1:N)
    ! During execution, K(J) is always 1.
    T(K(J)) = A(J)
  end do
end subroutine foo

As no locality is specified, we can refer to the following:

If a variable has unspecified locality, if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;

Writing to T in multiple iterations violates this and causes T to become undefined. Unless all of the elements of K are unique, this program contains a data race.

If we add shared(T) to the code, the behavior changes from undefined to prohibited:

If it is defined or becomes undefined during any iteration, it shall not be referenced, defined, or become undefined during any other iteration.

Note that I am interpreting "variable" to mean the element of an array, not the whole array, even though it's unclear, because if I interpret "variable" as the whole array, it is impossible to use DO CONCURRENT with arrays.

Can someone tell me what is wrong with my thinking?

@AnastasiaStulova

@klausler
Copy link

You deleted the reference to T(K(J)), but the interpretation of that reference is the whole point of the text that you quote.

From Fortran's perspective there are no such things as "data races" in DO CONCURRENT. It's not a parallel programming construct. T(1) is undefined after the loop ends (if there's more than one iteration) because of the definitions in multiple iterations, but within each iteration, it is well defined in the current standard. This is the problem in a nutshell.

@jeffhammond
Copy link

jeffhammond commented Sep 12, 2023

My point is that, there is a race on T in both your program and mine, and unless WG5 believes that race conditions are legal and defined, the interpretation of your program does not matter, because it has undefined behavior before it gets to the interesting part.

The solution is to make data races undefined behavior, to match every other programming model with concurrent loops, not to accept that data races are legal and well-defined and try to reason about the consequences of that.

@klausler
Copy link

It is meaningless to talk about race conditions in serial code. DO CONCURRENT, despite its name, is defined as a serial construct. RYOS. F'202X 11.1.7.4.3 paragraph 3: "The block of a DO CONCURRENT construct is executed for every active combination of the index-name values. Each execution of the block is an iteration. The executions may occur in any order."

@jeffhammond
Copy link

it was intended to allow parallel implementations. i am proceeding with the intent to make parallelism a reasonable implementation. given that Fujitsu, Cray, Intel and NVIDIA all implement DC with parallelism in a wide range of cases, i believe that allowing data races was the mistake, not parallelism.

@klausler
Copy link

Yes, that is entirely my point. DO CONCURRENT's default locality rules were badly defined and allow non-parallelizable data accesses to be written in conforming code.

@jeffhammond
Copy link

jeffhammond commented Sep 12, 2023

removed

@klausler
Copy link

Your example is clearly non-conforming, and should remain so. And it's not relevant to this particular issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clause 11 Standard Clause 11: Execution control
Projects
None yet
Development

No branches or pull requests

9 participants