Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism #17624

Closed
wants to merge 19 commits into from

Conversation

eme64
Copy link
Contributor

@eme64 eme64 commented Jan 30, 2024

Subtask of #16620
(The basic goal is to break SuperWord into different modules. This makes the code more maintainable and extensible. And eventually this allows some modules to be reused by other/new vectorizers.)

  1. Move out the shared code between SuperWord::SLP_extract (where we do vectorization) and SuperWord::unrolling_analysis, and move it to a new class VLoop. This allows us to decouple unrolling_analysis from the SuperWord object, and we can make it static.
  2. So far, SuperWord was reused for all loops in a compilation, and then "reset" (with SuperWord::init) for every loop. This is a bit of a nasty pattern. I now make a new VLoop and a new SuperWord object per loop.
  3. Since we now make more SuperWord objects, we allocate the internal data structures more often. Therefore, I now pre-allocate/reserve sufficient space on initialization.

Side-note about #17604 (integrated, no need to read any more):
I would like to remove the use of SuperWord::is_marked_reduction from SuperWord::unrolling_analysis. For starters: it is not clear what it was ever good for. Second: it requires us to do reduction marking/analysis before unrolling_analysis, and hence makes the reduction marking shared between unrolling_analysis and vectorization. I could move the reduction marking to VLoop now. But the _loop_reducitons set would have to be put on an arena, and I would like to avoid creating an arena for the unrolling_analysis. Plus, it would just be nicer code, to have reduction analysis together with body analysis, type analysis, etc. and all of them in only in SLP_extract.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17624/head:pull/17624
$ git checkout pull/17624

Update a local copy of the PR:
$ git checkout pull/17624
$ git pull https://git.openjdk.org/jdk.git pull/17624/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17624

View PR using the GUI difftool:
$ git pr show -t 17624

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17624.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 30, 2024

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8324890 8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism Jan 30, 2024
@openjdk
Copy link

openjdk bot commented Jan 30, 2024

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 30, 2024
return false; // failure
}

const char* VLoop::check_preconditions_helper() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: replaces most code from the old SuperWord::transform_loop

_num_work_vecs(0), // amount of vector work we have
_num_reductions(0) // amount of reduction work we have
{
}

//------------------------------transform_loop---------------------------
bool SuperWord::transform_loop(IdealLoopTree* lpt, bool do_optimization) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: code moved to VLoop::check_preconditions_helper

Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: all the do_optimization parts are not part of preconditions, and hence they are kept in the new transform_loop

cl->set_pre_loop_end(pre_end);
}

init(); // initialize data structures
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is the end of the "preconditions", and we used to set _early_exit = false inside init()


if (SuperWordReductions) {
mark_reductions();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I now would like to move reduction marking to after precondition checking. Hence, I moved it to SLP_extract.

assert(pre_loop_end, "must be valid");
_pre_loop_end = pre_loop_end;
}

Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This should have never been cached in the node itself, but only during autovectorization.

I moved it now into VLoop, which I also pass into VPointer, which has to access the pre-loop for independence checks.

@@ -231,14 +231,11 @@ class CountedLoopNode : public BaseCountedLoopNode {
// vector mapped unroll factor here
int _slp_maximum_unroll_factor;

// Cached CountedLoopEndNode of pre loop for main loops
CountedLoopEndNode* _pre_loop_end;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this makes the node smaller, and does not cache something that may be invalid later. It was used only during SuperWord. Looks like a bad pattern.

_race_possible(false), // cases where SDMU is true
_early_return(true), // analysis evaluations routine
_do_vector_loop(phase->C->do_vector_loop()), // whether to do vectorization/simd style
_do_vector_loop(phase()->C->do_vector_loop()), // whether to do vectorization/simd style
_num_work_vecs(0), // amount of vector work we have
_num_reductions(0) // amount of reduction work we have
Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
Before this change, we used to only create SuperWord once, and use it for all loops in the compilation. Now that I ripped out the "init" method, and avoid reusing SuperWord this way, we want to make sure we do not re-allocate too much.

For some data structures I now pre-allocate memory for the maximum size they may ever reach. This is to avoid re-allocation when they grow.

IdealLoopTree* lpt = vloop.lpt();
CountedLoopNode* cl = vloop.cl();
Node* cl_exit = vloop.cl_exit();
PhaseIdealLoop* phase = vloop.phase();
Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Made it static, and instead of the SuperWord object, we now only have access to the VLoop object.


if (SuperWordReductions) {
mark_reductions();
}
Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: now we don't need to mark reductions for unrolling_analysis any more, and only for SLP_extract. We win!

VectorSet _loop_reductions; // Reduction nodes in the current loop
Node* _bb; // Current basic block
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: always the same as cl

CountedLoopNode* head = pre_loop_end()->loopnode();
assert(head != nullptr, "must find head");
return head;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: before this patch, these two cache-accessors were in the CounterLoopEndNode.

sw.unrolling_analysis(_local_loop_unroll_factor);
VLoop vloop(this, true);
if (vloop.check_preconditions()) {
SuperWord::unrolling_analysis(vloop, _local_loop_unroll_factor);
Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I made unrolling_analysis static, and only pass in vloop, not all the info in the SuperWord object.
Advantage: we don't need to allocate any SuperWord data-structures any more for unrolling_analysis.

if (C->do_superword() && C->has_loops() && !C->major_progress()) {
// SuperWord transform
SuperWord sw(this);
ResourceArea autovectorization_arena;
Copy link
Contributor Author

@eme64 eme64 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this allows us to free up all the space used by SuperWord's internal data structures between every processed loop. Previously, all internal data structures were on the phase->C->comp_arena().

@vnkozlov
Copy link
Contributor

vnkozlov commented Feb 1, 2024

Would be nice to see affect of these changes on C2 compilation time. You have to create SW object each time now instead only once.

@eme64
Copy link
Contributor Author

eme64 commented Feb 2, 2024

@vnkozlov I created #17683 to add time measurement. Let me try it on a simple benchmark here.

It seems to me that there is no significant difference, the variance is higher than the difference.

Benchmark idea:

  • test001,002,003: single simple loops (with and without reductions)
  • test101: multiple simple loops: where we might see a difference if SuperWord is shared or not.
  • test201: single, large loop body. To test for large data structures in SuperWord.
  • test301: multiple large loop bodies. Here we should see the biggest difference, since they are large and SuperWord'ed right after each other, and so SuperWord is shared (before) and not shared (with patch).

Disabled turbo-boost for benchmark.

Interpretation / Implication / Speculation

In SuperWord, there are lots of nested loops. Certainly 2 and 3 deep, and even 4 deep, I think. Hence, a small overhead on memory can probably not really be measured. I'll look into lowering the algorithmic complexity in the future. This is especially important if we want to increase the loop body size.

test001,002,003
../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestAutoVectorizationTime::test0* -Xbatch -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=1000 TestAutoVectorizationTime.java
This patch:

Xbatch:
         IdealLoop:            25.508 s
           AutoVectorize:       5.927 s
Xcomp:
         IdealLoop:            12.860 s
           AutoVectorize:       2.736 s

Before:

Xbatch:
         IdealLoop:            25.414 s
           AutoVectorize:       6.171 s
Xcomp:
         IdealLoop:            12.836 s
           AutoVectorize:       2.824 s

test101
../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestAutoVectorizationTime::test1* -Xbatch -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=100 TestAutoVectorizationTime.java
This patch:

Xbatch:
         IdealLoop:            17.229 s
           AutoVectorize:       3.042 s
Xcomp:
         IdealLoop:            12.569 s
           AutoVectorize:       2.134 s

Before:

Xbatch:
         IdealLoop:            16.873 s
           AutoVectorize:       2.974 s
Xcomp:
         IdealLoop:            12.368 s
           AutoVectorize:       2.100 s

test201
../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestAutoVectorizationTime::test2* -Xbatch -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=100 TestAutoVectorizationTime.java
This patch:

Xbatch:
         IdealLoop:             4.292 s
           AutoVectorize:       2.669 s
Xcomp:
         IdealLoop:             3.812 s
           AutoVectorize:       2.573 s

Before:

Xbatch:
         IdealLoop:             4.211 s
           AutoVectorize:       2.648 s
Xcomp:
         IdealLoop:             3.905 s
           AutoVectorize:       2.683 s

test301
../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,TestAutoVectorizationTime::test3* -Xbatch -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=10 TestAutoVectorizationTime.java
This patch:

Xbatch:
         IdealLoop:            12.811 s
           AutoVectorize:       7.165 s
Xcomp:
         IdealLoop:             4.912 s
           AutoVectorize:       2.811 s

Before:

Xbatch:
         IdealLoop:            13.097 s
           AutoVectorize:       7.452 s
Xcomp:
         IdealLoop:             5.253 s
           AutoVectorize:       3.118 s

This is the benchmark:

public class TestAutoVectorizationTime {
    static int RANGE = 10_000;

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        for (int i = 0; i < 10_000; i++) {
            test001(a, b);
            test002(a, b, i % 200 - 100);
            test003(a, b, i % 200 - 100);
            test101(a, b);
            test201(a, b);
            test301(a, b);
        }
    }

    static void test001(int[] a, int[] b) {
        for (int i = 0; i < a.length; i++) {
            a[i] = b[i] + 100;
        }
    }

    static void test002(int[] a, int[] b, int s) {
        for (int i = 0; i < a.length-1; i++) {
            a[i] += b[i] + s;
        }
    }

    static int test003(int[] a, int[] b, int s) {
        int x = 0;
        for (int i = 0; i < a.length; i++) {
            x += a[i] * b[i] + s * a[i] + b[i] * 101;
        }
        return x;
    }

    static void test101(int[] a, int[] b) {
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
        for (int i = 0; i < a.length; i++) { a[i] = b[i] + 100; }
        for (int i = 0; i < a.length; i++) { b[i] = a[i] + 100; }
    }

    static void test201(int[] a, int[] b) {
        for (int i = 0; i < a.length/8-8; i+=1) {
          b[i*8+ 0] = a[i*8+ 0] + 100;
          b[i*8+ 1] = a[i*8+ 1] + 100;
          b[i*8+ 2] = a[i*8+ 2] + 100;
          b[i*8+ 3] = a[i*8+ 3] + 100;
          b[i*8+ 4] = a[i*8+ 4] + 100;
          b[i*8+ 5] = a[i*8+ 5] + 100;
          b[i*8+ 6] = a[i*8+ 6] + 100;
          b[i*8+ 7] = a[i*8+ 7] + 100;
	}
    }

    static void test301(int[] a, int[] b) {
        for (int i = 0; i < a.length/8-8; i+=1) { b[i*8+ 0] = a[i*8+ 0] + 100; b[i*8+ 1] = a[i*8+ 1] + 100; b[i*8+ 2] = a[i*8+ 2] + 100; b[i*8+ 3] = a[i*8+ 3] + 100; b[i*8+ 4] = a[i*8+ 4] + 100; b[i*8+ 5] = a[i*8+ 5] + 100; b[i*8+ 6] = a[i*8+ 6] + 100; b[i*8+ 7] = a[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { a[i*8+ 0] = b[i*8+ 0] + 100; a[i*8+ 1] = b[i*8+ 1] + 100; a[i*8+ 2] = b[i*8+ 2] + 100; a[i*8+ 3] = b[i*8+ 3] + 100; a[i*8+ 4] = b[i*8+ 4] + 100; a[i*8+ 5] = b[i*8+ 5] + 100; a[i*8+ 6] = b[i*8+ 6] + 100; a[i*8+ 7] = b[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { b[i*8+ 0] = a[i*8+ 0] + 100; b[i*8+ 1] = a[i*8+ 1] + 100; b[i*8+ 2] = a[i*8+ 2] + 100; b[i*8+ 3] = a[i*8+ 3] + 100; b[i*8+ 4] = a[i*8+ 4] + 100; b[i*8+ 5] = a[i*8+ 5] + 100; b[i*8+ 6] = a[i*8+ 6] + 100; b[i*8+ 7] = a[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { a[i*8+ 0] = b[i*8+ 0] + 100; a[i*8+ 1] = b[i*8+ 1] + 100; a[i*8+ 2] = b[i*8+ 2] + 100; a[i*8+ 3] = b[i*8+ 3] + 100; a[i*8+ 4] = b[i*8+ 4] + 100; a[i*8+ 5] = b[i*8+ 5] + 100; a[i*8+ 6] = b[i*8+ 6] + 100; a[i*8+ 7] = b[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { b[i*8+ 0] = a[i*8+ 0] + 100; b[i*8+ 1] = a[i*8+ 1] + 100; b[i*8+ 2] = a[i*8+ 2] + 100; b[i*8+ 3] = a[i*8+ 3] + 100; b[i*8+ 4] = a[i*8+ 4] + 100; b[i*8+ 5] = a[i*8+ 5] + 100; b[i*8+ 6] = a[i*8+ 6] + 100; b[i*8+ 7] = a[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { a[i*8+ 0] = b[i*8+ 0] + 100; a[i*8+ 1] = b[i*8+ 1] + 100; a[i*8+ 2] = b[i*8+ 2] + 100; a[i*8+ 3] = b[i*8+ 3] + 100; a[i*8+ 4] = b[i*8+ 4] + 100; a[i*8+ 5] = b[i*8+ 5] + 100; a[i*8+ 6] = b[i*8+ 6] + 100; a[i*8+ 7] = b[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { b[i*8+ 0] = a[i*8+ 0] + 100; b[i*8+ 1] = a[i*8+ 1] + 100; b[i*8+ 2] = a[i*8+ 2] + 100; b[i*8+ 3] = a[i*8+ 3] + 100; b[i*8+ 4] = a[i*8+ 4] + 100; b[i*8+ 5] = a[i*8+ 5] + 100; b[i*8+ 6] = a[i*8+ 6] + 100; b[i*8+ 7] = a[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { a[i*8+ 0] = b[i*8+ 0] + 100; a[i*8+ 1] = b[i*8+ 1] + 100; a[i*8+ 2] = b[i*8+ 2] + 100; a[i*8+ 3] = b[i*8+ 3] + 100; a[i*8+ 4] = b[i*8+ 4] + 100; a[i*8+ 5] = b[i*8+ 5] + 100; a[i*8+ 6] = b[i*8+ 6] + 100; a[i*8+ 7] = b[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { b[i*8+ 0] = a[i*8+ 0] + 100; b[i*8+ 1] = a[i*8+ 1] + 100; b[i*8+ 2] = a[i*8+ 2] + 100; b[i*8+ 3] = a[i*8+ 3] + 100; b[i*8+ 4] = a[i*8+ 4] + 100; b[i*8+ 5] = a[i*8+ 5] + 100; b[i*8+ 6] = a[i*8+ 6] + 100; b[i*8+ 7] = a[i*8+ 7] + 100; }
        for (int i = 0; i < a.length/8-8; i+=1) { a[i*8+ 0] = b[i*8+ 0] + 100; a[i*8+ 1] = b[i*8+ 1] + 100; a[i*8+ 2] = b[i*8+ 2] + 100; a[i*8+ 3] = b[i*8+ 3] + 100; a[i*8+ 4] = b[i*8+ 4] + 100; a[i*8+ 5] = b[i*8+ 5] + 100; a[i*8+ 6] = b[i*8+ 6] + 100; a[i*8+ 7] = b[i*8+ 7] + 100; }
    }
}

@vnkozlov
Copy link
Contributor

vnkozlov commented Feb 2, 2024

Thank you for running the timing testing. What about memory fragmentation? Is this code will uses default chunks in Arena (they can be reused) or allocates new chunk (malloc) each time which may lead to fragmentation.

@eme64
Copy link
Contributor Author

eme64 commented Feb 3, 2024

@vnkozlov is there a way to measure memory fragmentation? I don't know how to answer that question.
And is there really a difference to how it was done before? Before we just put everything on the comp_arena, and never recover the memory until that arena is given up. Now we have a new autovectorization_arena for each AutoVectorization pass over all loops. I guess there could be multiple such passes in a single compilation. Before this patch, this means that each such AutoVectorization pass creates its SuperWord object, and allocates memory on the comp_arena, and the memory usage of all these passes adds up. With this patch, we at least are able to give up the memory after every pass.

Of course this is only helpful if the malloc/free'd chunks can properly be reused.
I think that chunks can be properly reused: that is what the ChunkPool::take_from_pool/return_to_pool are for. They are called from ChunkPool::allocate_chunk/deallocate_chunk, if there is a pool for the requested chunk-size/length.

I did some rr debugging. And I see that the first (init_size) chunk is allocated from a pool. But subsequent chunks (grow) have "non-standard" length, and are malloc/free'd. The reason is that get_pool_for_size does exact length comparison, and a random length grow will hit one of the pre-defined lengths only with a very low probability. I suppose that could eventually lead to fragmentation. We get "non-standard" lengths quickly when you for example pre-allocate a specific size, which is not a power of 2.

I wonder if we could not round up the chunk-size to the next bigger size for which we have a pool. Of course this would mean we have some padding in the chunks, but if they are short-lived chunks then at least the whole memory can be reclaimed. @jdksjolen you have done some work on Arenas. Do you have any wisdom to offer here?

An alternative: we can put the autovectorization_arena at Compile. That way, the chunks are kept until the end of compilation, and can be reused between the different AutoVectorization passes/phases.

@vnkozlov
Copy link
Contributor

vnkozlov commented Feb 3, 2024

And I see that the first (init_size) chunk is allocated from a pool. But subsequent chunks (grow) have "non-standard" length, and are malloc/free'd.

Yes, that was my concern.

There are chunks with different sizes: arena.hpp#L66. Is your allocation sizes > 32*K Chunk::size "Default size of an Arena chunk"? Arena::grow() uses MAX2(ARENA_ALIGN(x), (size_t) Chunk::size);.
Which of SuperWord allocations are big? Can we split them to fit into 32K?

I think, this should not stop you from doing this refactoring. Yes, it will allow return memory back sooner and it is up to OS how it optimize it. I read your offline discussion with Johan. He has interesting suggestion fro growable arrays (use C heap).

@eme64
Copy link
Contributor Author

eme64 commented Feb 4, 2024

I think the only really big array is _bb_idx, which converts node->_idx to bb_idx, basically to have smaller idx so that we can then address into more compact arrays. So we need one array that can be as large as the node-count (that can be many K, hence often more than 32K bytes). Splitting it would work, I guess. But it would be artificial and a more complex solution. I guess we could have an array of arrays?? Except that one big array, all other arrays can be of the size of the number of nodes in the loop body, which is much smaller.

Does using CHeap directly really improve the situation? That just means that we directly malloc, instead of through the Arena, right? This would not make a difference for memory fragmentation, or am I wrong?

I think it might be a good idea to have at least the large bb_idx array be allocated in Compile, and we can decide if that is through the comp_arena or directly with CHeap. That would at least mean that we only have "one unit of fragmentation" per compilation, rather than per SuperWord loop or pass over all loops. I would argue for using the Arena and not CHeap directly: the array will grow from a small size to potentially very large size, in exponential (doubling) steps. This means at the beginning we have a few rather small pieces, which could lead to fragmentation. Having them all in Arena Chunks would lower fragmentation, right? For the large memory segments it does not matter: CHeap will obviously malloc directly, and the Arena also since it is a large and non-standard size.

bool is_slp = true;
size_t ignored_size = lpt()->_body.size();
size_t ignored_size = lpt->_body.size();
int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: make local, add ResourceMark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I will do this in a future RFE.

@eme64
Copy link
Contributor Author

eme64 commented Feb 5, 2024

@vnkozlov After extensive discussions with @jdksjolen , I now decided to create a VShareData class, which has its own arena, which holds the really large array(s), so that they can be shared between the different SuperWord / AutoVectorizations of the loops. This means fragmentation for large arrays is not as low as before this change. All the smaller arrays and small allocations are ok if they are not shared, since they nicely fit in chunks anyway, and therefore we don't have to be worried about fragmentation so much.

@eme64 eme64 requested a review from vnkozlov February 6, 2024 16:02
Copy link
Contributor

@rwestrel rwestrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@openjdk
Copy link

openjdk bot commented Feb 8, 2024

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism

Reviewed-by: kvn, roland

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 57 new commits pushed to the master branch:

  • 6944537: 8325203: System.exit(0) kills the launched 3rd party application
  • 4368437: 8325264: two compiler/intrinsics/float16 tests fail after JDK-8324724
  • 4a3a38d: 8325517: Shenandoah: Reduce unnecessary includes from shenandoahControlThread.cpp
  • 40708ba: 8325563: Remove unused Space::is_in
  • 29d89d4: 8325551: Remove unused obj_is_alive and block_start in Space
  • 8ef918d: 8324646: Avoid Class.forName in SecureRandom constructor
  • 69b2674: 8324648: Avoid NoSuchMethodError when instantiating NativePRNG
  • 52d4976: 8325437: Safepoint polling in monitor deflation can cause massive logs
  • 8b70b8d: 8325440: Confusing error reported for octal literals with wrong digits
  • 5daf622: 8325309: Amend "Listeners and Threads" in AWTThreadIssues.html
  • ... and 47 more: https://git.openjdk.org/jdk/compare/b02599d22e0f424a08045b32b94549c272fe35a7...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 8, 2024
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments


// Shared data structures for all AutoVectorizations, to reduce allocations
// of large arrays.
VSharedData vshared;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is local for each build_and_optimize() call and space will be freed by destructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.
Before my change, we had a SuperWord object per build_and_optimize() that allocated all the data structures.
So the scope is now still the same.
Except that before it all went to the comp_arena.

So before, we might have allocated those data structures multiple times (once per build_and_optimize), and grown comp_arena each time. Now we put it to a dedicated arena, which is freed in the destructor. So the memory usage should be a little lower that way.


GrowableArray<int>& node_idx_to_loop_body_idx() {
// Since this is a shared resource, we clear before every individual use.
_node_idx_to_loop_body_idx.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be explicit VSharedData::clear() method called in auto_vectorize(). Otherwise much later someone will have hard time to find place where the space is cleared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnkozlov done.

@eme64 eme64 requested a review from vnkozlov February 9, 2024 05:46
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@eme64
Copy link
Contributor Author

eme64 commented Feb 10, 2024

Thanks @vnkozlov @rwestrel for the review!
/integrate

@openjdk
Copy link

openjdk bot commented Feb 10, 2024

Going to push as commit 232d136.
Since your change was applied there have been 65 commits pushed to the master branch:

  • 71d2dbd: 8325464: GCCause.java out of sync with gcCause.hpp
  • 6c7029f: 8318603: Parallelize sun/java2d/marlin/ClipShapeTest.java
  • e33d8a2: 8311076: RedefineClasses doesn't check for ConstantPool overflow
  • 6303c0e: 8325569: ProblemList gc/parallel/TestAlwaysPreTouchBehavior.java on linux
  • 3ebe6c1: 8319578: Few java/lang/instrument ignore test.java.opts and accept test.vm.opts only
  • d39b7ba: 8316460: 4 javax/management tests ignore VM flags
  • ac4607e: 8226919: attach in linux hangs due to permission denied accessing /proc/pid/root
  • b42b888: 8325038: runtime/cds/appcds/ProhibitedPackage.java can fail with UseLargePages
  • 6944537: 8325203: System.exit(0) kills the launched 3rd party application
  • 4368437: 8325264: two compiler/intrinsics/float16 tests fail after JDK-8324724
  • ... and 55 more: https://git.openjdk.org/jdk/compare/b02599d22e0f424a08045b32b94549c272fe35a7...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 10, 2024
@openjdk openjdk bot closed this Feb 10, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 10, 2024
@openjdk
Copy link

openjdk bot commented Feb 10, 2024

@eme64 Pushed as commit 232d136.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants