Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[interp] Move from stack based to fully local var based design #20663

Merged
merged 9 commits into from Jan 8, 2021

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Dec 14, 2020

Instead of having instructions that push and pop from the stack, every instruction has explicit dreg and sregs.

While the purpose of this PR is mainly to make it easier to implement more advanced optimization in the future, it also has noticeable performance implications. The code is simplified because we no longer need to update and save the SP. However, the code for each instruction is bloated due to the addition of explicit source and destination offsets. This is counteracted by the reduction of the total number of instructions, since ldloc/stloc and moves become redundant and they are mostly optimized away, even in this implementation state. Here are the total number of executed opcodes as part of running the corlib test suite with the interp https://gist.github.com/BrzVlad/d62f504930b75cba4b870e6dbd947e90.

@vargaz
Copy link
Contributor

vargaz commented Jan 5, 2021

@monojenkins build failed

/*
* The local associated with the value of this stack entry. Every time we push on
* the stack a new new local is created.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

@@ -3472,10 +3472,10 @@ interp_exec_method (InterpFrame *frame, ThreadContext *context, FrameClauseArgs
++ip;
MINT_IN_BREAK;
MINT_IN_CASE(MINT_DUP_VT) {
int i32 = READ32 (ip + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this break some code which used to work ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even before this change, some opcodes support 32bit sized value types while others only 16bit, so it's not like we really supported it. This change just doubles down on having all sizes fit into 16bit. If deemed necessary in the future, we can easily add some slow path opcodes, accepting the full size, similarly with the local offsets, which are also only 16bits

@vargaz
Copy link
Contributor

vargaz commented Jan 5, 2021

Looks fine in general.

Some opcodes support large vt, while others don't, making it useless. Completely remove support for it, so we have faster code for common scenarios. Later at some point, we could add slow opcodes for these corner cases (also for methods with a lot of locals).
In order to facilitate future optimizations and also reduce the total number of instructions (since ldloc/stloc/mov are mostly redundant with this change). Instead of having instructions pop and push to the stack, they now receive explicit offsets for dreg and sregs. The offset is relative to frame->stack. Call args/ret are handled through a single dreg (this is a special CallArgs dreg). As with other locals, this dreg will be resolved to an offset. However, in addition to the optional return value of the call, at this offset, the call args will reside one after the other (This means the called frame will have the stack base at this offset, and the arguments are directly part of its local space). A local that holds an arg value to a call will never be referenced again (as a sreg in another instruction), but its offset will be in the param area of the call, and we will never optimize it out, since the call keeps it alive.

Since instructions only operate on stackvals, all IL locals of primitive types now occupy sizeof (stackval), instead of their real size. This enables them to be used directly as the source / destination of an instruction. Currently, there are two types of locals : normal locals and execution stack locals. While normal locals occupy space on the stack on the first come first served basis, execution stack locals have a predefined stack offset (relative to the start of the execution stack that we used to have before, which is computed during the initial code generation, mapping directly to the IL stack). Their real offset is resolved only after all the normal locals have their offset determined. Execution stack locals will become normal locals in the future, once we have a local offset allocator that takes into account the liveness region of locals.
Also move the code over to transform.c so we can add more verbose dumping in the future for method/class/field tokens.
In the previous commit we made return opcode write directly the result at the bottom of the stack. Previously we used to leave the result at the bottom of the execution stack, and manually write it to the execution stack of the parent frame, if we had a parent.
Replace the previous cumbersome cprop with a new simpler and more generic pass. It iterates on the instructions of each basic block, one at a time, saving the definition for every destination local. If a local is used as a sreg and its definition is a MOV, then we can forward through the copy. The deadce pass will remove locals that are not referenced anymore.

We add INTERP_LOCAL_FLAG_CALL_ARGS flag to locals that represent call args, since deadce would normally remove them. We generate moves when transitioning from a basic block to another while having a stack state in fixup_newbb_stack_locals. This is serves the same purpose as a phi instruction, in the future we should probably change to SSA and further simplify parts of the implementation. The fact that we don't have our own local offset allocator for execution stack locals, means that we are not free to extend liveness of these type of locals, overcomplicating the cprop pass. For now.
newobj_reg_map holds the src->dst writes that are done inside MINT_NEWOBJ_FAST (through the memmove), even when inlining. Once we remove the redundant memmove we would no longer need this newobj_reg_map and cprop would work normally.
This replaces LDLOCA + LDFLD/STFLD to a simple MOV.
We use the address of this label during EH to resume to it. For whatever reason, with some compilers, if the label is followed by a void return, the address of the label is at the end of the method, so we can't resume to it because it points to foreign code. We return NULL to workaround this.
@BrzVlad BrzVlad force-pushed the feature-interp-local-machine branch from 3b7d7c3 to 365d2f9 Compare January 6, 2021 23:23
@BrzVlad BrzVlad merged commit ed7d3b5 into mono:master Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants