Skip to content

Feat: Add loop support to the optimise-relinearization pass#2758

Open
akashmadhu4 wants to merge 5 commits intogoogle:mainfrom
akashmadhu4:fix/Opt_Relinearize
Open

Feat: Add loop support to the optimise-relinearization pass#2758
akashmadhu4 wants to merge 5 commits intogoogle:mainfrom
akashmadhu4:fix/Opt_Relinearize

Conversation

@akashmadhu4
Copy link
Copy Markdown
Contributor

Summary of the changes:

Previously loops have to be unrolled for running the pass which optimally inserts mgmt::RelinearizeOp's . Now loops can be processed without unrolling . If there is a nested loop , it treats inner loop as a self contained ILP problem , solves them to find their output degree and use this solved degree as fixed constraints in the parent loop's ILP solver.

  • Inner loop are solved first
  • Added a mechanism to pass the final solved yield degree of an inner loop up to the parent.
  • Refactored all 5 analysis walk to use WalkOrder::PreOrder along with WalkOrder::skip() which allows to skip the body of inner loop while processing outer loop
  • Added proper ILP constraints for the loop ( yeild degree < iter_arg degree) which bound the yield degree
  • Added loop_accumulator.mlir test file containing 3 test cases : a standard ct-ct loop-carried multiplication requiring relinearization inside the loop, nested_loop_both_mul testing nested loops where both inner and outer loops perform ct-ct , nested_loop_inner_ct_pt verifying that an inner loop with only ct-pt multiplications is correctly skipped by the relin-solver, while the outer ct-ct loop is correctly targeted
  • All 6+1 test passes

Notes / Open Questions for Discussion:

  • I constrained the initial iter_arg to degree 1 , forcing the loop carried constraint (yield degree <= iter_arg degree ) in the solver to force place mgmt.Relinearize op inside the loop if the yield degree naturally grows to 2. If the initial restriction wasn't rigidly locked to 1, the relinearization might have been safely delayed until after the loop entirely.

The initial iter_arg was constraint to 1 was because getDimension(iter_arg, solver) returns the maximum accumulated degree across all loop iterations

// Constraints to initialize the key basis degree variables at the start of
// the computation.
for (auto& [value, var] : keyBasisVars) {
  if (llvm::isa<BlockArgument>(value)) {
    auto blockArg = llvm::cast<BlockArgument>(value);
    int constrainedDegree;
    // Loop iter_args is always assumed degree 1 since getDimension diverges
    if (isa<LoopLikeOpInterface>(blockArg.getOwner()->getParentOp())) {
      constrainedDegree = 1;
    } else {
      // If the dimension is 3, the key basis is [0, 1, 2] and the degree is 2.
      constrainedDegree = getDimension(value, solver).value_or(2) - 1;
    }
    model.AddLinearConstraint(var == constrainedDegree, "");
  }
}
  • Handling iter_args that was first analysed as non-secret but becomes secret via yield :
// Handle iter_args that become secret via yield
if (!argIsSecret) {
  if (auto loopOp = dyn_cast<LoopLikeOpInterface>(op)) {
    auto iterArgs = loopOp.getRegionIterArgs();
    auto it = llvm::find(iterArgs, arg);
    if (it != iterArgs.end()) {
      unsigned idx = std::distance(iterArgs.begin(), it);
      auto yieldedValues = loopOp.getYieldedValues();
      if (idx < yieldedValues.size()) {
        argIsSecret = isSecret(yieldedValues[idx], solver);
      }
    }
  }
}

Is it better to do this handling on SecretnessAnalysis rather than handling it while creating variables.

Fixes #2600

@j2kun j2kun self-requested a review March 13, 2026 21:09
Copy link
Copy Markdown
Collaborator

@j2kun j2kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I like this approach, and reviewing the PR gave me some ideas for making it even better. I will describe those ideas and, if you don't feel up for it, we can merge this PR in (modulo the one comment and linter failure) and move on.

The main reason I think it would eventually need improvement is that, in the looped linalg kernels I've been working on, specifically the baby-step giant-step kernel, there are necessary if/else statements in the loop body.

So one improvement I see is that this approach can be generalized beyond loop support to support any region-bearing op. The loopBoundaryDegrees could be generalized to be something like fixedResultDegrees that signals to the solver that, for any operation present in the map, the solver must hard-code its result degrees as a constraint. One could also enforce that, at any basic block boundary, the block args (both entering and exiting) must always have linear degree. Then you can walk<WalkOrder::PostOrder>([&] (Block *block) {...}) to go block by block, and populate the map after each solve as you do here. In this way, I think you can remove most of the branches that specialize to loops/yields except for the step that populates the map from the solver output.

Which brings me to my next improvement: use RegionBranchOpInterface to avoid having to know about the op type at all. Though the terminology around that interface always confuses me, in practice RegionBranchOpInterface allows you to take an op's result, and (via getPredecessorValues I believe) get the program points that forward control flow to the op result (or conversely, getSuccessorRegions). This would allow you to connect, say, all three of an iter_arg to its loop-yielded value to the corresponding op result without hard-coding anything about affine.for/scf.for or affine.yield/scf.yield.

But even better, it would allow you to use one code path to support all for loops and if statements (and scf.while!).

Even further, you could use this interface technique to make a single global ILP that handles all ops and nested regions in a single formulation. You would use the connection between the region branching points described above to create constraints that effectively say "the degree of an iter_arg == the degree of the init == the degree of the yielded operand == the degree of the op result", but in code you would just loop over predecessors/successors and agnostically add constraints making all of them equal. And that would allow the ILP to find a solution in which the relinearization is delayed across a region boundary.

All that said, I am still not sure a global ILP is worth it here. Lazy relinearization is not the most important optimization, IMO, and having one ILP per basic block would have better compile-time performance (i.e., HEIR runtime) and not sacrifice all that much latency because most of the optimization opportunity is inside a loop's body. So in that case, the use of the interface would mainly be to allow you to support any kind of nested region-holding op with control flow (in particular, scf.if) and use it to populate the loopBoundaryDegrees (/ fixedResultDegrees) map without having to switch over a list of supported op types.

Ok, after that huge wall of text, I will also answer your specific questions:

I constrained the initial iter_arg to degree 1

As mentioned in the comment, I think all iter_args should be forced to have linear degree. Partly because of my next answer...

Handling iter_args that was first analysed as non-secret but becomes secret via yield

The way loop support is handled before this pass in the pipeline is to (a) peel the first iteration of a loop when an iter_arg is initialized with a cleartext value, so that iter_args are always invariantly ciphertexts, and (b) the loop is partially unrolled. This means that you should be able to safely ignore secretness discrepancies in the iter_args.

It also adds some context to my thoughts above: since the loops are partially unrolled, you should assume this pass will have sufficiently large blocks to work with and opportunities to do lazy relinearization. This reduces the marginal benefit of deferring relinearization across blocks, and hence the benefit of a global solve vs a block-local solve.

});
})
.Case<affine::AffineYieldOp, scf::YieldOp>([&](auto op) {
// For loop yield ops, the degree returned must not exceed the degree
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the degrees should be equal. In particular, when this is lowered to the ckks scheme, having degrees that are not equal across iter args will produce a type error.

I think it would be a good and simplifying assumption to enforce by fiat that all iter_args have a linear key basis.

@akashmadhu4
Copy link
Copy Markdown
Contributor Author

Thank you for explaining your ideas @j2kun and for clarifying my questions. This was really helpful. I agree with you on extending the approach to generalise beyond the loop support especially in cases like the baby-step giant-step kernel you mentioned.I’ll take a closer look at RegionBranchOpInterface, since it seems like a clean way to unify handling for loops, conditionals, and other control-flow constructs.

Regarding the global vs. block-level ILP, my understanding is that a global ILP could enable more optimal decisions (like delaying relinearization across blocks) . However as you mentioned partial unrolling creates sufficiently large blocks, so most of the useful lazy relinearization opportunities can be captured locally . Therefore, a block level ILP seems like a practical approach.

I’ll continue working on incorporating these ideas and follow up with updates before we merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OptimizeRelinearization is incompatible with loops

2 participants