Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port remaining HECO evaluation artifacts and ensure they produce equivalently optimal results. #571

Open
j2kun opened this issue Mar 27, 2024 · 14 comments

Comments

@j2kun
Copy link
Collaborator

j2kun commented Mar 27, 2024

@AlexanderViand-Intel I may need your help here. I haven't figured out how to build HECO yet (everything succeeds in your repo instructions up to the linking step, then it fails with a bazillion linker errors to LLVM).

What I think I'd want is just the output MLIR from each of the evaluation artifacts, so that I can compare the rotation counts in them with what HEIR produces. I tried manually porting roberts-cross, and I think I see where HEIR is still lacking, but I want some confirmation.

copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Part of #571

PiperOrigin-RevId: 621247689
copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Part of #571

PiperOrigin-RevId: 621247689
copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Note canonicalize slows the box_blur_64x64 test to a crawl (> 15m), so I converted it to "enormous" size so that it is skipped in CI.

Part of #571

PiperOrigin-RevId: 621247689
copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Note canonicalize slows the box_blur_64x64 test to a crawl (> 15m), so I converted it to "enormous" size so that it is skipped in CI.

Part of #571

PiperOrigin-RevId: 621247689
@j2kun
Copy link
Collaborator Author

j2kun commented Apr 2, 2024

Last one appears to be upgrading the roberts cross 4x4 to 64x64 (though we know this will be very slow due to #589). Maybe in the mean time we can limit it to 32x32.

copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Note canonicalize slows the box_blur_64x64 test to a crawl (> 15m), so I converted it to "enormous" size so that it is skipped in CI.

Part of #571

PiperOrigin-RevId: 621247689
copybara-service bot pushed a commit that referenced this issue Apr 2, 2024
This PR ports the gx_kernel evaluation artifact from HECO, and supports it by doing two things:

- Adding a canonicalize pass before insert-rotate but after full-loop-unroll, to ensure that ND tensors of constants are materialized to constants of tensors.
- Adding two patterns to splat constants into tensors during rotations insertion.

This leaves unanswered an important question: how should we detect and handle plaintext types that are not constants (say, function inputs or function inputs that are modified by some IR-internal ops). I will file a followup issue on that topic.

Note canonicalize slows the box_blur_64x64 test to a crawl (> 15m), so I converted it to "enormous" size so that it is skipped in CI.

Part of #571

PiperOrigin-RevId: 621321174
@AlexanderViand-Intel
Copy link
Collaborator

AlexanderViand-Intel commented Apr 4, 2024

I'm trying to get up to speed after RWC/Easter Holidays, and I saw that you mentioned compile times in the order of 50 minutes in #589, which sounds an order of magnitude worse than what HECO gets. In HECO, I did a few tricks to speed things up, such as moving one or two patterns out of normal canonicalization into their own pass, maybe we can do something similar here.

EDIT: It seems like the PR (#587) mentioned in that issue is somehow borked - there's no commits?

@AlexanderViand-Intel I may need your help here. I haven't figured out how to build HECO yet (everything succeeds in your repo instructions up to the linking step, then it fails with a bazillion linker errors to LLVM).

Mh, that's odd - the repo, and the artifact tag specifically, should "just" work, having been through the USENIX artifact evaluation process. What linker did you use? lld or something else?

What I think I'd want is just the output MLIR from each of the evaluation artifacts, so that I can compare the rotation counts in them with what HEIR produces. I tried manually porting roberts-cross, and I think I see where HEIR is still lacking, but I want some confirmation.

Sure, let me run those for you and figure out how best to post them.

@j2kun
Copy link
Collaborator Author

j2kun commented Apr 4, 2024

That is bizarre about copybara. Here's the commit: 4d84f04

@j2kun
Copy link
Collaborator Author

j2kun commented Apr 4, 2024

Yeah I was thinking I could extract the needed patterns (converting the cleartext weight tensor to individual constants would suffice), but haven't had the chance to figure out exactly which pattern does that in canonicalize.

@AlexanderViand-Intel
Copy link
Collaborator

AlexanderViand-Intel commented Apr 8, 2024

What I think I'd want is just the output MLIR from each of the evaluation artifacts, so that I can compare the rotation counts in them with what HEIR produces. I tried manually porting roberts-cross, and I think I see where HEIR is still lacking, but I want some confirmation.

Sure, let me run those for you and figure out how best to post them.

Here's that HECO output, I figured adding it to the repo as another tag would be easier than creating a bunch of gists: HECO@artifact_mlir_output

@j2kun
Copy link
Collaborator Author

j2kun commented Apr 8, 2024

Summarizing:

boxblur_64x64.mlir dotproduct_8.mlir gxkernel_64x64.mlir hammingdistance_4.mlir linearpolynomial_64.mlir quadraticpolynomial_64.mlir robertscross_64x64.mlir
HEIR 7 3 6 (8x8, not 64x64) 3 0 0 2 (4x4)
HECO 3 3 5 3 0 0 4

So remaining work is:

@AlexanderViand-Intel
Copy link
Collaborator

AlexanderViand-Intel commented Apr 9, 2024

  • Scrutinize roberts_cross to ensure that HEIR's "better" program is actually correct.

👀 I think HECO generates the same program as porcupine here (which should be optimal, so I'd be surprised to see something better) but let me check whether this is actually the case.

EDIT: I looked it up and Porcupine's solution is better, but it still needs 3 rotations, not 2.

@j2kun
Copy link
Collaborator Author

j2kun commented Apr 18, 2024

After discussion, we realized the reason HEIR seems to do worse on the box_blur example is because HECO's kernel is 2x2, while HEIR's is 3x3, so HEIR's should be equivalent to HECO on that example. robertscross_64x64 is the remaining example to inspect.

@j2kun
Copy link
Collaborator Author

j2kun commented Apr 26, 2024

Now that the first e2e runtime test is done (simple_sum_test.cpp), I'm looking at adding analogous tests for the rest of the HECO examples to bolster our confidence in their correctness.

@AlexanderViand-Intel: do you happen to have any input-output test pairs lying around from HECO? Just to save some time manually generating them.

@j2kun
Copy link
Collaborator Author

j2kun commented May 8, 2024

Roberts cross had a bug in the input MLIR that is fixed in #676

@AlexanderViand-Intel
Copy link
Collaborator

@AlexanderViand-Intel: do you happen to have any input-output test pairs lying around from HECO? Just to save some time manually generating them.

Sorry, everything from HECO would be SEAL based (e.g., Porcupine-based SEAL implementation), so I think what you ended up doing is the most effective way forward.

@AlexanderViand-Intel
Copy link
Collaborator

So remaining work is:

  • Improve box_blur

I think this is the last open item in this issue, right?
According to the box_blur_64x64.mlir test, HEIR still generates a solution with 7 rotations, some of which are rotations not of the input image, but of intermediate results.

@j2kun
Copy link
Collaborator Author

j2kun commented May 13, 2024

So remaining work is:

  • Improve box_blur

I think this is the last open item in this issue, right? According to the box_blur_64x64.mlir test, HEIR still generates a solution with 7 rotations, some of which are rotations not of the input image, but of intermediate results.

From #571 (comment), we agreed that it was 7 because the kernel is 3x3, whereas HECO did 2x2.

The gx_kernel still has one more rotation than HECO that I haven't explained, and it also doesn't have an end-to-end test yet.

@AlexanderViand-Intel
Copy link
Collaborator

AlexanderViand-Intel commented May 13, 2024

From #571 (comment), we agreed that it was 7 because the kernel is 3x3, whereas HECO did 2x2.

🤦 Sorry, I got confused by seeing the intermediate-value rotations, which is structurally different from what HECO generates. This is actually a good thing, though , as HEIR does in fact get 3 rotations in 2x2 kernel mode, but HECO for 3x3 requires (as I'd expect) 8 rotations (kernel_size - 1).

The gx_kernel still has one more rotation than HECO that I haven't explained, and it also doesn't have an end-to-end test yet.

I wouldn't be surprised if it turns out this is another bug in the HECO inputs 🙈 as it'd be odd for HEIR to successfully handle kernels of this style but not two such kernels in sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants