Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added max_kernel_size argument to core driver #1612

Closed
wants to merge 1 commit into from
Closed

added max_kernel_size argument to core driver #1612

wants to merge 1 commit into from

Conversation

lriesebos
Copy link
Contributor

ARTIQ Pull Request

Description of Changes

Add an option to the core driver to limit the maximum (static) kernel size before it is loaded to the core device.

I do not know if there is any interest in such a feature, feel free to close this PR if there is not.

Related Issue

A way to prevent loading kernels that are too large while waiting for #544.

Type of Changes

Type
✨ New feature

Steps (Choose relevant, delete irrelevant before submitting)

All Pull Requests

  • Use correct spelling and grammar.
  • Update RELEASE_NOTES.rst if there are noteworthy changes, especially if there are changes to existing APIs.
  • Close/update issues.
  • Check the copyright situation of your changes and sign off your patches (git commit --signoff, see copyright).

Code Changes

  • Run flake8 to check code style (follow PEP-8 style). flake8 has issues with parsing Migen/gateware code, ignore as necessary.
  • Test your changes or have someone test them. Mention what was tested and how.
  • Add and check docstrings and comments
  • Check, test, and update the unittests in /artiq/test/ or gateware simulations in /artiq/gateware/test

Documentation Changes

  • Check, test, and update the documentation in doc/. Build documentation (cd doc/manual/; make html) to ensure no errors.

Git Logistics

  • Split your contribution into logically separate changes (git rebase --interactive). Merge/squash/fixup commits that just fix or amend previous commits. Remove unintended changes & cleanup. See tutorial.
  • Write short & meaningful commit messages. Review each commit for messages (git show). Format:
    topic: description. < 50 characters total.
    
    Longer description. < 70 characters per line
    

Licensing

See copyright & licensing for more info.
ARTIQ files that do not contain a license header are copyrighted by M-Labs Limited and are licensed under LGPLv3+.

Signed-off-by: Leon Riesebos <leon.riesebos@duke.edu>
@sbourdeauducq
Copy link
Member

If you're hitting problems when loading large kernels, I think a solution would be to increase the memory allocated to the comms CPU. It is 4MB only for historical reasons, and we no longer support boards with such a limited amount of RAM.
https://github.com/m-labs/artiq/blob/master/artiq/firmware/runtime/runtime.ld#L9
https://github.com/m-labs/artiq/blob/master/artiq/firmware/ksupport/ksupport.ld

@dnadlinger
Copy link
Collaborator

dnadlinger commented Feb 13, 2021

Wouldn't it be more appropriate (less complex, less need for tuning parameters) to make the core device fail cleanly on kernels that are too large? (I assume the issue is that it just OOM-panics right now?)

@lriesebos
Copy link
Contributor Author

Increasing the memory allocated to the comms CPU sounds like a fair start. I do not think I am the right person to make that change, so I would have to ask one of you to do so.

A clean fail for too large kernels would definitely be a better solution than this one. What we currently experience with too large kernels is that the Kasli crashes in an unrecoverable way and we need to power cycle it. It would be great if that does not happen and an appropriate error message is shown instead, but again I am not the person who would be able to make that change.

Feel free to close this PR in favor of the two proposed solutions which seem more useful and robust.

@hartytp
Copy link
Collaborator

hartytp commented Feb 13, 2021

I assume there is a runtime panic at OOM? One of the things on my todo list is to make a controller that forwards UART messages to the main log. That would at least help debugging

@sbourdeauducq
Copy link
Member

What we currently experience with too large kernels is that the Kasli crashes in an unrecoverable way and we need to power cycle it.

You can use artiq_flash start to reboot it via JTAG.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Feb 14, 2021

Wouldn't it be more appropriate (less complex, less need for tuning parameters) to make the core device fail cleanly on kernels that are too large?

It's not so easy because Rust did not have a good way to handle memory allocation errors - it just panics.
There's some activity on this topic though: rust-lang/rust#66740 rust-lang/rust#66741 - another reason to upgrade Rust which is likely easier by switching to RISC-V.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Feb 14, 2021

Should we then just pre-allocate a buffer to write the kernel into? This would effectively hard-code the limit into the kernel CPU instead (i.e. possibly require a manual update if the memory assignment is changed), but that still seems preferable to adding a heuristic limit in a completely different place (the core device driver).

We could also read the kernel into a container that supports failing allocations (by interfacing directly with the allocator) instead of Vec<u8>.

@dnadlinger
Copy link
Collaborator

Seems like https://doc.rust-lang.org/alloc/vec/struct.Vec.html#method.try_reserve (unstable) or try_vec! from fallible_collections would work. This would solve the DMA trace length issue too.

@lriesebos
Copy link
Contributor Author

just fyi, below you will find the uart log output when uploading a kernel that is too large.
networking goes down immediately and a reset through JTAG or a power cycle is required.
It seems that the limit is somewhere between 2.36 and 2.42 MB (not MiB).

[17:20:23.023149 0.013872] [   226.009177s]  INFO(runtime::session): idle kernel interrupted
[17:20:23.030550 0.007404] panic at runtime/main.rs:278:5heap view: BUSY 0x40147000 + 0xc + 0x24 -> 0x40147030
[17:20:23.037588 0.007039] BUSY 0x40147030 + 0xc + 0xc -> 0x40147048
[17:20:23.042088 0.004499] BUSY 0x40147048 + 0xc + 0xc -> 0x40147060
...
[17:20:23.263094 0.003842] BUSY 0x401baac4 + 0xc + 0x7c8 -> 0x401bb298
[17:20:23.267292 0.004199] IDLE 0x401bb298 + 0xc + 0x244d5c -> 0x0
[17:20:23.270651 0.003359]  === busy: 0x73aa0 idle: 0x245278 meta: 0x2e8 total: 0x2b9000
[17:20:23.275793 0.005141] 
[17:20:23.275939 0.000148] cannot allocate layout: Layout { size_: 2418892, align_: 1 }
[17:20:23.281644 0.005702] backtrace for software version 5.7146.ceaf6584;staq:
[17:20:23.297147 0.015504] 0x4004bd00
[17:20:23.313204 0.016056] 0x4002118c
[17:20:23.329197 0.015992] 0x4004bb04
[17:20:23.331130 0.001934] 0x4001b4ac
[17:20:23.345199 0.014069] 0x4001afb4
[17:20:23.347158 0.001960] 0x4003f4dc
[17:20:23.361171 0.014013] 0x40024b44
[17:20:23.362661 0.001490] 0x40024610
[17:20:23.378576 0.015914] halting.
[17:20:23.380499 0.001924] use `artiq_coremgmt config write -s panic_reset 1` to restart instead

sbourdeauducq added a commit that referenced this pull request Feb 21, 2021
@sbourdeauducq
Copy link
Member

92fd705

Should we then just pre-allocate a buffer to write the kernel into?

That may work, can you write a PR? I don't think more manual assignment of addresses would be required.

that still seems preferable to adding a heuristic limit in a completely different place (the core device driver).

Yes.

@lriesebos lriesebos deleted the feature/max_kernel_size branch May 27, 2021 16:05
@occheung occheung mentioned this pull request Oct 11, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants