Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MirageOS on Xen: systematic crash with create_bounce_frame #731

Closed
vitcozzolino opened this issue Dec 13, 2016 · 21 comments
Closed

MirageOS on Xen: systematic crash with create_bounce_frame #731

vitcozzolino opened this issue Dec 13, 2016 · 21 comments

Comments

@vitcozzolino
Copy link

vitcozzolino commented Dec 13, 2016

Hi,
I'm running a unikernel on XEN that basically accesses a remote DB, fetches and computes some data, sends out the result. Apparently, if I try to fetch and parse a JSON response greater than a empirically found threshold (details at the bottom of the post), the PVM XEN unikernel just crashes and this is wait I see when running sudo xl dmesg:

(XEN) Pagetable walk from 00000000002c8ff8:
(XEN)  L4[0x000] = 0000001f7d2e6067 00000000000004e6
(XEN)  L3[0x000] = 0000001f7d2e7067 00000000000004e7
(XEN)  L2[0x001] = 0000001f7d2e9067 00000000000004e9
(XEN)  L1[0x0c8] = 001000080fec8025 00000000000002c8
(XEN) domain_crash_sync called from entry.S: fault at ffff82d080226237 create_bounce_frame+0xdf/0x13a
(XEN) Domain 8 (vcpu#0) crashed on cpu#10:
(XEN) ----[ Xen-4.6.0  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    10
(XEN) RIP:    e033:[<00000000002583d3>]
(XEN) RFLAGS: 0000000000010246   EM: 1   CONTEXT: pv guest (d8v0)
(XEN) rax: 0000000000000009   rbx: 0000000000000000   rcx: 000000000000003a
(XEN) rdx: 00000000002293e4   rsi: 0000000000000000   rdi: 00000000002c9098
(XEN) rbp: 00000000002c9098   rsp: 00000000002c9038   r8:  0000000000000009
(XEN) r9:  0000000000000003   r10: 0000000000000003   r11: 0000000000000000
(XEN) r12: 0000000000440d38   r13: 0000000000000000   r14: 00000018ca7b6000
(XEN) r15: 000000000000003b   cr0: 0000000080050033   cr4: 00000000001526e0
(XEN) cr3: 0000001f7d2e5000   cr2: 0000000000000019
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=00000000002c9038:
(XEN)    0000000000000bfc 6d5f666f5f74754f 0200007900000003 0000000000000bf8
(XEN)    00000000002c9070 fffffffffffffffd 0000000000000bfc 0000000000000000
(XEN)    00000000002c91a8 0000000000440d38 0000000000000000 00000000002589ff
(XEN)    000000000000003b 00000018ca7b6000 0000000000000000 0000000000440d38
(XEN)    00000000002c91a8 0000000000000000 0000000000000000 0000000000000003
(XEN)    0000000000000003 0000000000000009 0000000000000009 000000000000003a
(XEN)    00000000002293e4 0000000000000000 00000000002c91a8 ffffffffffffffff
(XEN)    00000000002583d3 000000010000e030 0000000000010046 00000000002c9148
(XEN)    000000000000e02b 0700000000000000 0000000000000bf8 00000000002c9168
(XEN)    ffffffff00000003 0000000000000bfc 6e756f665f746f4e 0600000000000064
(XEN)    0000000000000bf8 0000000000000000 00000000002c92b8 0000000000440d38
(XEN)    0000000000000000 00000000002589ff 000000000000003b 00000018ca7b6000
(XEN)    0000000000000000 0000000000440d38 00000000002c92b8 0000000000000000
(XEN)    0000000000000000 0000000000000003 0000000000000003 0000000000000009
(XEN)    0000000000000009 000000000000003a 00000000002293e4 0000000000000000
(XEN)    00000000002c92b8 ffffffffffffffff 00000000002583d3 000000010000e030
(XEN)    0000000000010046 00000000002c9258 000000000000e02b 00000000000013fc
(XEN)    656e696665646e55 7372756365725f64 75646f6d00000003 050000000000656c
(XEN)    00000000003ddbb8 00000000003de4f8 00000000003dfef8 0000000000000000
(XEN)    00000000002c93c8 0000000000440d38 0000000000000000 00000000002589ff

I've tried to destroy/create multiple times the same unikernel and I always receive the same error. When running on Unix I don't bump into this issue, even when fetching multiple MB of data.

By filling my code with logs, I figured out where exactly the unikernel stops. Specifically during the JSON response parsing (I'm using the YoJson library):

let directExtractionn rawJson =
           Log.info (fun f -> f "Initializing direct extraction");
            let json = Yojson.Basic.from_string rawJson in
            let result = [json] |> filter_member "results" |> flatten |> filter_member "series"
            |> flatten |> filter_member "values" |> flatten in
                List.map (
                                fun item ->
                                let datapoint = match item |> index 1 with
                                    | `String a -> a
                                    | `Float f -> string_of_float f
                                    | `Int i -> string_of_float (float_of_int i)
                                    | `Bool b -> string_of_bool b
                                in
                                datapoint
            ) result |> computeAverage >>= fun aver ->
            log_lwt ~inject:(fun f -> f "Result %f" aver)

I know that probably my code is not really optimized and clean but I'm quite shocked to see that my unikernel crashes when it has to extract roughly 2800 datapoints (it's more or less the threshold at which it crashes). The function computeAverage is not even called. If I run the same code on Unix I can parse and process up to a 1M datapoints in less than a second.

I've also tried to increase the number of vcpus and memory, but nothing changed (16 vcpus and 4GB of memory).

I would like to add that this threshold changes depending on the host machine:

  • Machine A (Ubuntu 14.04, Xen 4.6.0, 32 Cores, 128 GB RAM, 10 GB Network Interface, Mirage 2.9.1, Ocaml 4.02.3) -> Threshold is around 107Kb
  • Machine B (Debian 8.5, Xen 4.4.1, 4 cores, 8 GB RAM, 1GB Network Interface, Mirage 2.9.1, Ocaml 4.02.3) -> Threshold is around 33Kb
@vitcozzolino vitcozzolino changed the title Xen: data allocation issue? MirageOS on Xen: systematic crash with create_bounce_frame Dec 13, 2016
@vitcozzolino
Copy link
Author

vitcozzolino commented Dec 14, 2016

I've created a lightweight version of my original Unikernel to make it easier to debug and figure where the error comes from. I've create a Gist with all the required resources.

Just one suggestion, run make with warning suppression.. otherwise you are gonna see a wall of warnings about "unescaped end-of-line in string constant" regarding my in-lined JSON string.

@talex5
Copy link
Contributor

talex5 commented Dec 16, 2016

@vitcozzolino the upload seems to be incomplete (the JSON ends in the middle of an entry). If I close all the open values, it works for me though.

@vitcozzolino
Copy link
Author

@talex5 I've re-uploaded the JSON.. can you try again now? Maybe something went wrong when I pasted the code.

@yomimono
Copy link
Contributor

I was able to reproduce this as given with OCaml 4.02.3 and latest released versions of packages (obtained by mirage configure with mirage version 2.9.1 and the default opam repository), resulting in a crash with very little console output and dmesg output like that mentioned by vitcozzolino. When I added some additional printf debugging calls to C.log, the unikernel completed successfully (and very quickly).

@talex5
Copy link
Contributor

talex5 commented Dec 17, 2016

@vitcozzolino I think you're running out of stack space because List.map isn't tail recursive. Replacing the two calls to List.map with bigmap made it work for me:

let bigmap f xs =
  let rec aux acc = function
    | [] -> List.rev acc
    | x :: xs -> aux (f x :: acc) xs
  in
  aux [] xs

@vitcozzolino
Copy link
Author

vitcozzolino commented Dec 19, 2016

@talex5 Thanks much, now it works! Unfortunately I've bumped into another issue.. I've stressed a bit the Unikernel and when I try to fetch too many rows I receive the following error:

Fatal error: out of memory.
Mirage existing with status 2
Do_exit called!

On Unix it doesn't happen.. but I don't think that this is a coding error now. Are we running out of stack memory again? I'm trying to figure out how to expand it.

@talex5
Copy link
Contributor

talex5 commented Dec 19, 2016

Assuming your Xen unikernel really does have enough RAM, here are some known problem areas to consider:

@vitcozzolino
Copy link
Author

vitcozzolino commented Dec 19, 2016

@talex5 Sorry, I forgot to reconfigure the available RAM for my Unikernel. Now I can go up to 4M rows fetched from the DB and that's already enough for my measurements. Thanks a lot for helping :)

@hannesm
Copy link
Member

hannesm commented Dec 19, 2016

I'm curious where the stack size of Xen unikernels is configured, and if it is big enough for us (mirage/ocaml-git#151 seems to be related) -- on Solo5 I have not run into the stack size problem (maybe we should have the same stack size for Solo5 and for Xen!?).

@vitcozzolino
Copy link
Author

I was actually asking myself the same question but I was not able to find any references. I only found some Ocaml related information about how to change the stack size and how it works.

At the moment I'm still able to trigger out_of_memory errors with some specific xen unikernel configurations. For example, if I try to parse and manipulate huge HTTP GET body responses (size of 110+ MB) I receive an out_of_memory error if the RAM allocated for my XEN Unikernel is <= 2GB. I'm still running tests and measurements, so I still have to polish my findings.

Anyway I would love to understand the correlation between available stack memory in the MirageOS PVM running on XEN and the amount of RAM I actually allocated into the .xl config file.

@mato
Copy link
Contributor

mato commented Dec 19, 2016

@hannesm The stack size is in Mini-OS, upstream does it here: https://github.com/mirage/mini-os/blob/master/include/x86/arch_limits.h#L17, used at https://github.com/mirage/mini-os/blob/master/include/mm.h#L44. The Mirage fork will be similar, don't have a copy on hand.

@hannesm
Copy link
Member

hannesm commented Dec 19, 2016

@mato thx, this is valuable information (and should be in some FAQ somewhere on the MirageOS website IMHO), related regarding memory tuning is Solo5/solo5#58 (comment)

@talex5
Copy link
Contributor

talex5 commented Dec 19, 2016

(you also get a lot more stack space on ARM: a little over 1MB: https://github.com/talex5/mini-os/blob/444542b05e0f8a6220129b90f4697563d4bd0e1b/arch/arm/arm32.S#L151)

@mato
Copy link
Contributor

mato commented Dec 20, 2016

I just tried the siege -c10 -b -t30s http://10.0.0.2:8080/ test against a static_website unikernel compiled for Xen (OCaml 4.03.0, latest mirage-dev, Xen 4.8, identical hardware to that used for Solo5 tests in Solo5/solo5#58) and had to bump up the domU memory to 512MB, even then I occasionally hit out of memory errors. It's possible that the Mini-OS memory allocator(s) are less efficient than those used in Solo5 (where the path for OCaml memory allocation goes straight to dlmalloc and there's no underlying page allocator involved).

@hannesm
Copy link
Member

hannesm commented Dec 20, 2016

@mato solo5 also does io-page allocation differently, or (see mirage/io-page#38) -- how (and where?) does solo5 actually implement the io-page primitive caml_alloc_pages?

@mato
Copy link
Contributor

mato commented Dec 20, 2016 via email

@vitcozzolino
Copy link
Author

Hi, I've done some more tests to understand when the Fatal error: out of memory. is triggered. I've created a MirageOS unikernel running on XEN with 2GB of RAM and 2 cores and I've tried to fetch and compute an increasing amount of rows (fetched from a DB). This is what happens (systematically):

  • 1M rows (~30MB) -> Success
  • 2M rows (~60MB) -> Success
  • 3M rows (~90MB) -> Success
  • 4M rows (ND) -> Failure (Fatal error: out of memory.)

I have been following the discussion but honestly I would like some help understanding the correlation between the out of memory error and the RAM allocated by the PVM. I'm fairly sure there is one considering that if I bump the RAM to 4GB I can complete successfully the 4M rows request. Why do I need so much RAM to handle such a small (proportionally) request?

@talex5
Copy link
Contributor

talex5 commented Jan 6, 2017

If you're getting out of memory (rather than a page fault) then presumably it's not stack space that's the problem now (and so this is really a different issue). It's probably worth trying the original suggestions again on this new problem. i.e. try calling Gc.full_major () just before the crash to see if the problem is GC accounting (and look at mirage/io-page#38 if it is). Otherwise, see if you can make another small test case that doesn't need an actual database.

@vitcozzolino
Copy link
Author

I've tried with Gc.full_major () but unfortunately nothing changed. I'm in the process of gathering a bit more data but at the moment I can confirm that I can basically replicate the same out of memory error with manifold combinations of RAM and amount of fetched data. For example, with 64 MB RAM I can trigger the error by fetching roughly 3.5 MB of data.

Will provide an updated gist and some more info as soon as possible.

@marqueswsm
Copy link

Hello! Did you find a solution to this problem? I'm working with Mirage on Xen and I'm having this same problem of "out of memory".

@talex5
Copy link
Contributor

talex5 commented Aug 26, 2017

Closing this issue because the original create_bounce_frame problem was solved (non-tail-recursive map) and this probably isn't a good place for discussion of general out-of-memory problems. Please open a new issue if you have a new out-of-memory problem.

@talex5 talex5 closed this as completed Aug 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants