Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Better error reporting and logging #55

Closed
wants to merge 1 commit into from

Conversation

@talex5
Copy link
Contributor

@talex5 talex5 commented Mar 22, 2016

This branch makes some changes to allow better diagnostics from functoria unikernels:

  • It removes functoria's default error handling (which is to ignore the actual error value and fail with the name of an internal variable). Devices are now expected to handle errors themselves and report something sensible.
  • connect is replaced by connect_raw, which doesn't automatically force its arguments. This allows devices to control the order of initialisation. In particular, this allows a log reporter device to configure logging before starting the initialisation of the main unikernel.

See mirage/better-errors for the corresponding changes to mirage.

See https://github.com/talex5/canopy-data/blob/master/Posts/Errors.md for examples of the new errors and logging this makes possible.

@Drup
Copy link
Member

@Drup Drup commented Mar 22, 2016

All my comments from mirage/mirage#514 still apply. Making connect private doesn't make the API any better.
You introduce a bunch of complexity for one device (logging) that could be implemented with more orthogonal APIs.

As for errors, It's a good idea to handle them at the site of the device, instead of the site of the connect. I however don't think the solution is to make the device handle all the wiring himself.

If we assume that all connects returns an 'a result (which is pretty much the case, afaics) but we don't want to propagate the errors down the line, then it's better to have something specialized for that.
I think it's time to reboot mirage/mirage#383 and put pp_error and connect functions inside the DEVICE signature.

@talex5
Copy link
Contributor Author

@talex5 talex5 commented Mar 22, 2016

For devices that support a generic pp_error function and where there's nothing extra to print, you just use or_fail_pp, e.g.

method! private connect _ modname = function
  | [ eth ] ->
      Printf.sprintf "%s.connect %s" modname eth
      |> or_fail_pp "Ethernet connection failed" modname
  | _ -> failwith "The ethernet connect should receive exactly one argument."

There are several devices that never fail to connect (e.g. the clock devices) and it's nice that I was able to remove all the error handling code from those.

It's also somewhat convenient that we can support other APIs, at least during the transition.

Do you think you could modify the code to show how you're proposing to handle things? i.e. so it works the same way it does in this branch from the user's point of view (e.g. the clock device specified by the user in config.ml gets initialised, then the logging system is initialised, then the unikernel is initialised).

@Drup
Copy link
Member

@Drup Drup commented Mar 22, 2016

About errors: Your solution may looks more convenient, because you moved all the boilerplate about error handling inside the device definition (and provide some convenience functions ... in mirage). I agree it works, but I would prefer to go the opposite way, and force everyone to use a result type.

About logging: I'll try to find some time to make an implementation ...

@talex5
Copy link
Contributor Author

@talex5 talex5 commented Mar 26, 2016

It would be nice at least not to force people to add error reporting functions for things that can't fail. For example, after updating tcpip to only return result types for things that can actually fail, this is the diff to mirage-skeleton:

diff --git a/ethifv4/unikernel.ml b/ethifv4/unikernel.ml
index 953eb30..b2a6491 100644
--- a/ethifv4/unikernel.ml
+++ b/ethifv4/unikernel.ml
@@ -13,21 +13,14 @@ module Main (C: CONSOLE) (N: NETWORK) (Clock : V1.CLOCK) = struct
   module I = Ipv4.Make(E)(A)
   module U = Udp.Make(I)
   module T = Tcp.Flow.Make(I)(OS.Time)(Clock)(Random)
-  module D = Dhcp_clientv4.Make(C)(OS.Time)(Random)(U)
-
-  let or_error _c name fn t =
-    fn t
-    >>= function
-    | `Error _e -> Lwt.fail (Failure ("Error starting " ^ name))
-    | `Ok t -> Lwt.return t
+  module D = Dhcp_clientv4.Make(OS.Time)(Random)(U)

   let start c net _ =
-    or_error c "Ethif" E.connect net
-    >>= fun e ->
-    or_error c "Arpv4" A.connect e
-    >>= fun a ->
-    or_error c "Ipv4" (I.connect e) a
-    >>= fun i ->
+    E.connect net >>= function
+    | `Error e -> failwith (Format.asprintf "Can't connect ethernet: %a" E.pp_error e)
+    | `Ok e ->
+    A.connect e >>= fun a ->
+    I.connect e a >>= fun i ->

 I.set_ip i (Ipaddr.V4.of_string_exn "10.0.0.2")
 >>= fun () ->
@samoht
Copy link
Member

@samoht samoht commented Apr 4, 2016

@Drup any update on this? Would be great to be able to merge the logging and error reporting patches as they greatly improve the current state of the MirageOS world and they will bit rot very fast as they are quite large changes.

@talex5 talex5 force-pushed the talex5:better-errors branch from b6e7863 to 6625405 Apr 6, 2016
@talex5
Copy link
Contributor Author

@talex5 talex5 commented Apr 6, 2016

One possibility: since the new API (for logging) is backwards compatible (existing devices don't change - only the log device uses the new method), could we apply the patch now and then change to the new API when that's ready? Presumably, that would only involve changing a couple of lines in the log device.

The changes to error handling also shouldn't conflict with future changes, since we have to remove the current error handling in any case. Even if we want to use pp_error for all devices eventually, it might still make sense to use the current patch to allow a transition period so we don't have to release everything at once.

@samoht
Copy link
Member

@samoht samoht commented Apr 28, 2016

The log parts of that PR is now done differently, by adding new init hooks (#58)

Before, each user of an input checked whether it was an error and
aborted if so. However, the user of a value usually doesn't have the
information needed to display the error. Instead, errors should be
handled at the place where [connect] is called, using instance-specific
code.
@talex5 talex5 force-pushed the talex5:better-errors branch from b15ec32 to 907a410 Jul 13, 2016
@Drup
Copy link
Member

@Drup Drup commented Jul 21, 2016

I'm still of the opinion that moving error handling in mirage, instead of functoria doing it, is the wrong direction. You say it allows to simplify error handling for devices that don't error, why does it matter ? It's certainly not efficiency (this is startup code, aka cold, and flambda should be good at simplifying this), the prettyness of the generated code is unimportant and it makes the device code less regular (some need to return Ok, some don't).

IMHO, A better solution would be to assume (in functoria) that M.pp_error exists, and use it. That would mean making meta_connect emits | Error _e -> <module>.pp_error <device name> _e.

@talex5
Copy link
Contributor Author

@talex5 talex5 commented Jul 23, 2016

Well, here are the simplifications that were possible to mirage-skeleton at least:

talex5/mirage-skeleton@79744e6

Mirage/functoria is useful when you want to abstract over the platform, but there are plenty of times you want to instantiate things yourself.

Also, as a library author, it's quite annoying having to write pp_error functions for errors that don't exist (especially as OCaml doesn't allow empty variants).

@hannesm
Copy link
Member

@hannesm hannesm commented Oct 2, 2016

I'm closing this since #71 was merged (see mirage/mirage#602 for more details), all people involved in this issue seem to be happy with the solution

@hannesm hannesm closed this Oct 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.