New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Better error reporting and logging #55

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@talex5
Contributor

talex5 commented Mar 22, 2016

This branch makes some changes to allow better diagnostics from functoria unikernels:

  • It removes functoria's default error handling (which is to ignore the actual error value and fail with the name of an internal variable). Devices are now expected to handle errors themselves and report something sensible.
  • connect is replaced by connect_raw, which doesn't automatically force its arguments. This allows devices to control the order of initialisation. In particular, this allows a log reporter device to configure logging before starting the initialisation of the main unikernel.

See mirage/better-errors for the corresponding changes to mirage.

See https://github.com/talex5/canopy-data/blob/master/Posts/Errors.md for examples of the new errors and logging this makes possible.

@Drup

This comment has been minimized.

Show comment
Hide comment
@Drup

Drup Mar 22, 2016

Member

All my comments from mirage/mirage#514 still apply. Making connect private doesn't make the API any better.
You introduce a bunch of complexity for one device (logging) that could be implemented with more orthogonal APIs.

As for errors, It's a good idea to handle them at the site of the device, instead of the site of the connect. I however don't think the solution is to make the device handle all the wiring himself.

If we assume that all connects returns an 'a result (which is pretty much the case, afaics) but we don't want to propagate the errors down the line, then it's better to have something specialized for that.
I think it's time to reboot mirage/mirage#383 and put pp_error and connect functions inside the DEVICE signature.

Member

Drup commented Mar 22, 2016

All my comments from mirage/mirage#514 still apply. Making connect private doesn't make the API any better.
You introduce a bunch of complexity for one device (logging) that could be implemented with more orthogonal APIs.

As for errors, It's a good idea to handle them at the site of the device, instead of the site of the connect. I however don't think the solution is to make the device handle all the wiring himself.

If we assume that all connects returns an 'a result (which is pretty much the case, afaics) but we don't want to propagate the errors down the line, then it's better to have something specialized for that.
I think it's time to reboot mirage/mirage#383 and put pp_error and connect functions inside the DEVICE signature.

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Mar 22, 2016

Contributor

For devices that support a generic pp_error function and where there's nothing extra to print, you just use or_fail_pp, e.g.

method! private connect _ modname = function
  | [ eth ] ->
      Printf.sprintf "%s.connect %s" modname eth
      |> or_fail_pp "Ethernet connection failed" modname
  | _ -> failwith "The ethernet connect should receive exactly one argument."

There are several devices that never fail to connect (e.g. the clock devices) and it's nice that I was able to remove all the error handling code from those.

It's also somewhat convenient that we can support other APIs, at least during the transition.

Do you think you could modify the code to show how you're proposing to handle things? i.e. so it works the same way it does in this branch from the user's point of view (e.g. the clock device specified by the user in config.ml gets initialised, then the logging system is initialised, then the unikernel is initialised).

Contributor

talex5 commented Mar 22, 2016

For devices that support a generic pp_error function and where there's nothing extra to print, you just use or_fail_pp, e.g.

method! private connect _ modname = function
  | [ eth ] ->
      Printf.sprintf "%s.connect %s" modname eth
      |> or_fail_pp "Ethernet connection failed" modname
  | _ -> failwith "The ethernet connect should receive exactly one argument."

There are several devices that never fail to connect (e.g. the clock devices) and it's nice that I was able to remove all the error handling code from those.

It's also somewhat convenient that we can support other APIs, at least during the transition.

Do you think you could modify the code to show how you're proposing to handle things? i.e. so it works the same way it does in this branch from the user's point of view (e.g. the clock device specified by the user in config.ml gets initialised, then the logging system is initialised, then the unikernel is initialised).

@Drup

This comment has been minimized.

Show comment
Hide comment
@Drup

Drup Mar 22, 2016

Member

About errors: Your solution may looks more convenient, because you moved all the boilerplate about error handling inside the device definition (and provide some convenience functions ... in mirage). I agree it works, but I would prefer to go the opposite way, and force everyone to use a result type.

About logging: I'll try to find some time to make an implementation ...

Member

Drup commented Mar 22, 2016

About errors: Your solution may looks more convenient, because you moved all the boilerplate about error handling inside the device definition (and provide some convenience functions ... in mirage). I agree it works, but I would prefer to go the opposite way, and force everyone to use a result type.

About logging: I'll try to find some time to make an implementation ...

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Mar 26, 2016

Contributor

It would be nice at least not to force people to add error reporting functions for things that can't fail. For example, after updating tcpip to only return result types for things that can actually fail, this is the diff to mirage-skeleton:

diff --git a/ethifv4/unikernel.ml b/ethifv4/unikernel.ml
index 953eb30..b2a6491 100644
--- a/ethifv4/unikernel.ml
+++ b/ethifv4/unikernel.ml
@@ -13,21 +13,14 @@ module Main (C: CONSOLE) (N: NETWORK) (Clock : V1.CLOCK) = struct
   module I = Ipv4.Make(E)(A)
   module U = Udp.Make(I)
   module T = Tcp.Flow.Make(I)(OS.Time)(Clock)(Random)
-  module D = Dhcp_clientv4.Make(C)(OS.Time)(Random)(U)
-
-  let or_error _c name fn t =
-    fn t
-    >>= function
-    | `Error _e -> Lwt.fail (Failure ("Error starting " ^ name))
-    | `Ok t -> Lwt.return t
+  module D = Dhcp_clientv4.Make(OS.Time)(Random)(U)

   let start c net _ =
-    or_error c "Ethif" E.connect net
-    >>= fun e ->
-    or_error c "Arpv4" A.connect e
-    >>= fun a ->
-    or_error c "Ipv4" (I.connect e) a
-    >>= fun i ->
+    E.connect net >>= function
+    | `Error e -> failwith (Format.asprintf "Can't connect ethernet: %a" E.pp_error e)
+    | `Ok e ->
+    A.connect e >>= fun a ->
+    I.connect e a >>= fun i ->

 I.set_ip i (Ipaddr.V4.of_string_exn "10.0.0.2")
 >>= fun () ->
Contributor

talex5 commented Mar 26, 2016

It would be nice at least not to force people to add error reporting functions for things that can't fail. For example, after updating tcpip to only return result types for things that can actually fail, this is the diff to mirage-skeleton:

diff --git a/ethifv4/unikernel.ml b/ethifv4/unikernel.ml
index 953eb30..b2a6491 100644
--- a/ethifv4/unikernel.ml
+++ b/ethifv4/unikernel.ml
@@ -13,21 +13,14 @@ module Main (C: CONSOLE) (N: NETWORK) (Clock : V1.CLOCK) = struct
   module I = Ipv4.Make(E)(A)
   module U = Udp.Make(I)
   module T = Tcp.Flow.Make(I)(OS.Time)(Clock)(Random)
-  module D = Dhcp_clientv4.Make(C)(OS.Time)(Random)(U)
-
-  let or_error _c name fn t =
-    fn t
-    >>= function
-    | `Error _e -> Lwt.fail (Failure ("Error starting " ^ name))
-    | `Ok t -> Lwt.return t
+  module D = Dhcp_clientv4.Make(OS.Time)(Random)(U)

   let start c net _ =
-    or_error c "Ethif" E.connect net
-    >>= fun e ->
-    or_error c "Arpv4" A.connect e
-    >>= fun a ->
-    or_error c "Ipv4" (I.connect e) a
-    >>= fun i ->
+    E.connect net >>= function
+    | `Error e -> failwith (Format.asprintf "Can't connect ethernet: %a" E.pp_error e)
+    | `Ok e ->
+    A.connect e >>= fun a ->
+    I.connect e a >>= fun i ->

 I.set_ip i (Ipaddr.V4.of_string_exn "10.0.0.2")
 >>= fun () ->

@talex5 talex5 referenced this pull request Mar 27, 2016

Closed

[Tracking] Better errors #107

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Apr 4, 2016

Member

@Drup any update on this? Would be great to be able to merge the logging and error reporting patches as they greatly improve the current state of the MirageOS world and they will bit rot very fast as they are quite large changes.

Member

samoht commented Apr 4, 2016

@Drup any update on this? Would be great to be able to merge the logging and error reporting patches as they greatly improve the current state of the MirageOS world and they will bit rot very fast as they are quite large changes.

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Apr 6, 2016

Contributor

One possibility: since the new API (for logging) is backwards compatible (existing devices don't change - only the log device uses the new method), could we apply the patch now and then change to the new API when that's ready? Presumably, that would only involve changing a couple of lines in the log device.

The changes to error handling also shouldn't conflict with future changes, since we have to remove the current error handling in any case. Even if we want to use pp_error for all devices eventually, it might still make sense to use the current patch to allow a transition period so we don't have to release everything at once.

Contributor

talex5 commented Apr 6, 2016

One possibility: since the new API (for logging) is backwards compatible (existing devices don't change - only the log device uses the new method), could we apply the patch now and then change to the new API when that's ready? Presumably, that would only involve changing a couple of lines in the log device.

The changes to error handling also shouldn't conflict with future changes, since we have to remove the current error handling in any case. Even if we want to use pp_error for all devices eventually, it might still make sense to use the current patch to allow a transition period so we don't have to release everything at once.

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Apr 28, 2016

Member

The log parts of that PR is now done differently, by adding new init hooks (#58)

Member

samoht commented Apr 28, 2016

The log parts of that PR is now done differently, by adding new init hooks (#58)

Remove error handling from use-sites
Before, each user of an input checked whether it was an error and
aborted if so. However, the user of a value usually doesn't have the
information needed to display the error. Instead, errors should be
handled at the place where [connect] is called, using instance-specific
code.

@talex5 talex5 referenced this pull request Jul 14, 2016

Closed

[Tracking] Better errors #559

@Drup

This comment has been minimized.

Show comment
Hide comment
@Drup

Drup Jul 21, 2016

Member

I'm still of the opinion that moving error handling in mirage, instead of functoria doing it, is the wrong direction. You say it allows to simplify error handling for devices that don't error, why does it matter ? It's certainly not efficiency (this is startup code, aka cold, and flambda should be good at simplifying this), the prettyness of the generated code is unimportant and it makes the device code less regular (some need to return Ok, some don't).

IMHO, A better solution would be to assume (in functoria) that M.pp_error exists, and use it. That would mean making meta_connect emits | Error _e -> <module>.pp_error <device name> _e.

Member

Drup commented Jul 21, 2016

I'm still of the opinion that moving error handling in mirage, instead of functoria doing it, is the wrong direction. You say it allows to simplify error handling for devices that don't error, why does it matter ? It's certainly not efficiency (this is startup code, aka cold, and flambda should be good at simplifying this), the prettyness of the generated code is unimportant and it makes the device code less regular (some need to return Ok, some don't).

IMHO, A better solution would be to assume (in functoria) that M.pp_error exists, and use it. That would mean making meta_connect emits | Error _e -> <module>.pp_error <device name> _e.

@talex5

This comment has been minimized.

Show comment
Hide comment
@talex5

talex5 Jul 23, 2016

Contributor

Well, here are the simplifications that were possible to mirage-skeleton at least:

talex5/mirage-skeleton@79744e6

Mirage/functoria is useful when you want to abstract over the platform, but there are plenty of times you want to instantiate things yourself.

Also, as a library author, it's quite annoying having to write pp_error functions for errors that don't exist (especially as OCaml doesn't allow empty variants).

Contributor

talex5 commented Jul 23, 2016

Well, here are the simplifications that were possible to mirage-skeleton at least:

talex5/mirage-skeleton@79744e6

Mirage/functoria is useful when you want to abstract over the platform, but there are plenty of times you want to instantiate things yourself.

Also, as a library author, it's quite annoying having to write pp_error functions for errors that don't exist (especially as OCaml doesn't allow empty variants).

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Oct 2, 2016

Member

I'm closing this since #71 was merged (see mirage/mirage#602 for more details), all people involved in this issue seem to be happy with the solution

Member

hannesm commented Oct 2, 2016

I'm closing this since #71 was merged (see mirage/mirage#602 for more details), all people involved in this issue seem to be happy with the solution

@hannesm hannesm closed this Oct 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment