Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json output for the compiler, part 1: structured metadata for error messages #9979

Open
wants to merge 17 commits into
base: trunk
Choose a base branch
from

Conversation

Octachron
Copy link
Member

This PR is the result of @muskangarg21's Outreachy internship.
The aim of this PR is to give a basic machine-readable output to avoid the need of parsing the compiler messages.
It adds a new -json flag to ocaml, ocamc, and ocamlopt that turns the compiler messages format to json.
For instance trying to compile

1 + ()

with

# ocamlc -json test.ml

yields with this PR:

{"error_report":
    [{"info": {"kind": "error"},
      "msg":
         {"location": {"file": "test.ml", "lines": [1, 1], "characters": [4, 6]},
          "content":
             "This expression has type unit but an expression was expected of type int"
         },
      "submsgs": []
     }
    ]
}

(the testsuite contains more involved examples).

Note that This PR doesn't alter the content of error messages, and only structures the associated metadata.

The current json scheme (an object with a single "error_report" field) has been chosen to be extensible.
In particular, @muskangarg21 has extended this output mode to all debugging output of the compiler (-dprofile, -dparsetree, -d....). Since this extension is mostly busy work, I have split it into a upcoming second PR.

@ulugbekna
Copy link
Contributor

Thank you for your PR and congrats to @muskangarg21!

This should be very helpful to get nicer errors in dune, merlin, and ocaml-lsp!

I can't comment on the implementation but wonder if we're planning to have more formal description of produced JSONs (e.g., what values can info.kind take apart from "error") and maybe have a design discussion about that format; for example, having

  ...
  "lines": [6, 8]
  "characters": [3, 18]
  ...

is a little ambiguous (does the first character number pertain to the first line number?) and perhaps could be better represented as

  ...
  "start_pos" : { "line" : 6, "character" : 3 },
  "end_pos": { "line" : 8, "character" : 18 },
  ...

Replacing fixed-sized arrays with a json object would also cut down on code such as | [a, b] -> ... | _ -> ... and make parsing easier.

Anyway, thank you!

@Octachron Octachron force-pushed the json-error-finalized branch 2 times, most recently from 243844b to 099526d Compare October 16, 2020 10:13
@xavierleroy
Copy link
Contributor

Before we bikeshed on the output format:

Just like there are DTDs to describe XML documents, I heard there are JSON schema to describe JSON documents.

Would it be useful to write down a schema for the JSON data that is produced here? I didn't see it in the PR but may have missed it.

@xavierleroy
Copy link
Contributor

Relevant link: https://json-schema.org/

@ulugbekna
Copy link
Contributor

well, the format I had in mind implied use of json schemas :-)

@Octachron
Copy link
Member Author

It would be probably useful to have a json schema for the output. However, it opens the door to few design questions: do we want a hand-written schema, or one automatically generated? There is also the question of the coupling between the schema and the json printer, do we want to test that the printer is valid according to the schema, or would we rather make the printer valid by construction using a typed representation of the schema?

@gasche
Copy link
Member

gasche commented Oct 19, 2020

What if we saw these questions as opportunities for incremental improvements, rather than choices that you want to make right now? Having a schema written down, even just for documentation, is better than not having one at all. Having a way to check validity (at runtime or compile-time) is better than not. (Among checking strategies, it's not clear that "typed representations" pull their complexity weight, so this is more of an exploration of tradeoffs.) Rather than the perfect/definitive solution, I think that you should go for the amount of effort you think is right for a first PR. (Maybe that's the current state, but then maybe not.)

@ulugbekna
Copy link
Contributor

ulugbekna commented Oct 19, 2020

My initial request was just to have some reference (this is a good example) that users of the compiler's json output can use and have some discussion on the json schema design choices, so that the incremental changes later don't snowball less desirable choices made at the beginning.

utils/misc.mli Outdated
(** Store json fragments for the json output mode *)
type json_fragments =
{
toplevel_keys : Json.t Stdlib.String.Map.t ref;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a mutable field instead?

@rgrinberg
Copy link
Member

@Octachron pointed out this PR to me when I asked about implementing ocaml/ocaml-lsp#258

This lsp issue requires me to distinguish between "unused value" and "deprecation" warnings.

Off the top of my head, I don't see how this PR helps lsp. In lsp, we already have direct access to Location.report values, so the problem isn't parsing the errors. The problem is that Location.report does not have enough (conveniently available) information to interpret these errors. What would immediately help me is amending report_kind to be:

type report_kind =
  | Report_error
  | Report_warning of string * Warnings.t
  | Report_warning_as_error of string * Warnings.t
  | Report_alert of string
  | Report_alert_as_error of string

Plus a similar change for alerts that would allow me to pick out the ones that correspond to deprecations.

Where this PR does help is for reporting build errors. It makes it possible for us to:

  • drop the regex hacks required to extract locations
  • allows us to add useful metadata when reporting such errors.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished a first pass of reviews. There is a fair number of questions and comments, because the diff itself is large. Overall this looks fairly good, and I must say that I am impressed with the breadth of changes to the compiler internals; this is a more ambitious change than I would have guessed. Thanks!

@@ -676,6 +676,11 @@ let mk_error_style f =
\ be set through the OCAML_ERROR_STYLE environment variable."
;;

let mk_json f =
"-json", Arg.Unit f,
" Print the compiler messages in json format"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(when supported)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably a good point currently.

| `String of string
| `Assoc of (string * t) list
| `List of t list
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that this was designed to be compatible with the polymorphic-variant representation of other JSON libraries. Maybe this could be maybe explicit with a comment pointing to the definition in those libraries, so that future evolutions of the datatype (if any) can be manually checked to preserve this compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, I know Yojson is the most popular json library in OCaml, but I've come to appreciate how jsonaf handles numbers in json more. It's not a huge deal, but I'd prefer if the compiler followed this convention instead.

utils/misc.mli Outdated
type json_fragments =
{
toplevel_keys : Json.t Stdlib.String.Map.t ref;
error_key : Json.t list ref;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what error_key means here, maybe a comment (or a different name) would help. (toplevel_keys is clear enough)

utils/misc.mli Outdated
(** Json mode: we are building a json output, and only print to the
underlying formatter once the logging session ends. *)

val logf : string ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the key parameter could be a keyword?

utils/misc.ml Outdated Show resolved Hide resolved
Misc.Log.logf "status" log "Interrupted.@.";
Misc.Log.flush log;
false
| x -> Location.report_exception log x; Misc.Log.flush log; false) in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would want to factorize the three flush calls here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we could use a function Location.with_log : formatter -> (unit -> 'a) -> 'a)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have factorized the code with a with_log function as much as possible (there is at least one case when we want to avoid printing empty object needlessly in the toplevel).

toplevel/opttoploop.ml Outdated Show resolved Hide resolved
toplevel/opttoploop.ml Outdated Show resolved Hide resolved

let execute_phrase_with_log print_outcome log phr =
Fun.protect (fun () -> execute_phrase print_outcome log phr)
~finally:Warnings.reset_fatal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what the reset_fatal business is doing here, but this might be due to an inconsistency between toploop and opttoploop (which I am reviewing first).

toplevel/opttoploop.mli Outdated Show resolved Hide resolved
@trefis
Copy link
Contributor

trefis commented Nov 25, 2020

I haven't looked at this yet, but as I said to @Octachron in private: please do not merge before I do.
This is likely to impact merlin, where we also make a json object out of everything, and have also modified to pipeline to intercept errors, etc.

And here is something I haven't said to @Octachron in private: I wish there had been more (public?) communication, in particular with merlin's authors, prior to this work being actually done/started; of course I might just have had my head buried in the sand, and missed the discussions when they happened, in which case: my bad.
Anyway, as things are, I'm afraid there will be a lot of duplication with merlin, as well as a lot of small differences sprinkled everywhere. And I fear it's going to be a pain for us to integrate with that change. (The other option being for us to bypass it entirely, and rely on our own json serializers, but that'd kind of make me sad).

@gasche
Copy link
Member

gasche commented Nov 25, 2020

Regarding previous communication: there was a first PR (#9538) on May 9th, 2020, and then another (#9758) on July 13th, 2020, and a discussion on caml-devel on the same day. (There was also an older issue, #7123 from 2016, where those PRs were reported.) This being said, I think that the onus is on downstream forkers to send us their stuff, rather than on compiler maintainers to pre-warn them of potential changes to the codebase.

@lpw25
Copy link
Contributor

lpw25 commented Nov 25, 2020

To be fair, @trefis's complaint is not the first I have heard about a lack of communication for this line of work. There doesn't seem to have been much discussion with the authors of tools that might consume this data (e.g. dune, ocaml-lsp, merlin, etc.). Perhaps we could get a description of some use cases that this work is intended to solve, and then check with the authors of the tools involved that this is what they actually need.

@trefis
Copy link
Contributor

trefis commented Nov 25, 2020

Thanks for all the bookkeeping @gasche !

I don't want to derail this conversation too much, so let's call this my last message in this PR about communication. (You're of course free to answer my answer if you'd like :p)

For the record: #9538 is about a separate tool, so I didn't feel particularly concerned, and didn't have anything to say about it.
#9758 was about the compiler, I did feel concern, and I did reply on that PR that some discussion would be nice, that merlin was already doing something similar, and that that something was also slightly different. On that same PR, on that same day, @Octachron said he already had a much wider patch on the compiler, which I guess this PR is extracted from. Before this PR, I do not believe that I'd seen any description of any part of that work (neither the format used as an output, nor the integration inside the compiler).

This being said, I think that the onus is on downstream forkers to send us their stuff, rather than on compiler maintainers to pre-warn them of potential changes to the codebase.

Interesting, I'm pretty sure I remember instances of compiler maintainers (myself included) being worried by the potential impact of such or such PR, building opam-repository with it, and sometimes sending patches to users packages.
Also, this isn't just an instance of "potential changes to the codebase", but an opportunity of reusing some existing code whose authors sadly hadn't had the time to upstream. And I'd be surprised if you, or the mentor of that internship could deny knowing about the existence of that particular project, and the fact that it changed the output of the compiler for a more regular format.

@Octachron
Copy link
Member Author

@trefis: Let me restate that indeed, the PR will not be merged without your review.

Concerning the content of the PR, one important point is that the PR and its second part is more wide than deep and mostly focuses on adding a JSON-structured metadata layer to all compiler messages.

I expect most of those messages to be completely orthogonal to Merlin's changes.
If the few intersection points need to be completely rewritten for the sake of interoperability with Merlin, so be it. I shall do the work. The pedagogical part of the internship has already been a success from my point of view.

Similarly, the precise format ought to be discussed. Nevertheless, I thought that for basic metadata that discussion could happen in this very PR or amended later. That was probably an error in term of communication. I am not completely sure what is the best way to fix this mistake? A "Ocaml and tools" meeting maybe?

@gasche
Copy link
Member

gasche commented Apr 14, 2021

We are still waiting for Merlin people to give input on this PR proposal.

@trefis
Copy link
Contributor

trefis commented Apr 15, 2021

Thanks for the ping, that had dropped of my stack.
I'm currently half-way in my review of PR 8516, so I'll finish that and then turn to this one. I'm not going to give an ETA for my review, because I'm necessarily going to miss it. But I'll try to hurry up :)

@trefis
Copy link
Contributor

trefis commented May 10, 2021

I just had a look, and as @Octachron said this (so far) completely orthogonal to what's been done inside Merlin.

Sorry for the delay!

@rgrinberg
Copy link
Member

I discussed the issue of editor integration & compiler errors with the dune team today, and we thought of another use case where structured errors can be useful. For errors displayed in editors like vscode, it's better to display error messages formatted with a much wider margin so that they fit into a single line whenever possible. Thus, it would be good if the Format information was part of the error structure.

@Octachron
Copy link
Member Author

@rgrinberg , I could add the formatter geometry to the data. However, I am not sure how often it will be different from the default value, since there is no easy to change this default when using the compiler executables.

@gasche
Copy link
Member

gasche commented Jan 18, 2023

I just had a look, and as @Octachron said this (so far) completely orthogonal to what's been done inside Merlin.

I was curious to know what has been done inside merlin, and asked @voodoos. Answer: raise_error:

https://github.com/ocaml/merlin/blob/d6943ba5a4d918482e3193f63fa5399086e6fe63/src/ocaml/typing/msupport.mli#L29-L37

The raise_error function is used to record that an error was raised on the side, which will later be turned into a message for the user, or a JSON fragment, etc. It may not interrupt the control flow (it often does not), its return type is unit instead of 'a, so users of the function must decide what to do after returning the error. For example:

https://github.com/ocaml/merlin/blob/f34484b7291a0ee9a4e7ad8165027247834afdec/src/ocaml/typing/typecore.ml#L4294-L4315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants