Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a dedicated type to represent global identifiers in CMO files #12031

Merged

Conversation

shindere
Copy link
Contributor

Ths comes from #11996 and is kind of a continuation of #11997.

The idea here is to stop using the Ident.t type to represent identifiers
when stored in CMO files.

Ident.t describes four possible categories of identifiers: Local, Scoped,
Global and Predef. At the CMO stage, though, identifiers can be only of
categories Global or Predef.

This commit thus modifies the type used to represnetidentifiers in CMO files
to make it explicit to which
of these category they belong and then use only strings.

@gasche
Copy link
Member

gasche commented Feb 22, 2023

I am worried that the move from a domain-specific type Ident.t to a domain-agnostic type string is not a clear win in terms of code readability and maintainability. There are places in the compiler were we have been bitten by the fact of using types that are "too concrete", without a clear control on the way the values can be constructed and inspected, and this is making later refactoring delicate/painful. (For example, recently, the type of compilation unit names.)

Instead of the generic string type, could we have a specific global type, for example using the following?

type global = Global of string [@@unboxed]

bytecomp/bytepackager.ml Outdated Show resolved Hide resolved
bytecomp/bytepackager.ml Outdated Show resolved Hide resolved
@@ -233,9 +252,13 @@ let package_object_files ~ppf_dump files targetfile targetname coercion =
| { pm_kind = PM_intf } ->
required_globals
| { pm_kind = PM_impl { cu_required_globals; cu_reloc } } ->
let cu_required_globals =
List.map Ident.create_persistent cu_required_globals
in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the type of cu_required_globals to use globals instead of identifiers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a commit doing that in https://github.com/hhugo/ocaml/tree/represent-ids-with-standard-types-in-cmo-files.
It removes a bunch of Ident.create_persistent

@@ -421,6 +427,8 @@ let to_file outchan unit_name objfile ~required_globals code =
(p, pos_out outchan - p)
end else
(0, 0) in
let is_not_predef id = not (Ident.is_predef id) in
let required_globals = Ident.Set.filter is_not_predef required_globals in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add a new filtering step here?

@@ -41,17 +41,14 @@ module Bytecode = struct
let required =
List.filter
(fun id ->
not (Ident.is_predef id)
&& not (String.contains (Ident.name id) '.'))
not (String.contains id '.'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and there is no filtering done here anymore, I guess this is related?)

@shindere
Copy link
Contributor Author

shindere commented Feb 22, 2023 via email

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from 7f6b3ad to 7d45a28 Compare February 23, 2023 07:17
@shindere
Copy link
Contributor Author

shindere commented Feb 23, 2023 via email

@Octachron
Copy link
Member

Octachron commented Feb 23, 2023

Overall, I disagree with the changes proposed by this PR: using types from the standard library is not a good enough reason to lose type information from my point of view.

In particular, Ident.name is a non-injective function that should ideally used only for printing. I would

I think it is fine to use finer-grained types to represent idents in cmo, for instance:

type cmo_ident = Global of string | Predef of string

with a Ident.to_cmo: Ident.t -> cmo_ident function that fails whenever Ident.name would be losing information.

@shindere
Copy link
Contributor Author

shindere commented Feb 23, 2023 via email

@Octachron
Copy link
Member

Did you read the comment where I explained that, with such a
representation, it becomes possible to encode thingsthat make no sense,
semantically speaking?

Then the solution is to add more types and distinguish between predef idents and global idents rather than removing all type information all together. (Without knowing the use case precisely, you are describing one of the use case for GADTs: describing a type family where some functions work on the whole family of types whereas some only work on some member of this family.)

Since Ident.to_cmo_ident or Ident.to_cmo_predef translates Ident.t, it need to know the implementation of the type and thus it seem sensible for the function to live in Ident.t (like Ident.name) . Did you mean to ask where the cmo ident type should be defined? I would expect in its own module, probably in a new and separate library shared by dynlink and the rest of the compiler.

I agree that it makes sense to limit the dependency of dynlink on the rest of the compiler but that should done without making the rest of the compiler less maintainable.

@dra27
Copy link
Member

dra27 commented Feb 23, 2023

Did you mean to ask where the cmo ident type should be defined? I would expect in its own module, probably in a new and separate library shared by dynlink and the rest of the compiler.

Out of curiosity, why couldn't it go in Cmo_format (with the various other cmo-specific types?)

@Octachron
Copy link
Member

That could be a better location since Cmo_format is already a Dynlink dependency.

@shindere
Copy link
Contributor Author

shindere commented Feb 23, 2023 via email

@lthls
Copy link
Contributor

lthls commented Feb 23, 2023

Since I've been mentioned: we use a dedicated Compilation_unit type for global module identifiers on our tree. It doesn't cover predefined exceptions, but it's much easier to manipulate than string or Ident.t in particular when packs are involved.
If you're not going in this direction, then I would side with @Octachron and prefer a dedicated type for cmo globals rather than just string. I'm fine with having the distinction between predefined exceptions and global units in the reloc_info type, though as @gasche suggested this could be made simpler with a boolean flag than with two distinct constructors.

@gasche
Copy link
Member

gasche commented Feb 23, 2023

I don't see a problem with having a type of globals that can contain predefined globals, and then a setglobal instruction that is only valid on non-predef. (Avoiding this statically would require GADTs and the cost in complexity would be large.) In OCaml, a variable in the namespace of term variables x may be either an int or something else, and yet the grammar allows us to form the expression x + 1 whether x is an int or not. We don't insist that the grammar must statically rule these terms out by using different syntactic namespaces for integer variables and other variables. I think the situation is similar here.

@lukemaurer
Copy link

Since I've been mentioned: we use a dedicated Compilation_unit type for global module identifiers on our tree.

Worth noting we also ended up with both getglobal and getpredef as separate primitives. Globals and predefs are different enough (even if they end up mapped into the same space) to merit different types.

That said, a single Global.t covering both would still be an improvement over Ident.t. (I agree with others that string is the wrong type - it usually is!)

@gasche
Copy link
Member

gasche commented Feb 24, 2023

Well, if there is already an implementation of that in the flambda2 tree, should we consider upstreaming that specific part now? It would at least be interesting to compare with the current proposal by @shindere. Maybe the code is nicer, maybe it isn't, but if we can reduce the diff to flambda2 this is also a side-benefit to take into account.

@shindere
Copy link
Contributor Author

shindere commented Feb 24, 2023 via email

@shindere
Copy link
Contributor Author

shindere commented Feb 24, 2023 via email

@gasche
Copy link
Member

gasche commented Feb 25, 2023

No, not type global = string. I explained why we don't want to use string earlier, and Florian, Vincent and Luke concurred.

There are places in the compiler were we have been bitten by the fact of using types that are "too concrete", without a clear control on the way the values can be constructed and inspected, and this is making later refactoring delicate/painful. (For example, recently, the type of compilation unit names.)

In this specific case there is also the problem of losing information pointed out by @Octachron, we probably need to keep the distinction between Predef and Global in the data itself. (Or you have to explain why it is okay to lose it, but I haven't seen any such explanation in the PR so far.)

@lukemaurer
Copy link

Gabriel Scherer (2023/02/24 12:38 -0800):
In my understanding, the flambda2 Compilation_unit.t type could be used to type the cu_required_globals field, but flambda2 does not modify the reloc_info type as this patch does. One other thing I don't know is how much it would cost to pull in flambda2's Compilation_unit.t type. I can't be sure but I am guessing that a type dedicated to compilation units, if used everywhere where it makes sens, might be a somewhat invasive change. Not to say we shouldn't do it, but even if we do it, invasive or not invasive, it won't solve the thing this Pr tries to address.

Much as I'd be happy to see it land, the patch is indeed fairly invasive, especially at the .mli level. And yes, it's actually largely orthogonal to this patch (we wanted to disrupt the bytecode world as little as possible).

@Octachron
Copy link
Member

After some thinking during the week-end and a discussion with @xavierleroy this afternoon, I believe that my points of friction with this PR lay not so much with the use of string as a final type for the globals but really with the use of Ident.name and Ident.create_persistent.

I believe that the use of Ident.name in the context of this PR should be replaced by a Ident.to_global function (with an option return type) that would make it explicit that we are only storing Ident.ts that were already globals or predefs, in particular when we are currently using Ident.name in conjunction with Ident.global and Ident.is_predef.

Symmetrically, I agree with @gasche that many use of Ident.create_persistent looks like the changes from Ident.t to string was not propagated enough and that propagating further the change would create more natural code.

I still think that for me using a specific cmo_global type would be the easier way to spot and fix those friction points, but that's a personal preference and I am not opposed to the idea of keeping strings once the issues above are fixed. As a counter-opinion, @xavierleroy thinks that I am overengineering and that the PR looks fine at first glance.

P.S. : Somehow it seems that I didn't mention in my initial chain of comments, but I do think that this PR is a step in the right direction in term of simplification of cmo file.

@shindere
Copy link
Contributor Author

shindere commented Mar 1, 2023

While working on take #2 of this PR to take into account all the comments
(many thanks to all for the support, btw!, I discovered something that I
think needs to be clarified (apologies for not being able to use nice links
to the code).

On trunk, the compilation_unit type declared in
file_formats/cmo_format.mli has this field:

  cu_required_globals: Ident.t list; (* Compilation units whose initialization side effects must occur before this one. *)

As you all know, in theory Ident.t can theoretically be Local, Scoped,
Global or Predef.

In the current PR, though, I have made the assumption that it can only be a
global identifier. In particular, it didn't occur to me that it could be a
predef, I think because of the comment saying it's a compilation unit
whereas to me predefs are rather exceptions, builtin identifiers...

So in the current PR I used

  cu_required_globals: string list;

which does not let me encode whether a required global is a predef or not.

Now, I wanted to make sure, so I pushed the required-predef branch on
shindere/ocaml. Its first commit adds debugging to ocamlobjinfo to print,
for each required global, whether it's a predef or not. Its second commit
adds the required-predefs file which is the result of running the modified
ocamlobjinfo on all the .cmo files of the codebase.

This shows that yes, we do have required globals which are predefs and not
names of compilation units.

To give an insight, here is the output of the following ocmmand:

grep PREDEF=true required-predefs | uniq
	Assert_failure (PREDEF=true)
	Match_failure (PREDEF=true)
	Out_of_memory (PREDEF=true)
	Invalid_argument (PREDEF=true)
	Failure (PREDEF=true)
	Not_found (PREDEF=true)
	Sys_error (PREDEF=true)
	End_of_file (PREDEF=true)
	Division_by_zero (PREDEF=true)
	Stack_overflow (PREDEF=true)
	Sys_blocked_io (PREDEF=true)
	Assert_failure (PREDEF=true)
	Undefined_recursive_module (PREDEF=true)
	Assert_failure (PREDEF=true)

My question at this stage is simple: do we need to preserve this behaviour
or is it expected that such predefined identifiers are not part of the
list of required globals?

@Octachron
Copy link
Member

As far as I can see, there are two end-users of cu_required_globals : Bytelink.link through missing_globals and byte/dynlink.ml. Both of them are already filtering the predefs that were added by accident to cu_required_globals before using the contents of the field. Thus I would suggest to go forward and make sure that cu_required_globals never contains predefs.

@shindere
Copy link
Contributor Author

shindere commented Mar 1, 2023 via email

@shindere
Copy link
Contributor Author

shindere commented Mar 1, 2023 via email

@shindere
Copy link
Contributor Author

shindere commented Mar 1, 2023

Well, things must not be too broken because, after having restored
the output of ocamlobjinfo, the no-predefs-in-cu-required-globals
branch passes the testsuite.

@Octachron
Copy link
Member

Octachron commented Mar 2, 2023

The fd0fa25 commit looks like a good preliminary step on its own. It seems fine to make it the first commit in the PR as a prerequisite that can be reviewed on its own.

@gasche
Copy link
Member

gasche commented Mar 2, 2023

I agree, but (unlike many of the refactoring changes discussed here and previously) it actually requires someone who knows about this stuff. @xavierleroy, do you know why we sometimes have "Predef" identifiers added to the globals required by a compilation unit?

@Octachron
Copy link
Member

As far as I can see, it was #837 that introduced both the code path that might add Predef and the filter on such identifiers in Bytelink. @chambart , do you remember if there was any specific reason to not do the filtering on identifiers before hand?

@shindere
Copy link
Contributor Author

shindere commented Mar 29, 2023 via email

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from bce23d7 to 1dbd50f Compare March 29, 2023 17:32
@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from 5e9db29 to 3ae4f79 Compare April 18, 2023 14:37
@shindere
Copy link
Contributor Author

Just pushed a version of the branch which, I think, addresses all the
comments except one, namely @Octachron's suggestion to further propagate
the types to the Lambda layer.

As can be seen on the
represent-ids-with-standard-types-in-cmo-files-with-type-propagation
branch of shindere/ocaml, I did try to implement this change but it
requires to modify many files not originally modified and I for instance
foudn myself modifying the middle_end (in other words, the native
commpiler).

The modifications I had to do looked sensible to me but huge, and perhaps
out of scope for this PR. Actually they felt substantial enough to deserve a
PR on their own.

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from 3ae4f79 to 0b1bd21 Compare April 18, 2023 15:09
Copy link
Member

@Octachron Octachron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the changes look good to me: it clarifies the meaning of "globals" in the bytecode compiler and remove impossible cases from the code logic. In particular in the bytecode linker and packager are nicely simplified by the change.

I am not approving yet due to a problematic assert in Emitcode.slot_for_setglobal.

Note that I have many small remarks on the API but those are more suggestions.

bytecomp/bytelink.ml Outdated Show resolved Hide resolved
bytecomp/bytelink.ml Show resolved Hide resolved
bytecomp/bytelink.ml Outdated Show resolved Hide resolved
bytecomp/bytepackager.ml Outdated Show resolved Hide resolved
bytecomp/bytepackager.ml Outdated Show resolved Hide resolved
@@ -48,7 +48,7 @@ module type S = sig
val fold_initial_units
: init:'a
-> f:('a
-> comp_unit:string
-> compunit:string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change.

tools/dumpobj.ml Outdated
@@ -169,7 +170,7 @@ let print_setglobal_name ic =
if n >= Array.length !globals || n < 0
then print_string "<global table overflow>"
else match !globals.(n) with
Global id -> print_string(Ident.name id)
| Glob glob -> print_string (Symtable.Global.string_of_global glob)
| _ -> print_string "???"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern could be made non-fragile, but that's seem quite orthogonal to this PR.

bytecomp/emitcode.ml Outdated Show resolved Hide resolved
let scan_reloc (rel, _) = match rel with
Reloc_primitive s -> primitives := String.Set.add s !primitives
| Reloc_literal _ | Reloc_getcompunit _ | Reloc_setcompunit _
| Reloc_getpredef _ -> ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is currently unrelated to the current PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like changes that make pattern-matching non-fragile, and I think it makes sense in the present PR if it changed the constructor at that type. I vote to keep the change.

toplevel/byte/topeval.ml Outdated Show resolved Hide resolved
@shindere
Copy link
Contributor Author

shindere commented Apr 28, 2023 via email

@Octachron
Copy link
Member

Compunit could be a module by itself (or later part of the LinkingCore module) . And since Predef is currently unused, I would propose to remove it and find a better location for it later.

@shindere
Copy link
Contributor Author

shindere commented Apr 28, 2023 via email

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch 4 times, most recently from 3fa0252 to 24c643b Compare May 2, 2023 15:03
@shindere
Copy link
Contributor Author

shindere commented Jun 12, 2023 via email

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch 3 times, most recently from 1ebfcc6 to 0bda36a Compare June 12, 2023 15:46
@gasche
Copy link
Member

gasche commented Jun 14, 2023

Status update: @Octachron wants to have another look at this PR this week and then we should be good to merge.

Copy link
Member

@Octachron Octachron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my nitpicking questions were answered, the PR is ready to merge. Thanks for all the hard refactoring and archeology work.

@shindere
Copy link
Contributor Author

shindere commented Jun 15, 2023 via email

@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from 0bda36a to a235a38 Compare June 16, 2023 09:24
This commit introduces the dedicated `compunit` (resp. `predef`) types to
represent names of compilation units (resp. predefined exceptions) in
CMO files. This makes the CMO format independent of the type checker's
 `Ident.t` type which, although it is domain-specific, can represent
identifiers which can actually not occur in CMO files, namely the
Local and Scoped categories of identifiers.

This commit also adds a `Global` module to `Symtable` to represent the
fact that, for the time being, the symbol table contains both
compilation units and predefined exceptions.
@shindere shindere force-pushed the represent-ids-with-standard-types-in-cmo-files branch from a235a38 to ef4b5ef Compare June 16, 2023 09:40
@shindere shindere merged commit 27786a7 into ocaml:trunk Jun 16, 2023
10 of 11 checks passed
@shindere shindere deleted the represent-ids-with-standard-types-in-cmo-files branch June 16, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet