New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified metadata for compilation files (or no more capitalize_ascii) #12389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this is a very nice change, and the callsites are all improved. I am broadly in favor of merging this work.
In practice I read unit_info.mli
without looking at any of the other parts of the PR, and I found it fairly confusing. I think that if we want other people to work on that part of the compiler codebase, we should try to have (besides the nice refactoring) an interface that people can understand. I did a first round of feedback focusing on my various sources of confusion on unit_info.mli.
parsing/unit_info.mli
Outdated
|
||
(** [modname_from_source filename] is [modulize stem] where [stem] is the | ||
basename of the filename [file] stripped from all its extensions.*) | ||
val modname_from_source: string -> string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of functions would be easier to read with "silent" type abbreviations just for this module:
type path = string
type filepath = string
type dirpath = string
type modname = string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I wouldn't mind type modname = Modname of string [@@unboxed]
either, but we could start with a synonym to avoid caller-side changes.)
parsing/unit_info.mli
Outdated
|
||
(** [from_source filename] associates the module name [modname_from_source | ||
filename] to the source file [filename]. *) | ||
val from_source: string -> source_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docstrings are a bit confusing because the type is path -> source_file
, but the description is about module names. I wonder if a more abstract description would make sense, for example:
(** [from_source filename] is the unit information for the source file [filename]. *)
parsing/unit_info.mli
Outdated
|
||
(** [cmi_uncap u] finds in the load_path a name matching the module name | ||
[modname u]. *) | ||
val cmi_uncap: _ t -> target_only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this find_cmi_uncap
to keep in mind that there is a non-trivial lookup logic behind it (with a performance cost, etc.).
Some more vague comments:
Note: remarks (1) and (2) above are about misunderstandings on what we mean by specific names; providing examples in the beginning of the .mli documentation would reduce the opportunities for such misunderstandings. (We could still disagree on whether the names are the right ones, but at least the reader would quickly be sure of what they mean.) |
@Octachron and myself discussed this again yesterday evening. For reference here is my new understanding of the concepts in unit_info.mli:
I asked @Octachron whether we could differentiate "whole units" and "single files" with separate types:
That interface would be less polymorphic than the current one (where some helper functions can take any |
549e712
to
d626e7d
Compare
I have removed the unified GADTs and split it into an The other cases were not strictly necessary : this simplified version only loses the information that only artifact metadata derived from This required a few function duplication, but the resulting interface should be far easier to read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first of review, up to the file typing/typemod.ml
excluded.
I like the new API much better, thanks! Also, not to boast, but while reading the rest of the PR -- the many changes in the compiler codebase -- I realized that while the module takes care of a niche concern, it is actually used all over the place, so it is in fact rather important that the interface and function name be approachable.
parsing/unit_info.mli
Outdated
- the module name associated to the unit | ||
- the filename prefix (dirname + basename with all extensions stripped) | ||
for compilation artifacts | ||
- the source file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concrete example would help the reader.
- the module name associated to the unit; for example "Mylib_Foo"
- the filename prefix (...); for example "_build/mylib/Mylib__Foo"
- the source file; for example "mylib/foo.ml"
(I think of compilation units as typically having two source files, the .ml and the .mli, so I am not sure what "the source file" means.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an example. The source file refers to the lone input source file, since the compilation pipeline proceeds file-by-file.
parsing/unit_info.mli
Outdated
(** Metadata for a single compilation artifact: | ||
- the module name associated to the artifact | ||
- the filesystem path | ||
- the source file for compilation file if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata for a single compilation artifact, for example a .cmi or .cmx file:
- the module name associated to the artifact; for example "Foo"
- the filesystem path; for example "_build/src/foo.cmx"
- the source file if it is known; for example "src/foo.ml"
(again I am not sure what "the source file" means)
parsing/unit_info.mli
Outdated
val source_file: t -> string | ||
|
||
(** [artifact_source_file a] is the source file of [a] if it exists. *) | ||
val artifact_source_file: artifact -> string option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... at the risk of being nitpicky, I wonder how an Artifact
submodule would feel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added such submodule, since it helps a lot distinguish the field of the Artifact.t
and Unit_info.t
types.
parsing/unit_info.mli
Outdated
|
||
(** [normalized_cmi u] finds in the load_path a file matching the module name | ||
[modname u]. *) | ||
val normalized_cmi: t -> artifact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it raises Not_found, I think it should be named find_normalized_cmi
, and the exception should be documented.
parsing/unit_info.mli
Outdated
|
||
(** [artifact filename] reconstruct the module name | ||
[modname_from_source filename] associated to the artifact [filename]. *) | ||
val artifact: string -> artifact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads weird because all other Unit_info.<noun>
function are used to project information out of the datatypes of this module. Maybe Unit_info.Artifact.make
or Unit_info.Artifact.from_path
?
driver/optcompile.ml
Outdated
|
||
|
||
let clambda i backend Typedtree.{structure; coercion; _} = | ||
let cmx = Unit_info.cmx i.target in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(meh again)
@@ -64,19 +65,20 @@ let flambda i backend Typedtree.{structure; coercion; _} = | |||
in | |||
Asmgen.compile_implementation | |||
~backend | |||
~prefixname:i.output_prefix | |||
~prefixname:(Unit_info.prefix i.target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each call to compile_implementation
needs to be modified. Maybe it could even be changed to take a unit_info
instead of a prefixname
, to make caller-side changes simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to drop the information which is no longer used in this case.
driver/optcompile.ml
Outdated
Asmgen.compile_implementation_linear i.output_prefix ~progname:i.source_file | ||
Compilenv.reset ?packname:!Clflags.for_package (Unit_info.modname i.target); | ||
Asmgen.compile_implementation_linear (Unit_info.prefix i.target) | ||
~progname:(Unit_info.source_file i.target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as well it looks like a single i.target
argument would make a lot of sense for compile_implementation_linear
.
file_formats/cmt_format.ml
Outdated
@@ -164,19 +164,20 @@ let record_value_dependency vd1 vd2 = | |||
if vd1.Types.val_loc <> vd2.Types.val_loc then | |||
value_deps := (vd1, vd2) :: !value_deps | |||
|
|||
let save_cmt filename modname binary_annots sourcefile initial_env cmi shape = | |||
let save_cmt dest binary_annots initial_env cmi shape = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target
or dest
?
typing/persistent_env.ml
Outdated
let read penv f modname filename = | ||
let read penv f a = | ||
let modname = Unit_info.artifact_modname a in | ||
let filename = Unit_info.filename a in | ||
snd (read_pers_struct penv f true modname filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the argument change to be pushed into read_pers_struct
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with the final result. What do you want to do about the history?
(I think that it would make sense to squash, but you could also split into two commits, one that introduces unit_info and one with all the client-side changes at once.)
1eed6ef
to
f91ddec
Compare
I kept two commits one for introducing the new module, the second one for using it. While rebasing #11736 on top of this PR, I couldn't resist the temptation to cherry-pick one simplification: with the last commit, we no longer add cmi with invalid modname in the persistent environment. |
Unified metadata for compilation files (or no more capitalize_ascii) (cherry picked from commit c2b87d8)
Fix cmi lookup after ocaml#12389 (cherry picked from commit eab1105)
This PR is an alternative implementation of a slice of #11736 that is entirely focused on the handling of metadata (module name and source file currently) for compilation files (either compilation targets or source files) in order to avoid the proliferation of
String.{uncapitalize_ascii,capitalize_ascii}
within the compiler codebase.In brief, this PR makes sure that we define the transformations from compilation filenames to module names in one place, and the same for the reverse transform.
(However, dependencies made it easier to define the two functions in two different places).
After this PRs, the only calls to
String.capitalize_ascii
in the compiler happen in error message printers (typically to capitalize eitherfirst
orsecond
) and are only an obstacle to the internationalization of the compiler messages.The PR goes one step further and ensures that the module name associated to a compilation file is computed once by tracking file metadata across change of file extensions.
With this change, it is quite straightforward to check that the compiler only tries to derive filenames from module names when doing lookup for
cmo
,cmx
orcmi
file throughLoad_path
.This suggest to me that a good solution for the compilation artifact ambiguity (which files between
Foo.cmi
andfoo.cmi
should provide the moduleFoo
?) is to emit an error if the same directory contains distinct filename with the same normalization as suggested by @dbuenzli in #11736.