Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Merged
merged 4 commits into from Sep 8, 2023

Conversation

Octachron
Copy link
Member

This PR is an alternative implementation of a slice of #11736 that is entirely focused on the handling of metadata (module name and source file currently) for compilation files (either compilation targets or source files) in order to avoid the proliferation of String.{uncapitalize_ascii,capitalize_ascii} within the compiler codebase.

In brief, this PR makes sure that we define the transformations from compilation filenames to module names in one place, and the same for the reverse transform.
(However, dependencies made it easier to define the two functions in two different places).

After this PRs, the only calls to String.capitalize_ascii in the compiler happen in error message printers (typically to capitalize either first or second) and are only an obstacle to the internationalization of the compiler messages.

The PR goes one step further and ensures that the module name associated to a compilation file is computed once by tracking file metadata across change of file extensions.

With this change, it is quite straightforward to check that the compiler only tries to derive filenames from module names when doing lookup for cmo, cmx or cmi file through Load_path.

This suggest to me that a good solution for the compilation artifact ambiguity (which files between Foo.cmi and foo.cmi should provide the module Foo?) is to emit an error if the same directory contains distinct filename with the same normalization as suggested by @dbuenzli in #11736.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this is a very nice change, and the callsites are all improved. I am broadly in favor of merging this work.

In practice I read unit_info.mli without looking at any of the other parts of the PR, and I found it fairly confusing. I think that if we want other people to work on that part of the compiler codebase, we should try to have (besides the nice refactoring) an interface that people can understand. I did a first round of feedback focusing on my various sources of confusion on unit_info.mli.

parsing/unit_info.mli Outdated Show resolved Hide resolved
parsing/unit_info.mli Show resolved Hide resolved
parsing/unit_info.mli Outdated Show resolved Hide resolved
parsing/unit_info.mli Outdated Show resolved Hide resolved

(** [modname_from_source filename] is [modulize stem] where [stem] is the
basename of the filename [file] stripped from all its extensions.*)
val modname_from_source: string -> string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of functions would be easier to read with "silent" type abbreviations just for this module:

type path = string
type filepath = string
type dirpath = string
type modname = string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wouldn't mind type modname = Modname of string [@@unboxed] either, but we could start with a synonym to avoid caller-side changes.)


(** [from_source filename] associates the module name [modname_from_source
filename] to the source file [filename]. *)
val from_source: string -> source_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docstrings are a bit confusing because the type is path -> source_file, but the description is about module names. I wonder if a more abstract description would make sense, for example:

(** [from_source filename] is the unit information for the source file [filename]. *)

parsing/unit_info.ml Outdated Show resolved Hide resolved

(** [cmi_uncap u] finds in the load_path a name matching the module name
[modname u]. *)
val cmi_uncap: _ t -> target_only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this find_cmi_uncap to keep in mind that there is a non-trivial lookup logic behind it (with a performance cost, etc.).

parsing/unit_info.mli Outdated Show resolved Hide resolved
parsing/unit_info.mli Outdated Show resolved Hide resolved
@gasche
Copy link
Member

gasche commented Aug 5, 2023

Some more vague comments:

  1. I think that we have different things in mind about the name "target". For me a target is an output of the current compiler build action. I have the impression that what you call "target" are files that are built by some compiler invocation, the current one or a previous one. Those I call "build artifacts" or "object files".

  2. I am unsure what is called a "prefix" here. My current guess is that it denotes both a working directory and what you call a "stem", an extensionless basename fragment that is used to derive the module name. (In contrast, for opam or other package management tools the "prefix" is only a directory path.) I wonder if systems people have a more standard name for "stem", but I am fine with this choice.

  3. Unit_info.t pairs information about compiler objects (what path to write them to) and "provenance" information on the compiler inputs this information was derived from. Some operations of the API change the compiler object (eg. "move from a .cmo to the corresponding .cmi") and keep the same provenance. I wondered if the API or at least the implementation could be clarified by being explicit about this product structure: have a part of the data that only denotes a "compilation unit" (a module name, an optional source filename, a filesystem prefix), and a separate part that denotes one compiler object among the many that can be derived from it.

    (Without looking at the code in details, my intuition would be that the object-specific part does not really need to be stored along the unit information: functions that take a cmx_name as input, and currently get passed an Unit_info.cmx_name unit could be rewritten to take unit directly and ask for its cmx path at each use-site. The object-specific information, probably just an extension, needs to be passed only on functions that can work with several object file extensions in a generic way, and those are probably very rare and could take the extension as an extra argument separate from the unit info or something. But that is a more invasive change on the caller site, so why not have this intermediate form where we pair both together to stick closer to the previous path-only approach.)

Note: remarks (1) and (2) above are about misunderstandings on what we mean by specific names; providing examples in the beginning of the .mli documentation would reduce the opportunities for such misunderstandings. (We could still disagree on whether the names are the right ones, but at least the reader would quickly be sure of what they mean.)

@gasche
Copy link
Member

gasche commented Aug 29, 2023

@Octachron and myself discussed this again yesterday evening. For reference here is my new understanding of the concepts in unit_info.mli:

  • The type source_file represents (the filesystem paths to) a source compilation unit, that is basically a pair of a .ml file and a .mli file (or only some of them sometimes, I guess); I would call this source_unit or source_unit_info.
  • The type target_prefix represents (the filesystem paths to) an "compilation unit object" (or compiled compilation unit), that is a bunch of compiled object files (.cmi, possibly .cmt, .cmo or .cmx, etc.), along with information about its source compilation unit. I think we could call this a compilation_unit, or compilation_unit_info, or just t as it is the central type of the module.
  • The type 'a any_target represents (the filesystem path to) a single object file that is part of an object compilation unit, derived from a compilation unit object. If 'a is string (type target), then it knows of an existing a source compilation unit, if 'a is unit (type target_only) then it was derived from an object file and has no source compilation unit attached.

I asked @Octachron whether we could differentiate "whole units" and "single files" with separate types:

  • a type source_unit for source compilation units
  • a type 'a t for compiled compilation units that also carry a 'a: source_unit t has source information, unit t has only object information
  • a type 'a object or 'a artifact for a single object file derived from a 'a t

That interface would be less polymorphic than the current one (where some helper functions can take any (_, _) t as input, and would have to be duplicated to work on the different types), and it is not clear whether it would be a problem in practice or not. I think that @Octachron wants to hack on it a bit to see what would work and what would not.

@Octachron
Copy link
Member Author

I have removed the unified GADTs and split it into an Unit_info.t and Unit_info.artifact types.

The other cases were not strictly necessary : this simplified version only loses the information that only artifact metadata derived from Unit_info.t contains source file information.

This required a few function duplication, but the resulting interface should be far easier to read.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first of review, up to the file typing/typemod.ml excluded.

I like the new API much better, thanks! Also, not to boast, but while reading the rest of the PR -- the many changes in the compiler codebase -- I realized that while the module takes care of a niche concern, it is actually used all over the place, so it is in fact rather important that the interface and function name be approachable.

- the module name associated to the unit
- the filename prefix (dirname + basename with all extensions stripped)
for compilation artifacts
- the source file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concrete example would help the reader.

- the module name associated to the unit; for example "Mylib_Foo"
- the filename prefix (...); for example "_build/mylib/Mylib__Foo"
- the source file; for example "mylib/foo.ml"

(I think of compilation units as typically having two source files, the .ml and the .mli, so I am not sure what "the source file" means.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an example. The source file refers to the lone input source file, since the compilation pipeline proceeds file-by-file.

(** Metadata for a single compilation artifact:
- the module name associated to the artifact
- the filesystem path
- the source file for compilation file if it exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata for a single compilation artifact, for example a .cmi or .cmx file:
- the module name associated to the artifact; for example "Foo"
- the filesystem path; for example "_build/src/foo.cmx"
- the source file if it is known; for example "src/foo.ml"

(again I am not sure what "the source file" means)

val source_file: t -> string

(** [artifact_source_file a] is the source file of [a] if it exists. *)
val artifact_source_file: artifact -> string option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... at the risk of being nitpicky, I wonder how an Artifact submodule would feel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added such submodule, since it helps a lot distinguish the field of the Artifact.t and Unit_info.t types.


(** [normalized_cmi u] finds in the load_path a file matching the module name
[modname u]. *)
val normalized_cmi: t -> artifact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it raises Not_found, I think it should be named find_normalized_cmi, and the exception should be documented.


(** [artifact filename] reconstruct the module name
[modname_from_source filename] associated to the artifact [filename]. *)
val artifact: string -> artifact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads weird because all other Unit_info.<noun> function are used to project information out of the datatypes of this module. Maybe Unit_info.Artifact.make or Unit_info.Artifact.from_path?



let clambda i backend Typedtree.{structure; coercion; _} =
let cmx = Unit_info.cmx i.target in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(meh again)

@@ -64,19 +65,20 @@ let flambda i backend Typedtree.{structure; coercion; _} =
in
Asmgen.compile_implementation
~backend
~prefixname:i.output_prefix
~prefixname:(Unit_info.prefix i.target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each call to compile_implementation needs to be modified. Maybe it could even be changed to take a unit_info instead of a prefixname, to make caller-side changes simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to drop the information which is no longer used in this case.

Asmgen.compile_implementation_linear i.output_prefix ~progname:i.source_file
Compilenv.reset ?packname:!Clflags.for_package (Unit_info.modname i.target);
Asmgen.compile_implementation_linear (Unit_info.prefix i.target)
~progname:(Unit_info.source_file i.target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well it looks like a single i.target argument would make a lot of sense for compile_implementation_linear.

@@ -164,19 +164,20 @@ let record_value_dependency vd1 vd2 =
if vd1.Types.val_loc <> vd2.Types.val_loc then
value_deps := (vd1, vd2) :: !value_deps

let save_cmt filename modname binary_annots sourcefile initial_env cmi shape =
let save_cmt dest binary_annots initial_env cmi shape =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target or dest?

let read penv f modname filename =
let read penv f a =
let modname = Unit_info.artifact_modname a in
let filename = Unit_info.filename a in
snd (read_pers_struct penv f true modname filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the argument change to be pushed into read_pers_struct.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the final result. What do you want to do about the history?

(I think that it would make sense to squash, but you could also split into two commits, one that introduces unit_info and one with all the client-side changes at once.)

parsing/unit_info.mli Outdated Show resolved Hide resolved
@Octachron
Copy link
Member Author

I kept two commits one for introducing the new module, the second one for using it. While rebasing #11736 on top of this PR, I couldn't resist the temptation to cherry-pick one simplification: with the last commit, we no longer add cmi with invalid modname in the persistent environment.

@gasche gasche merged commit c2b87d8 into ocaml:trunk Sep 8, 2023
9 checks passed
gasche added a commit that referenced this pull request Sep 11, 2023
Octachron pushed a commit to Octachron/ocaml that referenced this pull request Feb 20, 2024
Unified metadata for compilation files (or no more capitalize_ascii)

(cherry picked from commit c2b87d8)
Octachron pushed a commit to Octachron/ocaml that referenced this pull request Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants