Unified metadata for compilation files (or no more capitalize_ascii) #12389

Octachron · 2023-07-18T16:00:37Z

This PR is an alternative implementation of a slice of #11736 that is entirely focused on the handling of metadata (module name and source file currently) for compilation files (either compilation targets or source files) in order to avoid the proliferation of String.{uncapitalize_ascii,capitalize_ascii} within the compiler codebase.

In brief, this PR makes sure that we define the transformations from compilation filenames to module names in one place, and the same for the reverse transform.
(However, dependencies made it easier to define the two functions in two different places).

After this PRs, the only calls to String.capitalize_ascii in the compiler happen in error message printers (typically to capitalize either first or second) and are only an obstacle to the internationalization of the compiler messages.

The PR goes one step further and ensures that the module name associated to a compilation file is computed once by tracking file metadata across change of file extensions.

With this change, it is quite straightforward to check that the compiler only tries to derive filenames from module names when doing lookup for cmo, cmx or cmi file through Load_path.

This suggest to me that a good solution for the compilation artifact ambiguity (which files between Foo.cmi and foo.cmi should provide the module Foo?) is to emit an error if the same directory contains distinct filename with the same normalization as suggested by @dbuenzli in #11736.

gasche

In theory this is a very nice change, and the callsites are all improved. I am broadly in favor of merging this work.

In practice I read unit_info.mli without looking at any of the other parts of the PR, and I found it fairly confusing. I think that if we want other people to work on that part of the compiler codebase, we should try to have (besides the nice refactoring) an interface that people can understand. I did a first round of feedback focusing on my various sources of confusion on unit_info.mli.

parsing/unit_info.mli

gasche · 2023-08-04T20:50:32Z

parsing/unit_info.mli

+
+(** [modname_from_source filename] is [modulize stem] where [stem] is the
+    basename of the filename [file] stripped from all its extensions.*)
+val modname_from_source: string -> string


This sort of functions would be easier to read with "silent" type abbreviations just for this module:

type path = string type filepath = string type dirpath = string type modname = string

(I wouldn't mind type modname = Modname of string [@@unboxed] either, but we could start with a synonym to avoid caller-side changes.)

gasche · 2023-08-05T07:25:40Z

parsing/unit_info.mli

+
+(** [from_source filename] associates the module name [modname_from_source
+    filename] to the source file [filename]. *)
+val from_source: string -> source_file


These docstrings are a bit confusing because the type is path -> source_file, but the description is about module names. I wonder if a more abstract description would make sense, for example:

(** [from_source filename] is the unit information for the source file [filename]. *)

parsing/unit_info.ml

gasche · 2023-08-05T07:42:10Z

parsing/unit_info.mli

+
+(** [cmi_uncap u] finds in the load_path a name matching the module name
+    [modname u]. *)
+val cmi_uncap: _ t -> target_only


I would name this find_cmi_uncap to keep in mind that there is a non-trivial lookup logic behind it (with a performance cost, etc.).

parsing/unit_info.mli

gasche · 2023-08-05T08:11:53Z

Some more vague comments:

I think that we have different things in mind about the name "target". For me a target is an output of the current compiler build action. I have the impression that what you call "target" are files that are built by some compiler invocation, the current one or a previous one. Those I call "build artifacts" or "object files".
I am unsure what is called a "prefix" here. My current guess is that it denotes both a working directory and what you call a "stem", an extensionless basename fragment that is used to derive the module name. (In contrast, for opam or other package management tools the "prefix" is only a directory path.) I wonder if systems people have a more standard name for "stem", but I am fine with this choice.
Unit_info.t pairs information about compiler objects (what path to write them to) and "provenance" information on the compiler inputs this information was derived from. Some operations of the API change the compiler object (eg. "move from a .cmo to the corresponding .cmi") and keep the same provenance. I wondered if the API or at least the implementation could be clarified by being explicit about this product structure: have a part of the data that only denotes a "compilation unit" (a module name, an optional source filename, a filesystem prefix), and a separate part that denotes one compiler object among the many that can be derived from it.

(Without looking at the code in details, my intuition would be that the object-specific part does not really need to be stored along the unit information: functions that take a cmx_name as input, and currently get passed an Unit_info.cmx_name unit could be rewritten to take unit directly and ask for its cmx path at each use-site. The object-specific information, probably just an extension, needs to be passed only on functions that can work with several object file extensions in a generic way, and those are probably very rare and could take the extension as an extra argument separate from the unit info or something. But that is a more invasive change on the caller site, so why not have this intermediate form where we pair both together to stick closer to the previous path-only approach.)

Note: remarks (1) and (2) above are about misunderstandings on what we mean by specific names; providing examples in the beginning of the .mli documentation would reduce the opportunities for such misunderstandings. (We could still disagree on whether the names are the right ones, but at least the reader would quickly be sure of what they mean.)

gasche · 2023-08-29T09:13:45Z

@Octachron and myself discussed this again yesterday evening. For reference here is my new understanding of the concepts in unit_info.mli:

The type source_file represents (the filesystem paths to) a source compilation unit, that is basically a pair of a .ml file and a .mli file (or only some of them sometimes, I guess); I would call this source_unit or source_unit_info.
The type target_prefix represents (the filesystem paths to) an "compilation unit object" (or compiled compilation unit), that is a bunch of compiled object files (.cmi, possibly .cmt, .cmo or .cmx, etc.), along with information about its source compilation unit. I think we could call this a compilation_unit, or compilation_unit_info, or just t as it is the central type of the module.
The type 'a any_target represents (the filesystem path to) a single object file that is part of an object compilation unit, derived from a compilation unit object. If 'a is string (type target), then it knows of an existing a source compilation unit, if 'a is unit (type target_only) then it was derived from an object file and has no source compilation unit attached.

I asked @Octachron whether we could differentiate "whole units" and "single files" with separate types:

a type source_unit for source compilation units
a type 'a t for compiled compilation units that also carry a 'a: source_unit t has source information, unit t has only object information
a type 'a object or 'a artifact for a single object file derived from a 'a t

That interface would be less polymorphic than the current one (where some helper functions can take any (_, _) t as input, and would have to be duplicated to work on the different types), and it is not clear whether it would be a problem in practice or not. I think that @Octachron wants to hack on it a bit to see what would work and what would not.

Octachron · 2023-08-30T15:12:00Z

I have removed the unified GADTs and split it into an Unit_info.t and Unit_info.artifact types.

The other cases were not strictly necessary : this simplified version only loses the information that only artifact metadata derived from Unit_info.t contains source file information.

This required a few function duplication, but the resulting interface should be far easier to read.

gasche

I did a first of review, up to the file typing/typemod.ml excluded.

I like the new API much better, thanks! Also, not to boast, but while reading the rest of the PR -- the many changes in the compiler codebase -- I realized that while the module takes care of a niche concern, it is actually used all over the place, so it is in fact rather important that the interface and function name be approachable.

gasche · 2023-08-30T18:38:17Z

parsing/unit_info.mli

+    - the module name associated to the unit
+    - the filename prefix (dirname + basename with all extensions stripped)
+      for compilation artifacts
+    - the source file


A concrete example would help the reader.

- the module name associated to the unit; for example "Mylib_Foo" - the filename prefix (...); for example "_build/mylib/Mylib__Foo" - the source file; for example "mylib/foo.ml"

(I think of compilation units as typically having two source files, the .ml and the .mli, so I am not sure what "the source file" means.)

I have added an example. The source file refers to the lone input source file, since the compilation pipeline proceeds file-by-file.

gasche · 2023-08-30T18:40:19Z

parsing/unit_info.mli

+(**  Metadata for a single compilation artifact:
+    - the module name associated to the artifact
+    - the filesystem path
+    - the source file for compilation file if it exists


Metadata for a single compilation artifact, for example a .cmi or .cmx file: - the module name associated to the artifact; for example "Foo" - the filesystem path; for example "_build/src/foo.cmx" - the source file if it is known; for example "src/foo.ml"

(again I am not sure what "the source file" means)

gasche · 2023-08-30T18:46:12Z

parsing/unit_info.mli

+val source_file: t -> string
+
+(** [artifact_source_file a] is the source file of [a] if it exists. *)
+val artifact_source_file: artifact -> string option


... at the risk of being nitpicky, I wonder how an Artifact submodule would feel.

I have added such submodule, since it helps a lot distinguish the field of the Artifact.t and Unit_info.t types.

gasche · 2023-08-30T18:54:30Z

parsing/unit_info.mli

+
+(** [normalized_cmi u] finds in the load_path a file matching the module name
+    [modname u]. *)
+val normalized_cmi: t -> artifact


If it raises Not_found, I think it should be named find_normalized_cmi, and the exception should be documented.

gasche · 2023-08-30T18:58:53Z

parsing/unit_info.mli

+
+(** [artifact filename] reconstruct the module name
+    [modname_from_source filename] associated to the artifact [filename]. *)
+val artifact: string -> artifact


This reads weird because all other Unit_info.<noun> function are used to project information out of the datatypes of this module. Maybe Unit_info.Artifact.make or Unit_info.Artifact.from_path?

gasche · 2023-08-30T19:47:46Z

driver/optcompile.ml



 let clambda i backend Typedtree.{structure; coercion; _} =
+  let cmx = Unit_info.cmx i.target in


(meh again)

gasche · 2023-08-30T19:48:37Z

driver/optcompile.ml

@@ -64,19 +65,20 @@ let flambda i backend Typedtree.{structure; coercion; _} =
          in
          Asmgen.compile_implementation
            ~backend
-            ~prefixname:i.output_prefix
+            ~prefixname:(Unit_info.prefix i.target)


Each call to compile_implementation needs to be modified. Maybe it could even be changed to take a unit_info instead of a prefixname, to make caller-side changes simpler?

I think it is better to drop the information which is no longer used in this case.

gasche · 2023-08-30T19:49:43Z

driver/optcompile.ml

-  Asmgen.compile_implementation_linear i.output_prefix ~progname:i.source_file
+  Compilenv.reset ?packname:!Clflags.for_package (Unit_info.modname i.target);
+  Asmgen.compile_implementation_linear (Unit_info.prefix i.target)
+    ~progname:(Unit_info.source_file i.target)


here as well it looks like a single i.target argument would make a lot of sense for compile_implementation_linear.

gasche · 2023-08-30T19:50:43Z

file_formats/cmt_format.ml

@@ -164,19 +164,20 @@ let record_value_dependency vd1 vd2 =
  if vd1.Types.val_loc <> vd2.Types.val_loc then
    value_deps := (vd1, vd2) :: !value_deps

-let save_cmt filename modname binary_annots sourcefile initial_env cmi shape =
+let save_cmt dest binary_annots initial_env cmi shape =


target or dest?

gasche · 2023-08-30T19:57:50Z

typing/persistent_env.ml

-let read penv f modname filename =
+let read penv f a =
+  let modname = Unit_info.artifact_modname a in
+  let filename = Unit_info.filename a in
  snd (read_pers_struct penv f true modname filename)


I would expect the argument change to be pushed into read_pers_struct.

gasche

I am happy with the final result. What do you want to do about the history?

(I think that it would make sense to squash, but you could also split into two commits, one that introduces unit_info and one with all the client-side changes at once.)

parsing/unit_info.mli

Octachron · 2023-09-08T12:53:17Z

I kept two commits one for introducing the new module, the second one for using it. While rebasing #11736 on top of this PR, I couldn't resist the temptation to cherry-pick one simplification: with the last commit, we no longer add cmi with invalid modname in the persistent environment.

Fix cmi lookup after #12389

Unified metadata for compilation files (or no more capitalize_ascii) (cherry picked from commit c2b87d8)

Fix cmi lookup after ocaml#12389 (cherry picked from commit eab1105)

gasche reviewed Aug 5, 2023

View reviewed changes

gasche mentioned this pull request Aug 29, 2023

Newlines in quoted string literals mishandled on Windows/Cygwin #12502

Closed

Octachron force-pushed the unified_file_info branch from 549e712 to d626e7d Compare August 30, 2023 15:07

gasche reviewed Aug 30, 2023

View reviewed changes

gasche approved these changes Sep 7, 2023

View reviewed changes

parsing/unit_info.mli Outdated Show resolved Hide resolved

Octachron added 4 commits September 8, 2023 10:37

Unit_info module: metadata for compilation units and artifacts

8058b56

use Unit_info everywhere

efa6276

update Changes

6ff06b4

Only add valid modname in the persistent env

f91ddec

Octachron force-pushed the unified_file_info branch from 1eed6ef to f91ddec Compare September 8, 2023 12:49

Octachron mentioned this pull request Sep 8, 2023

Modest support for Unicode letters in identifiers #11736

Closed

gasche merged commit c2b87d8 into ocaml:trunk Sep 8, 2023
9 checks passed

jmid mentioned this pull request Sep 11, 2023

Regression on trunk compiling QCheck with PR 12389 #12543

Closed

Octachron mentioned this pull request Sep 11, 2023

Fix cmi lookup after #12389 #12545

Merged

gasche added a commit that referenced this pull request Sep 11, 2023

Merge pull request #12545 from Octachron/fix_too_strict_cmi_lookup

eab1105

Fix cmi lookup after #12389

Octachron mentioned this pull request Sep 12, 2023

Warn on ambiguous library compilation artifacts #12550

Open

ccasin mentioned this pull request Oct 8, 2023

Add a -H flag, second attempt #12246

Merged

Octachron mentioned this pull request Oct 13, 2023

Modest support for Unicode letters in identifiers, take 2 #12664

Open

Octachron pushed a commit to Octachron/ocaml that referenced this pull request Feb 20, 2024

Merge pull request ocaml#12389 from Octachron/unified_file_info

5d42fc0

Unified metadata for compilation files (or no more capitalize_ascii) (cherry picked from commit c2b87d8)

Octachron pushed a commit to Octachron/ocaml that referenced this pull request Feb 20, 2024

Merge pull request ocaml#12545 from Octachron/fix_too_strict_cmi_lookup

a634b87

Fix cmi lookup after ocaml#12389 (cherry picked from commit eab1105)

gasche mentioned this pull request Feb 21, 2024

5.2.0~alpha1: ocamlc -pack changed the expected naming convention for the cmi files #12984

Closed

Octachron mentioned this pull request Feb 23, 2024

#12984: restore the filename computation for companion cmi #12987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Octachron commented Jul 18, 2023

gasche left a comment

gasche Aug 4, 2023

gasche Aug 5, 2023

gasche Aug 5, 2023

gasche Aug 5, 2023

gasche commented Aug 5, 2023

gasche commented Aug 29, 2023

Octachron commented Aug 30, 2023

gasche left a comment

gasche Aug 30, 2023

Octachron Sep 7, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

Octachron Sep 7, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

Octachron Sep 7, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

gasche Aug 30, 2023

gasche left a comment

Octachron commented Sep 8, 2023



		let clambda i backend Typedtree.{structure; coercion; _} =
		let cmx = Unit_info.cmx i.target in

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Unified metadata for compilation files (or no more capitalize_ascii) #12389

Conversation

Octachron commented Jul 18, 2023

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche commented Aug 5, 2023

gasche commented Aug 29, 2023

Octachron commented Aug 30, 2023

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche left a comment

Choose a reason for hiding this comment

Octachron commented Sep 8, 2023