Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cabal init incorrectly writes non-ASCII file names in library:exposed-modules #6507

Open
Fyrbll opened this issue Jan 23, 2020 · 7 comments
Open

Comments

@Fyrbll
Copy link

Fyrbll commented Jan 23, 2020

Describe the bug
Consider a folder hierarchy that looks like

proglet/
|- Typeclass/
   |- Pokémon.hs

Running cabal init --interactive when the working directory is proglet generates the files necessary for cabal v2-build.

Running cabal v2-build, however, gives the unexpected error

Typeclass/Pokémon.hs:1:8: error:
    File name does not match module name:
    Saw: ‘Typeclass.Pokémon’
    Expected: ‘Typeclass.Pokémon’
  |
1 | module Typeclass.Pokémon where
  |        ^^^^^^^^^^^^^^^^^

To Reproduce

  1. Create folder proglet
  2. Create folder proglet/Typeclass
  3. Create file proglet/Typeclass/Pokémon.hs with content
module Typeclass.Pokémon where

x = 1
  1. Change the working directory to proglet
  2. Run cabal init --interactive and complete the prompts this way
  • Don't generate a "simple project with sensible defaults"
  • Build a library
  • Use any version of the Cabal specification
  • Any package name of your choosing
  • Any package version
  • Any license
  • Any author name
  • Any email
  • Any project URL
  • Any synopsis
  • Any category
  • No source directory
  • No test suite
  • Haskell2010
  • No "informative comments"
  1. Run cabal v2-build and observe the error

Expected behavior
I expected the build to succeed.

System information

  • Operating system: macOS 10.14
  • cabal version: 3.0.0.0
  • ghc version: 8.6.5

Additional context
It seems that the "e with acute accent" under "Saw" is the UTF-8 character U+00E9, while the "e with acute accent" under "Expected" is a combination of the normal letter "e" (U+0065) and some acute accent character.

Since the "e with acute accent" under "Expected" corresponds to the contents of the library:exposed-modules section of the .cabal file, I checked proglet.cabal (I have removed the comment lines below).

cabal-version:       2.4

name:                proglet
version:             0.1.0.0
license:             BSD-3-Clause
license-file:        LICENSE
author:              Fyrbll
maintainer:          unnecessary
extra-source-files:  CHANGELOG.md

library
  exposed-modules:     Typeclass.Pokémon
  build-depends:       base ^>=4.12.0.0
  default-language:    Haskell2010

I changed Typeclass.Pokémon in the file above to Typeclass.Pokémon, where the latter actually uses U+00E9, and then cabal build worked painlessly.

@phadej
Copy link
Collaborator

phadej commented Jan 23, 2020

According to https://superuser.com/questions/999232/unicode-filenames-in-windows-vs-mac-os-x.

OS X uses UTF-8. Codepoints are encoded using between one and five bytes. OS X uses Unicode NFD (Normalization Form Canonical Decomposition).

This means that when a Unicode character such as "é" is used in a filename it will always be normalized by the system into a regular ASCII "e" followed by a Unicode combining acute accent, and will always take two codepoints.

NFD (note visible e, i.e. 65)

00000000  54 79 70 65 63 6c 61 73  73 2e 50 6f 6b 65 cc 81  |Typeclass.Poke..|
00000010  6d 6f 6e 0a                                       |mon.|
00000014

NKC (note no e)

00000000  54 79 70 65 63 6c 61 73  73 2e 50 6f 6b c3 a9 6d  |Typeclass.Pok..m|
00000010  6f 6e 0a                                          |on.|
00000013

I'm not 100% sure though what happens, cabal init writes what it gets from the file system,
but GHC doesn't find a file based on it. Conversion through String shouldn't destroy this.
Finding where the normalization happens (and why) will help to solve the issue.

@Fyrbll
Copy link
Author

Fyrbll commented Jan 23, 2020

Thanks so much for the insight! I'll use this information to make whatever progress I can on my end, and if I learn exactly what's going on I'll post my findings here.

@Fyrbll
Copy link
Author

Fyrbll commented Jan 27, 2020

Unfortunately, the Stack Overflow answer only applied to the HFS and HFS+ file systems, which were replaced by the APFS file system in macOS High Sierra (10.13) and above. I can confirm that my machine uses APFS, whose normalization rules can be found here. According to these, the system doesn't enforce a single form of Unicode normalization.

APFS accepts only valid UTF-8 encoded filenames for creation, and preserves both case and normalization of the filename on disk in all variants.
...
Being normalization-insensitive ensures that normalization variants of a filename cannot be created in the same directory, and that a filename can be found with any of its normalization variants.

I have reason to believe it's the program creating the file that controls how its name is normalized. On my system:

  • Emacs 26.1: Running M-x save-buffer, then entering the name éclair, results in an NFD normalized file name.
  • TextEdit: Opening a new file, then saving it using Command + S with the name éclair results in an NFD normalized file name.
  • Vim 8.0: Opening a new file, then saving it with :w éclair results in an NFC normalized file name.
  • Running > éclair or touch éclair with bash results in an NFC normalized file name.

(Note: I checked the above statements using xxd)

With this information in mind, when I create the following file with Emacs and save it with the name Pokémon.hs using the save-buffer function, the file name is NFD normalized, whereas the text within the file (most importantly the name of the module) is NFC normalized.

module Pokémon where

x = 1

Note that when cabal init is populating the exposed-modules field, it doesn't venture into the Haskell files themselves to pull out module names. It looks at file names and directory names, trusting that the relative path to a file will match the name of the module declared within it, except with periods in place of slashes.

If I understand correctly, when cabal build, cabal new-build, or cabal v2-build is run, the name of the module is expected to match the name of the file exactly - which won't happen if the module's name and the file's name are normalized differently.

I managed to work around this problem for my local cabal by changing a definition in the where clause of the function scanForModulesIn, located in the module Distribution.Client.Init.Heuristics of the cabal-install project.

I changed

entries <- getDirectoryContents (projectRoot </> dir)

to

entries <- fmap (map (T.unpack . normalize NFKC . T.pack))
    (getDirectoryContents (projectRoot </> dir))

Above, unpack and pack are from Data.Text, while normalize is from Data.Text.Normalize in the unicode-transforms package.

@Fyrbll
Copy link
Author

Fyrbll commented Apr 26, 2020

Since this has been marked as a bug now, if the fix above is acceptable (in its current form it adds unicode-transforms as a dependency) I can make a pull request.

@phadej
Copy link
Collaborator

phadej commented Apr 26, 2020

I'm not sure that's a correct fix. I don't understand why getDirectoryContents get differently normalized contents.

Without proper understanding when we fix macOS we might break Windows or Linux, so fix of this should have properly.

Also there is: haskell/tar#6 so I suspect that may cause some problems too (or is the problem?)

@emilypi emilypi mentioned this issue Mar 30, 2021
11 tasks
@jneira
Copy link
Member

jneira commented Jul 17, 2021

@emilypi could be this be resolved by #7344?

@emilypi
Copy link
Member

emilypi commented Jul 17, 2021

@jneira this was explicitly left off that particular ticket, because we weren't sure if it was completely solved. However if someone were to confirm that we did in fact fix this I would be fine with saying it's closed. A regression test for this would be enough for me to make that call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants