Skip to content

A Scala compiler plugin that warns of/composes decomposed Unicode characters

License

Notifications You must be signed in to change notification settings

ken1ma/unicode-compose

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Difficulty

  1. There are systems that do not implement Unicode equivalence, which may cause interoperability issues that are hard to fix.

    1. They are hard since two different but canonically equivalent sequences of characters are indistinguishable on the screen.
  2. XFS, the default file system in RHEL 7, is one of such systems.

    1. The following code will throw NoSuchFileException on XFS, but completes successfully on macOS/Windows:

      import java.nio.file.{Paths, Files}
      
      object FileNameGa extends App {
      	val name1 = "が.txt" // precomposed character ('\u304c')
      	val name2 = "が.txt" // decomposed character ('\u304b', '\u3099') that looks identical
      
      	// create a file with name1, and read the file with name2
      	Files.write(Paths.get(name1), Array[Byte](0))
      	Files.readAllBytes(Paths.get(name2)) // the file is found only if the system implements Unicode equivalence
      }

A Remedy

  1. The problem can be largely avoided by not mixing precomposed / decomposed characters.

  2. Unicode defines, and Java supports the normalization forms that can solve the problem, but every one of them brings new obstacles.

    1. NFD (decomposed) is not easy to work with; the vast majority of programming/authoring tools produces precomposed characters.

      1. One popular source of NFD strings is the file name in macOS Finder. If a file name is copy-pasted to Vim, there most likely is a problem since the macOS input method produces precomposed characters.
    2. NFC (precomposed) not only composes characters, but also consolidates similar characters into CJK Unified Ideographs, while the vast majority of people casually distinguish those similar characters that carry subtly different sentiments.

Scala compiler plugin

The plugin warns/compose decomposed Unicode characters.

Build Environment

  1. Java 1.8.0_201

  2. mill 0.3.6

    1. On macOS, Homebrew can be used for installation

       brew install mill
      

Commands

  1. Build and publish to ~/.ivy2/local

     mill plugin.publishLocal
    
  2. Publish to Maven Central

     mill plugin.publish --sonatypeCreds "user:pass" --release true
    
    1. macOS: install gnupg with brew
    2. mill-0.3.6: mill plugin.publish fails with os.SubprocessException: CommandResult 2 if gpg tries to ask for the passphrase
      1. To avoid the failure, make gpg-agent provide the passphrase, e.g., LANG=en_US.UTF8 gpg -ab README.md

References

  1. https://docs.scala-lang.org/overviews/plugins/index.html
  2. https://typelevel.org/scala/docs/phases.html
  3. https://tama-san.com/unicode-nfc/

About

A Scala compiler plugin that warns of/composes decomposed Unicode characters

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages