UNIC: Unicode and Internationalization Crates for Rust
Branch: master
Clone or download
bors and eyeplum Merge #251
251: Add ucd/name_aliases r=behnam a=eyeplum

Recently I found myself needing to fetch name aliases for a given character and it was not implemented in rust-unic yet. So I implemented it and thought it might be good to merge it back.

**Changes**
- Parse http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt and expose the via `NameAliasType` and `name_aliases_of()`
- In the `unic-inspector` app, if a character's Unicode name is not available, check if it has name abbreviations and fallback to the first name abbreviations if available, otherwise display `<none>`

**Notes**

All of the APIs return a vector of string references because the comments of `NameAliases.txt` mentions:

> Parsers of this data file should take note that the same code point can (and does) occur more than once.

Some `unic-inspector` examples after the change:

```
$ unic-inspector \U0010\U000A\U0041\U007F\U0098\U0530
 �  | U+0010 | DLE                    | Control 
   | U+000A | LF                     | Control 
 A | U+0041 | LATIN CAPITAL LETTER A | Uppercase_Letter 
 �  | U+007F | DEL                    | Control 
 �  | U+0098 | SOS                    | Control 
 ԰ | U+0530 | <none>                 | Unassigned 
```

Didn't find a GitHub issue for name aliases but #127 might be related.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/open-i18n/rust-unic/251)
<!-- Reviewable:end -->


Co-authored-by: Yan Li <eyeplum@gmail.com>
Latest commit 6c6702a Jan 13, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Add .github (empty) templates Sep 27, 2017
apps/cli [ucd/name_aliases] Improve name_aliases public APIs Jan 11, 2019
docs Merge #228 Aug 19, 2018
etc data: Delete and move remaining files to /external/ Jan 6, 2019
external external/unicode: Add submodules: ucd, idna, emoji Jan 6, 2019
gen [ucd/name_aliases] Improve name_aliases public APIs Jan 11, 2019
unic [ucd/name_aliases] Improve name_aliases public APIs Jan 11, 2019
.appveyor.yml
.bors.toml Add bors.toml Aug 10, 2017
.gitattributes Generated rs -> rsd "Rust Declarations" Sep 17, 2017
.gitignore Remove IDE specific ignore from gitignore Jul 19, 2017
.gitmodules external/unicode: Add submodules: ucd, idna, emoji Jan 6, 2019
.rustfmt.toml [rustfmt] Enable reorder_impl_items Aug 19, 2018
.travis.yml travis: Bump MIN_RUST_VERSION to 1.31.0 Jan 6, 2019
AUTHORS Add unic-segment component Oct 6, 2017
CODE_OF_CONDUCT.md [CODE_OF_CONDUCT] Use The Rust Code of Conduct Nov 11, 2017
CONTRIBUTING.md Merge #228 Aug 19, 2018
COPYRIGHT.md [COPYRIGHT.md] Update dates and organizations Jun 20, 2017
Cargo.toml data: Delete and move remaining files to /external/ Jan 6, 2019
LICENSE-APACHE Initial commit with UCD, Bidi, IDNA, and Normalization components Jun 20, 2017
LICENSE-MIT Initial commit with UCD, Bidi, IDNA, and Normalization components Jun 20, 2017
README.md travis: Bump MIN_RUST_VERSION to 1.31.0 Jan 6, 2019

README.md

UNIC: Unicode and Internationalization Crates for Rust

UNIC-logo

Travis AppVeyor Rust-1.31.0+ Unicode-10.0.0 Release Crates.io Documentation Gitter

https://github.com/open-i18n/rust-unic

UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benefiting from performance gains of unsafe mode when possible.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.

Project Status

At the moment UNIC is under heavy development: the API is updated frequently on master branch, and there will be API breakage between each 0.x release. Please see open issues for changes planed.

We expect to have the 1.0 version released in 2018 and maintain a stable API afterwards, with possibly one or two API updates per year for the first couple of years.

Design Goals

  1. Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.

  2. Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.

  3. Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.

Components and their Organization

UNIC Components have a hierarchical organization, starting from the unic root, containing the major components. Each major component, in turn, may host some minor components.

API of major components are designed for the end-users of the libraries, and are expected to be extensively documented and accompanies with code examples.

In contrast to major components, minor components act as providers of data and algorithms for the higher-level, and their API is expected to be more performing, and possibly providing multiple ways of accessing the data.

The UNIC Super-Crate

The unic super-crate is a collection of all UNIC (major) components, providing an easy way of access to all functionalities, when all or many are needed, instead of importing components one-by-one. This crate ensures all components imported are compatible in algorithms and consistent data-wise.

Main code examples and cross-component integration tests are implemented under this crate.

Major Components

Applications

Code Organization: Combined Repository

Some of the reasons to have a combined repository these components are:

  • Faster development. Implementing new Unicode/i18n components very often depends on other (lower level) components, which in turn may need adjustments—expose new API, fix bugs, etc—that can be developed, tested and reviewed in less cycles and shorter times.

  • Implementation Integrity. Multiple dependencies on other components mean that the components need to, to some level, agree with each other. Many Unicode algorithms, composed from smaller ones, assume that all parts of the algorithm is using the same version of Unicode data. Violation of this assumption can cause inconsistencies and hard-to-catch bugs. In a combined repository, it's possible to reach a better integrity during development, as well as with cross-component (integration) tests.

  • Pay for what you need. Small components (basic crates), which cross-depend only on what they need, allow users to only bring in what they consume in their project.

  • Shared bootstrapping. Considerable amount of extending Unicode/i18n functionalities depends on converting source Unicode/locale data into structured formats for the destination programming language. In a combined repository, it's easier to maintain these bootstrapping tools, expand coverage, and use better data structures for more efficiency.

Documentation

How to Use UNIC

In Cargo.toml:

[dependencies]
unic = "0.8.0"  # This has Unicode 10.0.0 data and algorithms

And in main.rs:

extern crate unic;

use unic::ucd::common::is_alphanumeric;
use unic::bidi::BidiInfo;
use unic::normal::StrNormalForm;
use unic::segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
use unic::ucd::normal::compose;
use unic::ucd::{is_cased, Age, BidiClass, CharAge, CharBidiClass, StrBidiClass, UnicodeVersion};

fn main() {

    // Age

    assert_eq!(Age::of('A').unwrap().actual(), UnicodeVersion { major: 1, minor: 1, micro: 0 });
    assert_eq!(Age::of('\u{A0000}'), None);
    assert_eq!(
        Age::of('\u{10FFFF}').unwrap().actual(),
        UnicodeVersion { major: 2, minor: 0, micro: 0 }
    );

    if let Some(age) = '🦊'.age() {
        assert_eq!(age.actual().major, 9);
        assert_eq!(age.actual().minor, 0);
        assert_eq!(age.actual().micro, 0);
    }

    // Bidi

    let text = concat![
        "א",
        "ב",
        "ג",
        "a",
        "b",
        "c",
    ];

    assert!(!text.has_bidi_explicit());
    assert!(text.has_rtl());
    assert!(text.has_ltr());

    assert_eq!(text.chars().nth(0).unwrap().bidi_class(), BidiClass::RightToLeft);
    assert!(!text.chars().nth(0).unwrap().is_ltr());
    assert!(text.chars().nth(0).unwrap().is_rtl());

    assert_eq!(text.chars().nth(3).unwrap().bidi_class(), BidiClass::LeftToRight);
    assert!(text.chars().nth(3).unwrap().is_ltr());
    assert!(!text.chars().nth(3).unwrap().is_rtl());

    let bidi_info = BidiInfo::new(text, None);
    assert_eq!(bidi_info.paragraphs.len(), 1);

    let para = &bidi_info.paragraphs[0];
    assert_eq!(para.level.number(), 1);
    assert_eq!(para.level.is_rtl(), true);

    let line = para.range.clone();
    let display = bidi_info.reorder_line(para, line);
    assert_eq!(
        display,
        concat![
            "a",
            "b",
            "c",
            "ג",
            "ב",
            "א",
        ]
    );

    // Case

    assert_eq!(is_cased('A'), true);
    assert_eq!(is_cased('א'), false);

    // Normalization

    assert_eq!(compose('A', '\u{030A}'), Some('Å'));

    let s = "ÅΩ";
    let c = s.nfc().collect::<String>();
    assert_eq!(c, "ÅΩ");

    // Segmentation

    assert_eq!(
        Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
        &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
    );

    assert_eq!(
        Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
        &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
    );

    assert_eq!(
        GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
        &[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
    );

    assert_eq!(
        Words::new(
            "The quick (\"brown\") fox can't jump 32.3 feet, right?",
            |s: &&str| s.chars().any(is_alphanumeric),
        ).collect::<Vec<&str>>(),
        &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
    );

    assert_eq!(
        WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
        &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
    );

    assert_eq!(
        WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
        &[
            (0, "Brr"),
            (3, ","),
            (4, " "),
            (5, "it's"),
            (9, " "),
            (10, "29.3"),
            (14, "°"),
            (16, "F"),
            (17, "!")
        ]
    );
}

You can find more examples under examples and tests directories. (And more to be added as UNIC expands...)

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Code of Conduct

UNIC project follows The Rust Code of Conduct. You can find a copy of it in CODE_OF_CONDUCT.md or online at https://www.rust-lang.org/conduct.html.