Skip to content

Commit

Permalink
Intl segmenter updates (#8402)
Browse files Browse the repository at this point in the history
* Hydrated stub pages with metadata and structure; first drafts of constructor and supportedLocalesOf pages

* Segmenter examples (#4)

* make spanish_segmenter more... modern?

* add example and syntax to Segmenter#resolvedOptions

* add example and syntax to Segmenter#segment

* add information about segment data objects

* Hit the 80-20 point on Intl.Segmenter

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/constructor/index.md

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Apply suggestions from code review

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/constructor/index.md

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/constructor/index.md

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Fixed constructor structure

* Fixed constructor structure

* Apply suggestions from code review

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/segments/index.md

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/index.md

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

* Fixed main index link reference

* Fixed code block error

* Wrote the @@iterator page

* Apply suggestions from code review

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Rework tree structure per @Elchi3 comment

* Remove exotic whitespace/gremlin

* Add interactive examples (cf. mdn/interactive-examples#1987)

* Remove jsxref, fix links, normalize tags

* Taking review comments into account, improving examples

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/segment/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/segmenter/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segments/containing/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segments/@@iterator/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segmenter/supportedlocalesof/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Update files/en-us/web/javascript/reference/global_objects/intl/segments/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Remove interactive example due to Fx missing impl.

* sort methods alphabetically

* Update files/en-us/web/javascript/reference/global_objects/intl/segments/@@iterator/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* Favor const

* improve example while condition

* Update files/en-us/web/javascript/reference/global_objects/intl/segments/containing/index.md

Co-authored-by: wbamberg <will@bootbonnet.ca>

* DLify localeMatcher

* this one needs to be let

Co-authored-by: Ujjwal Sharma <ryzokuken@disroot.org>
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Romulo Cintra <romulocintra@users.noreply.github.com>
Co-authored-by: wbamberg <will@bootbonnet.ca>
Co-authored-by: julieng <julien.gattelier@gmail.com>
Co-authored-by: SphinxKnight <SphinxKnight@users.noreply.github.com>
  • Loading branch information
7 people committed Jan 10, 2022
1 parent 8ade1ab commit c95e770
Show file tree
Hide file tree
Showing 12 changed files with 492 additions and 30 deletions.
@@ -1,24 +1,65 @@
---
title: Intl.Segmenter
slug: Web/JavaScript/Reference/Global_Objects/Intl/Segmenter
tags:
- Internationalization
- Intl
- JavaScript
- Localization
- Reference
browser-compat: javascript.builtins.Intl.Segmenter
---
{{JSRef}}

The **`Intl.Segmenter`** object is a constructor for segmenters, objects that enable language sensitive string splitting.
The **`Intl.Segmenter`** object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

## Constructor

- [`Intl.Segmenter()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter)
- : Creates a new `Segmenter` object.
- : Creates a new `Intl.Segmenter` object.

## Static methods

- {{jsxref("Intl.Segmenter.supportedLocalesOf", "Intl.Segmenter.supportedLocalesOf()")}}
- [`Intl.Segmenter.supportedLocalesOf()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/supportedLocalesOf)
- : Returns an array containing those of the provided locales that are supported without having to fall back to the runtime's default locale.

## Instance methods

- {{jsxref("Intl.Segmenter.segment", "Intl.Segmenter.prototype.segment()")}}
- : Getter function that segments a string according to the locale and granularity of this {{jsxref("Global_Objects/Intl/Segmenter", "Intl.Segmenter")}} object.
- {{jsxref("Intl.Segmenter.resolvedOptions", "Intl.Segmenter.prototype.resolvedOptions()")}}
- : Returns a new object with properties reflecting the locale and granularity options computed during initialization of this {{jsxref("Global_Objects/Intl/Segmenter", "Intl.Segmenter")}} object.
- [`Intl.Segmenter.prototype.resolvedOptions()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/resolvedOptions)
- : Returns a new object with properties reflecting the locale and granularity options computed during initialization of this `Intl.Segmenter` object.
- [`Intl.Segmenter.prototype.segment()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/segment)
- : Returns a new iterable [`Segments`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segments) instance
representing the segments of a string according to the locale and granularity of this `Intl.Segmenter` instance.

## Examples

### Basic usage and difference from String.prototype.split()

If we were to use [`String.prototype.split(" ")`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split) to segment a text in words, we would not get the correct result if the locale of the text does not use whitespaces between words (which is the case for Japanese, Chinese, Thai, Lao, Khmer, Myanmar, etc.).

```js example-bad
const str = "吾輩は猫である。名前はたぬき。";
console.table(str.split(" "));
// ['吾輩は猫である。名前はたぬき。']
// The two sentences are not correctly segmented.

```

```js example-good
const str = "吾輩は猫である。名前はたぬき。";
const segmenterJa = new Intl.Segmenter('ja-JP', { granularity: 'word' });

const segments = segmenterJa.segment(str);
console.table(Array.from(segments));
// [{segment: '吾輩', index: 0, input: '吾輩は猫である。名前はたぬき。', isWordLike: true},
// etc.
// ]
```

## Specifications

{{Specifications}}

## Browser compatibility

{{Compat}}
@@ -1,7 +1,81 @@
---
title: Intl.Segmenter.prototype.resolvedOptions()
slug: Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/resolvedOptions
tags:
- Internationalization
- Intl
- JavaScript
- Localization
- Reference
browser-compat: javascript.builtins.Intl.Segmenter.resolvedOptions
---
{{JSRef}}

Returns a new object with properties reflecting the locale and granularity options computed during initialization of this [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object.
The **`Intl.Segmenter.prototype.resolvedOptions()`** method returns a new object with properties reflecting the locale and granularity options computed during the initialization of this [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object.

## Syntax

```js
resolvedOptions()
```

### Parameters

None.

### Return value

A new object with properties reflecting the locale and collation options computed
during the initialization of the given [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object.

## Description

The resulting object has the following properties:

- `locale`
- : The BCP 47 language tag for the locale actually used. If any Unicode extension
values were requested in the input BCP 47 language tag that led to this locale,
the key-value pairs that were requested and are supported for this locale are
included in `locale`.
- `granularity`
- : The value provided for this property in the `options` argument or filled
in as the default.

## Examples

### Basic usage

```js
const spanishSegmenter = new Intl.Segmenter("es", {granularity: "sentence"});
const options = spanishSegmenter.resolvedOptions();
console.log(options.locale); // "es"
console.log(options.granularity); // "sentence"
```

### Default granularity

```js
const spanishSegmenter = new Intl.Segmenter("es");
const options = spanishSegmenter.resolvedOptions();
console.log(options.locale); // "es"
console.log(options.granularity); // "grapheme"
```

### Fallback locale

```js
const banSegmenter = new Intl.Segmenter("ban");
const options = banSegmenter.resolvedOptions();
console.log(options.locale);
// "fr" on a runtime where the Balinese locale
// is not supported and French is the default locale
console.log(options.granularity); // "grapheme"
```

## Specifications

{{Specifications}}

## Browser compatibility

{{Compat}}
@@ -1,7 +1,69 @@
---
title: Intl.Segmenter.prototype.segment()
slug: Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/segment
tags:
- Internationalization
- Intl
- JavaScript
- Localization
- Reference
browser-compat: javascript.builtins.Intl.Segmenter.segment
---
{{JSRef}}

Getter function that segments a string according to the locale and granularity of this [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object.
The **`Intl.Segmenter.prototype.segment()`** method segments a string according to the locale and granularity of this [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object.

## Syntax

```js
segment(input)
```

### Parameters

- `input`
- : The text to be segmented as a [`String`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String).

### Return value

A new iterable [`Segments`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segments) object containing the segments of the input string, using the segmenter's locale and granularity.

## Examples

```js
// Create a locale-specific word segmenter
const segmenter = new Intl.Segmenter("fr", {granularity: "word"});

// Use it to get an iterator over the segments of a string
const input = "Moi ? N'est-ce pas ?";
const segments = segmenter.segment(input);

// Use that for segmentation
for (const {segment, index, isWordLike} of segments) {
console.log("segment at code units [%d, %d]: «%s»%s",
index, index + segment.length,
segment,
isWordLike ? " (word-like)" : ""
);
}
// logs
// segment at code units [0, 3]: «Moi» (word-like)
// segment at code units [3, 4]: « »
// segment at code units [4, 5]: «?»
// segment at code units [5, 6]: « »
// segment at code units [6, 11]: «N'est» (word-like)
// segment at code units [11, 12]: «-»
// segment at code units [12, 14]: «ce» (word-like)
// segment at code units [14, 15]: « »
// segment at code units [15, 18]: «pas» (word-like)
// segment at code units [18, 19]: « »
// segment at code units [19, 20]: «?»
```

## Specifications

{{Specifications}}

## Browser compatibility

{{Compat}}

This file was deleted.

@@ -0,0 +1,70 @@
---
title: Intl.Segmenter() constructor
slug: Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter
tags:
- Constructor
- Segmenter
- Internationalization
- Intl
- JavaScript
- Localization
- Reference
browser-compat: javascript.builtins.Intl.Segmenter.constructor
---

The **`Intl.Segmenter()`** constructor creates [`Intl.Segmenter`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) objects that enable locale-sensitive text segmentation.

## Syntax

```js
new Intl.Segmenter()
new Intl.Segmenter(locales)
new Intl.Segmenter(locales, options)
```

### Parameters

- `locales` {{ optional_inline }}
- : A string with a BCP 47 language tag, or an array of such strings. For the general form and interpretation of the `locales` argument, see the [`Intl`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl#locale_identification_and_negotiation) page.
- `options` {{ optional_inline }}
- : An object with some or all of the following properties:
- `granularity` {{ optional_inline }}
- : A string. Possible values are:
- `"grapheme"` (default)
- : Split the input into segments at grapheme cluster (user-perceived character) boundaries, as determined by the locale.
- `"word"`
- : Split the input into segments at word boundaries, as determined by the locale.
- `"sentence"`
- : Split the input into segments at sentence boundaries, as determined by the locale.
- `localeMatcher` {{ optional_inline }}
- : The locale matching algorithm to use. Possible values are:
- `"best fit"` (default)
- : The runtime may choose a possibly more suited locale than the result of the lookup algorithm.
- `"lookup"`
- : Use the [BCP 47 Lookup algorithm](https://datatracker.ietf.org/doc/html/rfc4647#section-3.4) to choose the locale from `locales`. For each locale in `locales`, the runtime returns the first supported locale (possibly removing restricting subtags of the provided locale tag to find such a supported locale. In other words providing `"de-CH"` as `locales` may result in using `"de"` if `"de"` is supported but `"de-CH"` is not).


### Return value

A new [`Intl.Segments`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segments) instance.

## Examples

### Basic usage

The following example shows how to count words in a string using the Japanese language (where splitting the string using `String` methods would have given an incorrect result).

```js
const text = "吾輩は猫である。名前はたぬき。";
const japaneseSegmenter = new Intl.Segmenter("ja-JP", {granularity: "word"});
console.log([...japaneseSegmenter.segment(text)].filter(segment => segment.isWordLike).length);
// logs 8 as the text is segmented as '吾輩'|'は'|'猫'|'で'|'ある'|'。'|'名前'|'は'|'たぬき'|'。'
```

## Specifications

{{Specifications}}

## Browser compatibility

{{Compat}}

This file was deleted.

This file was deleted.

This file was deleted.

@@ -1,7 +1,66 @@
---
title: Intl.Segmenter.supportedLocalesOf()
slug: Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/supportedLocalesOf
tags:
- Internationalization
- Intl
- JavaScript
- Localization
- Reference
browser-compat: javascript.builtins.Intl.Segmenter.supportedLocalesOf
---
{{JSRef}}

Returns an array containing those of the provided locales that are supported without having to fall back to the runtime's default locale.
The **`Intl.Segmenter.supportedLocalesOf()`** method returns an array containing those of the provided locales that are supported without having to fall back to the runtime's default locale.

## Syntax

```js
supportedLocalesOf(locales)
supportedLocalesOf(locales, options)
```

### Parameters

- `locales`
- : A string with a BCP 47 language tag, or an array of such strings. For the general
form of the `locales` argument, see the [`Intl`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl#locale_identification_and_negotiation) page.
- `options` {{optional_inline}}
- : An object that may have the following property:
- `localeMatcher`
- : The locale matching algorithm to use. Possible values are
"`lookup`" and "`best fit`"; the default is
"`best fit`". For information about this option, see the [`Intl`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl#locale_negotiation) page.

### Return value

An array of strings representing a subset of the given locale tags that are supported
in segmentation without having to fall back to the runtime's default locale.

## Examples

### Using supportedLocalesOf()

Assuming a runtime that supports Indonesian and German but not Balinese in list
formatting, `supportedLocalesOf` returns the Indonesian and German language
tags unchanged, even though `pinyin` collation is neither relevant to segmentation
nor used with Indonesian, and a specialized German for Indonesia is
unlikely to be supported. Note the specification of the "`lookup`"
algorithm here — a "`best fit`" matcher might decide that Indonesian is an
adequate match for Balinese since most Balinese speakers also understand Indonesian,
and therefore return the Balinese language tag as well.

```js
const locales = ['ban', 'id-u-co-pinyin', 'de-ID'];
const options = { localeMatcher: 'lookup' };
console.log(Intl.Segmenter.supportedLocalesOf(locales, options).join(', '));
// → "id-u-co-pinyin, de-ID"
```

## Specifications

{{Specifications}}

## Browser compatibility

{{Compat}}

0 comments on commit c95e770

Please sign in to comment.