Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/language: change of behavior for language matcher #24211

Open
LayneChris opened this issue Mar 2, 2018 · 13 comments
Open

x/text/language: change of behavior for language matcher #24211

LayneChris opened this issue Mar 2, 2018 · 13 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@LayneChris
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.9.2

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

linux amd64

What did you do?

The golang.org/x/text package seemed to have changed in the behaviour of language matching with an update a few days ago:

package main

import (
    "fmt"

    "golang.org/x/text/language"
)

func main() {
    s := []language.Tag{language.MustParse("en"), language.MustParse("fr")}
    p := []language.Tag{language.MustParse("en-US"), language.MustParse("en")}

    l := language.NewMatcher(s)
    ll, _, _ := l.Match(p...)

    fmt.Println(ll)
}

What did you expect to see?

This used to print "en" but now prints "en-u-rg-uszzzz". This doesn't make sense because I only support "en" and "fr" so why is it returning something else? Switching the order of my preferred languages gives "en".

Is there a rhyme or reason why the change because I cannot understand? Defeats the purpose of a language "matcher" if it is going to return languages that I don't support.

If this is by design what is the best way to get just "en". Parent()? Base()? SomethingElse()?

What did you see instead?

"en-u-rg-uszzzz", a language I do not support.

@bradfitz bradfitz changed the title Change of behavior for language matcher x/text/language: change of behavior for language matcher Mar 2, 2018
@gopherbot gopherbot added this to the Unreleased milestone Mar 2, 2018
@bradfitz
Copy link
Contributor

bradfitz commented Mar 2, 2018

/cc @mpvl

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 13, 2018
@basgys
Copy link

basgys commented Mar 14, 2018

This issue seems to occur only when the preferred language contains regional preferences.

Example

Supported:

  • en
  • en-GB
  • de-CH
  • fr-CH
  • en-US

Tests:

  • want en expect result en actual result en
  • want fr expect result fr-CH actual result fr-CH
  • want de expect result de-CH actual result de-CH
  • want en-GB expect result en-GB actual result en-GB
  • want fr-CH expect result fr-CH actual result fr-CH
  • want de-CH expect result de-CH actual result de-CH
  • want it expect result en actual result en
  • want it-CH expect result en actual result en-u-rg-chzzzz

Regression

This regression was introduced on this commit: golang/text@6008361#diff-c716fa1ccf70cc54ac8b513b999c84eb

Here:

if w.RegionID != tt.RegionID && w.RegionID != 0 {
	if w.RegionID != 0 && tt.RegionID != 0 && tt.RegionID.Contains(w.RegionID) {
		tt.RegionID = w.RegionID
		tt.RemakeString()
	} else if r := w.RegionID.String(); len(r) == 2 {
		// TODO: also filter macro and deprecated.
		tt, _ = tt.SetTypeForKey("rg", strings.ToLower(r)+"zzzz")
	}
}

Temporary workaround
Rollback golang.org/x/text to commit 2120f96286c5897163172b6f5ac2fd0921777714 cc @LayneChris

@nicksnyder
Copy link

nicksnyder commented Apr 4, 2018

I was just bitten by this today.

Here is a minimal repro (does not repro on play.golang.org):

package main

import "fmt"
import "golang.org/x/text/language"

func main() {
	m := language.NewMatcher([]language.Tag{language.English})
	tag, i, conf := m.Match(language.AmericanEnglish)
	fmt.Println(tag, i, conf) // en-u-rg-uszzzz 0 Exact
}

@nicksnyder
Copy link

nicksnyder commented Apr 4, 2018

I notice that the documentation says this:

Note that Tag that is returned by Match and MatchString may differ from any of the supported languages, as it may contain carried over settings from the user tags. This may be inconvenient when your application has some additional locale-specific data for your supported languages. Match and MatchString both return the index of the matched supported tag to simplify associating such data with the matched tag.

It seems like the intended behavior is to save the input array passed to the matcher and then use the index returned from Match.

It is annoying that I can't query the input array from the matcher itself though.

Example workaround

package main

import "fmt"
import "golang.org/x/text/language"

func main() {
	supported := []language.Tag{language.English}
	m := language.NewMatcher(supported)
	_, i, conf := m.Match(language.AmericanEnglish)
	fmt.Println(supported[i], i, conf) // en 0 Exact
}

@dnx2k
Copy link

dnx2k commented May 30, 2018

My temporary workaround:

langTag, _, _ := languageMatcher.Match(tags...)
langTagString := langTag.String()[0:2]

stapelberg added a commit to stapelberg/debiman that referenced this issue Aug 9, 2018
See golang/go#24211 for more details

This hit us for a zh_CN manpage with precisely one "en" option: the returned tag
is en-u-rg-chzzzz, not en.
@chenjie4255
Copy link

I was a bit surprised by this when I meet this problem in my production environment ,is that just a mistake? I was really loving golang, but this is really frustrated me, my every beliefs on golang have beed break.

it not something about changing the exists codes or improving my project's unit test, it never be.

@schmurfy
Copy link

I just got bitten by this, why is it still open after so long ?

@dradtke
Copy link

dradtke commented May 4, 2020

This seems like the most straightforward, least-hacky way to get the language out of the tag:

supported := []language.Tag{language.English}
m := language.NewMatcher(supported)
tag, _, _ := m.Match(language.AmericanEnglish)
base, _, _ := tag.Raw()
languageCode := base.ISO3() // eng

playground link

@svenwltr
Copy link

Experiencing this on a patch update is quite irritating.

We use this as a workaround:

strings.Split(tag.String(), "-")[0]

@razor-1
Copy link

razor-1 commented Apr 4, 2021

@sebastien-rosset
Copy link

It seems like the intended behavior is to save the input array passed to the matcher and then use the index returned from Match.
It is annoying that I can't query the input array from the matcher itself though.

Related to this, the NewMatcher documentation about the returned index is a bit convoluted:

The index returned by the Match method corresponds to the index of the matched tag in t, but is augmented with the Unicode extension ('u')of the corresponding preferred tag. This allows user locale options to be passed transparently.

The beginning of the sentence is simple (i.e. it is the index in slice t). However using the word "augmented" to qualify the meaning of the index is confusing. What does it mean to augment "the index of the matched tag"?

@josharian
Copy link
Contributor

@ianlancetaylor looking at https://dev.golang.org/owners, @mpvl is still listed as owner for x/text, but it seems like he's not particularly engaged in it at the moment. Is there another owner we can ping?

(This issue is really easy to hit and inconvenient to work around.)

@dominikh
Copy link
Member

IMHO this is working as intended. The "new" behavior provides strictly more information than the old behavior. If you want to know which tag of your list of supported tags was matched, use the returned index. If you want to provide better internationalization support, use the returned tag, which can specify extra information using BCP 47 Extension U.

Based on the original example, with the supported tag en and the preferred tag en-US, the result en-u-rg-uszzzz selects en, but with the regional override us that may affect collation, number/date formats, default currency, etc. This additional information is useful to, for example, golang.org/x/text/number.

If you are using the tag as a key, for example for localization, your best option is probably to use both tags: the one obtained via the index for the lookup, and the returned one for services from x/text/*. This is reflected in the existing documentation:

Note that Tag that is returned by Match and MatchString may differ from any of the supported languages, as it may contain carried over settings from the user tags. This may be inconvenient when your application has some additional locale-specific data for your supported languages. Match and MatchString both return the index of the matched supported tag to simplify associating such data with the matched tag.

https://go.dev/blog/matchlang goes into some of the details of this, too.

I believe that, as an alternative to using the index, it is safe to use language.Tag.Raw to assemble a new key that drops the u extension, as long as none of your supported tags included their own extensions. I don't think that Match will modify the supported tags in any ways other than setting the u extension (but see the below note on canonicalization.)

In summary: the returned tag is an accurate description of the user's wishes, reconciled with the supported languages. This tag should be used with APIs that themselves understand BCP-47 tags. The returned index refers to the input and is useful if you need to use it as raw data, for example as a key into a map.

As an aside, a lot of the code that takes the returned tag and slices and dices it (as included in many of the comments in this issue) is prone to being incorrect due to canonicalization. language.MustParse("in") will return the tag id, because language.MustParse canonicalizes deprecated subtags. Which means that mixing string keys and Tag keys in your data structures will not behave correctly. Either use canonicalized tags as keys, or index into the original slice of strings (or, much less preferred, use language.Raw.MustParse instead.) See https://pkg.go.dev/golang.org/x/text/language#hdr-Canonicalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests