New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: go get fails on non-ASCII github packages #18660

Closed
mcandre opened this Issue Jan 14, 2017 · 14 comments

Comments

Projects
None yet
6 participants
@mcandre

mcandre commented Jan 14, 2017

$ go get -u github.com/mcandre/toys/go/испытание
package github.com/mcandre/toys/go/испытание: invalid github.com/ import path "github.com/mcandre/toys/go/испытание"

In vcs.go:

var vcsPaths = []*vcsPath{
        // Github                                                                                                                               
        {
                prefix: "github.com/",
                re:     `^(?P<root>github\.com/[A-Za-z0-9_.\-]+/[A-Za-z0-9_.\-]+)(/[A-Za-z0-9_.\-]+)*$`,
                vcs:    "git",
                repo:   "https://{root}",
                check:  noVCSSuffix,
        },

Could the regex be made more flexible?

@bradfitz bradfitz changed the title from go get fails on Cyrillic package names to cmd/go: go get fails on non-ASCII github packages Jan 14, 2017

@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Jan 18, 2017

Member

I'll add some additional information about this bug below.

This affects golang.org/x/tools/go/vcs package in addition to cmd/go.

I've made a smaller reproduce repo (@rsc is welcome to fork it to his collection if desired; consider my repo temporary):

https://github.com/shurcooL-test/go-get-issue-unicode

Note that this issue affects only import paths that are statically known (GitHub, Bitbucket, etc.). It's not an issue for vanity import paths.

E.g., here is an alternative vanity import path that works without issues:

dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание

You can go get, go install, go test, go doc it without issues:

$ go get -u dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
$ go install dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
$ go test dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
ok  	dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание	0.014s
$ go doc dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
package испытание // import "dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание"

Package испытание demonstrates Unicode capabilities in Go source code.

type Эксперимент struct{ ... }
    func Испытание() Эксперимент

You can also call vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false) successfully on the vanity import path:

$ goexec 'vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false)'
(*vcs.RepoRoot)(&vcs.RepoRoot{
	VCS: (*vcs.Cmd)(&vcs.Cmd{
		Name:        (string)("Git"),
		Cmd:         (string)("git"),
		CreateCmd:   (string)("clone {repo} {dir}"),
		DownloadCmd: (string)("pull --ff-only"),
		TagCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
			(vcs.TagCmd)(vcs.TagCmd{
				Cmd:     (string)("show-ref"),
				Pattern: (string)("(?:tags|origin)/(\\S+)$"),
			}),
		}),
		TagLookupCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
			(vcs.TagCmd)(vcs.TagCmd{
				Cmd:     (string)("show-ref tags/{tag} origin/{tag}"),
				Pattern: (string)("((?:tags|origin)/\\S+)$"),
			}),
		}),
		TagSyncCmd:     (string)("checkout {tag}"),
		TagSyncDefault: (string)("checkout master"),
		LogCmd:         (string)(""),
		Scheme: ([]string)([]string{
			(string)("git"),
			(string)("https"),
			(string)("http"),
			(string)("git+ssh"),
		}),
		PingCmd: (string)("ls-remote {scheme}://{repo}"),
	}),
	Repo: (string)("https://github.com/shurcooL-test/go-get-issue-unicode"),
	Root: (string)("dmitri.shuralyov.com/temp/go-get-issue-unicode"),
})
(interface{})(nil)

However, both go get and vcs.RepoRootForImportPath() fail on the "github.com/shurcooL-test/go-get-issue-unicode/испытание" import path, because the regexps in vcs.go only allow ASCII letters, not unicode ones.

Member

dmitshur commented Jan 18, 2017

I'll add some additional information about this bug below.

This affects golang.org/x/tools/go/vcs package in addition to cmd/go.

I've made a smaller reproduce repo (@rsc is welcome to fork it to his collection if desired; consider my repo temporary):

https://github.com/shurcooL-test/go-get-issue-unicode

Note that this issue affects only import paths that are statically known (GitHub, Bitbucket, etc.). It's not an issue for vanity import paths.

E.g., here is an alternative vanity import path that works without issues:

dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание

You can go get, go install, go test, go doc it without issues:

$ go get -u dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
$ go install dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
$ go test dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
ok  	dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание	0.014s
$ go doc dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
package испытание // import "dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание"

Package испытание demonstrates Unicode capabilities in Go source code.

type Эксперимент struct{ ... }
    func Испытание() Эксперимент

You can also call vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false) successfully on the vanity import path:

$ goexec 'vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false)'
(*vcs.RepoRoot)(&vcs.RepoRoot{
	VCS: (*vcs.Cmd)(&vcs.Cmd{
		Name:        (string)("Git"),
		Cmd:         (string)("git"),
		CreateCmd:   (string)("clone {repo} {dir}"),
		DownloadCmd: (string)("pull --ff-only"),
		TagCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
			(vcs.TagCmd)(vcs.TagCmd{
				Cmd:     (string)("show-ref"),
				Pattern: (string)("(?:tags|origin)/(\\S+)$"),
			}),
		}),
		TagLookupCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
			(vcs.TagCmd)(vcs.TagCmd{
				Cmd:     (string)("show-ref tags/{tag} origin/{tag}"),
				Pattern: (string)("((?:tags|origin)/\\S+)$"),
			}),
		}),
		TagSyncCmd:     (string)("checkout {tag}"),
		TagSyncDefault: (string)("checkout master"),
		LogCmd:         (string)(""),
		Scheme: ([]string)([]string{
			(string)("git"),
			(string)("https"),
			(string)("http"),
			(string)("git+ssh"),
		}),
		PingCmd: (string)("ls-remote {scheme}://{repo}"),
	}),
	Repo: (string)("https://github.com/shurcooL-test/go-get-issue-unicode"),
	Root: (string)("dmitri.shuralyov.com/temp/go-get-issue-unicode"),
})
(interface{})(nil)

However, both go get and vcs.RepoRootForImportPath() fail on the "github.com/shurcooL-test/go-get-issue-unicode/испытание" import path, because the regexps in vcs.go only allow ASCII letters, not unicode ones.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Jan 18, 2017

Member

It's not easy to expand character range of unicode. go's identifier doesn't accept some unicode classes. For example, IDEOGRAPHIC SPACE is.

Member

mattn commented Jan 18, 2017

It's not easy to expand character range of unicode. go's identifier doesn't accept some unicode classes. For example, IDEOGRAPHIC SPACE is.

@bradfitz bradfitz added this to the Go1.9 milestone Jan 19, 2017

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Jan 19, 2017

Member

@mattn, that's not a blocker for this bug. We can match on [^/]+.

Member

bradfitz commented Jan 19, 2017

@mattn, that's not a blocker for this bug. We can match on [^/]+.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Jan 19, 2017

Member

@bradfitz please try this.

https://github.com/mattn/misc/blob/master/foo.go

the first import github.com/mattn/misc/世界 is okay to import. But second one github.com/mattn/misc/世 界 occur invalid import path. And I guess [^/]+ should not accept .., :, ;, &, % or something characters which can make another problems for each OSs.

Member

mattn commented Jan 19, 2017

@bradfitz please try this.

https://github.com/mattn/misc/blob/master/foo.go

the first import github.com/mattn/misc/世界 is okay to import. But second one github.com/mattn/misc/世 界 occur invalid import path. And I guess [^/]+ should not accept .., :, ;, &, % or something characters which can make another problems for each OSs.

@mattn

This comment has been minimized.

Show comment
Hide comment
@mattn

mattn Jan 19, 2017

Member

If you think go get doesn't have to matter the letters, and the error should be occured expectedly, i don't have strong opinion. So please fixes this.

Member

mattn commented Jan 19, 2017

If you think go get doesn't have to matter the letters, and the error should be occured expectedly, i don't have strong opinion. So please fixes this.

dmitshur added a commit to shurcooL/home that referenced this issue Feb 5, 2017

Add temporary vanity import path.
For golang/go#18660 and golang/gddo#468.

Will be deleted after those issues are resolved.
@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/41750 mentions this issue.

gopherbot pushed a commit to golang/gddo that referenced this issue Apr 25, 2017

gosrc: Allow Unicode letters in import paths.
Background

The following is a valid vanity import path that works without issues:

	dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание

You can go get, go install, go test, go doc it without issues:

	$ go get -u dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
	$ go install dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
	$ go test dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
	ok  	dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание	0.014s
	$ go doc dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание
	package испытание // import "dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание"

	Package испытание demonstrates Unicode capabilities in Go source code.

	type Эксперимент struct{ ... }
	    func Испытание() Эксперимент

You can also call vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false)
(from golang.org/x/tools/go/vcs) successfully on the vanity import path:

	$ goexec 'vcs.RepoRootForImportPath("dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание", false)'
	(*vcs.RepoRoot)(&vcs.RepoRoot{
		VCS: (*vcs.Cmd)(&vcs.Cmd{
			Name:        (string)("Git"),
			Cmd:         (string)("git"),
			CreateCmd:   (string)("clone {repo} {dir}"),
			DownloadCmd: (string)("pull --ff-only"),
			TagCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
				(vcs.TagCmd)(vcs.TagCmd{
					Cmd:     (string)("show-ref"),
					Pattern: (string)("(?:tags|origin)/(\\S+)$"),
				}),
			}),
			TagLookupCmd: ([]vcs.TagCmd)([]vcs.TagCmd{
				(vcs.TagCmd)(vcs.TagCmd{
					Cmd:     (string)("show-ref tags/{tag} origin/{tag}"),
					Pattern: (string)("((?:tags|origin)/\\S+)$"),
				}),
			}),
			TagSyncCmd:     (string)("checkout {tag}"),
			TagSyncDefault: (string)("checkout master"),
			LogCmd:         (string)(""),
			Scheme: ([]string)([]string{
				(string)("git"),
				(string)("https"),
				(string)("http"),
				(string)("git+ssh"),
			}),
			PingCmd: (string)("ls-remote {scheme}://{repo}"),
		}),
		Repo: (string)("https://github.com/shurcooL-test/go-get-issue-unicode"),
		Root: (string)("dmitri.shuralyov.com/temp/go-get-issue-unicode"),
	})
	(interface{})(nil)

However, gosrc.IsValidRemotePath incorrectly reports false for the
"dmitri.shuralyov.com/temp/go-get-issue-unicode/испытание" import path.

Fix

gosrc.IsValidRemotePath reports false for such import paths because
validPathElement regexp only allows ASCII letters A-Za-z, not Unicode
ones.

This change fixes that by using a predefined character class, the
Unicode character property class \p{L} that describes the Unicode
characters that are letters.

Additionally, fix an issue where a query parameter value was not
correctly escaped when constructing a URL.

Fixes golang/gddo#468.
Updates golang/go#18660.

References

-	https://stackoverflow.com/questions/3617797/regex-to-match-only-letters
-	https://stackoverflow.com/questions/6005459/is-there-a-way-to-match-any-unicode-non-alphabetic-character
-	https://www.regular-expressions.info/unicode.html#prop

Change-Id: I48680749d827cbc63fefca2c21e9790009f20746
Reviewed-on: https://go-review.googlesource.com/41750
Reviewed-by: Chris Broadfoot <cbro@golang.org>
Reviewed-by: Tuo Shan <shantuo@google.com>
Reviewed-by: Francesc Campoy Flores <campoy@golang.org>
@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Apr 26, 2017

Member

I think I can apply a similar solution to fix this as I did for golang/gddo#468:

gosrc.IsValidRemotePath reports false for such import paths because
validPathElement regexp only allows ASCII letters A-Za-z, not Unicode
ones.

This change fixes that by using a predefined character class, the
Unicode character property class \p{L} that describes the Unicode
characters that are letters.

References

See commit golang/gddo@cdd60fa for full details.

The fix for cmd/go then becomes:

diff --git a/src/cmd/go/internal/get/vcs.go b/src/cmd/go/internal/get/vcs.go
index 7439cc8649..c72d52bc1b 100644
--- a/src/cmd/go/internal/get/vcs.go
+++ b/src/cmd/go/internal/get/vcs.go
@@ -851,7 +851,7 @@ var vcsPaths = []*vcsPath{
 	// Github
 	{
 		prefix: "github.com/",
-		re:     `^(?P<root>github\.com/[A-Za-z0-9_.\-]+/[A-Za-z0-9_.\-]+)(/[A-Za-z0-9_.\-]+)*$`,
+		re:     `^(?P<root>github\.com/[A-Za-z0-9_.\-]+/[A-Za-z0-9_.\-]+)(/[\p{L}0-9_.\-]+)*$`,
 		vcs:    "git",
 		repo:   "https://{root}",
 		check:  noVCSSuffix,

I tested it, and it works on the original bug report. It doesn't seem neccessary to allow Unicode for GitHub username and repository name, because from what I can tell, GitHub doesn't allow those:

image

image

So a Unicode character can only come up within a directory name of a GitHub repository.

I'd rather make the smallest possible change that'll resolve this issue (unless I'm advised otherwise).

If that sounds reasonable, I can send a CL that'll resolve this.

Member

dmitshur commented Apr 26, 2017

I think I can apply a similar solution to fix this as I did for golang/gddo#468:

gosrc.IsValidRemotePath reports false for such import paths because
validPathElement regexp only allows ASCII letters A-Za-z, not Unicode
ones.

This change fixes that by using a predefined character class, the
Unicode character property class \p{L} that describes the Unicode
characters that are letters.

References

See commit golang/gddo@cdd60fa for full details.

The fix for cmd/go then becomes:

diff --git a/src/cmd/go/internal/get/vcs.go b/src/cmd/go/internal/get/vcs.go
index 7439cc8649..c72d52bc1b 100644
--- a/src/cmd/go/internal/get/vcs.go
+++ b/src/cmd/go/internal/get/vcs.go
@@ -851,7 +851,7 @@ var vcsPaths = []*vcsPath{
 	// Github
 	{
 		prefix: "github.com/",
-		re:     `^(?P<root>github\.com/[A-Za-z0-9_.\-]+/[A-Za-z0-9_.\-]+)(/[A-Za-z0-9_.\-]+)*$`,
+		re:     `^(?P<root>github\.com/[A-Za-z0-9_.\-]+/[A-Za-z0-9_.\-]+)(/[\p{L}0-9_.\-]+)*$`,
 		vcs:    "git",
 		repo:   "https://{root}",
 		check:  noVCSSuffix,

I tested it, and it works on the original bug report. It doesn't seem neccessary to allow Unicode for GitHub username and repository name, because from what I can tell, GitHub doesn't allow those:

image

image

So a Unicode character can only come up within a directory name of a GitHub repository.

I'd rather make the smallest possible change that'll resolve this issue (unless I'm advised otherwise).

If that sounds reasonable, I can send a CL that'll resolve this.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian Apr 26, 2017

Contributor

Possibly relevant: #20115

Contributor

josharian commented Apr 26, 2017

Possibly relevant: #20115

@bradfitz

This comment has been minimized.

Show comment
Hide comment
Member

bradfitz commented Apr 26, 2017

@shurcooL, SGTM.

@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Apr 26, 2017

Member

#20115 provides further support for the choice of not allowing Unicode for GitHub usernames and repository names needlessly. That way, only the directory names will be susceptible, but not the actual repository, which helps.

@bradfitz Ok, I'll send a CL. I'll work on adding/updating a test case for it.

Would it be possible/good idea for @rsc to create a copy of (or fork) my https://github.com/shurcooL-test/go-get-issue-unicode repository, so I could rely on it to be there and use it in my tests? I ask because most go get issue repos are already there. @rsc can rename the repository to go-get-issue-18660.

Member

dmitshur commented Apr 26, 2017

#20115 provides further support for the choice of not allowing Unicode for GitHub usernames and repository names needlessly. That way, only the directory names will be susceptible, but not the actual repository, which helps.

@bradfitz Ok, I'll send a CL. I'll work on adding/updating a test case for it.

Would it be possible/good idea for @rsc to create a copy of (or fork) my https://github.com/shurcooL-test/go-get-issue-unicode repository, so I could rely on it to be there and use it in my tests? I ask because most go get issue repos are already there. @rsc can rename the repository to go-get-issue-18660.

@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Apr 26, 2017

Member

Although, I'm not sure if this really needs a live go get test, I think some offline test cases for vcsPath path regexp matching could be sufficient.

Member

dmitshur commented Apr 26, 2017

Although, I'm not sure if this really needs a live go get test, I think some offline test cases for vcsPath path regexp matching could be sufficient.

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Apr 26, 2017

Member

Self-contained non-network test only please. :)

Member

bradfitz commented Apr 26, 2017

Self-contained non-network test only please. :)

@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/41822 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/42017 mentions this issue.

gopherbot pushed a commit to golang/tools that referenced this issue Apr 28, 2017

go/vcs: allow go get on github.com/ import paths with Unicode letters
Manually apply same change as CL 41822 did for cmd/go/internal/get,
but for golang.org/x/tools/go/vcs, to help keep them in sync.

Updates golang/go#18660.
Helps golang/go#11490.

Change-Id: I6c7759c073583dea771bc438b70f8c2eb7b5ebfb
Reviewed-on: https://go-review.googlesource.com/42017
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>

@gopherbot gopherbot closed this in 6511931 Apr 28, 2017

@golang golang locked and limited conversation to collaborators Apr 28, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.