proposal: runtime: expose current thread id or processor id #18590

Closed
funny-falcon opened this Issue Jan 10, 2017 · 68 comments

Comments

Projects
None yet
@funny-falcon
Contributor

funny-falcon commented Jan 10, 2017

Often there is a need to reduce lock contention in algorithms.
Usual way to acomplish this is sharding, ie several separate locks protects non-intersecting parts of greater state.
But then there is a need for fast (and preferably non-locking) way to determine shard id.
And it will be great, if such shard id will further reduce lock contention.

  • Best way to do it is exposing current cpu number. On Intel cpu it could be done with CPUID instruction.
    CPUID it returns APIC-ID which are not consecutive, so probably runtime should analyze CPU topology to convert APIC-ID to consecutive cpu-number. On the other way, given some architectures allows hot cpu swapping, probably it is better to return APIC-ID as is.
  • Simple way is to expose thread id getg().m.id as runtime.Mid() . Cause program could contain a lot of threads (due to system calls, or cgo calls), there is a need to additional fast random value, so I propose to additionally expose runtime.fastrand() as runtime.Fastrand() (mixing it with secure random key to hide original fastrand value).

"Best way" could be implemented in some "golang/x/" package. "Simple way" needs to implement in "runtime" at least wrapper around "getg().m.id", to link with this wrapper from external library (if there is a way to call private "runtime" function from external package).

@ianlancetaylor ianlancetaylor changed the title from Proposal: expose current thread id or even processor id to user-space to proposal: expose current thread id or even processor id to user-space Jan 10, 2017

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jan 10, 2017

Contributor

Historically we have always rejected goroutine-local information.

Your proposal is incomplete: you need to define exactly when the value is permitted to change, and you need to do so without overly constraining the runtime. Right now a g is permitted to move to a different m or p at any function call, which includes several implicit function calls inserted by the compiler for things like map opertions. In order to address #10958 it is likely that in future a releases a g will be permitted to move to a different m or p in any loop. It's not enough to define runtime.Mid without also providing a clear explanation for when it will return a different value.

The discussion in #17973 may be of interest.

Contributor

ianlancetaylor commented Jan 10, 2017

Historically we have always rejected goroutine-local information.

Your proposal is incomplete: you need to define exactly when the value is permitted to change, and you need to do so without overly constraining the runtime. Right now a g is permitted to move to a different m or p at any function call, which includes several implicit function calls inserted by the compiler for things like map opertions. In order to address #10958 it is likely that in future a releases a g will be permitted to move to a different m or p in any loop. It's not enough to define runtime.Mid without also providing a clear explanation for when it will return a different value.

The discussion in #17973 may be of interest.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Result of runtime.Mid() could be changed at any moment.
Value, returned from Mid() doesn't eliminate need of locking: since number of threads could be much larger than preallocated number of shards, locks are inevitable.
And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.
Mid() combined with Fastrand() are just a hint for reducing lock contention, but doesn't eliminates locks completely.

Contributor

funny-falcon commented Jan 10, 2017

Result of runtime.Mid() could be changed at any moment.
Value, returned from Mid() doesn't eliminate need of locking: since number of threads could be much larger than preallocated number of shards, locks are inevitable.
And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.
Mid() combined with Fastrand() are just a hint for reducing lock contention, but doesn't eliminates locks completely.

@mdempsky

This comment has been minimized.

Show comment
Hide comment
@mdempsky

mdempsky Jan 10, 2017

Member

And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.

If runtime.Mid changes (e.g., because the goroutine is now being run on a new OS thread), how do you make sure a goroutine continues to use the correct bucket?

In case the answer is that each goroutine only calls runtime.Mid once, then why not just use math/rand.Int?

Member

mdempsky commented Jan 10, 2017

And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.

If runtime.Mid changes (e.g., because the goroutine is now being run on a new OS thread), how do you make sure a goroutine continues to use the correct bucket?

In case the answer is that each goroutine only calls runtime.Mid once, then why not just use math/rand.Int?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

make sure a goroutine continues to use the correct bucket?

It is just hint to select a bucket. Per-shard mutex still will be acquired, so there is no "fail" if runtime.Mid will be changed after call. But most of time it will be not changed, and this will help to reduce mutex contention.

Problem with math/rand.Int is single global mutex around this call. See my previous closed proposal: #18514

BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkMathRand-4             10000000               162 ns/op
BenchmarkMathRand-4             10000000               159 ns/op
Contributor

funny-falcon commented Jan 10, 2017

make sure a goroutine continues to use the correct bucket?

It is just hint to select a bucket. Per-shard mutex still will be acquired, so there is no "fail" if runtime.Mid will be changed after call. But most of time it will be not changed, and this will help to reduce mutex contention.

Problem with math/rand.Int is single global mutex around this call. See my previous closed proposal: #18514

BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkMathRand-4             10000000               162 ns/op
BenchmarkMathRand-4             10000000               159 ns/op
@quentinmit

This comment has been minimized.

Show comment
Hide comment
@quentinmit

quentinmit Jan 10, 2017

Contributor

Does syscall.Gettid not give you a thread ID that roughly meets this criteria already? Why does the runtime need to provide it?

Contributor

quentinmit commented Jan 10, 2017

Does syscall.Gettid not give you a thread ID that roughly meets this criteria already? Why does the runtime need to provide it?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

I bet syscall.Gettid is expensive. No?

Contributor

funny-falcon commented Jan 10, 2017

I bet syscall.Gettid is expensive. No?

@kostya-sh

This comment has been minimized.

Show comment
Hide comment
@kostya-sh

kostya-sh Jan 10, 2017

Contributor
Contributor

kostya-sh commented Jan 10, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Probably, syscall.Gettid() is not so expensive on Linux, but it is much more expensive on other platforms (and probably doesn't exists on Windows).

Contributor

funny-falcon commented Jan 10, 2017

Probably, syscall.Gettid() is not so expensive on Linux, but it is much more expensive on other platforms (and probably doesn't exists on Windows).

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

@kostya-sh Here is example of usage of runtime.Mid() and runtime.Fastrand() based on your benchmark: https://gist.github.com/funny-falcon/d8adc39e777b79e56274a65abca29e78

results are:

BenchmarkRandInt63_Global-4                     10000000               154 ns/op
BenchmarkRandInt63_Channel-4                    10000000               220 ns/op
BenchmarkRandInt63_Pool-4                       50000000                28.8 ns/op
BenchmarkSourceInt63_Pool-4                     50000000                34.9 ns/op
BenchmarkRandInt63_Source-4                     300000000                5.71 ns/op
BenchmarkRandInt63_ShardsGettid-4               50000000                29.2 ns/op
BenchmarkRandInt63_ShardsMid-4                  100000000               14.2 ns/op
BenchmarkRandInt63_ShardsFastrand-4             30000000                49.0 ns/op
BenchmarkRandInt63_ShardsMidFastrand-4          100000000               17.1 ns/op

It shows that using Mid() (and combination with Fastrand()) could be almost twice faster than single Pool.

Contributor

funny-falcon commented Jan 10, 2017

@kostya-sh Here is example of usage of runtime.Mid() and runtime.Fastrand() based on your benchmark: https://gist.github.com/funny-falcon/d8adc39e777b79e56274a65abca29e78

results are:

BenchmarkRandInt63_Global-4                     10000000               154 ns/op
BenchmarkRandInt63_Channel-4                    10000000               220 ns/op
BenchmarkRandInt63_Pool-4                       50000000                28.8 ns/op
BenchmarkSourceInt63_Pool-4                     50000000                34.9 ns/op
BenchmarkRandInt63_Source-4                     300000000                5.71 ns/op
BenchmarkRandInt63_ShardsGettid-4               50000000                29.2 ns/op
BenchmarkRandInt63_ShardsMid-4                  100000000               14.2 ns/op
BenchmarkRandInt63_ShardsFastrand-4             30000000                49.0 ns/op
BenchmarkRandInt63_ShardsMidFastrand-4          100000000               17.1 ns/op

It shows that using Mid() (and combination with Fastrand()) could be almost twice faster than single Pool.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Looks like exposing getg().m.p.ptr().id gives even better results.

Contributor

funny-falcon commented Jan 10, 2017

Looks like exposing getg().m.p.ptr().id gives even better results.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Looks like sync.Pool already uses hiddenly exposed getg().m.p.ptr().id (through runtime_procPin). So, effectiveness of sharding by P.id is known to core team.

So why not expose it to regular Go users?

(of cause, runtime_procPin does more than simply returning current P.id.
But just returning P.id with runtime.Pid() will be already great help).

Contributor

funny-falcon commented Jan 10, 2017

Looks like sync.Pool already uses hiddenly exposed getg().m.p.ptr().id (through runtime_procPin). So, effectiveness of sharding by P.id is known to core team.

So why not expose it to regular Go users?

(of cause, runtime_procPin does more than simply returning current P.id.
But just returning P.id with runtime.Pid() will be already great help).

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Jan 10, 2017

Member

So why not expose it to regular Go users?

Because Go considers simplicity a virtue. When there are trade-offs between simplicity and performance, Go considers simplicity. You seem to have a different value trade-off, and it's obvious you prefer performance above all else. That's probably why many of your proposals to Go end up getting rejected. Please keep that in mind for future proposals.

We understand that things can be always be faster, and we want things to be fast, but only if things are also simple and have clean APIs. We do not want to document and support and export all possible innards.

Member

bradfitz commented Jan 10, 2017

So why not expose it to regular Go users?

Because Go considers simplicity a virtue. When there are trade-offs between simplicity and performance, Go considers simplicity. You seem to have a different value trade-off, and it's obvious you prefer performance above all else. That's probably why many of your proposals to Go end up getting rejected. Please keep that in mind for future proposals.

We understand that things can be always be faster, and we want things to be fast, but only if things are also simple and have clean APIs. We do not want to document and support and export all possible innards.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

But such thing as P.id is very simple thing. Does anyone expect changing of Go runtime scheduler core schema? Will P ever be removed?

Contributor

funny-falcon commented Jan 10, 2017

But such thing as P.id is very simple thing. Does anyone expect changing of Go runtime scheduler core schema? Will P ever be removed?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

That's probably why many of your proposals to Go end up getting rejected.

"just keep pushing" :-)

Contributor

funny-falcon commented Jan 10, 2017

That's probably why many of your proposals to Go end up getting rejected.

"just keep pushing" :-)

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Jan 10, 2017

Member

Pushing the same ideas over and over is actually more likely to annoy people. Please try to understand the space of trade-offs that Go strives for.

But such thing as P.id is very simple thing.

That's not the point. We don't expose every "simple thing" just because it's simple. We also have to then document it, support it, test it, and keep it available forever. And we have to consider whether it's redundant with other API, or will be misused, etc.

Does anyone expect changing of Go runtime scheduler core schema?

It has in the past.

Member

bradfitz commented Jan 10, 2017

Pushing the same ideas over and over is actually more likely to annoy people. Please try to understand the space of trade-offs that Go strives for.

But such thing as P.id is very simple thing.

That's not the point. We don't expose every "simple thing" just because it's simple. We also have to then document it, support it, test it, and keep it available forever. And we have to consider whether it's redundant with other API, or will be misused, etc.

Does anyone expect changing of Go runtime scheduler core schema?

It has in the past.

@mdempsky

This comment has been minimized.

Show comment
Hide comment
@mdempsky

mdempsky Jan 10, 2017

Member

@funny-falcon My main impression from this thread is it's not obvious that the design space has been exhausted. E.g., you could already locklessly assign a unique ID to each worker yourself. Why have you ruled that out?

Member

mdempsky commented Jan 10, 2017

@funny-falcon My main impression from this thread is it's not obvious that the design space has been exhausted. E.g., you could already locklessly assign a unique ID to each worker yourself. Why have you ruled that out?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

@mdempsky single lockless approach awailable in Go is atomic.AddUintxx , but it is also doesn't scale with a lot of cpu cores cause of cacheline contention.

Any "lock" or "lockless" technique will fail to scale without hint about current scheduling item. That is why it were used for sync.Pool, yes?

@jonhoo uses CPUID instruction and parsing '/proc/cpuinfo' , but it works only on linux x86 : https://github.com/jonhoo/drwmutex

Contributor

funny-falcon commented Jan 10, 2017

@mdempsky single lockless approach awailable in Go is atomic.AddUintxx , but it is also doesn't scale with a lot of cpu cores cause of cacheline contention.

Any "lock" or "lockless" technique will fail to scale without hint about current scheduling item. That is why it were used for sync.Pool, yes?

@jonhoo uses CPUID instruction and parsing '/proc/cpuinfo' , but it works only on linux x86 : https://github.com/jonhoo/drwmutex

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jan 10, 2017

Contributor

The Go runtime is happy to use per-P or per-M data in support of specific, simple, well-defined functionality that can, if necessary, be provided in other ways. If you look back at the history of sync.Pool, you will see a lot of discussion about exactly these issues: is sync.Pool simple enough, well-defined enough, for people to use successfully.

To me your proposal seems very loosely defined. It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't." That's not a well-defined API that people can use to write good code, because the expectations are not clear. Since the expectations are not clear, people will rely on the implementation, and that means that if we change the implementation the performance characteristics of people's code will change. That's very problematic for a feature that exists only for its effect on performance.

Contributor

ianlancetaylor commented Jan 10, 2017

The Go runtime is happy to use per-P or per-M data in support of specific, simple, well-defined functionality that can, if necessary, be provided in other ways. If you look back at the history of sync.Pool, you will see a lot of discussion about exactly these issues: is sync.Pool simple enough, well-defined enough, for people to use successfully.

To me your proposal seems very loosely defined. It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't." That's not a well-defined API that people can use to write good code, because the expectations are not clear. Since the expectations are not clear, people will rely on the implementation, and that means that if we change the implementation the performance characteristics of people's code will change. That's very problematic for a feature that exists only for its effect on performance.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Pushing the same ideas over and over is actually more likely to annoy people.

I pushed different ideas. Yes, I am annoying a bit. But someone still should push strange ideas. At least, there will be documented rejection of such idea.

We don't expose every "simple thing" just because it's simple.

That is why here is discussion of proposal: to decide worthwhile it or not.

It has in the past.

Given two most successful current green thread runtimes (Go and Erlang) has similare structure (at least, they both has per-core schedulers - P ), it is quite unfortunately for next big change. I could be mistaken.

Contributor

funny-falcon commented Jan 10, 2017

Pushing the same ideas over and over is actually more likely to annoy people.

I pushed different ideas. Yes, I am annoying a bit. But someone still should push strange ideas. At least, there will be documented rejection of such idea.

We don't expose every "simple thing" just because it's simple.

That is why here is discussion of proposal: to decide worthwhile it or not.

It has in the past.

Given two most successful current green thread runtimes (Go and Erlang) has similare structure (at least, they both has per-core schedulers - P ), it is quite unfortunately for next big change. I could be mistaken.

@davecheney

This comment has been minimized.

Show comment
Hide comment
@davecheney

davecheney Jan 10, 2017

Contributor
Contributor

davecheney commented Jan 10, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

@davecheney Original my use-case is still rpc. I just realized that runtime.Pid() is simply better.
Database, for which I'm maintaining connector, may handle 1Mrps easily, and at this rate every possibility for improvement is worthwhile.

I saw "per-cpu local info" requests at least several times during discussion of this and previous proposal (walking by links), so it doesn't look like it is just my crazy need. And sharding by runtime.Pid() looks like very close approximation, even if Mutex is still involved (ie no pinning like for sync.Pool).

Contributor

funny-falcon commented Jan 10, 2017

@davecheney Original my use-case is still rpc. I just realized that runtime.Pid() is simply better.
Database, for which I'm maintaining connector, may handle 1Mrps easily, and at this rate every possibility for improvement is worthwhile.

I saw "per-cpu local info" requests at least several times during discussion of this and previous proposal (walking by links), so it doesn't look like it is just my crazy need. And sharding by runtime.Pid() looks like very close approximation, even if Mutex is still involved (ie no pinning like for sync.Pool).

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

@ianlancetaylor

It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't."

Api for "Pid()" is "give me a number, that approximates scheduling internals: cpu number or scheduler (P), - so that two goroutines that calls this method simultaneously (ie in parallel) will receive different numbers with very high probability, so if I use it for choosing mutex or other synchronization primitive, it will likely to be non-contended. Preferable it should be bound with understandable small value (num-of-cpu cores or GOMAXPROCS)."

Contributor

funny-falcon commented Jan 10, 2017

@ianlancetaylor

It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't."

Api for "Pid()" is "give me a number, that approximates scheduling internals: cpu number or scheduler (P), - so that two goroutines that calls this method simultaneously (ie in parallel) will receive different numbers with very high probability, so if I use it for choosing mutex or other synchronization primitive, it will likely to be non-contended. Preferable it should be bound with understandable small value (num-of-cpu cores or GOMAXPROCS)."

@davecheney

This comment has been minimized.

Show comment
Hide comment
@davecheney

davecheney Jan 10, 2017

Contributor
Contributor

davecheney commented Jan 10, 2017

@dsnet dsnet added this to the Proposal milestone Jan 10, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Currently I choose shard for putting request into by atomic increment. But it means, even if goroutine is not preempted, it will spread its requests across many shards. (Shard - is datastructure that aggregates requests before network writer writes them into connection).

With Pid() cache locality will be much improved and lock contention decreased.

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Contributor

funny-falcon commented Jan 10, 2017

Currently I choose shard for putting request into by atomic increment. But it means, even if goroutine is not preempted, it will spread its requests across many shards. (Shard - is datastructure that aggregates requests before network writer writes them into connection).

With Pid() cache locality will be much improved and lock contention decreased.

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

@davecheney

This comment has been minimized.

Show comment
Hide comment
@davecheney

davecheney Jan 10, 2017

Contributor

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Please, can we talk about your specific problem, not other problems that other people may be able to solve with this feature.

Contributor

davecheney commented Jan 10, 2017

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Please, can we talk about your specific problem, not other problems that other people may be able to solve with this feature.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 10, 2017

Contributor

Let it be called not Pid but ProcHint() :

// ProcHint gives a hint about current scheduler item
// (for example, cpu nunber, or id of internal scheduler).
// ProcHint is bounded by 0 <= ProcHint < ProcHintMax
// You may use it to reduce lock contention if you may allocate
// ProcHintMax independent mutexes or channels and
// then choose them using ProcHint.
func ProcHint() int

// ProcHintMax gives upper bound of ProcHint.
// Note: it could be changed in runtime, for example,
// if you change GOMAXPROCS or if you hot-swap cpu.
func ProcHintMax() int
Contributor

funny-falcon commented Jan 10, 2017

Let it be called not Pid but ProcHint() :

// ProcHint gives a hint about current scheduler item
// (for example, cpu nunber, or id of internal scheduler).
// ProcHint is bounded by 0 <= ProcHint < ProcHintMax
// You may use it to reduce lock contention if you may allocate
// ProcHintMax independent mutexes or channels and
// then choose them using ProcHint.
func ProcHint() int

// ProcHintMax gives upper bound of ProcHint.
// Note: it could be changed in runtime, for example,
// if you change GOMAXPROCS or if you hot-swap cpu.
func ProcHintMax() int
@minux

This comment has been minimized.

Show comment
Hide comment
@minux

minux Jan 11, 2017

Member
Member

minux commented Jan 11, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 11, 2017

Contributor

but hard to formally design and get accepted.

Hard doesn't mean impossible. It means, more work should be put in.

// ProcMaxHint returns the approximate number of scheduler items
// (for example, CPU cores or runtime scheduler P).
// It tends to be dense, ie schedulers are likely to have continuous id
// in range 0 <= id < ProcMaxHint.
// Note: it could be changed in runtime, for example, if you change
// GOMAXPROCS or if you hot-swap cpu or if you call runtime.LockOSThread().
func ProcMaxHint() int {}

// ProcHint returns the approximate hint about current scheduler id
// (for example, cpu core number, or id of internal scheduler).
// ProcHint is likely to be in range 0 <= id < ProcMaxHint, if ProcMaxHint is not changed,
// (but note, that ProcMaxHint could be changed at runtime).
// You may use it to reduce lock contention if you may allocate
// ProcMaxHint independent mutexes or channels and then choose them using ProcHint.
//
//     type counterMap struct {
//         m sync.Mutex
//         c map[string]int
//         _pad [64]byte
//     }
//     type Stat []counterMap
//     func NewStat() Stat {
//         s := make([]counterMap, runtime.ProcMaxHint())
//         for i := range s {
//             s[i].c = make(map[string]int)
//         }
//         return s
//     }
//     func (s State) Increment(key string, val int) {
//         // since ProcMaxHint could be changed
//         // ProcHint could be larger than len(s)
//         i := runtime.ProcHint() % len(s)
//         s[i].Lock()
//         s[i].c[key] += val
//         s[i].Unlock()
//     }
func ProcHint() int {}
Contributor

funny-falcon commented Jan 11, 2017

but hard to formally design and get accepted.

Hard doesn't mean impossible. It means, more work should be put in.

// ProcMaxHint returns the approximate number of scheduler items
// (for example, CPU cores or runtime scheduler P).
// It tends to be dense, ie schedulers are likely to have continuous id
// in range 0 <= id < ProcMaxHint.
// Note: it could be changed in runtime, for example, if you change
// GOMAXPROCS or if you hot-swap cpu or if you call runtime.LockOSThread().
func ProcMaxHint() int {}

// ProcHint returns the approximate hint about current scheduler id
// (for example, cpu core number, or id of internal scheduler).
// ProcHint is likely to be in range 0 <= id < ProcMaxHint, if ProcMaxHint is not changed,
// (but note, that ProcMaxHint could be changed at runtime).
// You may use it to reduce lock contention if you may allocate
// ProcMaxHint independent mutexes or channels and then choose them using ProcHint.
//
//     type counterMap struct {
//         m sync.Mutex
//         c map[string]int
//         _pad [64]byte
//     }
//     type Stat []counterMap
//     func NewStat() Stat {
//         s := make([]counterMap, runtime.ProcMaxHint())
//         for i := range s {
//             s[i].c = make(map[string]int)
//         }
//         return s
//     }
//     func (s State) Increment(key string, val int) {
//         // since ProcMaxHint could be changed
//         // ProcHint could be larger than len(s)
//         i := runtime.ProcHint() % len(s)
//         s[i].Lock()
//         s[i].c[key] += val
//         s[i].Unlock()
//     }
func ProcHint() int {}
@minux

This comment has been minimized.

Show comment
Hide comment
@minux

minux Jan 18, 2017

Member
Member

minux commented Jan 18, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 18, 2017

Contributor
  1. It doesn't really important to have hard guarantee that hint is less than max in practice. It is mentioned in functions docs and clearly shown in example, that user code just ought to be ready for hint to be greater than previously gathered maxhint.
    (Same way, code in sync.Pool just ought to be ready for P.id to be greater than previously gathered max).
  2. There is no need in definition "how dense". User will use this API when lock contention is much more important for him than exact memory usage. It is just enough for user to now that it is "adequately dense".
    And btw there is definition: "It tends to be dense, ie schedulers are likely to have continuous id in range 0 <= id < ProcMaxHint." . "tends... to have continous id in range" means usually it is quite close to 1, no? Probably it could be formulated in clearer way, I just don't know english too well.
  3. There is no need in this guarantee. It is explicitely called "Hint" and "approximate value", and explicitely said in documentation that it doesn't eliminate need in synchronization. It just helps reduce lock contention, but doesn't guarantee to eliminate lock contention.
Contributor

funny-falcon commented Jan 18, 2017

  1. It doesn't really important to have hard guarantee that hint is less than max in practice. It is mentioned in functions docs and clearly shown in example, that user code just ought to be ready for hint to be greater than previously gathered maxhint.
    (Same way, code in sync.Pool just ought to be ready for P.id to be greater than previously gathered max).
  2. There is no need in definition "how dense". User will use this API when lock contention is much more important for him than exact memory usage. It is just enough for user to now that it is "adequately dense".
    And btw there is definition: "It tends to be dense, ie schedulers are likely to have continuous id in range 0 <= id < ProcMaxHint." . "tends... to have continous id in range" means usually it is quite close to 1, no? Probably it could be formulated in clearer way, I just don't know english too well.
  3. There is no need in this guarantee. It is explicitely called "Hint" and "approximate value", and explicitely said in documentation that it doesn't eliminate need in synchronization. It just helps reduce lock contention, but doesn't guarantee to eliminate lock contention.

@rsc rsc changed the title from proposal: expose current thread id or even processor id to user-space to proposal: runtime: expose current thread id or processor id Jan 23, 2017

@rsc

This comment has been minimized.

Show comment
Hide comment
Contributor

rsc commented Jan 23, 2017

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jan 23, 2017

Member

Some of us on the Go team at Google have been kicking around a different idea for a year or so that I think would solve this without exposing implementation details. I don't think there's been any online discussion of this idea, so I'll sketch out a proposal and link it from here. (However, I'm writing a talk for Friday, so don't expect it right away. :)

Member

aclements commented Jan 23, 2017

Some of us on the Go team at Google have been kicking around a different idea for a year or so that I think would solve this without exposing implementation details. I don't think there's been any online discussion of this idea, so I'll sketch out a proposal and link it from here. (However, I'm writing a talk for Friday, so don't expect it right away. :)

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 24, 2017

Member

Just for kicks I tried to see just how far I could push sync.Pool. Turns out, the answer is "pretty far": a per-thread library using sync.Pool and runtime.SetFinalizer turns out hacky and a bit space-inefficient, but the API is pretty usable and the performance is better than I expected.

For the curious:
https://go-review.googlesource.com/#/c/35676/

Member

bcmills commented Jan 24, 2017

Just for kicks I tried to see just how far I could push sync.Pool. Turns out, the answer is "pretty far": a per-thread library using sync.Pool and runtime.SetFinalizer turns out hacky and a bit space-inefficient, but the API is pretty usable and the performance is better than I expected.

For the curious:
https://go-review.googlesource.com/#/c/35676/

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 25, 2017

Contributor

@bcmills I thought about "allocator style" api. I even started to write comment with definition of such API. Unfortunately I understood it will be tricky to use:

  • sometimes I need to lock all allocated "pieces"
  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or arround Allocator, which returns us to lock contention.

So that I decided not to send that comment (despite I spend a hour to write it). I've understood that application/library knows more about allocation and locking pattern than it could be supposed ahead. So "*Hint" api will be more convenient.

Probably, I missed something. Maybe there is super-lightweight synchronization trick around of number of allocatef items.

For example, Allocator may force preallocation for all current P, and do not allocate new items without call to "Reallocate()". (And probably there should be method "WantsToReallocate()").

Contributor

funny-falcon commented Jan 25, 2017

@bcmills I thought about "allocator style" api. I even started to write comment with definition of such API. Unfortunately I understood it will be tricky to use:

  • sometimes I need to lock all allocated "pieces"
  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or arround Allocator, which returns us to lock contention.

So that I decided not to send that comment (despite I spend a hour to write it). I've understood that application/library knows more about allocation and locking pattern than it could be supposed ahead. So "*Hint" api will be more convenient.

Probably, I missed something. Maybe there is super-lightweight synchronization trick around of number of allocatef items.

For example, Allocator may force preallocation for all current P, and do not allocate new items without call to "Reallocate()". (And probably there should be method "WantsToReallocate()").

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 25, 2017

Member

@funny-falcon

  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or around Allocator, which returns us to lock contention.

It does lead to a mutex in the Allocator, and that's fine in practice. You don't need to eliminate all contention: you only need to eliminate contention in the steady-state.

In my experience, the calls that require Range tend to be infrequent enough that their impact on steady-state contention is minimal.

That said, you could use an approach similar to the one in change 33912 to avoid locking on the Range path entirely in the steady state, so that you only take the mutex when actually allocating and perhaps for a small number of calls thereafter (to reduce the overhead of reallocating the lock-free data structures).

Member

bcmills commented Jan 25, 2017

@funny-falcon

  • to do that I ought to call Range() and lock individual item one by one. But I also have to be sure no new items will be allocated at this moment (it is ok to return already allocated items).
  • it leads to single mutex in or around Allocator, which returns us to lock contention.

It does lead to a mutex in the Allocator, and that's fine in practice. You don't need to eliminate all contention: you only need to eliminate contention in the steady-state.

In my experience, the calls that require Range tend to be infrequent enough that their impact on steady-state contention is minimal.

That said, you could use an approach similar to the one in change 33912 to avoid locking on the Range path entirely in the steady state, so that you only take the mutex when actually allocating and perhaps for a small number of calls thereafter (to reduce the overhead of reallocating the lock-free data structures).

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 25, 2017

Member

See also #8281

Member

bcmills commented Jan 25, 2017

See also #8281

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

@bcmills, in my usage case, I use "shard number" as a low bits of request number. So when answer arrived, I may find "shard" by this low bits.
But to use low bits, I have to know maximum.
With my api, I will call ProcMaxHint() only once at startup, will round it to closest power of 2, and then will mask ProcHint() with only low bits:

   s.max = closestPower2(ProcMaxHint())
   s.shards = allocShards(s.max)
   /////////
   shardN = ProcHint() & (s.max-1)
   shard = s.shards[shardN]
   reqId = shard.nextReq()

It will be difficult to do with Allocator api.

Edit: fixed s/ProcMax/ProcHint/

Contributor

funny-falcon commented Jan 26, 2017

@bcmills, in my usage case, I use "shard number" as a low bits of request number. So when answer arrived, I may find "shard" by this low bits.
But to use low bits, I have to know maximum.
With my api, I will call ProcMaxHint() only once at startup, will round it to closest power of 2, and then will mask ProcHint() with only low bits:

   s.max = closestPower2(ProcMaxHint())
   s.shards = allocShards(s.max)
   /////////
   shardN = ProcHint() & (s.max-1)
   shard = s.shards[shardN]
   reqId = shard.nextReq()

It will be difficult to do with Allocator api.

Edit: fixed s/ProcMax/ProcHint/

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 26, 2017

Member

@funny-falcon What is the underlying problem you're trying to solve with that sharding? There may be a clearer way to address it with an API other than the one you're proposing.

Member

bcmills commented Jan 26, 2017

@funny-falcon What is the underlying problem you're trying to solve with that sharding? There may be a clearer way to address it with an API other than the one you're proposing.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

note: i've fixed misspelling in previos example: s/ProxMax/ProcHint/

Fast RPC.
Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.
With ProcHint lock contention will be lesser and CPU-cache usage will be greater. Probably, it will help to scale past 4 cores and over 1Mrps.

Fixed number of shards ought to be for response mapping.

Other possible usage is collection of statistic. But it could be solved with Allocator API.

Why my api is bad? It is low-level enough for efficient usage, but doen't expose "internals". And it is useful enough for those who know what they want.

Contributor

funny-falcon commented Jan 26, 2017

note: i've fixed misspelling in previos example: s/ProxMax/ProcHint/

Fast RPC.
Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.
With ProcHint lock contention will be lesser and CPU-cache usage will be greater. Probably, it will help to scale past 4 cores and over 1Mrps.

Fixed number of shards ought to be for response mapping.

Other possible usage is collection of statistic. But it could be solved with Allocator API.

Why my api is bad? It is low-level enough for efficient usage, but doen't expose "internals". And it is useful enough for those who know what they want.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

Allocator could satisfy me, if there will be a way to costruct it prefilled with fixed size, and then get elements by id:

NewFixedAllocator(func(id int) interface) *FixedAllocator {}
func (a *FixedAllocator) Get() interface{} {}
// Size never changes
func (a *FixedAllocator) Size() int {}
func (a *FixedAllocator) GetById(id int) interface{} {}

But still, lowerlevel api could be much simpler and more useful.

Contributor

funny-falcon commented Jan 26, 2017

Allocator could satisfy me, if there will be a way to costruct it prefilled with fixed size, and then get elements by id:

NewFixedAllocator(func(id int) interface) *FixedAllocator {}
func (a *FixedAllocator) Get() interface{} {}
// Size never changes
func (a *FixedAllocator) Size() int {}
func (a *FixedAllocator) GetById(id int) interface{} {}

But still, lowerlevel api could be much simpler and more useful.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jan 26, 2017

Member

I've posted a rough "sharded value" proposal as #18802. The API operates along similar lines to @bcmills' Allocator (not surprising, since the sharded value proposal came out of discussions with him), but the terminology is different and I've posed some trade-offs in the API design space.

Member

aclements commented Jan 26, 2017

I've posted a rough "sharded value" proposal as #18802. The API operates along similar lines to @bcmills' Allocator (not surprising, since the sharded value proposal came out of discussions with him), but the terminology is different and I've posed some trade-offs in the API design space.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jan 26, 2017

Member

@funny-falcon, perhaps I'm not understanding your use case, but why isn't a plain map the right answer for looking up requests awaiting response by shard?

Member

aclements commented Jan 26, 2017

@funny-falcon, perhaps I'm not understanding your use case, but why isn't a plain map the right answer for looking up requests awaiting response by shard?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

@aclements, map is much-much slower at this modification rate. Both it locates and modifies value slower than custom hash, and single map ought to be protected with single mutex, which quickly become bottleneck at the rate of 1M rps.

Contributor

funny-falcon commented Jan 26, 2017

@aclements, map is much-much slower at this modification rate. Both it locates and modifies value slower than custom hash, and single map ought to be protected with single mutex, which quickly become bottleneck at the rate of 1M rps.

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 26, 2017

Member

@funny-falcon

Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.

You're still describing an implementation, not a use-case. You're telling me that you want to calculate a shard, but not what you're using the sharding to actually do. (This looks like an XY problem.)

I'm assuming you're describing a server-side optimization. A simple server serving immutable data should generally read a request, compute the response, and send the response all on the same goroutine / thread, so how does the sharding come into play? I'm sure I'm missing something about your use-case, but I have no idea what.

Member

bcmills commented Jan 26, 2017

@funny-falcon

Sharding allows to reduce lock contention for RPS when it works at 1Mrps (through single connection using 4 cores)
Shard contains byte buffer for serialized request and hash map (custom) of requests waiting for response.
Currently I use AtomicAdd(1) for choosing shard.

You're still describing an implementation, not a use-case. You're telling me that you want to calculate a shard, but not what you're using the sharding to actually do. (This looks like an XY problem.)

I'm assuming you're describing a server-side optimization. A simple server serving immutable data should generally read a request, compute the response, and send the response all on the same goroutine / thread, so how does the sharding come into play? I'm sure I'm missing something about your use-case, but I have no idea what.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

No, I'm talking about connector to external database capable to serve > 1Mrps. And users actually wants to send 1Mrps.

Each of 1M requests per second is different. Each request should be sent over TCP connection, and filled with response.

Contributor

funny-falcon commented Jan 26, 2017

No, I'm talking about connector to external database capable to serve > 1Mrps. And users actually wants to send 1Mrps.

Each of 1M requests per second is different. Each request should be sent over TCP connection, and filled with response.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

I understood, fixed allocator api could be simpler:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}
Contributor

funny-falcon commented Jan 26, 2017

I understood, fixed allocator api could be simpler:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}
@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

Edit: FixSharded.Get is non-locking, so application code should contain synchronization around value usage.

Contributor

funny-falcon commented Jan 26, 2017

Edit: FixSharded.Get is non-locking, so application code should contain synchronization around value usage.

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 26, 2017

Member

@funny-falcon

No, I'm talking about connector to external database capable to serve > 1Mrps.

I still don't know what that means. How about some pseudocode, or (better still) a link to some real code?

Member

bcmills commented Jan 26, 2017

@funny-falcon

No, I'm talking about connector to external database capable to serve > 1Mrps.

I still don't know what that means. How about some pseudocode, or (better still) a link to some real code?

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

That is current code:

It is capable to send 1Mrps requests and waits for 1Mrps responses through one connection using 4 cores. Database ( https://tarantool.org ) does not even uses 100% of single core at this request rate, so there is room for improvements.

Contributor

funny-falcon commented Jan 26, 2017

That is current code:

It is capable to send 1Mrps requests and waits for 1Mrps responses through one connection using 4 cores. Database ( https://tarantool.org ) does not even uses 100% of single core at this request rate, so there is room for improvements.

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 26, 2017

Member
  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway? (How are you keeping those calls colocated to the same thread as the putFuture call?)

At any rate: that's very complicated, heavily asynchronous code. I would argue that it's not idiomatic for a Go program, and we should be optimizing for making idiomatic code more efficient rather than speeding up complicated code.

A much simpler example (e.g. a self-contained benchmark with a similar request flow) would be helpful.

Member

bcmills commented Jan 26, 2017

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway? (How are you keeping those calls colocated to the same thread as the putFuture call?)

At any rate: that's very complicated, heavily asynchronous code. I would argue that it's not idiomatic for a Go program, and we should be optimizing for making idiomatic code more efficient rather than speeding up complicated code.

A much simpler example (e.g. a self-contained benchmark with a similar request flow) would be helpful.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

"Simple should be simple, and complex should be possible".

If you will not add possibilities for "complex cases", they still will be written in C++.

Contributor

funny-falcon commented Jan 26, 2017

"Simple should be simple, and complex should be possible".

If you will not add possibilities for "complex cases", they still will be written in C++.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor
  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway?

Yes. But they are sharded by incoming response id, and it is much shorter than sending request, cause doesn't contain deserialization of response (and sending contains serialization of request).

Cause of serialization into reusable batch buffer, CPU alignment is more important on request sending than on filling request with response bytes.

Contributor

funny-falcon commented Jan 26, 2017

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway?

Yes. But they are sharded by incoming response id, and it is much shorter than sending request, cause doesn't contain deserialization of response (and sending contains serialization of request).

Cause of serialization into reusable batch buffer, CPU alignment is more important on request sending than on filling request with response bytes.

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Jan 26, 2017

Contributor

If you will not add possibilities for "complex cases", they still will be written in C++.

A language trying too hard to be loved by everyone ends loved by few.

Contributor

cznic commented Jan 26, 2017

If you will not add possibilities for "complex cases", they still will be written in C++.

A language trying too hard to be loved by everyone ends loved by few.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

@cznic two simple methods lowlevel could not destroy love for Golang. Impossibility to do something efficiently can.

Number of those, who will directly write code using proposed methods, will be small. And this people actually knows what they wants. For those people low-level methods are usually better.

But number of indirect users are much larger, cause number of users who will use libraries written by former people is more countable.

Contributor

funny-falcon commented Jan 26, 2017

@cznic two simple methods lowlevel could not destroy love for Golang. Impossibility to do something efficiently can.

Number of those, who will directly write code using proposed methods, will be small. And this people actually knows what they wants. For those people low-level methods are usually better.

But number of indirect users are much larger, cause number of users who will use libraries written by former people is more countable.

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Jan 26, 2017

Contributor

Number of those, who will directly write code using proposed methods, will be small.

I rest my case 😉

Contributor

cznic commented Jan 26, 2017

Number of those, who will directly write code using proposed methods, will be small.

I rest my case 😉

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 26, 2017

Member

@funny-falcon

Impossibility to do something efficiently can [destroy love for Go].

The example you've provided is so complex that it's very difficult to see whether your use-case is truly impossible to address efficiently (both with the standard library as it stands today and with the alternative proposals).

Which is to say: one of the great strengths of Go is that it is reasonably efficient for programs implemented in a direct, simple style. In many cases, attempts to "optimize" Go programs using idioms from other languages (e.g. futures and callbacks) end up fighting against optimizations in the runtime (escape analysis, goroutine scheduling, the transaction-oriented collector, etc.).

If you want to argue that something is necessary on efficiency grounds (and, let's be clear, the Go compiler and runtime have lots of room for improvement!), the first step is to describe the problem you're trying to solve as idiomatic code and show where the bottlenecks occur (e.g. using pprof profiles or similar).

Member

bcmills commented Jan 26, 2017

@funny-falcon

Impossibility to do something efficiently can [destroy love for Go].

The example you've provided is so complex that it's very difficult to see whether your use-case is truly impossible to address efficiently (both with the standard library as it stands today and with the alternative proposals).

Which is to say: one of the great strengths of Go is that it is reasonably efficient for programs implemented in a direct, simple style. In many cases, attempts to "optimize" Go programs using idioms from other languages (e.g. futures and callbacks) end up fighting against optimizations in the runtime (escape analysis, goroutine scheduling, the transaction-oriented collector, etc.).

If you want to argue that something is necessary on efficiency grounds (and, let's be clear, the Go compiler and runtime have lots of room for improvement!), the first step is to describe the problem you're trying to solve as idiomatic code and show where the bottlenecks occur (e.g. using pprof profiles or similar).

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 26, 2017

Contributor

About idiomatic Go code:

  • unsafe is not idiomatic, but many libraries ought to use it for performance
  • sync and sync/atomic is not idiomatic, but many libraries ought to use it for performance.
  • more and more often people uses fasthttp instead of idiomatic net/http . Why? Cause it is fast.

People choose Go cause it is simpler and faster. But it is faster not only cause it is "compiled and statically typed", but because there is possibility to write fast libraries in... and usage of sync, sync/atomic, unsafe is huge part of this possibility.

More possibilities to write faster libraries => more possibilities to write fast applications => more new users who want or need performance (hint: most of Go users).

Contributor

funny-falcon commented Jan 26, 2017

About idiomatic Go code:

  • unsafe is not idiomatic, but many libraries ought to use it for performance
  • sync and sync/atomic is not idiomatic, but many libraries ought to use it for performance.
  • more and more often people uses fasthttp instead of idiomatic net/http . Why? Cause it is fast.

People choose Go cause it is simpler and faster. But it is faster not only cause it is "compiled and statically typed", but because there is possibility to write fast libraries in... and usage of sync, sync/atomic, unsafe is huge part of this possibility.

More possibilities to write faster libraries => more possibilities to write fast applications => more new users who want or need performance (hint: most of Go users).

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 27, 2017

Contributor

@bcmills show me RPC written in idiomatic Go and capable to perform 1M requests per second. Then I will never write any new silly proposal. I will just use/copy code that you show me.

Contributor

funny-falcon commented Jan 27, 2017

@bcmills show me RPC written in idiomatic Go and capable to perform 1M requests per second. Then I will never write any new silly proposal. I will just use/copy code that you show me.

@bcmills

This comment has been minimized.

Show comment
Hide comment
@bcmills

bcmills Jan 27, 2017

Member

@funny-falcon There's no need to get hostile.

The point of asking for idiomatic code is not that it does, today, what you want it to do. The point is that it becomes much easier to see where and why it does not do what you want it to do, and the solution that enables your use-case is more likely to benefit the Go community as a whole rather than just one niche application.

And I disagree with what you wrote about unsafe, sync, and atomic. I would consider those — and even reflect — to be perfectly idiomatic in many situations. When I'm talking about "idiomatic Go code", I mean code with straightforward control flow, mostly synchronous function calls, mostly functional-style APIs, types with relatively few methods, methods with relatively few parameters, and so on.

The use of mostly synchronous functions is particularly important in that regard: synchronicity tends to produce good cache locality for server-style programs (with lots of hot "per-request" data, relatively little "background" data, and relatively few different kinds of tasks being performed), makes the behavior of the program clearer to the goroutine scheduler, and makes it a bit easier to thread parameters down through the call tree without introducing aliasing bugs (leading to, for example, the ability to allocate the "per thread" data at the top of a goroutine and pass it down through that goroutine rather than scattering map lookups throughout the code).

Member

bcmills commented Jan 27, 2017

@funny-falcon There's no need to get hostile.

The point of asking for idiomatic code is not that it does, today, what you want it to do. The point is that it becomes much easier to see where and why it does not do what you want it to do, and the solution that enables your use-case is more likely to benefit the Go community as a whole rather than just one niche application.

And I disagree with what you wrote about unsafe, sync, and atomic. I would consider those — and even reflect — to be perfectly idiomatic in many situations. When I'm talking about "idiomatic Go code", I mean code with straightforward control flow, mostly synchronous function calls, mostly functional-style APIs, types with relatively few methods, methods with relatively few parameters, and so on.

The use of mostly synchronous functions is particularly important in that regard: synchronicity tends to produce good cache locality for server-style programs (with lots of hot "per-request" data, relatively little "background" data, and relatively few different kinds of tasks being performed), makes the behavior of the program clearer to the goroutine scheduler, and makes it a bit easier to thread parameters down through the call tree without introducing aliasing bugs (leading to, for example, the ability to allocate the "per thread" data at the top of a goroutine and pass it down through that goroutine rather than scattering map lookups throughout the code).

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 27, 2017

Contributor

@bcmills I'm not even-tempered, but I'm not hostile.

Contributor

funny-falcon commented Jan 27, 2017

@bcmills I'm not even-tempered, but I'm not hostile.

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 27, 2017

Contributor

I'm closing now. May be I'm mistaken somewhere.

Contributor

funny-falcon commented Jan 27, 2017

I'm closing now. May be I'm mistaken somewhere.

@valyala

This comment has been minimized.

Show comment
Hide comment
@valyala

valyala Jan 28, 2017

Contributor

I'd just expose a single function in runtime package:

// PID returns an id currently associated with the CPU core.
//
// The function may return distinct values for the same CPU core in the long run.
// There are no guarantees on the maximum and minimum id values.
func PID() uint32 {
  return getg().m.p.ptr().id
}

This functionality is the core of timers scalability CL - https://go-review.googlesource.com/#/c/34784/ .

Contributor

valyala commented Jan 28, 2017

I'd just expose a single function in runtime package:

// PID returns an id currently associated with the CPU core.
//
// The function may return distinct values for the same CPU core in the long run.
// There are no guarantees on the maximum and minimum id values.
func PID() uint32 {
  return getg().m.p.ptr().id
}

This functionality is the core of timers scalability CL - https://go-review.googlesource.com/#/c/34784/ .

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 28, 2017

Contributor

@valyala, you did it! thank you for timers!

I'd just expose a single function in runtime package:

That is what I named ProcHint, cause core-team doesn't want for this to be named like a hook into internals.

I agree, single functions could be already enough.

Contributor

funny-falcon commented Jan 28, 2017

@valyala, you did it! thank you for timers!

I'd just expose a single function in runtime package:

That is what I named ProcHint, cause core-team doesn't want for this to be named like a hook into internals.

I agree, single functions could be already enough.

@funny-falcon funny-falcon reopened this Jan 28, 2017

@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Jan 28, 2017

Contributor

@bradfitz @bcmills technique used by @valyala to improve timers at https://go-review.googlesource.com/#/c/34784/ exactly shows why runtime.ProcHint() (or Pid()) is useful for high performance multi-core programming.

I agree with @valyala that ProcMaxHint most likely doesn't need. It is enough to have fixed size number of "shards".

Contributor

funny-falcon commented Jan 28, 2017

@bradfitz @bcmills technique used by @valyala to improve timers at https://go-review.googlesource.com/#/c/34784/ exactly shows why runtime.ProcHint() (or Pid()) is useful for high performance multi-core programming.

I agree with @valyala that ProcMaxHint most likely doesn't need. It is enough to have fixed size number of "shards".

@rsc

This comment has been minimized.

Show comment
Hide comment
@rsc

rsc Feb 13, 2017

Contributor

The valid uses of this function are far outweighed by the buggy invalid uses. There are of course uses of per-P data inside the runtime. That's not an argument for exposing P numbers outside the runtime. I would rather add a few higher-level APIs that are easier to use correctly, like the ShardedValue proposal, than add one low-level API that is very hard to use correctly.

Contributor

rsc commented Feb 13, 2017

The valid uses of this function are far outweighed by the buggy invalid uses. There are of course uses of per-P data inside the runtime. That's not an argument for exposing P numbers outside the runtime. I would rather add a few higher-level APIs that are easier to use correctly, like the ShardedValue proposal, than add one low-level API that is very hard to use correctly.

@rsc rsc closed this Feb 13, 2017

@rsc rsc added Proposal-Declined and removed Proposal labels Feb 13, 2017

@happilymarrieddad

This comment has been minimized.

Show comment
Hide comment
@happilymarrieddad

happilymarrieddad Feb 1, 2018

Hey guys,

syscall.Gettid() works pretty well.

package main

import (
	"log"
	"runtime"
	"syscall"
)

func main() {
	runtime.GOMAXPROCS(runtime.NumCPU())

	num := 20
	sem := make(chan int, num)

	for i := 0; i < num; i++ {
		go func() {
			log.Println(syscall.Gettid())

			sem <- 0
		}()
	}

	for i := 0; i < num; i++ {
		<-sem
	}

	return
}

Hey guys,

syscall.Gettid() works pretty well.

package main

import (
	"log"
	"runtime"
	"syscall"
)

func main() {
	runtime.GOMAXPROCS(runtime.NumCPU())

	num := 20
	sem := make(chan int, num)

	for i := 0; i < num; i++ {
		go func() {
			log.Println(syscall.Gettid())

			sem <- 0
		}()
	}

	for i := 0; i < num; i++ {
		<-sem
	}

	return
}

@happilymarrieddad

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

This comment has been minimized.

Show comment
Hide comment
@funny-falcon

funny-falcon Feb 1, 2018

Contributor

@happilymarrieddad , no it doesn't . There is only GOMAXPROCS of possible running schedulers (unless one use runtime.LockOsThread), and thousands of native threads (if some syscall or C library locks for a long time).

Contributor

funny-falcon commented Feb 1, 2018

@happilymarrieddad , no it doesn't . There is only GOMAXPROCS of possible running schedulers (unless one use runtime.LockOsThread), and thousands of native threads (if some syscall or C library locks for a long time).

@golang golang locked as resolved and limited conversation to collaborators Feb 1, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.