Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket connection errors #21

Closed
inancgumus opened this issue Aug 1, 2017 · 8 comments
Closed

Socket connection errors #21

inancgumus opened this issue Aug 1, 2017 · 8 comments
Labels

Comments

@inancgumus
Copy link

When sess.SetPoolLimit(50) not used many errors are occurring when mgo is under stress like 10.000 concurrent connections.

When I limit the pool errors went away.

I've created a test-case source-code for this problem below, so you can test it in your own machine easily.

Do you have any suggestions about this behaviour?

package main

import (
	"fmt"
	"time"

	mgo "github.com/globalsign/mgo"
	"github.com/globalsign/mgo/bson"
)

// TODO: put some records into db first:
//
// use testapi
// db.competitions.insert([
//   {game_id: 1, game_name: "foo"},
//   {game_id: 2, game_name: "bar"},
//   {game_id: 3, game_name: "jazz"}
// ])

// NOTE: you might want to increase this depending on your machine power
//       mine is:
//         MacBook (Retina, 12-inch, Early 2015)
//         1,2 GHz Intel Core M
//         8 GB 1600 MHz DDR3
const ops = 10000

type m bson.M

func main() {
	sess, err := mgo.DialWithTimeout("localhost", time.Second)
	if err != nil {
		panic(err)
	}
	defer sess.Close()

	// NOTE: without this setting there are many errors
	//       see the end of the file
	// setting pool limit prevents most of the timeouts
	// sess.SetPoolLimit(50)

	// sess.SetSocketTimeout(60 * time.Second)
	sess.SetMode(mgo.Monotonic, true)

	time.Sleep(time.Second)

	done := make(chan bool, ops)

	for i := 0; i < ops; i++ {
		go func() {
			defer func() { done <- true }()

			result, err := fetchSomething(sess)
			if err != nil {
				fmt.Printf("ERR: %s\n", err)
			}
			fmt.Printf("RESULT: %+v\n", result)
		}()
	}

	for i := 0; i < ops; i++ {
		<-done
	}
}

func fetchSomething(sess *mgo.Session) ([]m, error) {
	s := sess.Copy()
	defer s.Close()

	result := []m{}

	group := m{
		"$group": m{
			"_id": m{
				"id":   "$game_id",
				"name": "$game_name",
			},
		},
	}

	project := m{
		"$project": m{
			"_id":  "$_id.id",
			"name": "$_id.name",
		},
	}

	sort := m{
		"$sort": m{
			"_id": 1,
		},
	}

	err := col(s, "competitions").Pipe([]m{group, project, sort}).All(&result)
	if err != nil {
		return nil, err
	}
	return result, nil
}

// col is a helper for selecting a column
func col(sess *mgo.Session, name string) *mgo.Collection {
	return sess.DB("testapi").C(name)
}

/*


ERRORS WITHOUT sess.SetPoolLimit(50)


$ go run socket_error.go
RESULT: [map[name:foo _id:1] map[_id:2 name:bar] map[_id:3 name:jazz]]
ERR: read tcp 127.0.0.1:52918->127.0.0.1:27017: read: connection reset by peer
ERR: write tcp 127.0.0.1:52084->127.0.0.1:27017: write: broken pipe
RESULT: []
RESULT: []
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
RESULT: []
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
RESULT: []
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
RESULT: []
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
RESULT: []
ERR: write tcp 127.0.0.1:53627->127.0.0.1:27017: write: broken pipe
RESULT: []
RESULT: []
ERR: read tcp 127.0.0.1:53627->127.0.0.1:27017: read: connection reset by peer
RESULT: []
ERR: resetNonce: write tcp 127.0.0.1:53625->127.0.0.1:27017: write: broken pipe
RESULT: []
RESULT: [map[name:foo _id:1] map[_id:2 name:bar] map[_id:3 name:jazz]]
ERR: resetNonce: write tcp 127.0.0.1:54591->127.0.0.1:27017: write: broken pipe
RESULT: []
...
...
*/
@domodwyer
Copy link

domodwyer commented Aug 2, 2017

Hi @inancgumus,

Have you tuned the mongod connection limits (either in the config or with --maxConns) and the ulimits? Can you paste the output of db.serverStatus().connections and ulimit -a on your test machine?

It seems SetPoolLimit() isn't supposed to be used as a concurrency limiter. I do agree it's not really the job of mgo to limit concurrency, but if that's the case I have no idea why SetPoolLimit() exists - we use a concurrency throttler in our application code to limit the number of simultaneous calls to mongo and avoid SetPoolLimit() entirely, it works well and I suggest you try the same.

@inancgumus
Copy link
Author

inancgumus commented Aug 4, 2017

Hi @domodwyer ,

I didn't try --maxConns, but I set ulimit to unlimiteds. Yeah, I had read that view on SetPoolLimit and I aggree on that. I'm just exploring what causes this and how to prevent it, firstly, other than restructing my code.

Did you try the above code on your machine? So, maybe you can see the problem directly.


These are my config details:

ulimit -a:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4864
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited
> db.version()
3.4.4
> db.serverStatus().connections
{ "current" : 1, "available" : 399999, "totalCreated" : 18584 }
> db.runCommand( { getCmdLineOpts: 1 } )
{
	"argv" : [
		"/usr/local/opt/mongodb/bin/mongod",
		"--config",
		"/usr/local/etc/mongod.conf"
	],
	"parsed" : {
		"config" : "/usr/local/etc/mongod.conf",
		"net" : {
			"bindIp" : "127.0.0.1"
		},
		"operationProfiling" : {
			"mode" : "all"
		},
		"storage" : {
			"dbPath" : "/usr/local/var/mongodb"
		},
		"systemLog" : {
			"component" : {
				"query" : {
					"verbosity" : 5
				}
			},
			"destination" : "file",
			"logAppend" : true,
			"path" : "/usr/local/var/log/mongodb/mongo.log",
			"verbosity" : 5
		}
	},
	"ok" : 1
}

/usr/local/etc/mongod.conf:

systemLog:
  destination: file
  path: /usr/local/var/log/mongodb/mongo.log
  logAppend: true
storage:
  dbPath: /usr/local/var/mongodb
net:
  bindIp: 127.0.0.1
operationProfiling:
  mode: all
systemLog:
  verbosity: 5
  component:
    query:
      verbosity: 5

And these are the error messages in mongo log:

4079:2017-08-04T15:39:21.249+0300 D NETWORK  [conn20] SocketException: remote: 127.0.0.1:64246 error: 9001 socket exception [CLOSED] server [127.0.0.1:64246]
4080:2017-08-04T15:39:21.249+0300 D NETWORK  [conn19] SocketException: remote: 127.0.0.1:64243 error: 9001 socket exception [CLOSED] server [127.0.0.1:64243]
4081:2017-08-04T15:39:21.249+0300 D NETWORK  [conn24] SocketException: remote: 127.0.0.1:64245 error: 9001 socket exception [CLOSED] server [127.0.0.1:64245]
4082:2017-08-04T15:39:21.249+0300 D NETWORK  [conn23] SocketException: remote: 127.0.0.1:64244 error: 9001 socket exception [CLOSED] server [127.0.0.1:64244]
4083:2017-08-04T15:39:21.249+0300 D NETWORK  [conn18] SocketException: remote: 127.0.0.1:64242 error: 9001 socket exception [CLOSED] server [127.0.0.1:64242]
4084:2017-08-04T15:39:21.249+0300 D NETWORK  [conn17] SocketException: remote: 127.0.0.1:64241 error: 9001 socket exception [CLOSED] server [127.0.0.1:64241]
4085:2017-08-04T15:39:21.249+0300 D NETWORK  [conn16] SocketException: remote: 127.0.0.1:64240 error: 9001 socket exception [CLOSED] server [127.0.0.1:64240]
4086:2017-08-04T15:39:21.249+0300 D NETWORK  [conn15] SocketException: remote: 127.0.0.1:64239 error: 9001 socket exception [CLOSED] server [127.0.0.1:64239]
4087:2017-08-04T15:39:21.249+0300 D NETWORK  [conn14] SocketException: remote: 127.0.0.1:64238 error: 9001 socket exception [CLOSED] server [127.0.0.1:64238]
4088:2017-08-04T15:39:21.249+0300 D NETWORK  [conn13] SocketException: remote: 127.0.0.1:64237 error: 9001 socket exception [CLOSED] server [127.0.0.1:64237]
4089:2017-08-04T15:39:21.249+0300 D NETWORK  [conn12] SocketException: remote: 127.0.0.1:64236 error: 9001 socket exception [CLOSED] server [127.0.0.1:64236]
4090:2017-08-04T15:39:21.249+0300 D NETWORK  [conn11] SocketException: remote: 127.0.0.1:64235 error: 9001 socket exception [CLOSED] server [127.0.0.1:64235]
4091:2017-08-04T15:39:21.250+0300 D NETWORK  [conn10] SocketException: remote: 127.0.0.1:64234 error: 9001 socket exception [CLOSED] server [127.0.0.1:64234]
4092:2017-08-04T15:39:21.250+0300 D NETWORK  [conn9] SocketException: remote: 127.0.0.1:64233 error: 9001 socket exception [CLOSED] server [127.0.0.1:64233]
4093:2017-08-04T15:39:21.250+0300 D NETWORK  [conn6] SocketException: remote: 127.0.0.1:64229 error: 9001 socket exception [CLOSED] server [127.0.0.1:64229]
4094-2017-08-04T15:39:21.250+0300 I -        [conn444] end connection 127.0.0.1:64961 (480 connections now open)
--
4094-2017-08-04T15:39:21.250+0300 I -        [conn444] end connection 127.0.0.1:64961 (480 connections now open)
4095:2017-08-04T15:39:21.250+0300 D NETWORK  [conn8] SocketException: remote: 127.0.0.1:64231 error: 9001 socket exception [CLOSED] server [127.0.0.1:64231]
4096:2017-08-04T15:39:21.250+0300 D NETWORK  [conn4] SocketException: remote: 127.0.0.1:64227 error: 9001 socket exception [CLOSED] server [127.0.0.1:64227]
4097:2017-08-04T15:39:21.250+0300 D NETWORK  [conn7] SocketException: remote: 127.0.0.1:64230 error: 9001 socket exception [CLOSED] server [127.0.0.1:64230]
4098:2017-08-04T15:39:21.250+0300 D NETWORK  [conn5] SocketException: remote: 127.0.0.1:64228 error: 9001 socket exception [CLOSED] server [127.0.0.1:64228]
4099:2017-08-04T15:39:21.250+0300 D NETWORK  [conn3] SocketException: remote: 127.0.0.1:64226 error: 9001 socket exception [CLOSED] server [127.0.0.1:64226]
4100:2017-08-04T15:39:21.250+0300 D NETWORK  [conn2] SocketException: remote: 127.0.0.1:64225 error: 9001 socket exception [CLOSED] server [127.0.0.1:64225]
4101:2017-08-04T15:39:21.250+0300 D NETWORK  [conn1] SocketException: remote: 127.0.0.1:64224 error: 9001 socket exception [CLOSED] server [127.0.0.1:64224]
4102-2017-08-04T15:39:21.250+0300 I -        [conn388] end connection 127.0.0.1:64960 (480 connections now open)
4103-2017-08-04T15:39:21.250+0300 I -        [conn422] end connection 127.0.0.1:64959 (480 connections now open)
4104-2017-08-04T15:39:21.250+0300 I -        [conn387] end connection 127.0.0.1:64958 (480 connections now open)
4105-2017-08-04T15:39:21.250+0300 I -        [conn443] end connection 127.0.0.1:64957 (480 connections now open)
4106-2017-08-04T15:39:21.250+0300 I -        [conn386] end connection 127.0.0.1:64956 (480 connections now open)
4107-2017-08-04T15:39:21.250+0300 I -        [conn384] end connection 127.0.0.1:64954 (480 connections now open)
4108-2017-08-04T15:39:21.250+0300 I -        [conn385] end connection 127.0.0.1:64955 (480 connections now open)
4109-2017-08-04T15:39:21.250+0300 I -        [conn383] end connection 127.0.0.1:64953 (480 connections now open)
4110-2017-08-04T15:39:21.250+0300 I -        [conn382] end connection 127.0.0.1:64952 (480 connections now open)
4111-2017-08-04T15:39:21.251+0300 I -        [conn442] end connection 127.0.0.1:64951 (480 connections now open)

@inancgumus
Copy link
Author

inancgumus commented Aug 4, 2017

Now I tried maxConns option and there are still the same errors.

$ mongod --config /usr/local/etc/mongod.conf --maxConns=500000

.
.
.
ERR: read tcp 127.0.0.1:52547->127.0.0.1:27017: read: connection reset by peer
ERR: read tcp 127.0.0.1:52547->127.0.0.1:27017: read: connection reset by peer
.
.
.
ERR: read tcp 127.0.0.1:51524->127.0.0.1:27017: read: operation timed out
ERR: read tcp 127.0.0.1:51524->127.0.0.1:27017: read: operation timed out
ERR: read tcp 127.0.0.1:51525->127.0.0.1:27017: read: operation timed out
ERR: read tcp 127.0.0.1:51526->127.0.0.1:27017: read: operation timed out
ERR: read tcp 127.0.0.1:51521->127.0.0.1:27017: read: operation timed out
ERR: read tcp 127.0.0.1:51643->127.0.0.1:27017: read: operation timed out
.
.
.

@inancgumus
Copy link
Author

And there are also new errors from mgo like this (weirdly, not always happens):

$ mongod --config /usr/local/etc/mongod.conf --maxConns=500000 --slowms=50
goroutine 11389 [IO wait]:
net.runtime_pollWait(0x3258b40, 0x72, 0x760)
	/usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc424d1fc68, 0x72, 0x1270060, 0x126e260)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc424d1fc68, 0xc420555f50, 0x24)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc424d1fc00, 0xc420555f50, 0x24, 0x24, 0x0, 0x1270060, 0x126e260)
	/usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42000f6b0, 0xc420555f50, 0x24, 0x24, 0x0, 0x0, 0x0)
	/usr/local/opt/go/libexec/src/net/net.go:181 +0x70
github.com/globalsign/mgo.fill(0x1272100, 0xc42000f6b0, 0xc420555f50, 0x24, 0x24, 0x4, 0x0)
	/Users/inanc/dev/go/src/github.com/globalsign/mgo/socket.go:567 +0x53
github.com/globalsign/mgo.(*mongoSocket).readLoop(0xc4246f8870)
	/Users/inanc/dev/go/src/github.com/globalsign/mgo/socket.go:583 +0x107
created by github.com/globalsign/mgo.newSocket
	/Users/inanc/dev/go/src/github.com/globalsign/mgo/socket.go:196 +0x259

@domodwyer
Copy link

Hi @inancgumus

I ran your code and I can only reproduce with operationProfiling.mode = all which slows down operations significantly - turning this off will help performance in general, but this suggests your DB ops are slowish, and therefore are holding onto connections while they complete. The mgo driver will keep dialing new connections as it needs them, up till the limit set by SetPoolLimit() which is 4096 by default.

It's not actually possible to have 500,000 connections as you'll exhaust your ephemeral ports, but your ulimits -a above shows:

open files                      (-n) 4864

So you can't open more than 4864 sockets & files combined. When you hit this limit you'll get all sorts of network errors - try checking the number of connections you have open/waiting with netstat -an | grep tcp | wc -l and tweaking resource limits like kern.ipc.somaxconn - google will help here.

This doesn't seem to be a mgo issue - tuning your network stack will help but it really is best to limit concurrent ops in code. Saying that, feel free to reopen if you think it is a mgo issue!

Dom

@inancgumus
Copy link
Author

inancgumus commented Aug 8, 2017

your DB ops are slowish, and therefore are holding onto connections while they complete

My DB ops as in the code above and they're very simple and the collection only has a few records. So, I couldn't get what's wrong with them. Is there a problem in the code above?

I tried netstat and also lsof -n -i4TCP:27017 | wc -l to watch mongo connections. I also set kern.ipc.somaxconn and ulimit files to 10240 etc. And, then I increased with sess.SetPoolLimit(10000). I run the code again while netstat was showing 80 conns. I also limited the code with 500 ops. Whenever, netstat reaches near 1.000 conns, i/o timeout and conn reset by peer errors happen.

I don't know yet how this is not related to mongo or mgo. I'll investigate the same with other libraries or even directly reaching to mongo.

@domodwyer
Copy link

My DB ops as in the code above and they're very simple and the collection only has a few records. So, I couldn't get what's wrong with them. Is there a problem in the code above?

The code seems fine, but operationProfiling.mode = all will slow down operations considerably - try running without it. DB ops can also be slowed down by maxing the disk I/O, CPU, or any other resource shared between your app, Mongo, and any other program on your machine which is quite a few on your laptop!

Running an aggregation (or in this case, loads of parallel aggregations) over all records in your collection obviously takes proportionally long to the number of records, so each query will take longer with every record. This will definitely be holding onto a connection while it completes, meaning you'll need more and more connections to service requests. There might be a more efficient way to do this depending on your needs.

Best of luck!

@inancgumus
Copy link
Author

inancgumus commented Aug 8, 2017

Actually, all of this investigation started when I was stress testing my web API with vegeta. I wrote this small script to test the same errors happen to isolate the problem from my API. Then, I saw that it also happens with this simple code. So, I thought that it could be a mgo issue. Maybe, it's because of Mongo itself. I don't know yet. Thanks for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants