packets.go: read tcp [ip-address]: connection timed out #257

Closed
eduncan911 opened this Issue Jul 11, 2014 · 47 comments

Projects

None yet
@eduncan911

UPDATE: My resolution was to remove all "Idle" connections from the pool. See this comment:

#257 (comment)

I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:

  • Visit a URL, any url on the site, and load the page completely (as in, the page actually loads and the logs show a complete page has loaded).
  • Close browser, and wait.

A typical request is logged like this:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?

There was no activity / no web request made. The web app was simply idle.

The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4'

    [ <=>                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4' saved [6310]

And it sits idle, no response, for 15 minutes until wget times out.

Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.

After this 15m, you can guess exactly what is logged next:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

Another 15m wait time.

I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran wget http://localhost locally on the same box, and I get this "stalled" request that never completes and never sends back anything.

There are a number of factors in my web app so I will try to outline them accordingly.

Using:

  • Go 1.3
  • go-sql-driver/mysql ## Version 1.2 (2014-06-03)
  • Ubuntu 12.04 LTS, ~June 2014 Updates
  • Windows Azure

Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.

Here is some related code:

// global handler for our DB
var db *sql.DB

// CLI parameter
var dbdsn string

func init() {

    flag.StringVar(&dbdsn, "dbdsn", "root:root@tcp(localhost:3306)/prod?timeout=5s&tls=false&autocommit=true", "Specifies the MySql DSN connection.")
    flag.Parse()

    var err error
    db, err = sql.Open("mysql", dbdsn)
    if err != nil {
        log.Printf("ERROR in sql.Open(): %v", err)
    }

    //db.SetMaxIdleConns(5)

    // verify the DSN is setup properly1
    err = db.Ping()
    if err != nil {
        panic("PANIC when pinging db: " + err.Error()) // proper error handling instead of panic in your app
    }
}

// **********
// * omitted is the Gorilla MUX router and http handler registrations
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

    m := NewArticle(c)
    id := c.Vars["id"]

    var pid int
    var title, body, excerpt, date, slug, fi, fv, region, region_slug string
    err = db.QueryRow(
        "SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, "+
            "(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = '_wp_attached_file' AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = '_thumbnail_id' LIMIT 1) LIMIT 1) AS featured_image, "+
            "(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = 'fv_video' AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, "+
            "t.name as region, t.slug as region_slug "+
            "FROM wp_posts p "+
            "JOIN wp_term_relationships tr ON tr.object_id=p.ID "+
            "JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id "+
            "JOIN wp_terms t ON t.term_id=tt.term_id "+
            "WHERE p.post_name=? AND p.post_type='post' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()"+
            "AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1",
        id, RegionsParentCategory).
        Scan(&pid, &title, &body, &excerpt, &date, &slug, &fi, &fv, &region, &region_slug)

    if err != nil {
        if err == sql.ErrNoRows {

            // snipped code for redirects

            // article was not found
            return handleNotFound(w, r, c)

        } else {
            log.Printf("ERROR in .Scan(): %v", err)
        }
    } else {
        m.Region = Region{
            Name: region,
            Slug: region_slug,
        }
        m.Id = pid
        m.Title = title
        m.Body = template.HTML(body) // render the raw html
        m.Excerpt = excerpt
        m.Datetime = date
        m.Slug = slug
        m.FeaturedImageUrl = fi
        m.FeaturedVideoUrl = fv
    }

    web.RenderTemplate(w, "article", m)
    return
}

5 more DB queries, per request

In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and *db global variable you see above.

Timeouts / errors are always on the same DB query

Here's one of the "context" queries as a comparison:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Nothing special there.

I do call defer rows2.Close() only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests as no response or mysql driver timing out.

Questions

Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.

I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?

It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.

It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.

Thanks in advance.

@eduncan911 eduncan911 changed the title from packets.go: read tcp [ip-address]: connection timed out to [MySQL] packets.go: read tcp [ip-address]: connection timed out Jul 11, 2014
@arnehormann
Member

Regarding your biggest question: read this part of our README and pay special attention to the link at the end.
If I guess right, setting that parameter may help you and do what you expect from timeout. What you want is probably one of the server side timeouts, just read through the linked MySQL documentation for the matching one.

Answer 1: No, you should only Open() and Close() the DB once.
It's a connection pool, not a single connection. You only need one per user and server.
Your main problem with the code above is to defer Close() in init - do not do that. Defer it in main.
The deferred function is called when the scope is left - putting it in init will not help you there as init is done and Close is called before main is executed.

Answer 2: I don't know what problems you have with Prepare, but I guess it's that it is connected with the opening and closing of the connection pool. The prepared statements are invalid when the connection pool they are prepared in is closed.

Please get back to us here if that didn't help you.

@eduncan911

Thanks Arne. I have updated the SO with more debugging. It is an issue with the go app and/or the mysql driver explicitly. Please take another look at the SO as I completely re-worded it just now.

Also, I do set timeout=5s in my connection string. You can see the default one in the above example, and in production I am running:

./wwwgo -dbdsn="user:pwd@tcp(#.#.#.#:3306)/production_db?timeout=5s&tls=false&autocommit=true"

With the same timeout of 5s.

But as I stated in the updated SO, there is an interval of 15m timeouts (15m, 31m and 48m) that is happening, stalling the entire web request.

I don't think this issue should be closed. It still may be the MySql VM issue in Azure; but even then, why is there a 15 minute timeout happening in the mysql driver/packets?

Notice the logged datetime below from the MySql driver, and then the 31 minute request logged:

[MySQL] 2014/07/13 16:59:29 packets.go:32: read tcp #.#.#.#:3306: connection timed out
[MySQL] 2014/07/13 16:59:29 packets.go:118: write tcp #.#.#.#:3306: broken pipe
2014/07/13 16:59:29 INFO template rendering: article
2014/07/13 16:59:29 METRIC, URL: /new-york-city/test-1, HANDLER TIME: 4.5973ms, CTX TIME: 31m13.9383742s, TOTAL TIME: 31m13.9429739s

@julienschmidt julienschmidt reopened this Jul 13, 2014
@julienschmidt julienschmidt changed the title from [MySQL] packets.go: read tcp [ip-address]: connection timed out to packets.go: read tcp [ip-address]: connection timed out Jul 13, 2014
@eduncan911

I've updated the original post with the exact details now after more debugging.

I've confirmed the MySql driver is the one blocking for 15 minutes.

Once I make 5 to 8 requests to the homepage, I then let the wwwgo app sit idle for at least 30m. I then attempt to make a wget request and it gets blocked on db.Query().

user@govm1:~$ wget http://localhost
--2014-07-14 00:59:13--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response...

And that's it. It sits idle until a wget error of a timeout response for 15 minutes. Do note that my other 15 minutes came from web browser requests, not wget.

I added additional logging and when the following line executes, it blocks the handler (web request) for 15 minutes. See my previous comment for those errors.

rows, err := db.Query(
    "SELECT t.name, t.slug "+
        "FROM wp_terms t "+
        "JOIN wp_term_taxonomy tt ON tt.term_id = t.term_id "+
        "WHERE tt.taxonomy = 'category' AND tt.parent = (SELECT t2.term_id FROM wp_terms t2 WHERE t2.name=? LIMIT 1)",
    RegionsParentCategory)

When that line executes, it blocks the method for 15 minutes.

Then after waiting with no activity, all of the sudden this is logged below. It happens in one of the 15m intervals previously mentioned (15m, 31m or 48m), this is logged:

[MySQL] 2014/07/14 01:14:52 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/14 01:14:52 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/14 01:14:52 INFO template rendering: index
2014/07/14 01:14:52 METRIC, END URL: /, HANDLER TIME: 7.9183ms, CTX TIME: 15m39.2565228s, TOTAL TIME: 15m39.2644567s

This is obviously a timeout in the MySql driver, perhaps in the packets.go package around line 32.

Again, this may be an issue with my MySql VM I have running causing a tcp issue. But even so, there is a 15 minute timeout happening in the Go application using this MySql driver. That should not be happening.

Ps: note the datetimes logged above. There was no activity (no web requests) on the go app during this period. It just sat and waited, in idle.

@arnehormann
Member

@eduncan911 do you use a custom Dialer?

I only ask because the timeout parameter is not used by the driver directly, it is handed through to net.Dialer and any errors are propagated to database/sql.
If you use a custom dialer, the timeout is only included if you add it yourself.

@eduncan911

Not that I know of. Not using the "net" package, other than HTTP.

Packages:

"fmt"
"database/sql"
"flag"
"github.com/go-sql-driver/mysql"
"github.com/gorilla/mux"
"github.com/justinas/nosurf"
"html/template"
"log"
"net/http"
"runtime"
"time"

And the app/main is pretty boiler plate standard website setup stuff.

var router = mux.NewRouter()

region := router.PathPrefix("/{region}/").Subrouter()
region.HandleFunc("/{id}", defaultHandler(ArticleHandler)).Name("article")

csrfHandler := nosurf.New(http.DefaultServeMux)
csrfHandler.ExemptRegexp("/assets(.*)")
csrfHandler.ExemptRegexp("/favicon.ico")
csrfHandler.SetFailureHandler(http.HandlerFunc(csrfFailedHandler))

The defaultHandler is basically a wrapper around a normal http handler, that creates a new context and passes it into the new handlers. That's all it is.

@eduncan911

So I had a theory... Since the MySql query is stalling on execution of an "idle" application for a long period of time, perhaps Windows Azure does something to that TCP connection. Was there a persistent connection in the goland mysql driver that maybe Azure doesn't like across cloud services? (a cloud service is a collection of resources for scaling. e.g. all web servers in one cloud service to auto-scale up and down, all database servers in another cloud service, all backend jobs processing in another cloud service, etc). Perhaps Azure didn't like a new tcp connection after being idle for so long?

So i set out to investigate these... Also, the SetMaxIdleConns() had me courious...

sudo netstat -tnp | grep wwwgo

Sure enough, this showed 1 single persistent connection from my wwwgo app to my mysql IP address and 3306 port - even after letting it sit for several hours without any interation, it was still open and persistent.

This got me thinking... Maybe Azure didn't like long running tcp connections with no activity. So i set these:

db.SetMaxIdleConns(0)
db.SetMaxOpenConns(500) // matches my MySql my.cnf file settings

Running netstat shows no persistent connection now. And I have not seen the original problem since! I cannot reproduce the original problem any longer! Not only that, but now load-testing the site no longer gives too many files open error on a single CPu instance, nor do I have the i/o socket error error i got on multi-core MAXCPUS setting. Cool.

So the issue seems to be with a long running persistent connection that never closes.

I know I know.. The next words out of anyone's mouth are, "Are you sure you are calling rows.Close(), and that you are iterating over the entire collection? That will hold the connection open if you don't!"

The answer is that I either do Scan() on single rows, or I follow this pattern for rows.Close() and iterating:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Note that I call defer rows2.Close() only if there was no error. I have also guaranteed I am always iterating and always rows.Close() in this pattern. Some may say, "well, there still may be data in the rows and you should iterate over them." well, during my load tests today of 1000 concurrent users off and on for 15m intervals several times today, there was only the 1 persistent connection open to mysql at all times. This tells me that rows.Close() is being called ,and I am iterating over all rows - else, I would have more persistent connections open, right?

Shouldn't a long-running idle connection be dropped?

Entity Framework, for Microsoft SQL Server, doesn't do this "long running" idle connection pooling. It closes all queries.

If it is to remain open, shouldn't there be a timed "ping" across the wire to keep it active?

Similar to how Android tcp connections are handled in where you constantly send a ping to keep the connection alive.

@arnehormann
Member

It's great that you found the solution yourself. Congrats!

Still, this is out of scope for drivers.

SetMaxIdleConns and SetMaxOpenConns are database/sql functions. database/sql uses and wraps the database/sql/driver implementations provided by the driver. If the number of idle connections is more than 0, there will be open connections until the pool is closed.

The driver itself does not provide any pooling functionality, it is based on simple connections.
We do enable TCP keepalive by default.

Sending a ping-like query at regular intervals would require an additional goroutine per connection and synchronisation with any other query running at the same time. All of this would complicate maintenance of the driver and degrade its performance.

I also have a hunch Azure is misbehaving here.

IMO, we should not make any changes in the driver based on this issue.

@julienschmidt close if you agree, take over if you don't 😀

@eduncan911

When I ran netstat on the mysql VM, it did not show the connection - only the go app showed a connection

What I suspect is the packets.go code, or the mysql driver code that is calling it, isn't detecting that the connection has been broken by the network - nor is it timing out. It continues to think there is a connection open, and to attempts to use it an hour later.

This theory matches the exact sympons I originally had: after a long idle, attempting to make a SQL query blocks for 15m. But during that block, even 5 seconds after its was initially blocked, I could make another query and it was OK. I suspect a 2nd connection was created in the pool, which worked fine.

Therefore two issues need to be addressed here:

  • failure to detect a broken idle TCP pipe/connection
  • a 15 minute timeout, waiting for that TCP pipe to return data when it is reused (e.g. a new SQL query is issued).

If neither of these are part of the driver but instead the underlying Go packages, then I am more than happy to close the issue and take these two issues to the go team.

But I believe we can agree that there are two known issues here that do need to be addressed.

@arnehormann
Member

@eduncan911 I think your theory is right. This is what's happening:

  • you issue the query. It is sent to the server.
  • something is broken - you wait 15 min for the reply.
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

The mystery of the second log line (packets.go:118) is the call to Close after logging the first error. We could make that a little smarter, but I don't think it's really a bug.

What is broken?
My new suspicion: running your queries takes more than 5 seconds, they always time out unless they are in the query cache. That also explains why you see multiple queries - the first ones fail.
For the ones after, database/sql establishes a new connection and tries again (10 times by default) if a connection fails.

Can you please try what happens if you remove the timeout from the dsn?

@eduncan911

Interesting. I'll remove the timeout; but, on average my pages load in 12ms overall, including all 7 SQL queries ran synchronously (I was going to move to channels later). It is lightning fast.

Therefore, the 5 second thought i don't think is valid - unless that accounts for dozens of page requests back to back - so the connection stays open for longer than 5 seconds as it is being used over and over again. But, even that doesn't hold water - as I can 1) start the go app, 2) make 1 single request (which ends in 38ms, for the first request at the start of the go app), and 3) wait an hour. The pipe is broken, so after 1 hour when I attempt to make a 2nd request - the first initial SQL query is blocked in this state.

When you said:

  • you issue the query. It is sent to the server.
  • something is broken - you wait 15 min for the reply.
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

Let me clarify it below...

  • you make a single request, page loads fine and all queries run within 38ms.
  • you wait 1 hour with the MySql idle connection remaining open on tcp port 3306.
  • you attempt to issue a new query, but nothing is sent to the server because the tcp pipe of the idle connection is now broken - it does not exist on the mysql VM, but yet the VM running the go app things it is still open as netstat shows (it is broken though and does not exist on the mysql VM).
  • something is broken - you wait 15 min for a timeout
  • packets.go code calls buffer.go and requests more data.
  • buffer.go fulfils this request by reading from an io.Reader - the connection to the server.
  • reading from the tcp connection fails and an error is returned
  • packets.go logs this and informs database/sql that the connection is stale by returning driver.ErrBadConn

And if that ErrBadConn is indeed this error:

`[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out'

Then yes, that all sounds right.

@arnehormann
Member

Please add if err != nil { panic(err) } to your code wherever this bug happens.
I'd like to see the full stacktrace for the driver - to get a clue in which connection phase this happens. You may edit out your own stuff, but I need it for the driver - readPacket from packets.go is called by a lot of different functions...
It's weird that you fail to send a query but the error happens when a packet is read. Could be in the handshake phase when a new connection is established - but I can't see why that should fail due to a timeout.

@eduncan911

Will do. Let me schedule some downtime... 👍

@arnehormann
Member

cough of course you can also use debug.PrintStack and guard it with some kind of command line argument.
Probably better if you have to test in the live environment.

@arnehormann
Member

@eduncan911 small but relevant correction - the PrintStack must be in the driver, not in your code. Just put it below the log line in packets.go.

@pengfei-xue

is there any progress yet?

I got this error in our production server.

@arnehormann
Member

No progress on our side.
I don't get this error and don't have a reproducer, so I have to rely on users seeing the crashes.

If you also get this error, please edit your driver version, add PrintStack to packets.go as I described above.
When you get the stacktrace, add it in a comment.
I have to know where this happens before I can fix it.

@eduncan911

Sorry I haven't been able to "crash production" yet for an update to this issue. It's been fine since setting the SetMaxIdleConns() to zero, even under load tests across 3 go instances on 3 VMs and 1 MySql backend.

Quite surprised actually that MySql can take a beating like that on a single core VM with a limit of 300 IOPS for the VHD - it got about 700 RPS with heavy 5/6 queries per request, across 3 Go instances on 3 VMs. I account it to the 1.5 GB of memory the VM has since they are all READ queries. That's before I add indexes. I always maximize code performance and queries first, before moving to caching.

@pengfei-xue

actually, you can re-produce this debug by set mysql's wait_timeout = 2, then, after 2 seconds, issue some sql commands.

the stack log is:

/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/packets.go:33 (0x4a6673)
com/go-sql-driver/mysql.(_mysqlConn).readPacket: debug.PrintStack()
/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/packets.go:660 (0x4a9f2a)
com/go-sql-driver/mysql.(_mysqlStmt).readPrepareResultPacket: data, err := stmt.mc.readPacket()
/home/pengfei/Codes/Go/src/github.com/go-sql-driver/mysql/connection.go:148 (0x4a39c4)
com/go-sql-driver/mysql.(_mysqlConn).Prepare: columnCount, err := stmt.readPrepareResultPacket()
/usr/lib/go/src/pkg/database/sql/sql.go:945 (0x42fbbc)
(_DB).queryConn: si, err := dc.ci.Prepare(query)
/usr/lib/go/src/pkg/database/sql/sql.go:913 (0x42f8dd)
(_DB).query: return db.queryConn(ci, ci.releaseConn, query, args)
/usr/lib/go/src/pkg/database/sql/sql.go:899 (0x42f75b)
(_DB).Query: rows, err = db.query(query, args)
/usr/lib/go/src/pkg/database/sql/sql.go:977 (0x430037)
(_DB).QueryRow: rows, err := db.Query(query, args...)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/dwz.go:175 (0x40240d)
getUrl: db.QueryRow("SELECT url FROM url WHERE id = ?", id).Scan(&url)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/dwz.go:60 (0x400dcc)
(_DwzHandler).ServeHTTP: url := getUrl(hash)
/home/pengfei/Codes/Go/src/oupeng.com/mop/dwz/middlewares/middlewares.go:23 (0x4b438f)
com/mop/dwz/middlewares.(_CrossDomain).ServeHTTP: c.handler.ServeHTTP(w, r)
/usr/lib/go/src/pkg/net/http/server.go:1597 (0x46204e)
serverHandler.ServeHTTP: handler.ServeHTTP(rw, req)
/usr/lib/go/src/pkg/net/http/server.go:1167 (0x460057)
(_conn).serve: serverHandler{c.server}.ServeHTTP(w, w.req)
/usr/lib/go/src/pkg/runtime/proc.c:1394 (0x419540)
goexit: runtime·goexit(void)
[MySQL] 2014/07/25 12:42:30 packets.go:34: EOF
[MySQL] 2014/07/25 12:42:30 packets.go:120: write tcp 127.0.0.1:3306: broken pipe

@pengfei-xue

mysql closes the connection after a period of inactive time.

@ernestoalejo

I can reproduce the "write part" of this issue consistenly in a vagrant machine and in production:

[MySQL] 2014/08/27 19:39:32 packets.go:118: write unix /var/lib/mysql/mysql.sock: broken pipe
@eduncan911

I found it to be the state of the connection pool, where the pool thinks there still is a connection open. But in fact, it has been closed remotely by the mysqld.

Scroll up for my entire story and debugging: I verified this with netstat where I saw the Go application having a connection to port 3306 on the remote server. But yet, the remote server running MySql no longer had any open pipes (after the timeout I noted above). I did verify that upon making the application active again, opening many mysql connections, that I saw both open pipes on both servers. But again, after some time, the remote mysqld server would drop the open "idle" connection (still not sure if it is mysqld, or Windows Azure doing it).

The root problem is that the net package, and/or this mysql driver, does not detect the dropped connection and thinks that there is still an open idle connection - there isn't. So upon the next attempt by the mysql driver to make a query, with this stale idle connection that doesn't exist, it creates a block and eventually times out after a long period of time. Any additional hits to the mysql driver for queries works fine, as the connection pool simply creates new connections going forward - after that 1 idle connection (e.g. pooling the connections, use the idle open one for the first 1, then create new connections). New connections are fine; it is that idle one that is the problem because it simply isn't connected.

I resolved my issues by getting rid of all idle connections. See my posts above for the history and resolutions. But in short:

db.SetMaxIdleConns(0) // this is the root problem! set it to 0 to remove all idle connections
db.SetMaxOpenConns(500) // or whatever is appropriate for your setup.
@ernestoalejo

It's mysqld, as I'm only using Ubuntu/CentOS and MySQL all in the same machine (both vagrant and production).

Getting rid of all idle connections it's obviously a very good option (hasn't any appreciable performance hit in my responses times). Anyway I think we should bring this to the end and find out if it's a library problem or an issue of the standard one that should be reported and fixed.

At the very least if there's no way to solve this it should be documented where appropriate that a pool of permanent connections cannot be used with this library.

Anyway thanks for writing the solution in this issue explicitly.

@arnehormann
Member

Anyone in this thread:
please change http://play.golang.org/p/RXeKHioTpo so it reliably triggers the problem and post the link to your version here when it does.
I have never hit this bug, even after investing quite some time in it.
Help me with a reproducer so this can finally get closed.

UPDATE if it only happens on a VPS or in a specific environment I'd like to have a look there. If so, I can send you my public key for access.

@eduncan911

@arnehormann Put a delay in for 15 or 20 minutes after the first query.

then attempt 2 SQL queries after that 20 minutes in two go routines (GOMAXPROCS = 1, so you can somewhat control the order they run in). you want to wait for the amount of time that it takes for mysql to drop that idle connection.

the first query after 20 minutes will eventually timeout/error after several minutes of waiting - the reason is that it attempts to use what the driver/netpipes thinks is an "idle" live-still-connected connection in the pool. but, mysql has dropped that connection remotely already and it doesn't exist.

the second query will execute almost instantly, and will return without error. you'll still be waiting for the first query to timeout over several minutes. the second query works fine because it did NOT use an idle connection; but instead, the pool created a new tcp connection for the query. this is why it works.

^- as long as you have set idle connections to 1, that is.

it may also require mysql to be installed a remote machine. perhaps mysql breaks the long-idle connection for remote connections only after a period of time (where perhaps long-idle connections locally are fine).

@arnehormann
Member

@eduncan911 thanks, but please don't describe it - change my program linked above so it reliably hangs on your own machine when you start it there.
And I'd be glad if that happens faster than after 15 minutes, @pengfei-xue indicated in can be done after 2 seconds.
I just want a fast feedback loop when I debug this. And I don't want to guess concerning the details of how to hit the bug. I'd like some code, please...

@julienschmidt julienschmidt added this to the v1.3 milestone Sep 26, 2014
@julienschmidt julienschmidt added the bug label Sep 26, 2014
@arnehormann
Member

This bug is a little hard to fix. Here's what I know so far.

The cause
The server drops the network connection (netconn) due to a timeout but the driver does not find out until it attempts the next Read or Write.
I don't know a way to sense it earlier without active probing - from which we wouldn't gain much.

What happens
When the netconn is read from or written to, the operation fails with connection reset by peer, attempting to Close the database connection (dbconn) leads to the broken pipe from the network connection.
Enabling keepalives don't really help, they start pretty late.
The driver signals database/sql by driver.ErrBadConn which retries the operation on a different dbconn 10 times (http://golang.org/src/pkg/database/sql/sql.go line 815, 914).
database/sql only causes an error if it picks 10 dbconns with stale netconns after each other.

The fix?
I'd say this is on database/sql - it should retry db.maxOpen + 1 times instead of maxBadConnRetries (10) times.
If we try to reestablish the connection on our side, we reimplement part of the connection pool in database/sql.

@arnehormann
Member

I created a Go issue (8834).
Maybe we can include a workaround in the driver, but I don't see a sensible way to do it right now.
I dislike putting reconnection logic in the driver.

Where possible, call SetMaxOpenConns(n) with n <= 9 as a workaround - maybe even SetMaxIdleConns is sufficient.

It's also possible that this doesn't fix the issue here. If I identified the cause correctly, Prepare, Query, Exec or Scan should return driver.ErrBadConn. That still doesn't explain the hangs. Maybe the attempt to access a connection takes some time until it times out... but I can't reproduce that part of the bug yet.

Right now the only fix I'm reasonably sure of (and that I still dislike) is wrapping the network connection in a custom type, calling SetDeadline and resetting it on every call to Read and Write.

Thoughts?

@eduncan911

👍 anxiously awaiting word from Google. :)

@methane
Contributor
methane commented Feb 12, 2015

I've posted related issue golang/go#9851 .

@johto
johto commented Feb 13, 2015

(Happened to see this thread due to the Go ticket..)

This sounds like a plain, honest TCP timeout. I don't know much about MySQL or the driver, but what usually happens is this:

  1. The TCP connection is killed by some network device / server but it doesn't tell the other end about it. This leaves the TCP connection in ESTABLISHED on the client's end while its state is actually gone from the server.
  2. Driver calls net.Conn.Write to send a query over, which ultimately effects a write() system call. The data fits into the OS' buffers, and thus the call doesn't block.
  3. Now, with the data in its buffers, the operating system's TCP stack starts sending the data to the remote end.
  4. The driver calls net.Conn.Read. This will block since there's nothing to read yet since the server hasn't replied (it hasn't even received the query, since there's no healthy connection).
  5. The TCP stack tries to send, re-send, re-send, etc. the data to the server. It will keep doing this for some specific amount of time (it was 15 minutes on our out-of-the-box Linux installation) before giving up, and then tells the client that the connection timed out.

The problem with setting a deadline is that for a lot of of apps it is either going to be annoyingly low and you're risking failures on just genuinely slow queries (perhaps if the database is under heavier load), or it'll still take annoyingly long before the deadline is reached, thus not really solving the problem.

What's really worth looking into is why step number 1 happens in the first place, and fix that. TCP timeouts and/or reconfiguring the network settings/gear can usually get you quite far. For some apps, deadlines can be nice in addition, but I don't think they're usable as a general workaround.

@mdu4rt3
mdu4rt3 commented Feb 16, 2015

+1 started running into this problem as well. Seems to just happen randomly when talking to a remote mysql over the internet vs lan.

@eduncan911

Just FYI, I resolved (bypassed) this issue by getting rid of the Idle connections.

See #257 (comment) above

@mdu4rt3
mdu4rt3 commented Feb 17, 2015

Yeah, I've been over this thread quite a bit. I tried the various workarounds didn't have any luck. Thanks for the suggestions.

@arnehormann
Member

@johto don't you think TCP_NODELAY should prevent that? According to the docs, TCP_NODELAY is set by default in Go. And we don't turn it off...
Maybe it's an issue with custom dialers?

@johto
johto commented Feb 17, 2015

@johto don't you think TCP_NODELAY should prevent that?

No. TCP_NODELAY simply means that data won't be artificially delayed in the buffers. I don't see how that's relevant.

@arnehormann
Member

@johto yeah, you're right. I temporarily thought TCP_NODELAY would make the Write call in step 2 blocking.

@methane
Contributor
methane commented Feb 17, 2015

I start implementing golang/go#9851.
Here is rough draft.

If someone suffered this problem now, could you try db.SetMaxReuseTimeout(time.Minute)?

@methane
Contributor
methane commented Feb 17, 2015

I'm sorry, I've forgot to implement check in DB.conn(). I'll update the patch later.

@bradrydzewski bradrydzewski referenced this issue in drone/drone Mar 16, 2015
Closed

drone start before mysql #919

@derekperkins

👍 I think I'm experiencing this exact same issue on Google Cloud SQL, but my requests time out at 100 seconds, so I can't verify that they are lasting 15+ minutes. The symptoms appear to be the same, though because the request is halted, I don't get a concrete error back.

@arnehormann
Member

I have an idea how to tackle this issue, but it's a little too brittle for my taste. I still don't see another way after a lot of hard thinking.

As I see it, the cause is the server cutting of the connection in a way that's not easily detectable by the client (same problem as ripping out the network cable). To detect this faster and to let the server know we are still there, we need a MySQL protocol based keepalive on the same connection, the TCP keepalive with it's overly long intervals doesn't help.

Ironically, @xaprb recently published a blog post praising our driver for not using timeouts... but what I propose is different to what other connectors do.

My idea:
the time of the last successfull Read can be used as a reliable liveness signal. If the time since last Read exceeds a keepalive duration, we have to issue a lightweight query (CMD_PING is most lightweight, Exec("DO 1") is a close second that's not a command packet) on the same MySQL connection leading to a Read. If that does not succeed, reconnect or close the connection.

Steps to do this:
Add fields to mysqlConn: lastRead time.Time, claimed bool and sync.Mutex. sync/atomic/... would be even better, but that probably won't do. Update lastRead whenever a Read happens.
Make existing code threadsafe enough for this (that's easy to get wrong and hard to check) - add locking and updates of claim and lastRead.
Introduce keepalive=<time.Duration interval> in the DSN.
If keepalive is set, spin up a goroutine when Open is called. It should have a time.Tick channel with that interval looping as long as the connection is not closed. Inside the loop, continue if the connection is claimed or if lastRead + interval > time.Now(). Otherwise, claim the connection and do a full CMD_PING roundtrip - if that does not finish before the next Tick, reconnect or call Close.
Make sure that goroutine exits when the connection is closed.
Claiming is needed so we never ever do this if the sequence id is going to be incremented as a result of the last command. I don't think calling CMD_PING is ok if we are inside a still open sequence...

Downsides:
Easy to get wrong, has to be kept in mind for all bugfixes, feature updates and additions (exactly like #302).
Has a (low) performance impact on connections without keepalive; locking and lastRead updates are required.
Introduces a small delay if the CMD_PING roundtrip is still active while a query is executed.
Adds network traffic (though that's negligable with a keepalive interval of 1 minute or so).
If we are in a sequence and the connection is claimed, the keepalive checks do not happen. Being slow in result processing can still lead to the error.

@arnehormann arnehormann reopened this Mar 18, 2015
@julienschmidt
Member

Ok TL;DR:

From #257 (comment)

  1. The TCP connection is killed by some network device / server but it doesn't tell the other end about it. This leaves the TCP connection in ESTABLISHED on the client's end while its state is actually gone from the server.
  2. Driver calls net.Conn.Write to send a query over, which ultimately effects a write() system call. The data fits into the OS' buffers, and thus the call doesn't block.
  3. Now, with the data in its buffers, the operating system's TCP stack starts sending the data to the remote end.
    4.The driver calls net.Conn.Read. This will block since there's nothing to read yet since the server hasn't replied (it hasn't even received the query, since there's no healthy connection).
  4. The TCP stack tries to send, re-send, re-send, etc. the data to the server. It will keep doing this for some specific amount of time (it was 15 minutes on our out-of-the-box Linux installation) before giving up, and then tells the client that the connection timed out.

So what we have to avoid is that we write to an connection, which we think is still alive but in reality the server closed it already.

We could could also do something like golang/go#9851 driver side by returning ErrBadConn when time.Now() - conn.LastCmdTime > maxIdleTime

And regarding Arne's keepalive feature... Déjà-vu? 9d66799

@arnehormann
Member

No Déjà-vu. My first commit was 2 or 3 months later - and I'm not much of a git historian 😀

@netroby
netroby commented Dec 15, 2015

#394
If anyone interest, this pr should help.

@eduncan911

👍

@guotie
guotie commented Jun 24, 2016

I have set the SetMaxIdleConns(0), And there are no acitve connections, but after a long peroid, the response is still long, about 19 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment