Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphite fixes3 #644

Closed
wants to merge 18 commits into from
Closed

Graphite fixes3 #644

wants to merge 18 commits into from

Conversation

Dieterbe
Copy link
Contributor

this is an updated version of #488

  • did a rebase after the udp graphite stuff got merged
  • a few bugfixes (for the commit-batches code, the current code in master has no bugs that i know of)

i have to test again to make sure, but i think it still performs worse than the original code (just like the 2 previous graphite-fixes PR's)

@Dieterbe Dieterbe mentioned this pull request Jun 13, 2014
@Dieterbe
Copy link
Contributor Author

ok, forget about the articifial benchmarks.
i have real emperical proof that this code is better than the current code.

before 17:11 i was running stock 0.7.3, and trying to send 2k metrics/s in, but carbon-relay-ng had to drop most of it because influx couldn't handle it. at 17:11 i deployed this code live (this patchset applied on top of 0.7.3)
see below for the results.

at time of writing it's 17:25

carbon-relay-ng output:

17:06:09.134058 routing.go:115: write tcp 10.90.128.39:2003: connection reset by peer
17:07:09.236220 routing.go:115: write tcp 10.90.128.39:2003: broken pipe
17:08:09.288686 routing.go:115: write tcp 10.90.128.39:2003: broken pipe
17:09:09.340169 routing.go:115: write tcp 10.90.128.39:2003: broken pipe
17:10:09.399735 routing.go:115: write tcp 10.90.128.39:2003: broken pipe
17:11:07.842679 routing.go:70: dial tcp 10.90.128.39:2003: connection refused
# no more errors, it's now 17:25

pic that shows that before was almost all drops, and after, 100% working fine
http://i.imgur.com/vKQwkcB.png (the last datapoint is not a drop, it's null, renderer thing)

@Dieterbe
Copy link
Contributor Author

also fwiw, been running this code at vimeo for a while now, haven't found issues yet.

@jvshahid
Copy link
Contributor

Should this be reviewed and merged into master ?

@Dieterbe
Copy link
Contributor Author

Yes please. It should be good.
Sent from mobile

John Shahid notifications@github.com wrote:

Should this be reviewed and merged into master ?


Reply to this email directly or view it on GitHub.

Dieter Plaetinck added 2 commits June 19, 2014 11:28
break loop when channel closed, always
also log shutdown so we know it worked
@Dieterbe
Copy link
Contributor Author

just noticed a bug: it doesn't break the commitloop in all cases when self.writeSeries gets closed, so the shutdown doesn't propagate properly. testing a fix right now..

EDIT: fixed

Dieter Plaetinck added 2 commits June 19, 2014 14:01
(but not in the internal library, cause that's implicit)
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 7, 2014

note that the bug fixed by included commit 4f084d3 is also in the current master branch (i.e. in current master, connection is closed when it fails to process a line, which is not very good)

}
wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this loop ever break ? It looks like we will never get to the wg.wait() line. Did I miss something ?

@jvshahid
Copy link
Contributor

lgtm, other than the things noted in the comments

otherwise this condition would be used no more than once and it would
only flush based on the timer.
* break from TCP Accept() loop when connection closed, which was preventing shutdown to proceed
* make sure that UDP functionality doesn't write to writeSeries channel after it has been closed.
* clearer, more specific shutdown message

in particular:

 * self.writers allows us to make sure things writing to writeSeries are done
   (they do blocking calls to handleMessage()) whether udp or tcp
 * self.connClosed lets us break from the Accept() loop,
   see http://zhen.org/blog/graceful-shutdown-of-go-net-dot-listeners/ (quit channel)
 * shutdown channel is now allCommitted

things can get a little complicated, so here's a little schematic of how the functions and
their logic relate:
indent for a call out or important code within. everything shown as one nested tree

  server.go
      go ListenAndServe
          go committer
              reads from self.writeSeries until closed, then writes to self.allCommitted
          Serve
              for {
                  Accept, breaks if err + connClosed
                  self.writers.Add()
                  go handleClient
                      for {
                          handleMessage
                              reads until err and writes to self.writeSeries until read failed
                          reads until EOF, ignores other handleMessage errors
                      }
                      conn.Close()
                      self.writers.Done()
              }
              self.writers.Wait()
              close(self.writeSeries)
      Close()
          close(self.connClosed)
          self.conn.Close()
          wants confirmation on allCommitted channel; [timeout] returns within 5s
@Dieterbe
Copy link
Contributor Author

good to go. that travis build failure seems to be due to unrelated code.

@jvshahid jvshahid closed this in 856a94c Jul 28, 2014
@jvshahid jvshahid added this to the 0.8.0 milestone Jul 28, 2014
@Dieterbe Dieterbe deleted the graphite-fixes3 branch July 28, 2014 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants