Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Peers that crash on component/start will not reboot #437

Closed
lbradstreet opened this issue Dec 11, 2015 · 3 comments
Closed

Peers that crash on component/start will not reboot #437

lbradstreet opened this issue Dec 11, 2015 · 3 comments
Assignees

Comments

@lbradstreet
Copy link
Member

Found by jepsen partitioning on a long running test.

So, what I think is happening is a peer fails, starts back up, throws an exception on startup (see exception below) when starting the log component. Note, the exception at the end is an error printing a future, which may be a aviso/pretty issue, but I don't think is relevant.

As a result of the exception during startup, the peer-lifecycle drops through to the fatal, and the peer drops off the cluster.

(defn ^{:no-doc true} peer-lifecycle [started-peer config shutdown-ch ack-ch]
  (try
    (loop [live @started-peer]
      (let [restart-ch (:restart-ch (:virtual-peer live))
            [v ch] (alts!! [shutdown-ch restart-ch] :priority true)]
        (cond (= ch shutdown-ch)
              (do (component/stop live)
                  (reset! started-peer nil)
                  (>!! ack-ch true))
              (= ch restart-ch)
              (do (component/stop live)
                  (Thread/sleep (or (:onyx.peer/retry-start-interval config) 2000))
                  (let [live (component/start live)]
                    (reset! started-peer live)
                    (recur live)))
              :else (throw (ex-info "Read from a channel with no response implementation" {})))))
    (catch Throwable e
      (fatal "Peer lifecycle threw an exception")
      (fatal e))))
15-Dec-11 10:16:11 n1 FATAL [onyx.system] -
                                                         ^[[37mjava.lang.Thread.run^[[m  ^[[32m            Thread.java:  745^[[m
                           ^[[37mjava.util.concurrent.ThreadPoolExecutor$Worker.run^[[m  ^[[32mThreadPoolExecutor.java:  617^[[m
                            ^[[37mjava.util.concurrent.ThreadPoolExecutor.runWorker^[[m  ^[[32mThreadPoolExecutor.java: 1142^[[m
                                          ^[[37mjava.util.concurrent.FutureTask.run^[[m  ^[[32m        FutureTask.java:  266^[[m
                                                                          ^[[37m...^[[m  ^[[32m                             ^[[m
                                          ^[[33mclojure.core/binding-conveyor-fn/^[[1;33mfn^[[m  ^[[32m               core.clj: 1916^[[m
                                                   ^[[33monyx.api/start-peers/fn/^[[1;33mfn^[[m  ^[[32m                api.clj:  298^[[m
                                                      ^[[33monyx.api/^[[1;33mpeer-lifecycle^[[m  ^[[32m                api.clj:  271^[[m
                                                   ^[[33monyx.system.OnyxPeer/^[[1;33mstart^[[m  ^[[32m             system.clj:   86^[[m
                                                ^[[33monyx.system/^[[1;33mrethrow-component^[[m  ^[[32m             system.clj:   53^[[m
                                                      ^[[33monyx.system.OnyxPeer/^[[1;33mfn^[[m  ^[[32m             system.clj:   87^[[m
                               ^[[37mcom.stuartsierra.component$start_system.invoke^[[m  ^[[32m         component.cljc:  163^[[m
                                                                          ^[[37m...^[[m  ^[[32m                             ^[[m
                            ^[[37mcom.stuartsierra.component$update_system.doInvoke^[[m  ^[[32m         component.cljc:  135^[[m
                                                          ^[[33mclojure.core/^[[1;33mreduce^[[m  ^[[32m               core.clj: 6518^[[m
                                                                          ^[[37m...^[[m  ^[[32m                             ^[[m
                    ^[[37mcom.stuartsierra.component$update_system$fn__10382.invoke^[[m  ^[[32m         component.cljc:  139^[[m
                                 ^[[37mcom.stuartsierra.component$try_action.invoke^[[m  ^[[32m         component.cljc:  117^[[m
                                                           ^[[33mclojure.core/^[[1;33mapply^[[m  ^[[32m               core.clj:  632^[[m
                                                                          ^[[37m...^[[m  ^[[32m                             ^[[m
                  ^[[37mcom.stuartsierra.component$fn__10331$G__10324__10336.invoke^[[m  ^[[32m         component.cljc:    5^[[m
                  ^[[37mcom.stuartsierra.component$fn__10331$G__10325__10333.invoke^[[m  ^[[32m         component.cljc:    5^[[m
                                           ^[[33monyx.log.zookeeper.ZooKeeper/^[[1;33mstart^[[m  ^[[32m          zookeeper.clj:   99^[[m
                                                                          ^[[37m...^[[m  ^[[32m                             ^[[m
                                                      ^[[33monyx.log.curator/^[[1;33mcreate^[[m  ^[[32m            curator.clj:   93^[[m
                  ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl.forPath^[[m  ^[[32m CreateBuilderImpl.java:   44^[[m
                  ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl.forPath^[[m  ^[[32m CreateBuilderImpl.java:  447^[[m
                  ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl.forPath^[[m  ^[[32m CreateBuilderImpl.java:  467^[[m
^[[37morg.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground^[[m  ^[[32m CreateBuilderImpl.java:  477^[[m
         ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground^[[m  ^[[32m CreateBuilderImpl.java:  699^[[m
                                   ^[[37morg.apache.curator.RetryLoop.callWithRetry^[[m  ^[[32m         RetryLoop.java:  107^[[m
                  ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl$11.call^[[m  ^[[32m CreateBuilderImpl.java:  703^[[m
                  ^[[37morg.apache.curator.framework.imps.CreateBuilderImpl$11.call^[[m  ^[[32m CreateBuilderImpl.java:  720^[[m
                                        ^[[37morg.apache.zookeeper.ZooKeeper.create^[[m  ^[[32m         ZooKeeper.java:  783^[[m
                                  ^[[37morg.apache.zookeeper.KeeperException.create^[[m  ^[[32m   KeeperException.java:   51^[[m
                                  ^[[37morg.apache.zookeeper.KeeperException.create^[[m  ^[[32m   KeeperException.java:   99^[[m
^[[1;31morg.apache.zookeeper.KeeperException$ConnectionLossException^[[m: ^[[3mKeeperErrorCode = ConnectionLoss for /onyx^[[m
    ^[[1mcode^[[m: -4
    ^[[1mpath^[[m: "/onyx"
                                  ^[[1;31mclojure.lang.ExceptionInfo^[[m: ^[[3mError in component :log in system onyx.system.OnyxPeer calling #'com.stuartsierra.component/start^[[m
                                  ^[[1;31mclojure.lang.ExceptionInfo^[[m: ^[[3mError in component :log in system onyx.system.OnyxPeer calling #'com.stuartsierra.component/start^[[m
    ^[[1mdata^[[m: {:reason :com.stuartsierra.component/component-function-threw-exception, :function #'com.stuartsierra.component/start, :system-key :log, :component #onyx.log.zookeeper.ZooKeeper{:config {:zookeeper/address "n1:2181,n2:2181,n3:2181,n4:2181,n5:2181", :onyx.peer/job-scheduler :onyx.job-scheduler/greedy, :onyx.messaging/impl :aeron, :onyx/id "JEPSENONYXID", :onyx.messaging/bind-addr "10.0.0.2", :onyx.messaging/peer-port 40200, :onyx.messaging.aeron/embedded-driver? true}, :monitoring #onyx.monitoring.no_op_monitoring.NoOpMonitoringAgent{}, :server nil, :conn #object[org.apache.curator.framework.imps.CuratorFrameworkImpl 0x15401712 "org.apache.curator.framework.imps.CuratorFrameworkImpl@15401712"], :prefix "JEPSENONYXID"}, :system #onyx.system.OnyxPeer{:monitoring #onyx.monitoring.no_op_monitoring.NoOpMonitoringAgent{}, :log #onyx.log.zookeeper.ZooKeeper{:config {:zookeeper/address "n1:2181,n2:2181,n3:2181,n4:2181,n5:2181", :onyx.peer/job-scheduler :onyx.job-scheduler/greedy, :onyx.messaging/impl :aeron, :onyx/id "JEPSENONYXID", :onyx.messaging/bind-addr "10.0.0.2", :onyx.messaging/peer-port 40200, :onyx.messaging.aeron/embedded-driver? true}, :monitoring #onyx.monitoring.no_op_monitoring.NoOpMonitoringAgent{}, :server nil, :conn #object[org.apache.curator.framework.imps.CuratorFrameworkImpl 0x15401712 "org.apache.curator.framework.imps.CuratorFrameworkImpl@15401712"], :prefix "JEPSENONYXID"}, :acking-daemon #onyx.messaging.acking_daemon.AckingDaemon{:opts {:zookeeper/address "n1:2181,n2:2181,n3:2181,n4:2181,n5:2181", :onyx.peer/job-scheduler :onyx.job-scheduler/greedy, :onyx.messaging/impl :aeron, :onyx/id "JEPSENONYXID", :onyx.messaging/bind-addr "10.0.0.2", :onyx.messaging/peer-port 40200, :onyx.messaging.aeron/embedded-driver? true}, :ack-state #object[clojure.lang.Atom 0x727d08fd {:status :ready, :val #onyx.messaging.acking_daemon.AckState{:state {}, :completed? false}}], :acking-ch #object[clojure.core.async.impl.channels.ManyToManyChannel 0x19b60ef3 "clojure.core.async.impl.channels.ManyToManyChannel@19b60ef3"], :completion-ch #object[clojure.core.async.impl.channels.ManyToManyChannel 0x42196a45 "clojure.core.async.impl.channels.ManyToManyChannel@42196a45"], :timeout-ch nil, :monitoring #onyx.monitoring.no_op_monitoring.NoOpMonitoringAgent{}, :log #onyx.log.zookeeper.ZooKeeper{:config {:zookeeper/address "n1:2181,n2:2181,n3:2181,n4:2181,n5:2181", :onyx.peer/job-scheduler :onyx.job-scheduler/greedy, :onyx.messaging/impl :aeron, :onyx/id "JEPSENONYXID", :onyx.messaging/bind-addr "10.0.0.2", :onyx.messaging/peer-port 40200, :onyx.messaging.aeron/embedded-driver? true}, :monitoring #onyx.monitoring.no_op_monitoring.NoOpMonitoringAgent{}, :server nil, :conn #object[org.apache.curator.framework.imps.CuratorFrameworkImpl 0x15401712 "org.apache.curator.framework.imps.CuratorFrameworkImpl@15401712"], :prefix "JEPSENONYXID"}, :ack-segments-fut #object[clojure.core$future_call$reify__6736 0x57f40d93 {:status :ready, :val nil}], :timeout-fut #object[clojure.core$future_call$reify__6736 0x4d54cd70 {:status :failed, :val #error {
{:status :failed, :val #error {
           :cause nil
           :via
           [{:type java.util.concurrent.CancellationException
             :message nil
             :at [java.util.concurrent.FutureTask report "FutureTask.java" 121]}]
           :trace
           [[java.util.concurrent.FutureTask report "FutureTask.java" 121]
            [java.util.concurrent.FutureTask get "FutureTask.java" 192]
            [clojure.core$deref_future invoke "core.clj" 2186]
            [clojure.core$future_call$reify__6736 deref "core.clj" 6683]
            [clojure.core$deref invoke "core.clj" 2206]
            [clojure.core$deref_as_map$fn__5931 invoke "core_print.clj" 392]
            [clojure.core$deref_as_map invoke "core_print.clj" 392]
            [clojure.core$fn__5937 invoke "core_print.clj" 411]
            [clojure.lang.MultiFn invoke "MultiFn.java" 233]
            [clojure.core$pr_on invoke "core.clj" 3548]
            [clojure.core$print_map$fn__5861 invoke "core_print.clj" 212]
            [clojure.core$print_sequential invoke "core_print.clj" 54]
            [clojure.core$print_map invoke "core_print.clj" 209]
            [clojure.core$fn__5882 invoke "core_print.clj" 272]
            [clojure.lang.MultiFn invoke "MultiFn.java" 233]
            [clojure.core$pr_on invoke "core.clj" 3548]
            [clojure.core$print_map$fn__5861 invoke "core_print.clj" 212]
            [clojure.core$print_sequential invoke "core_print.clj" 54]
            [clojure.core$print_map invoke "core_print.clj" 209]
            [clojure.core$fn__5882 invoke "core_print.clj" 272]
            [clojure.lang.MultiFn invoke "MultiFn.java" 233]
         [clojure.core$pr_on invoke "core.clj" 3548]
            [clojure.core$print_map$fn__5861 invoke "core_print.clj" 212]
            [clojure.core$print_sequential invoke "core_print.clj" 54]
            [clojure.core$print_map invoke "core_print.clj" 209]
            [clojure.core$fn__5882 invoke "core_print.clj" 272]
            [clojure.lang.MultiFn invoke "MultiFn.java" 233]
            [clojure.core$pr_on invoke "core.clj" 3548]
            [clojure.core$print_map$fn__5861 invoke "core_print.clj" 212]
            [clojure.core$print_sequential invoke "core_print.clj" 54]
            [clojure.core$print_map invoke "core_print.clj" 209]
            [clojure.core$fn__5864 invoke "core_print.clj" 219]
            [clojure.lang.MultiFn invoke "MultiFn.java" 233]
            [clojure.core$pr_on invoke "core.clj" 3548]
            [clojure.core$pr invoke "core.clj" 3560]
            [clojure.lang.AFn applyToHelper "AFn.java" 154]
            [clojure.lang.RestFn applyTo "RestFn.java" 132]
            [clojure.core$apply invoke "core.clj" 630]
            [clojure.core$prn doInvoke "core.clj" 3593]
            [clojure.lang.RestFn invoke "RestFn.java" 408]
            [onyx.static.logging_configuration$fn__12371 invoke "logging_configuration.clj" 14]
            [clojure.lang.MultiFn invoke "MultiFn.java" 229]
            [clojure.pprint$write_out invoke "pprint_base.clj" 194]
            [clojure.pprint$write$fn__8132 invoke "pprint_base.clj" 234]
            [clojure.pprint$write doInvoke "pprint_base.clj" 233]
            [clojure.lang.RestFn invoke "RestFn.java" 559]
            [io.aviso.exception$format_property_value invoke "exception.clj" 349]
            [io.aviso.exception$write_exception$write_exception_stack__6514 invoke "exception.clj" 435]
            [io.aviso.exception$write_exception invoke "exception.clj" 442]
            [clojure.lang.AFn applyToHelper "AFn.java" 160]
            [clojure.lang.AFn applyTo "AFn.java" 144]
            [clojure.core$apply invoke "core.clj" 632]
            [io.aviso.writer$into_string doInvoke "writer.clj" 61]
            [clojure.lang.RestFn invoke "RestFn.java" 439]
@MichaelDrogalis
Copy link
Contributor

Possible fix up.

@lbradstreet
Copy link
Member Author

This was not a full fix. Fix is incoming after everything looks better jepsen wise. Sorry for lack of communication here.

MichaelDrogalis added a commit that referenced this issue Jan 14, 2016
Defensive Peer Joining Procedures, fixes #423, #453, #437
@lbradstreet
Copy link
Member Author

Closed by #484

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants