Event Manager #3329

lavalamp · 2015-01-08T20:29:51Z

Now that we have various components producing events, we need to start processing them. Our current policy of a 2 day TTL isn't going to scale.

So, I want us to build an event manager.

The basic sketch is that it reads events from the cluster, processes them, and archives them.

a. Reading events:

It runs in a pod as part of your kubernetes cluster.
It uses the "kubernetes" service to contact the master.
It lists & then watches all events.

b. Process events

Idea: compression. If you see N events identical except for timestamp, transform that into one event with N timestamps.
Alternatively, do the same but for sequences of events. Like if a pod is in a pull -> fail -> pull... loop, make a meta-event and start storing timestamps.
Main purpose: Actively manage the events stored in the kubernetes system, with the goal of maximizing usefulness while storing only a sane number of events.
Idea: provide scriptable/configurable hooks. For example, it would great for an admin to make a policy here that deletes a pod after N failures to start. Or up a level, to delete a replication controller that's making crashlooping pods. Or delete a pod if it fails to schedule after 30 minutes. Etc.
Idea: provide scriptable/configurable alerts. As above, but email/page an admin/owner instead of deleting something.
Idea: upon reading an event, update the object's status information to reflect the new information.

c. Archive events

Store the events in a DB of some sort for offline analysis. Ideally the latency between the event getting added to k8s and getting stored in the DB is low

d. Misc open questions

Run one per namespace or one per cluster? I'd prefer the latter, but we'd have to be cognizant of not leaking information between namespaces.

We can also explore the idea of kubernetes writing events to a non-etcd (or separate etcd) DB in the first place. Events are the primary source of write load on etcd right now. It's good for the moment in that it's exposing some bugs in our use of etcd, but in the long term it's probably more efficient for us to use a different storage mechanism for events.

davidopp · 2015-01-08T20:53:35Z

How do we decide what information to put into PodStatus/NodeStatus and what information to put into discrete events? I believe there is some kind of homomorphism here because in the limit we could

embed events in the PodStatus/NodeStatus and remove it as a discrete API type; or on the other extreme
remove PodStatus/NodeStatus as API types and make them virtual objects that the client constructs by observing events

Also, on the archiving question -- could/should we just archive the whole etcd transaction log? (I assume etcd is structured as some kind of log that contains every mutating operation, perhaps with some kind of periodic compaction) This could be useful for post-hoc debugging and would give us archiving of events for free (if you were only interested in events when reading back, you'd skip all the non-event mutations).

lavalamp · 2015-01-08T21:48:58Z

How do we decide what information to put into PodStatus/NodeStatus and what information to put into discrete events?

XStatus should have the entire current state of X; an event about X tells you only one detail. I think there is room for both. I added b.6. above to capture the idea of updating status based on incoming events. It's not clear if that's actually the best thing to do, but it's worth talking about.

Also, on the archiving question -- could/should we just archive the whole etcd transaction log?

This is a fair point-- perhaps we should solve archival globally. But it would be good to store it in a searchable/queryable format.

derekwaynecarr · 2015-01-08T21:52:23Z

I prefer that we solve archival globally, and if we use events as the first use case, that is fine for me.

----- Original Message -----
From: "Daniel Smith" notifications@github.com
To: "GoogleCloudPlatform/kubernetes" kubernetes@noreply.github.com
Sent: Thursday, January 8, 2015 4:49:29 PM
Subject: Re: [kubernetes] Event Manager (#3329)

How do we decide what information to put into PodStatus/NodeStatus and what information to put into discrete events?

XStatus should have the entire current state of X; an event about X tells you only one detail. I think there is room for both. I added b.6. above to capture the idea of updating status based on incoming events. It's not clear if that's actually the best thing to do, but it's worth talking about.

Also, on the archiving question -- could/should we just archive the whole etcd transaction log?

This is a fair point-- perhaps we should solve archival globally. But it would be good to store it in a searchable/queryable format.

Reply to this email directly or view it on GitHub:
#3329 (comment)

dchen1107 · 2015-01-08T22:14:45Z

#2298 too

We plan to moving XStatus (at least PodStatus) computation to Kubelet level, which means keeping XStatus as API type is necessary.

davidopp · 2015-01-08T22:26:59Z

One possible model for Status vs. Events is the following (based on a similar system I worked on recently). I think this is similar to what you were saying.

A component that starts with no information can construct the entire state of the cluster by observing the current values of the Status objects in etcd. If it later becomes backlogged or goes down for a long time, it can come back up and reconstruct the state this way. On the other hand, Event lifetimes are bounded (either by an explicit TTL, or the compaction interval of the transaction log, or the retention policy of the system that archives the transaction log, or whatever). So Events should only be used to convey information that is not mission-critical (for example, you specifically would not want to delegate conversion of Events into Status to an event manager that runs on top of Kubernetes, since losing some Events would corrupt the representation of the cluster's state; having Kubelet and API server compute Status is safest).

This also provides a design guideline for components, which in the previous project we described as "edged-based" vs. "level-based"; the former rely on seeing every object transition, while the latter could figure out what to do just based on its observation of the current Statuses and whatever private state it had stashed away. Unfortunately by the time we understood this distinction we were already building edge-based components so we just pretended the store transaction log would be archived long enough that even if the edge-based component was down (or fell behind) for a long time enough time it would always be able to catch up. This isn't a good/safe assumption to make.

dchen1107 · 2015-02-03T21:57:15Z

We extracted work items required for v1 release, and filed them separately. Lower the priority of this to P3 to unblock V1.

smarterclayton · 2017-06-01T16:52:24Z

This is highly theoretical and in practice isn't an urgent issue. Please re-open if you disagree.

falenn · 2017-06-01T16:57:58Z

This message was created automatically by mail delivery software. A message that you sent could not be delivered to one or more of its recipients. This is a temporary error. The following address(es) deferred: curtis.l.bates@gmail.com Domain imwiz.com has exceeded the max emails per hour (158/150 (105%)) allowed. Message will be reattempted later

…

------- This is a copy of the message, including all the headers. ------ Received: from o9.sgmail.github.com ([167.89.101.2]:20471) by box969.bluehost.com with esmtps (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from <bounces+848413-5c7e-dev=imwiz.com@sgmail.github.com>) id 1dGTL9-000Tsb-Jb for dev@imwiz.com; Thu, 01 Jun 2017 10:52:52 -0600 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=github.com; h=from:reply-to:to:cc:in-reply-to:references:subject:mime-version:content-type:content-transfer-encoding:list-id:list-archive:list-post:list-unsubscribe; s=s20150108; bh=sWjMG0zxujO+bliemWP/guo18C8=; b=fZfNG3rHkyZkI1jM bCn2xBtv+PcgvpRtglDDoJPDY1TL68jca5Rd0GmELtqMbNtsD4tjbbmigXc4BWPl usDVJd2Fa+lG/QqPn8cV3OIv5fpOiHajo9VWPKdRTgV1B1Ud3XuKKLn9sCIovPY6 u90hpuymsiIx7FVnCtPaU5U/I0U= Received: by filter0501p1mdw1.sendgrid.net with SMTP id filter0501p1mdw1-2636-59304654-6A 2017-06-01 16:52:36.983492999 +0000 UTC Received: from github-smtp2a-ext-cp1-prd.iad.github.net (github-smtp2a-ext-cp1-prd.iad.github.net [192.30.253.16]) by ismtpd0006p1iad1.sendgrid.net (SG) with ESMTP id yNz9ELt9QHK39MwvLPFGGw for <dev@imwiz.com>; Thu, 01 Jun 2017 16:52:36.952 +0000 (UTC) Date: Thu, 01 Jun 2017 09:52:36 -0700 From: Clayton Coleman <notifications@github.com> Reply-To: kubernetes/kubernetes <reply@reply.github.com> To: kubernetes/kubernetes <kubernetes@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Message-ID: <kubernetes/kubernetes/issue/3329/issue_event/1106426314@github.com> In-Reply-To: <kubernetes/kubernetes/issues/3329@github.com> References: <kubernetes/kubernetes/issues/3329@github.com> Subject: Re: [kubernetes/kubernetes] Event Manager (#3329) Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="--==_mimepart_59304654d426c_499e3f93881f9c34638e6"; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list X-GitHub-Sender: smarterclayton X-GitHub-Recipient: falenn X-GitHub-Reason: subscribed List-ID: kubernetes/kubernetes <kubernetes.kubernetes.github.com> List-Archive: https://github.com/kubernetes/kubernetes List-Post: <mailto:reply@reply.github.com> List-Unsubscribe: <mailto:unsub+000ab60a3fb5b0fa12e7e6df6b61bbcf3c69ee2c639c864192cf000000011548085492a169ce0334e0d0@reply.github.com>, <https://github.com/notifications/unsubscribe/AAq2CuTTJbGS4fXq1H1_nk0xgp1vYoSHks5r_uxUgaJpZM4DQDoH> X-Auto-Response-Suppress: All X-GitHub-Recipient-Address: dev@imwiz.com X-SG-EID: APO41b8ovafPb3SK9rw3vGS2Kq45kgpYx4y17m0ryg3JH/JwE7co13m4iTYP9W+Ap+1iX/uexRsNs2 PL3kFzIhK3UtblvtVdEdL2jYUg5VR3eNf3cioyb7qKTIIAkzloSaNGr/LCx6TNTHOTKx7AlvhHAG6+ nPGbzqPOV4st4M7UOshun47MFku09BiO2vb8knsv+0s6tBv7MY0wi8ITwkhw9s/UKLfHs9V0yI/84g A= X-Spam-Status: No, score=-2.5 X-Spam-Score: -24 X-Spam-Bar: -- X-Ham-Report: Spam detection software, running on the system "box969.bluehost.com", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see root\@localhost for details. Content preview: Closed #3329. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #3329 (comment) [...] Content analysis details: (-2.5 points, 4.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -0.0 SPF_PASS SPF: sender matches SPF record 1.3 HTML_IMAGE_ONLY_24 BODY: HTML: images with 2000-2400 bytes of words 0.0 HTML_MESSAGE BODY: HTML included in message -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -2.8 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [167.89.101.2 listed in wl.mailspike.net] -0.8 AWL AWL: Adjusted score from AWL reputation of From: address X-Spam-Flag: NO

----==_mimepart_59304654d426c_499e3f93881f9c34638e6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Closed #3329.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #3329 (comment) ----==_mimepart_59304654d426c_499e3f93881f9c34638e6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit <p>Closed <a href="#3329" class="issue-link js-issue-link" data-url="#3329" data-id="53797072" data-error-text="Failed to load issue title" data-permission-text="Issue title is private">#3329</a>.</p> <p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="#3329 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AAq2Ch1hBlfXdHTnBCnbdHcL0XptxxiHks5r_uxUgaJpZM4DQDoH">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/AAq2Ct24TNi4Ruold8n4HrQNM75PaHVUks5r_uxUgaJpZM4DQDoH.gif" width="1" /></p> <div itemscope itemtype="http://schema.org/EmailMessage"> <div itemprop="action" itemscope itemtype="http://schema.org/ViewAction"> <link itemprop="url" href="#3329 (comment)"></link> <meta itemprop="name" content="View Issue"></meta> </div> <meta itemprop="description" content="View this Issue on GitHub"></meta> </div> <script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/kubernetes/kubernetes","title":"kubernetes/kubernetes","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/kubernetes/kubernetes"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"Closed #3329."}],"action":{"name":"View Issue","url":"#3329 (comment)"}}}</script> ----==_mimepart_59304654d426c_499e3f93881f9c34638e6--

dchen1107 added the area/introspection label Jan 8, 2015

goltermann added the kind/design Categorizes issue or PR as related to design. label Jan 14, 2015

bgrant0607 mentioned this issue Jan 15, 2015

Add DeltaStore store #3521

Closed

bgrant0607 mentioned this issue Jan 23, 2015

Error: [...] peers are not reachable [...] #3557

Closed

thockin assigned saad-ali Jan 28, 2015

goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 28, 2015

dchen1107 mentioned this issue Feb 3, 2015

Compress duplicate events #4073

Closed

dchen1107 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 3, 2015

bgrant0607 mentioned this issue Feb 13, 2015

Consider moving events out of etcd #4432

Closed

roberthbailey added team/master labels Feb 18, 2015

davidopp added team/control-plane sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/master labels Aug 22, 2015

bgrant0607 mentioned this issue Nov 4, 2015

Add event correlation #16798

Merged

saad-ali removed their assignment Mar 17, 2016

smarterclayton closed this as completed Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event Manager #3329

Event Manager #3329

lavalamp commented Jan 8, 2015

davidopp commented Jan 8, 2015

lavalamp commented Jan 8, 2015

derekwaynecarr commented Jan 8, 2015

dchen1107 commented Jan 8, 2015

davidopp commented Jan 8, 2015

dchen1107 commented Feb 3, 2015

smarterclayton commented Jun 1, 2017

falenn commented Jun 1, 2017 via email

Event Manager #3329

Event Manager #3329

Comments

lavalamp commented Jan 8, 2015

davidopp commented Jan 8, 2015

lavalamp commented Jan 8, 2015

derekwaynecarr commented Jan 8, 2015

dchen1107 commented Jan 8, 2015

davidopp commented Jan 8, 2015

dchen1107 commented Feb 3, 2015

smarterclayton commented Jun 1, 2017

falenn commented Jun 1, 2017 via email