TODOS #1

cakrit · 2019-03-07T17:09:48Z

Update the default repo and release as soon as v1.13 is out (Update version to 1.13 #5)
Starting the master takes a while and upgrading brings it down. Check if and how we can upgrade more gracefully.
~~Package and share properly, e.g. as shown here~~
See if and what needs to change so we can monitor applications on other pods (e.g. mysql).
Update documentation with instructions on how to modify the configmap from the helm command line. (Extend configuration file support and update README #8)
Clean up master plugins, leave the bare minimum ones (Disable plugins from the master #10)

cakrit · 2019-03-14T13:06:57Z

Probably #2 and #4 will be a bit more tricky.

cakrit · 2019-03-15T11:48:40Z

@varyumin do you have any ideas about this one?

Starting the master takes a while and upgrading brings it down. Check if and how we can upgrade more gracefully.

I started reading about what happens if we increase replicas to 2 in the statefulset, but I'm not sure it's suitable in this case, or at least I'm not sure of the other config changes we'd need to make...

cakrit · 2019-03-15T13:58:48Z

helm/helm#5451 opened for the fifth item.

varyumin · 2019-03-15T14:15:12Z

#2 Yes I have ideas and yes we need to try set up replicaset =2. When we will update StatefulSet, Kubernetes turn off one old netdata-master pod(instance) and exclude from service balancer. After
ReplicationController will raise one new netdata-master pod(instance), include it in the service balancer and after that it will check readinessProbe for a new pod. . We have to set correct readinessProbe. I will try replicaset = 2 today if it is OK I will send PR

varyumin · 2019-03-15T14:24:33Z

#2 Do you know URL for netdata, which answer 200 if netdata is ready and healthy and 50x if unhealthy. i could set liveness and readiness prob

cakrit · 2019-03-15T14:40:45Z

#2 Do you know URL for netdata, which answer 200 if netdata is ready and healthy and 50x if unhealthy. i could set liveness and readiness prob

I've never received 50x from netdata, but you could use http://[host]:19999/api/v1/info

varyumin · 2019-03-15T14:44:32Z

Thx. I will try today.

cakrit · 2019-03-15T14:46:14Z

For the replica, the things we'll need to test are:

We can't run into a scenario where both pods try to write to the db that's on the persistenvolume
The slaves correctly point only to one master at any given time.

If we run into trouble, we could alternatively have both masters alive, with one replicating metrics to the other. It might be trickier. Let's see what you come up with first! :)

cakrit · 2019-03-21T12:06:16Z

Fifth item done.

cakrit · 2019-03-21T13:01:06Z

Based on the work on the configurations, packaging doesn't make sense at this point, so I removed the third item. Reasoning:
We have potentially complex configurations that will be simplified only with the addition of a feature in netdata cloud to apply configurations to slaves. We are also considering extensions to make it easy for netdata to use configuration management tools. I'm not sure how well these will align with kubernetes, but I do know that many users will want to use a custom version of at least one of the 178+ config files we have. With several hundred config lines in the various config files, a single static helm chart just doesn't cut it, even with automated generation.

As a result, I'm inclined to leave it as is, with the instruction to download it and let the users modify the statefulset, the daemonset and the configmap as they please. If there's an alternative I am not aware of, I'd appreciate the feedback.

Once we're confident we have a relatively stable chart, we'll of course try to get it in the official helm repo, but that's not what packaging was about.

cakrit · 2019-03-21T13:08:06Z

@varyumin, regarding the second replica and handling the persistent volumes, I read a concerning article yesterday. What's your take on it? Could we run into trouble?

cakrit · 2019-03-21T16:56:30Z

Regarding this:

See if and what needs to change so we can monitor applications on other pods (e.g. mysql)

We have two things we need to do, neither of which is related to the helm chart:

For collectors that get metrics via TCP, attempt autodiscovery of the the ip/port and allow specification of endpoints using labels.
For collectors that get metrics by reading files, we need to see how people would install netdata as a sidecar container in the same pod.

cakrit · 2019-03-21T17:30:05Z

@varyumin if you don't have time for the replicas, can you point me to the right direction so I can give it a shot? I have planned to complete this work by next Wednesday.

varyumin · 2019-03-25T06:52:15Z

Hi, @cakrit ! I had some day-off and I was journeying.
About the second issue. I think a lot, maybe we need to change StatefulSet to Deployment and will start to use PersistentVolumes and PersistentVolumesClaim. And then we can use ReplicaSet 2 because "volume" will be shared or we can load metrics in MQ or DB. Can netdata load metrics from slave to MQ or DB?

varyumin · 2019-03-25T06:59:52Z

For collectors that get metrics by reading files, we need to see how people would install netdata as a sidecar container in the same pod.

If we can use a sidecar, at first we need to write mutating hook (https://medium.com/dowjones/how-did-that-sidecar-get-there-4dcd73f1a0a4) or controller

cakrit · 2019-03-25T10:21:20Z

About the second issue. I think a lot, maybe we need to change StatefulSet to Deployment and will start to use PersistentVolumes and PersistentVolumesClaim.

From the documentation I understood that it's only statefulsets that use persistent volumes. Do you have a reference for using persistent volumes with daemonsets?

And then we can use ReplicaSet 2 because "volume" will be shared or we can load metrics in MQ or DB. Can netdata load metrics from slave to MQ or DB?

No, this would be problematic. There's an assumption that each netdata instance maintains its own DB and its the only one writing to it. Our metrics DB is internal and we're building a new one to solve some issues we have.

It looks like it will be difficult and potentially dangerous. We could possibly define a preStart to ensure that no other pod is using the DB, with a test I haven't thought about yet. But perhaps it's an overkill. The thing is, we know that netdata itself does not require the time k8s takes to update the version and bring it back up. If a daemonset can indeed use a persistent volume, then perhaps we don't really need a second replica. The slaves are updated much faster.

Thanks for the sidecar info, I'll study it more later. We will first focus on autodiscovery of the available TCP endpoints (but it has little to do with the helmchart itself).

cakrit · 2019-03-25T10:22:40Z

Oh, I don't know if you noticed, but I did put a preStop check on the master, to ensure it has shut down before the pod is removed, so that we know the metrics have been written to the persistent volume.

cakrit · 2019-04-05T15:44:25Z

I'm leaving the timing issue and closing this one. Now that the slaves have a liveness probe, they also take ~1m to come down and up, so this time seems to be necessary. We can revisit in the future if needed.

This was referenced Mar 14, 2019

Kubernetes monitoring netdata/netdata#5392

Closed

Kubernetes helmchart improvements netdata/netdata#5637

Closed

cakrit mentioned this issue Mar 21, 2019

Extend configuration file support and update README #8

Merged

cakrit closed this as completed Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODOS #1

TODOS #1

cakrit commented Mar 7, 2019 •

edited

Loading

cakrit commented Mar 14, 2019

cakrit commented Mar 15, 2019

cakrit commented Mar 15, 2019

varyumin commented Mar 15, 2019 •

edited

Loading

varyumin commented Mar 15, 2019

cakrit commented Mar 15, 2019

varyumin commented Mar 15, 2019

cakrit commented Mar 15, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

varyumin commented Mar 25, 2019

varyumin commented Mar 25, 2019

cakrit commented Mar 25, 2019

cakrit commented Mar 25, 2019

cakrit commented Apr 5, 2019

TODOS #1

TODOS #1

Comments

cakrit commented Mar 7, 2019 • edited Loading

cakrit commented Mar 14, 2019

cakrit commented Mar 15, 2019

cakrit commented Mar 15, 2019

varyumin commented Mar 15, 2019 • edited Loading

varyumin commented Mar 15, 2019

cakrit commented Mar 15, 2019

varyumin commented Mar 15, 2019

cakrit commented Mar 15, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

cakrit commented Mar 21, 2019

varyumin commented Mar 25, 2019

varyumin commented Mar 25, 2019

cakrit commented Mar 25, 2019

cakrit commented Mar 25, 2019

cakrit commented Apr 5, 2019

cakrit commented Mar 7, 2019 •

edited

Loading

varyumin commented Mar 15, 2019 •

edited

Loading