Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODOS #1

Closed
4 of 5 tasks
cakrit opened this issue Mar 7, 2019 · 18 comments
Closed
4 of 5 tasks

TODOS #1

cakrit opened this issue Mar 7, 2019 · 18 comments

Comments

@cakrit
Copy link
Contributor

cakrit commented Mar 7, 2019

@cakrit
Copy link
Contributor Author

cakrit commented Mar 14, 2019

Probably #2 and #4 will be a bit more tricky.

@cakrit
Copy link
Contributor Author

cakrit commented Mar 15, 2019

@varyumin do you have any ideas about this one?

Starting the master takes a while and upgrading brings it down. Check if and how we can upgrade more gracefully.

I started reading about what happens if we increase replicas to 2 in the statefulset, but I'm not sure it's suitable in this case, or at least I'm not sure of the other config changes we'd need to make...

@cakrit
Copy link
Contributor Author

cakrit commented Mar 15, 2019

helm/helm#5451 opened for the fifth item.

@varyumin
Copy link
Contributor

varyumin commented Mar 15, 2019

#2 Yes I have ideas and yes we need to try set up replicaset =2. When we will update StatefulSet, Kubernetes turn off one old netdata-master pod(instance) and exclude from service balancer. After
ReplicationController will raise one new netdata-master pod(instance), include it in the service balancer and after that it will check readinessProbe for a new pod. . We have to set correct readinessProbe. I will try replicaset = 2 today if it is OK I will send PR

@varyumin
Copy link
Contributor

#2 Do you know URL for netdata, which answer 200 if netdata is ready and healthy and 50x if unhealthy. i could set liveness and readiness prob

@cakrit
Copy link
Contributor Author

cakrit commented Mar 15, 2019

#2 Do you know URL for netdata, which answer 200 if netdata is ready and healthy and 50x if unhealthy. i could set liveness and readiness prob

I've never received 50x from netdata, but you could use http://[host]:19999/api/v1/info

@varyumin
Copy link
Contributor

Thx. I will try today.

@cakrit
Copy link
Contributor Author

cakrit commented Mar 15, 2019

For the replica, the things we'll need to test are:

  • We can't run into a scenario where both pods try to write to the db that's on the persistenvolume
  • The slaves correctly point only to one master at any given time.

If we run into trouble, we could alternatively have both masters alive, with one replicating metrics to the other. It might be trickier. Let's see what you come up with first! :)

@cakrit
Copy link
Contributor Author

cakrit commented Mar 21, 2019

Fifth item done.

@cakrit
Copy link
Contributor Author

cakrit commented Mar 21, 2019

Based on the work on the configurations, packaging doesn't make sense at this point, so I removed the third item. Reasoning:
We have potentially complex configurations that will be simplified only with the addition of a feature in netdata cloud to apply configurations to slaves. We are also considering extensions to make it easy for netdata to use configuration management tools. I'm not sure how well these will align with kubernetes, but I do know that many users will want to use a custom version of at least one of the 178+ config files we have. With several hundred config lines in the various config files, a single static helm chart just doesn't cut it, even with automated generation.

As a result, I'm inclined to leave it as is, with the instruction to download it and let the users modify the statefulset, the daemonset and the configmap as they please. If there's an alternative I am not aware of, I'd appreciate the feedback.

Once we're confident we have a relatively stable chart, we'll of course try to get it in the official helm repo, but that's not what packaging was about.

@cakrit
Copy link
Contributor Author

cakrit commented Mar 21, 2019

@varyumin, regarding the second replica and handling the persistent volumes, I read a concerning article yesterday. What's your take on it? Could we run into trouble?

@cakrit
Copy link
Contributor Author

cakrit commented Mar 21, 2019

Regarding this:

See if and what needs to change so we can monitor applications on other pods (e.g. mysql)

We have two things we need to do, neither of which is related to the helm chart:

  • For collectors that get metrics via TCP, attempt autodiscovery of the the ip/port and allow specification of endpoints using labels.
  • For collectors that get metrics by reading files, we need to see how people would install netdata as a sidecar container in the same pod.

@cakrit
Copy link
Contributor Author

cakrit commented Mar 21, 2019

@varyumin if you don't have time for the replicas, can you point me to the right direction so I can give it a shot? I have planned to complete this work by next Wednesday.

@varyumin
Copy link
Contributor

Hi, @cakrit ! I had some day-off and I was journeying.
About the second issue. I think a lot, maybe we need to change StatefulSet to Deployment and will start to use PersistentVolumes and PersistentVolumesClaim. And then we can use ReplicaSet 2 because "volume" will be shared or we can load metrics in MQ or DB. Can netdata load metrics from slave to MQ or DB?

@varyumin
Copy link
Contributor

For collectors that get metrics by reading files, we need to see how people would install netdata as a sidecar container in the same pod.

If we can use a sidecar, at first we need to write mutating hook (https://medium.com/dowjones/how-did-that-sidecar-get-there-4dcd73f1a0a4) or controller

@cakrit
Copy link
Contributor Author

cakrit commented Mar 25, 2019

About the second issue. I think a lot, maybe we need to change StatefulSet to Deployment and will start to use PersistentVolumes and PersistentVolumesClaim.

From the documentation I understood that it's only statefulsets that use persistent volumes. Do you have a reference for using persistent volumes with daemonsets?

And then we can use ReplicaSet 2 because "volume" will be shared or we can load metrics in MQ or DB. Can netdata load metrics from slave to MQ or DB?

No, this would be problematic. There's an assumption that each netdata instance maintains its own DB and its the only one writing to it. Our metrics DB is internal and we're building a new one to solve some issues we have.

It looks like it will be difficult and potentially dangerous. We could possibly define a preStart to ensure that no other pod is using the DB, with a test I haven't thought about yet. But perhaps it's an overkill. The thing is, we know that netdata itself does not require the time k8s takes to update the version and bring it back up. If a daemonset can indeed use a persistent volume, then perhaps we don't really need a second replica. The slaves are updated much faster.

Thanks for the sidecar info, I'll study it more later. We will first focus on autodiscovery of the available TCP endpoints (but it has little to do with the helmchart itself).

@cakrit
Copy link
Contributor Author

cakrit commented Mar 25, 2019

Oh, I don't know if you noticed, but I did put a preStop check on the master, to ensure it has shut down before the pod is removed, so that we know the metrics have been written to the persistent volume.

@cakrit
Copy link
Contributor Author

cakrit commented Apr 5, 2019

I'm leaving the timing issue and closing this one. Now that the slaves have a liveness probe, they also take ~1m to come down and up, so this time seems to be necessary. We can revisit in the future if needed.

@cakrit cakrit closed this as completed Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants