Production ready ? #25

commarla · 2020-01-10T08:29:53Z

Do you think I can try chemtrail on my nomad production cluster ? I really need to scale my ASG accordingly with Nomad.
According to my really nice experience with Sherpa, I think I can trust chemtrail !

I'm just asking if there are any known issues or something I should know before trying it ?

Thanks,

jrasell · 2020-01-10T09:05:40Z

Hi @commarla. Short answer yes, I believe Chemtrail is ready and safe to run in a production environment. That being said, I would advise some caution due to the nature of the work it performs which could have monetary impacts, and generally bigger impact that Sherpa.

If you have the ability, I would suggest building off master and using this version which includes a no-op provider. This provider will not perform any actions to your cluster or cloud environment, but will just log intentions for analysis. This will hopefully give you some initial insight into its behaviour, before allowing it to perform actions.

The one bigger item currently missing that I am looking into over this weekend is the leadership locking method. I hope to have something basic in before releasing v0.1.0, but Chemtrail has more complicated internal workings when performing scaling than Sherpa and so would benefit from better state machine modelling which will take time to develop.

Thanks for the kind words on Sherpa, I hope Chemtrail treats you the same. I will leave this issue open, please add any questions you have on here and I'll be happy to help.

commarla · 2020-01-11T15:51:16Z

Hi @jrasell
Thanks for your answer. I just compiled master and ran it but unfortunately I immediately got a panic

❯ ./chemtrail server --provider-noop-enabled --log-level=DEBUG
4:45PM INF starting HTTP server addr=127.0.0.1:8000
4:45PM DBG setting up Nomad client
4:45PM DBG setting up Consul client
4:45PM DBG setting up in-memory storage backend
4:45PM DBG successfully setup notify log provider
4:45PM DBG setting up HTTP server routes
4:45PM DBG setting up HTTP server system routes
4:45PM DBG setting up HTTP server scale routes
4:45PM DBG setting up HTTP server policy routes
4:45PM INF mounting route endpoint GetSystemHealth method=GET path=/v1/system/health
4:45PM INF mounting route endpoint GetSystemMetrics method=GET path=/v1/system/metrics
4:45PM INF mounting route endpoint PostScaleIn method=POST path=/v1/scale/in/{client-class}
4:45PM INF mounting route endpoint PostScaleOut method=POST path=/v1/scale/out/{client-class}
4:45PM INF mounting route endpoint GetScaleStatus method=GET path=/v1/scale/status
4:45PM INF mounting route endpoint GetScaleStatusInfo method=GET path=/v1/scale/status/{id}
4:45PM INF mounting route endpoint GetPolicies method=GET path=/v1/policies
4:45PM INF mounting route endpoint GetPolicy method=GET path=/v1/policy/{client-class}
4:45PM INF mounting route endpoint PutPolicy method=PUT path=/v1/policy/{client-class}
4:45PM INF mounting route endpoint DeletePolicy method=DELETE path=/v1/policy/{client-class}
4:45PM INF HTTP server successfully listening addr=127.0.0.1:8000
4:45PM INF starting Chemtrail Nomad alloc watcher
4:45PM INF starting Chemtrail Nomad alloc update handler
4:45PM INF starting Chemtrail Nomad nodes watcher
4:45PM INF starting Chemtrail Nomad node update handler
4:45PM INF started scaling state garbage collector handler
4:45PM DBG nodes watcher last index has changed new=3515338 old=1
4:45PM DBG node modify index has changed is greater than last recorded new=3507873 node=cde0c78b-5879-bd82-38f6-3add487a4a58 old=0
4:45PM DBG node modify index has changed is greater than last recorded new=3506030 node=a240a424-374f-4d5e-bf11-f3f5e95bd5a8 old=0
4:45PM DBG received node update message to handle node-eligibility=eligible node-id=cde0c78b-5879-bd82-38f6-3add487a4a58 node-status=ready
4:45PM INF added node to Chemtrail internal state node-allocatable-cpu=9432 node-allocatable-memory=30768 node-class=app node-id=cde0c78b-5879-bd82-38f6-3add487a4a58
4:45PM DBG alloc watcher last index has changed new=3522309 old=1
4:45PM DBG alloc modify index has changed is greater than last recorded new=3522307 old=0
4:45PM DBG node modify index has changed is greater than last recorded new=3493446 node=44caffd1-6989-5a1b-42bb-a40d58e7d394 old=0
4:45PM DBG received node update message to handle node-eligibility=eligible node-id=a240a424-374f-4d5e-bf11-f3f5e95bd5a8 node-status=ready
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19431a1]

goroutine 58 [running]:
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).getNodeAllocatableResources(...)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:153
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).handleNodeAvailableMessage(0xc000120000, 0xc000527c20)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:96 +0x1d1
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).handleClientMessage(0xc000120000, 0x19ea900, 0xc000527c20)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:44 +0x1f2
created by github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).runNodeUpdateHandler
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:16 +0x98

I tried it quickly maybe I am missing a config or something.

jrasell · 2020-01-11T15:59:30Z

@commarla I have not seen this before, i'll take a look into it.

jrasell · 2020-01-11T16:19:05Z

@commarla I am unable to reproduce this locally immediately but have had a quick look into the code to double check what is going on. Would you be able to share the returned JSON from the call /v1/node/a240a424-374f-4d5e-bf11-f3f5e95bd5a8 on your cluster? Chemtrail initially seems to be processing the cluster state as expected, but hits a problem with pulling the CPU resource numbers from this node.

commarla · 2020-01-11T16:23:00Z

@jrasell I just call the API and saw that it's an old version of nomad (0.8.6). Let me try to update all my nodes before annoying you more.

On this environment I have at least 4 different version of nomad...

jrasell · 2020-01-11T16:26:41Z

@commarla that makes sense; I was looking at the particular struct section of the Nomad API I use and noticed it was added late in 2018 so thought it could be an older Nomad version. It would make sense to add some safety and fallback in Chemtrail to help avoid this in the future. I'll open a ticket and link it here.

commarla · 2020-01-13T09:48:04Z

Hi @jrasell Correct me if I am wrong but since your fix, I can run chemtrail with both nomad 0.10.2 and nomad 0.9.4 but only 0.10.2 nodes seem to be taken in account in the check-resource-actual compute.

I'm running an ASG with 40 nodes and half of them were always in 0.9.4 and I got an cpu actual at 2 percent. I knew it was false.
I just finished renew all nodes in 0.10.2 and now the cpu is around 60% which is more accurate.

jrasell · 2020-02-06T15:20:04Z

@commarla that seems weird, do you have any other information which could help track this down?

commarla · 2020-02-07T08:41:16Z

Hi @jrasell

I don't have any other info to give to you but I observe another issue which might be related.

At startup chemtrail work fine. Scaling in and out work as expected but after a few days I have this feeling some nodes are forgotten by chemtrail and not taken into account during the resource calculation.

I measure the node number and during the night we can observer a scaling in but after a few days the minimum instance count increase. It should be approximatively the same every night.
It is like chemtrail forgot some nodes, I can correlate that with the CPU usage on my ASG keep decreasing at night so our load is not higher.

And to confirm my feeling if I restart chemtrail, I observe a huge scaling in operation repetition as you can observe :

In red chemtrail restarts and you can see the low are increasing over time. After a restart the night instance count is correct again.

My setup is a bit complex, I am using spot instances (so I have high number of new/dead nomad agent) with a mixed instance type ("c5.4xlarge", "c5.2xlarge", "c5.9xlarge", "m5.4xlarge", "r5.4xlarge")

anthonymq · 2020-06-08T08:16:10Z

Hello @jrasell
I have the same issue as @commarla. On startup everything works fine, but after some time, chemtrail doesn't scale in clients even if the rule specified is matched.
After a restart everything goes fine.
I'm using it on a small cluster (2 to 7 t3.large)

jrasell self-assigned this Jan 10, 2020

jrasell added the kind/question Issues relating to requests for information label Jan 10, 2020

jrasell mentioned this issue Jan 11, 2020

Chemtrail can fail to gather node metrics if node is running old Nomad version (late 2018) #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production ready ? #25

Production ready ? #25

commarla commented Jan 10, 2020

jrasell commented Jan 10, 2020

commarla commented Jan 11, 2020

jrasell commented Jan 11, 2020

jrasell commented Jan 11, 2020

commarla commented Jan 11, 2020 •

edited

jrasell commented Jan 11, 2020

commarla commented Jan 13, 2020

jrasell commented Feb 6, 2020

commarla commented Feb 7, 2020

anthonymq commented Jun 8, 2020

Production ready ? #25

Production ready ? #25

Comments

commarla commented Jan 10, 2020

jrasell commented Jan 10, 2020

commarla commented Jan 11, 2020

jrasell commented Jan 11, 2020

jrasell commented Jan 11, 2020

commarla commented Jan 11, 2020 • edited

jrasell commented Jan 11, 2020

commarla commented Jan 13, 2020

jrasell commented Feb 6, 2020

commarla commented Feb 7, 2020

anthonymq commented Jun 8, 2020

commarla commented Jan 11, 2020 •

edited