Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Production ready ? #25

Open
commarla opened this issue Jan 10, 2020 · 10 comments
Open

Production ready ? #25

commarla opened this issue Jan 10, 2020 · 10 comments
Assignees
Labels
kind/question Issues relating to requests for information

Comments

@commarla
Copy link
Contributor

Hi @jrasell

Do you think I can try chemtrail on my nomad production cluster ? I really need to scale my ASG accordingly with Nomad.
According to my really nice experience with Sherpa, I think I can trust chemtrail !

I'm just asking if there are any known issues or something I should know before trying it ?

Thanks,

@jrasell jrasell self-assigned this Jan 10, 2020
@jrasell jrasell added the kind/question Issues relating to requests for information label Jan 10, 2020
@jrasell
Copy link
Owner

jrasell commented Jan 10, 2020

Hi @commarla. Short answer yes, I believe Chemtrail is ready and safe to run in a production environment. That being said, I would advise some caution due to the nature of the work it performs which could have monetary impacts, and generally bigger impact that Sherpa.

If you have the ability, I would suggest building off master and using this version which includes a no-op provider. This provider will not perform any actions to your cluster or cloud environment, but will just log intentions for analysis. This will hopefully give you some initial insight into its behaviour, before allowing it to perform actions.

The one bigger item currently missing that I am looking into over this weekend is the leadership locking method. I hope to have something basic in before releasing v0.1.0, but Chemtrail has more complicated internal workings when performing scaling than Sherpa and so would benefit from better state machine modelling which will take time to develop.

Thanks for the kind words on Sherpa, I hope Chemtrail treats you the same. I will leave this issue open, please add any questions you have on here and I'll be happy to help.

@commarla
Copy link
Contributor Author

Hi @jrasell
Thanks for your answer. I just compiled master and ran it but unfortunately I immediately got a panic

❯ ./chemtrail server --provider-noop-enabled --log-level=DEBUG
4:45PM INF starting HTTP server addr=127.0.0.1:8000
4:45PM DBG setting up Nomad client
4:45PM DBG setting up Consul client
4:45PM DBG setting up in-memory storage backend
4:45PM DBG successfully setup notify log provider
4:45PM DBG setting up HTTP server routes
4:45PM DBG setting up HTTP server system routes
4:45PM DBG setting up HTTP server scale routes
4:45PM DBG setting up HTTP server policy routes
4:45PM INF mounting route endpoint GetSystemHealth method=GET path=/v1/system/health
4:45PM INF mounting route endpoint GetSystemMetrics method=GET path=/v1/system/metrics
4:45PM INF mounting route endpoint PostScaleIn method=POST path=/v1/scale/in/{client-class}
4:45PM INF mounting route endpoint PostScaleOut method=POST path=/v1/scale/out/{client-class}
4:45PM INF mounting route endpoint GetScaleStatus method=GET path=/v1/scale/status
4:45PM INF mounting route endpoint GetScaleStatusInfo method=GET path=/v1/scale/status/{id}
4:45PM INF mounting route endpoint GetPolicies method=GET path=/v1/policies
4:45PM INF mounting route endpoint GetPolicy method=GET path=/v1/policy/{client-class}
4:45PM INF mounting route endpoint PutPolicy method=PUT path=/v1/policy/{client-class}
4:45PM INF mounting route endpoint DeletePolicy method=DELETE path=/v1/policy/{client-class}
4:45PM INF HTTP server successfully listening addr=127.0.0.1:8000
4:45PM INF starting Chemtrail Nomad alloc watcher
4:45PM INF starting Chemtrail Nomad alloc update handler
4:45PM INF starting Chemtrail Nomad nodes watcher
4:45PM INF starting Chemtrail Nomad node update handler
4:45PM INF started scaling state garbage collector handler
4:45PM DBG nodes watcher last index has changed new=3515338 old=1
4:45PM DBG node modify index has changed is greater than last recorded new=3507873 node=cde0c78b-5879-bd82-38f6-3add487a4a58 old=0
4:45PM DBG node modify index has changed is greater than last recorded new=3506030 node=a240a424-374f-4d5e-bf11-f3f5e95bd5a8 old=0
4:45PM DBG received node update message to handle node-eligibility=eligible node-id=cde0c78b-5879-bd82-38f6-3add487a4a58 node-status=ready
4:45PM INF added node to Chemtrail internal state node-allocatable-cpu=9432 node-allocatable-memory=30768 node-class=app node-id=cde0c78b-5879-bd82-38f6-3add487a4a58
4:45PM DBG alloc watcher last index has changed new=3522309 old=1
4:45PM DBG alloc modify index has changed is greater than last recorded new=3522307 old=0
4:45PM DBG node modify index has changed is greater than last recorded new=3493446 node=44caffd1-6989-5a1b-42bb-a40d58e7d394 old=0
4:45PM DBG received node update message to handle node-eligibility=eligible node-id=a240a424-374f-4d5e-bf11-f3f5e95bd5a8 node-status=ready
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19431a1]

goroutine 58 [running]:
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).getNodeAllocatableResources(...)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:153
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).handleNodeAvailableMessage(0xc000120000, 0xc000527c20)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:96 +0x1d1
github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).handleClientMessage(0xc000120000, 0x19ea900, 0xc000527c20)
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:44 +0x1f2
created by github.com/jrasell/chemtrail/pkg/scale/resource.(*updateHandler).runNodeUpdateHandler
        /Users/laurentcommarieu/src/iadvize/chemtrail/pkg/scale/resource/nodes.go:16 +0x98

I tried it quickly maybe I am missing a config or something.

@jrasell
Copy link
Owner

jrasell commented Jan 11, 2020

@commarla I have not seen this before, i'll take a look into it.

@jrasell
Copy link
Owner

jrasell commented Jan 11, 2020

@commarla I am unable to reproduce this locally immediately but have had a quick look into the code to double check what is going on. Would you be able to share the returned JSON from the call /v1/node/a240a424-374f-4d5e-bf11-f3f5e95bd5a8 on your cluster? Chemtrail initially seems to be processing the cluster state as expected, but hits a problem with pulling the CPU resource numbers from this node.

@commarla
Copy link
Contributor Author

commarla commented Jan 11, 2020

@jrasell I just call the API and saw that it's an old version of nomad (0.8.6). Let me try to update all my nodes before annoying you more.

On this environment I have at least 4 different version of nomad...

@jrasell
Copy link
Owner

jrasell commented Jan 11, 2020

@commarla that makes sense; I was looking at the particular struct section of the Nomad API I use and noticed it was added late in 2018 so thought it could be an older Nomad version. It would make sense to add some safety and fallback in Chemtrail to help avoid this in the future. I'll open a ticket and link it here.

@commarla
Copy link
Contributor Author

Hi @jrasell Correct me if I am wrong but since your fix, I can run chemtrail with both nomad 0.10.2 and nomad 0.9.4 but only 0.10.2 nodes seem to be taken in account in the check-resource-actual compute.

I'm running an ASG with 40 nodes and half of them were always in 0.9.4 and I got an cpu actual at 2 percent. I knew it was false.
I just finished renew all nodes in 0.10.2 and now the cpu is around 60% which is more accurate.

@jrasell
Copy link
Owner

jrasell commented Feb 6, 2020

@commarla that seems weird, do you have any other information which could help track this down?

@commarla
Copy link
Contributor Author

commarla commented Feb 7, 2020

Hi @jrasell

I don't have any other info to give to you but I observe another issue which might be related.

At startup chemtrail work fine. Scaling in and out work as expected but after a few days I have this feeling some nodes are forgotten by chemtrail and not taken into account during the resource calculation.

I measure the node number and during the night we can observer a scaling in but after a few days the minimum instance count increase. It should be approximatively the same every night.
It is like chemtrail forgot some nodes, I can correlate that with the CPU usage on my ASG keep decreasing at night so our load is not higher.

And to confirm my feeling if I restart chemtrail, I observe a huge scaling in operation repetition as you can observe :

Screenshot 2020-02-07 09 31 36

In red chemtrail restarts and you can see the low are increasing over time. After a restart the night instance count is correct again.

My setup is a bit complex, I am using spot instances (so I have high number of new/dead nomad agent) with a mixed instance type ("c5.4xlarge", "c5.2xlarge", "c5.9xlarge", "m5.4xlarge", "r5.4xlarge")

@anthonymq
Copy link

Hello @jrasell
I have the same issue as @commarla. On startup everything works fine, but after some time, chemtrail doesn't scale in clients even if the rule specified is matched.
After a restart everything goes fine.
I'm using it on a small cluster (2 to 7 t3.large)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/question Issues relating to requests for information
Projects
None yet
Development

No branches or pull requests

3 participants