Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Server showing node as unusable #19

Closed
vidit-bhatia opened this issue Jan 16, 2018 · 14 comments
Closed

Server showing node as unusable #19

vidit-bhatia opened this issue Jan 16, 2018 · 14 comments

Comments

@vidit-bhatia
Copy link

vidit-bhatia commented Jan 16, 2018

My server is showing the nodes as unusable . The server use to work just few days back . I tried going on to the node as described in #9 .

image

There was nothing mentioned other than that there is an internal error. Is there any way I can start my server or fix these nodes.
{"Code":"InternalError","Message":"","Category":"InternalError","ExitCode":1,"Details":null}

@AlexanderYukhanov
Copy link
Contributor

Thanks for reporting the problem. can you please provide stdout.txt and stderr.txt from /mnt/batch/tasks/startup/ for investigation? You can solve the problem by resizing the cluster to 0 and back to 2.
az batchai cluster resize -n -g -t 0
az batchai cluster resize -n -g -t 1

Thanks,
Alex

@AlexanderYukhanov
Copy link
Contributor

vidit-bhatia, I see you still have one node in unusable state. You probably would like to delete it if you are not using it via ssh, because it's still allocated and is considered to be used by you (so, it will be included in the bill). You can just set min size for your cluster to 0 to delete nodes when you are not using them.

@AlexanderYukhanov
Copy link
Contributor

Please note, system checks if it needs to resize cluster every 5 mins. So, it can take up to 5 mins for BatchAI to start nodes allocation after you submit a job.

@AlexanderYukhanov
Copy link
Contributor

vidit-bhatia. Can you please recreate your cluster? The issue is that your cluster was created before the recent ubuntu meltdown patch and kernel update. Now when your cluster is trying to allocate nodes it gets new kernel but old drivers.

@vidit-bhatia
Copy link
Author

@AlekseiPolkovnikov I will look into it on Monday see how that can be done as the cluster is used already by some people.

@AlexanderYukhanov
Copy link
Contributor

We have implemented a workaround on our side to make nodes after resize to pick up new drivers. So, you may keep the cluster and just make sure that all your unusable nodes removed

@vidit-bhatia
Copy link
Author

@AlexanderYukhanov Seems like the workaround does not work

@AlexanderYukhanov
Copy link
Contributor

what is happening?

@vidit-bhatia
Copy link
Author

vidit-bhatia commented Jan 22, 2018

image

Same as soon as the node starts it become unusable

@AlexanderYukhanov
Copy link
Contributor

taking a look

@AlexanderYukhanov
Copy link
Contributor

Now it's a different issue - "Blob fuse mounting failed". Can you please check account name, key and container name?

@vidit-bhatia
Copy link
Author

Looking into it

@vidit-bhatia
Copy link
Author

@AlexanderYukhanov the python API s does not allow me to update mount settings? Do I need to delete and recreate server again

@AlexanderYukhanov
Copy link
Contributor

Yes, it's not possible to change mount settings after cluster has been created.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants