-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus container doesn't start causing Linux VM's to fail deployment #3642
Comments
@SvenAelterman this could be what you are hitting. |
@migldasilva thanks, something must have changed around the docker install upstream. Think we have a few people hitting this. |
@migldasilva did you try with the sleep and without restarting the service? |
I tried to replicate earlier today and Nexus has deployed successfully for me. Current theory is something transient and need to try rerunning cloud init scripts on a failed deployment - https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts |
As mentioned by @marrobi, our hope is that this was caused by a transient issue (either related to the update of the nexus container on docker hub, or related to docker's registry APIs). Local testing suggests that fresh deployments of the nexus server work fine. Let's keep an eye on this. I've just confirmed that re-running the cloud-init scripts as described in the docs that @marrobi shared has successfully run on a VM that was previously seeing this issue, and that the nexus server is now working as expected. |
I agree that the line Background: https://docs.docker.com/docker-hub/download-rate-limit/ It wouldn't be feasible to pull with authentication, of course. We might need to mirror the package in Azure somewhere. The way to find out would be to inspect the response header. |
Hmmm. Re-running cloud init didn't entirely fix my broken nexus. The service came up and the web UI was available, but a number of other things (for example I've tried to redeploy and am hitting various other issues (docker run is fine, but other setup steps are not). |
@marrobi I didn't try sleep without restarting, only both actions together. The reason for not testing it is that I added a loop in the cloud init script, in order to wait until something in the VM would get correctly set. This loop run for around 5 minutes, and then I stop it manually. |
OK. Hopefully this is the final update for today: I was wrong in saying that this issue was transient, and that I wasn't seeing it. I've spent much of today going through a cycle of:
Unfortunately, it's a bit of a dice roll as to whether you hit the issue or not. I've hit the issue more times than not. In terms of my comments on re-running the cloud-init stuff, it turns out this isn't always an option. The terraform within this bundle drops files that are used by cloud-init into the While adding a sleep into the script might well reduce the likelihood of this issue happening, I wonder if there are a number of other changes that are needed to make deployment and configuration of the Nexus service more reliable. These might include...
All of this needs more thought, but I wanted to brain-dump it here before I forget. |
As @martinpeck mentioned, adding only a sleep in cloud-init is not enough. On the other hand, at the current stage of Nexus Shared Service, what would the disadvantages of restarting Docker daemon and sleeping? |
@migldasilva rather than sleeping maybe we have a script that tried to pull, and if fails, restarts, and tries? The sleep fees very much like a hack. I'd rather know what the underlying issue was, but know that might take time. The VM is currently running 18.04, maybe a first step would be to try with a newer version of Ubuntu. |
I'm seeing the nexus/vm connection problem as well. I have added the sleep to the cloud-config-yaml but how do I roll that out?
I think I need to delete the one that's pending as well before deploying and creating the shared service again? Tried via the api but 'pending' is locking it. Can I delete in the Azure portal? I can't work out what the 'pending' resource actually is though? |
@marrobi Thanks for the comments. I do agree with you about avoiding "hacks". 😄 I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem. |
@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to My suggestion if you aren't yet familiar with registering custom templates is to delete the nexus service, and redeploy, logon to the nexus VM and check cloud init has completed, and if it has failed run these commands: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts |
I think if we had a more robust script that was being run by cloud-init that could help resolve some of these issues. Also taking account of maybe moving the files away from the |
It doesn't seem to be here... in CosmoDB the 'pending' resource ID is listed in the Resources/Items tag as deploymentStatus -= updated but there is nothing in the Operations section with that ID and nothing that I can find saying "pending" |
In the operations documents it will be the For reference the valid values are:
|
Or you could delete the certs service via the API. It's just the UI that's blocking the disable/delete operations. |
On 2023-08-07 11:09, Marcus Robinson wrote:
Or you could delete the certs service via the API. It' just the UI
that's blocking the disable/delete operations.
--
Reply to this email directly, view it on GitHub [1], or unsubscribe
[2].
You are receiving this because you were mentioned.Message ID:
***@***.***>
Links:
------
[1]
#3642 (comment)
[2]
https://github.com/notifications/unsubscribe-auth/AHQNUAQJEBHKEEVLWTWYTGLXUC5FXANCNFSM6AAAAAA3AMEHBU
yes, I tried via the api ui but it says it will not delete a pending
resource
barbara
|
Ah, sorry, I'm getting confused between two issues. :/ Anyway if look in cosmos for any operations not completed for that resource. Let me know how get on. |
In frustration I destroyed the tre install and reinstalled (good excuse to check my documentation). All seemed to go well and I'm back to just having the Nexus problem. Unfortunately this time running the initcloud scripts manually isn't fixing the guacamole connection not working from a Linux vm issue. Trying to connect to the nexus web page from a windows vm doesn' work, and from the nexus admin vm I see the error you're talking about. Is adding the sleep and redeploying nexus worth trying?
|
Sorry to hear still having issues, as per @migldasilva solution adding:
To cloud-config.yaml and reregistering the template has solved the issue for him. It's odd as sometimes it just works. If you drop me an email to marrobi at microsoft.com I can maybe screen share to try and help you along if that would be useful? @migldasilva are you intending to do a PR? Thanks. |
@migldasilva are you intending to do this, otherwise I will do a PR with a script like:
|
While looking at @SvenAelterman's PR I noticed the firewall rules are added in the pipeline AFTER the VM is deployed. This could well be the source of the issue. Either way the retry script will fix it. The reason the pipeline is that way round is as the firewall rules are using output from the terraform. |
@marrobi I wasn't able to create a PR before going on vacations. As you mentioned in this comment, my approach was using a script quite similar to that one. I have just pulled the newest code and saw that such script was already merged (including a timeout option for limiting number of retries). Thus, it won't be required to create a new PR. Thanks. |
Describe the bug
When deploying Nexus, all the steps run as expected, but the Nexus container does not start.
After some debugging, I found the following in
/var/log/cloud-config-output.log
file:The most important line is:
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
I've been trying some solutions and tests found in Internet. The most effective one was to edit the file
templates/shared_services/sonatype-nexus-vm/terraform/cloud-config.yaml
so that theruncmd
section looks like this:It means that Docker daemon is restarted and an 1 minute delay is added until the Nexus Docker is started (lines 05 and 06).
Steps to reproduce
The text was updated successfully, but these errors were encountered: