Nexus container doesn't start causing Linux VM's to fail deployment #3642

migldasilva · 2023-08-01T22:02:40Z

Describe the bug

When deploying Nexus, all the steps run as expected, but the Nexus container does not start.

After some debugging, I found the following in /var/log/cloud-config-output.log file:

// OMITTED MESSAGES RELATED TO PACKAGES INSTALLATION 

Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for ureadahead (0.100.0-21) ...
Unable to find image 'sonatype/nexus3:latest' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
[
  {
    "environmentName": "AzureCloud",
    "id": "88888888-6258-ABCD-XXXX-UUUKKKKKKKKK",
    "isDefault": true,
    "name": "N/A(tenant level account)",
    "state": "Enabled",
    "tenantId": "abababab-XXXX-YYYY-4444-1234567890ab",
    "user": {
      "assignedIdentityInfo": "MSIResource-/subscriptions/11111111-2222-333-4444-5555555555/resourceGroups/rg-miguel07dev/providers/Microsoft.ManagedI
dentity/userAssignedIdentities/id-nexus-miguel07dev",
      "name": "userAssignedIdentity",
      "type": "servicePrincipal"
    }
  }
]
Getting cert and cert password from Keyvault...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...
Found config file: /tmp/nexus_repos_config/almalinux_proxy_conf.json. Sending to Nexus...
Response received from Nexus: 000
Response received from Nexus: 000
Response received from Nexus: 000

// THIS MESSAGE REPEATED SEVERAL TIMES

The most important line is:

docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.

I've been trying some solutions and tests found in Internet. The most effective one was to edit the file templates/shared_services/sonatype-nexus-vm/terraform/cloud-config.yaml so that the runcmd section looks like this:

01 runcmd:
02  - export DEBIAN_FRONTEND=noninteractive
03  # Give the Nexus process write permissions on the folder mounted as persistent volume
04  - chown -R 200 /etc/nexus-data
05  - systemctl restart docker.service
06  - sleep 60
07  # Run the nexus container with mapped volume for nexus config
08  - docker run -d -p 80:8081 -p 443:8443 -p 8083:8083 -v /etc/nexus-data:/nexus-data
09    --restart always
10    --name nexus
11    --log-driver local
12    sonatype/nexus3
13  # Reset the admin password of Nexus to the one created by TF and stored in KeyVault
14  - bash /tmp/reset_nexus_password.sh "${NEXUS_ADMIN_PASSWORD}"
15  # Invoke Nexus SSL configuration (which will also be ran as CRON daily to renew cert)
16  - bash /etc/cron.daily/configure_nexus_ssl.sh
17  # Configure Nexus repositories
18  - bash /tmp/configure_nexus_repos.sh "${NEXUS_ADMIN_PASSWORD}"

It means that Docker daemon is restarted and an 1 minute delay is added until the Nexus Docker is started (lines 05 and 06).

Steps to reproduce

Deploy Nexus shared service
Check if Nexus container is running

The text was updated successfully, but these errors were encountered:

marrobi · 2023-08-03T08:06:16Z

@SvenAelterman this could be what you are hitting.

marrobi · 2023-08-03T08:07:40Z

@migldasilva thanks, something must have changed around the docker install upstream. Think we have a few people hitting this.

marrobi · 2023-08-03T09:18:00Z

@migldasilva did you try with the sleep and without restarting the service?

marrobi · 2023-08-03T10:13:57Z

I tried to replicate earlier today and Nexus has deployed successfully for me.

Current theory is something transient and need to try rerunning cloud init scripts on a failed deployment - https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts

martinpeck · 2023-08-03T10:22:05Z

As mentioned by @marrobi, our hope is that this was caused by a transient issue (either related to the update of the nexus container on docker hub, or related to docker's registry APIs). Local testing suggests that fresh deployments of the nexus server work fine.

Let's keep an eye on this.

I've just confirmed that re-running the cloud-init scripts as described in the docs that @marrobi shared has successfully run on a VM that was previously seeing this issue, and that the nexus server is now working as expected.

SvenAelterman · 2023-08-03T11:33:55Z

I agree that the line docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF. seems to be key. I wonder if this is related to Docker Hub's throttling policies. This would explain the transient nature of the issue.

Background: https://docs.docker.com/docker-hub/download-rate-limit/

It wouldn't be feasible to pull with authentication, of course. We might need to mirror the package in Azure somewhere.

The way to find out would be to inspect the response header.

martinpeck · 2023-08-03T11:45:20Z

Hmmm. Re-running cloud init didn't entirely fix my broken nexus.

The service came up and the web UI was available, but a number of other things (for example pip install on VMs within workspaces) failed due to configuration/setup not having properly run.

I've tried to redeploy and am hitting various other issues (docker run is fine, but other setup steps are not).

migldasilva · 2023-08-03T16:28:47Z

@marrobi I didn't try sleep without restarting, only both actions together. The reason for not testing it is that I added a loop in the cloud init script, in order to wait until something in the VM would get correctly set. This loop run for around 5 minutes, and then I stop it manually.

martinpeck · 2023-08-03T19:01:21Z

OK. Hopefully this is the final update for today:

I was wrong in saying that this issue was transient, and that I wasn't seeing it.

I've spent much of today going through a cycle of:

disable and then delete the Nexus shared service
deploy the Nexus shared service
observe it to be broken, and try and bring it back to life re-running cloud-init
repeat

Unfortunately, it's a bit of a dice roll as to whether you hit the issue or not. I've hit the issue more times than not.

In terms of my comments on re-running the cloud-init stuff, it turns out this isn't always an option. The terraform within this bundle drops files that are used by cloud-init into the /tmp folder. Ubuntu will clear this folder on reboot (and may clear it at other times) so re-running cloud-init after a reboot is then impossible. So, other than "delete it and try again" I've not found a sure-fire work around to this issue.

While adding a sleep into the script might well reduce the likelihood of this issue happening, I wonder if there are a number of other changes that are needed to make deployment and configuration of the Nexus service more reliable. These might include...

remove the use of the file provisioner within Terraform. This is generally discouraged, and any failures in copying the files (in this case, the nexus_repos_config folder containing JSON files) difficult to debug. TF doesn't managed the state of these files which makes recovery harder too. Before I realised that the files were being cleaned up I was looking at the Resource Processor logs to determine if the nexus config had been copied to the VM, and the use of the file provisioner makes it very hard to tell.
when deploying setup scripts to the VM, avoid using the /tmp folder. This folder gets cleaned up periodically/at reboot, and this makes recovery of the service harder.
investigate ways of ensuring that errors in cloud-init are more visible. Whether that's a health check, or something else, there should be something obvious. This is especially true if the nexus service starts, but it's the config setup fails. In this situation (which I had several times today while trying to recover a broken install) you have a fully deployed Nexus (with API/website) but there's no package libraries configured and so the user experience within a workspace is still broken.
potentially find an entirely different way to deploy nexus into the Azure TRE

All of this needs more thought, but I wanted to brain-dump it here before I forget.

migldasilva · 2023-08-04T08:35:27Z

As @martinpeck mentioned, adding only a sleep in cloud-init is not enough.

On the other hand, at the current stage of Nexus Shared Service, what would the disadvantages of restarting Docker daemon and sleeping?

marrobi · 2023-08-04T09:35:00Z

@migldasilva rather than sleeping maybe we have a script that tried to pull, and if fails, restarts, and tries? The sleep fees very much like a hack.

I'd rather know what the underlying issue was, but know that might take time.

The VM is currently running 18.04, maybe a first step would be to try with a newer version of Ubuntu.

bwiseman · 2023-08-04T10:23:20Z

I'm seeing the nexus/vm connection problem as well. I have added the sleep to the cloud-config-yaml but how do I roll that out?

Nexus shared-service ' tre-shared-service-certs' is stuck in 'pending' so I can't delete that one.
Nexus shared service 'tre-shared-service-sonatype-nexus' is disabled so could be deleted

I think I need to delete the one that's pending as well before deploying and creating the shared service again? Tried via the api but 'pending' is locking it. Can I delete in the Azure portal? I can't work out what the 'pending' resource actually is though?

migldasilva · 2023-08-04T11:13:18Z

@marrobi Thanks for the comments. I do agree with you about avoiding "hacks". 😄

I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem.

marrobi · 2023-08-04T12:17:07Z

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

My suggestion if you aren't yet familiar with registering custom templates is to delete the nexus service, and redeploy, logon to the nexus VM and check cloud init has completed, and if it has failed run these commands: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts

marrobi · 2023-08-04T12:18:10Z

@marrobi Thanks for the comments. I do agree with you about avoiding "hacks". 😄

I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem.

I think if we had a more robust script that was being run by cloud-init that could help resolve some of these issues. Also taking account of maybe moving the files away from the /tmp folder.

bwiseman · 2023-08-07T08:04:41Z

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

It doesn't seem to be here... in CosmoDB the 'pending' resource ID is listed in the Resources/Items tag as deploymentStatus -= updated but there is nothing in the Operations section with that ID and nothing that I can find saying "pending"

marrobi · 2023-08-07T10:00:38Z

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

It doesn't seem to be here... in CosmoDB the 'pending' resource ID is listed in the Resources/Items tag as deploymentStatus -= updated but there is nothing in the Operations section with that ID and nothing that I can find saying "pending"

In the operations documents it will be the resource_id that matches the service ID. It might say awaiting_action, imagine one of the last ones in the list of operation documents.

For reference the valid values are:

# Resource Action Status
RESOURCE_STATUS_AWAITING_ACTION = "awaiting_action"
RESOURCE_ACTION_STATUS_INVOKING = "invoking_action"
RESOURCE_ACTION_STATUS_SUCCEEDED = "action_succeeded"
RESOURCE_ACTION_STATUS_FAILED = "action_failed"

marrobi · 2023-08-07T10:09:21Z

Or you could delete the certs service via the API. It's just the UI that's blocking the disable/delete operations.

bwiseman · 2023-08-07T10:22:14Z

On 2023-08-07 11:09, Marcus Robinson wrote: Or you could delete the certs service via the API. It' just the UI that's blocking the disable/delete operations. -- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you were mentioned.Message ID: ***@***.***> Links: ------ [1] #3642 (comment) [2] https://github.com/notifications/unsubscribe-auth/AHQNUAQJEBHKEEVLWTWYTGLXUC5FXANCNFSM6AAAAAA3AMEHBU

yes, I tried via the api ui but it says it will not delete a pending resource barbara

marrobi · 2023-08-07T10:27:30Z

Ok, just replicated:

If change any operations for the shared service with status awaiting_action to action_failed

marrobi · 2023-08-07T10:30:16Z

Ah, sorry, I'm getting confused between two issues. :/

Anyway if look in cosmos for any operations not completed for that resource. Let me know how get on.

bwiseman · 2023-08-08T14:27:39Z

In frustration I destroyed the tre install and reinstalled (good excuse to check my documentation). All seemed to go well and I'm back to just having the Nexus problem. Unfortunately this time running the initcloud scripts manually isn't fixing the guacamole connection not working from a Linux vm issue. Trying to connect to the nexus web page from a windows vm doesn' work, and from the nexus admin vm I see the error you're talking about. Is adding the sleep and redeploying nexus worth trying?

docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
Setting up Nexus SSL...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...

marrobi · 2023-08-08T15:00:29Z

Sorry to hear still having issues, as per @migldasilva solution adding:

systemctl restart docker.service
sleep 60

To cloud-config.yaml and reregistering the template has solved the issue for him. It's odd as sometimes it just works.

If you drop me an email to marrobi at microsoft.com I can maybe screen share to try and help you along if that would be useful?

@migldasilva are you intending to do a PR? Thanks.

marrobi · 2023-08-11T10:20:57Z

@migldasilva are you intending to do this, otherwise I will do a PR with a script like:

#!/bin/bash

while true; do
    docker pull sonatype/nexus3
    if [ $? -eq 0 ]; then
        break
    else
        echo "Failed to pull image, restarting Docker service"
        systemctl restart docker.service
        sleep 60
    fi
done

marrobi · 2023-08-16T14:20:15Z

While looking at @SvenAelterman's PR I noticed the firewall rules are added in the pipeline AFTER the VM is deployed. This could well be the source of the issue. Either way the retry script will fix it.

The reason the pipeline is that way round is as the firewall rules are using output from the terraform.

migldasilva · 2023-08-21T08:01:15Z

@marrobi I wasn't able to create a PR before going on vacations. As you mentioned in this comment, my approach was using a script quite similar to that one.

I have just pulled the newest code and saw that such script was already merged (including a timeout option for limiting number of retries). Thus, it won't be required to create a new PR.

Thanks.

migldasilva added the bug Something isn't working label Aug 1, 2023

marrobi added this to Azure TRE - Engineering Aug 3, 2023

marrobi moved this to Up Next in Azure TRE - Engineering Aug 3, 2023

marrobi pinned this issue Aug 3, 2023

marrobi changed the title ~~Nexus container doesn't start~~ Nexus container doesn't start causing Linux VM's to fail deployment Aug 3, 2023

marrobi unpinned this issue Aug 3, 2023

marrobi pinned this issue Aug 3, 2023

marrobi mentioned this issue Aug 3, 2023

Guacamole connections start to fail #3641

Closed

marrobi self-assigned this Aug 15, 2023

marrobi mentioned this issue Aug 15, 2023

Nexus: Add wait for docker, move scripts out of tmp, upgrade image. #3675

Closed

1 task

SvenAelterman mentioned this issue Aug 15, 2023

Update Nexus reliability #3676

Merged

SvenAelterman self-assigned this Aug 16, 2023

marrobi mentioned this issue Aug 16, 2023

Guacamole : 500 Internal Server Error #3679

Open

marrobi closed this as completed in #3676 Aug 18, 2023

github-project-automation bot moved this from Up Next to Done in Azure TRE - Engineering Aug 18, 2023

SvenAelterman unpinned this issue Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus container doesn't start causing Linux VM's to fail deployment #3642

Nexus container doesn't start causing Linux VM's to fail deployment #3642

migldasilva commented Aug 1, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

martinpeck commented Aug 3, 2023

SvenAelterman commented Aug 3, 2023

martinpeck commented Aug 3, 2023

migldasilva commented Aug 3, 2023 •

edited

Loading

martinpeck commented Aug 3, 2023

migldasilva commented Aug 4, 2023 •

edited

Loading

marrobi commented Aug 4, 2023

bwiseman commented Aug 4, 2023

migldasilva commented Aug 4, 2023

marrobi commented Aug 4, 2023

marrobi commented Aug 4, 2023

bwiseman commented Aug 7, 2023

marrobi commented Aug 7, 2023

marrobi commented Aug 7, 2023 •

edited by SvenAelterman

Loading

bwiseman commented Aug 7, 2023 via email

marrobi commented Aug 7, 2023

marrobi commented Aug 7, 2023

bwiseman commented Aug 8, 2023

marrobi commented Aug 8, 2023

marrobi commented Aug 11, 2023

marrobi commented Aug 16, 2023

migldasilva commented Aug 21, 2023 •

edited

Loading

Nexus container doesn't start causing Linux VM's to fail deployment #3642

Nexus container doesn't start causing Linux VM's to fail deployment #3642

Comments

migldasilva commented Aug 1, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

marrobi commented Aug 3, 2023

martinpeck commented Aug 3, 2023

SvenAelterman commented Aug 3, 2023

martinpeck commented Aug 3, 2023

migldasilva commented Aug 3, 2023 • edited Loading

martinpeck commented Aug 3, 2023

migldasilva commented Aug 4, 2023 • edited Loading

marrobi commented Aug 4, 2023

bwiseman commented Aug 4, 2023

migldasilva commented Aug 4, 2023

marrobi commented Aug 4, 2023

marrobi commented Aug 4, 2023

bwiseman commented Aug 7, 2023

marrobi commented Aug 7, 2023

marrobi commented Aug 7, 2023 • edited by SvenAelterman Loading

bwiseman commented Aug 7, 2023 via email

marrobi commented Aug 7, 2023

marrobi commented Aug 7, 2023

bwiseman commented Aug 8, 2023

marrobi commented Aug 8, 2023

marrobi commented Aug 11, 2023

marrobi commented Aug 16, 2023

migldasilva commented Aug 21, 2023 • edited Loading

migldasilva commented Aug 3, 2023 •

edited

Loading

migldasilva commented Aug 4, 2023 •

edited

Loading

marrobi commented Aug 7, 2023 •

edited by SvenAelterman

Loading

migldasilva commented Aug 21, 2023 •

edited

Loading