Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus container doesn't start causing Linux VM's to fail deployment #3642

Closed
migldasilva opened this issue Aug 1, 2023 · 26 comments · Fixed by #3676
Closed

Nexus container doesn't start causing Linux VM's to fail deployment #3642

migldasilva opened this issue Aug 1, 2023 · 26 comments · Fixed by #3676
Assignees
Labels
bug Something isn't working

Comments

@migldasilva
Copy link
Contributor

Describe the bug

When deploying Nexus, all the steps run as expected, but the Nexus container does not start.

After some debugging, I found the following in /var/log/cloud-config-output.log file:

// OMITTED MESSAGES RELATED TO PACKAGES INSTALLATION 

Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for ureadahead (0.100.0-21) ...
Unable to find image 'sonatype/nexus3:latest' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
[
  {
    "environmentName": "AzureCloud",
    "id": "88888888-6258-ABCD-XXXX-UUUKKKKKKKKK",
    "isDefault": true,
    "name": "N/A(tenant level account)",
    "state": "Enabled",
    "tenantId": "abababab-XXXX-YYYY-4444-1234567890ab",
    "user": {
      "assignedIdentityInfo": "MSIResource-/subscriptions/11111111-2222-333-4444-5555555555/resourceGroups/rg-miguel07dev/providers/Microsoft.ManagedI
dentity/userAssignedIdentities/id-nexus-miguel07dev",
      "name": "userAssignedIdentity",
      "type": "servicePrincipal"
    }
  }
]
Getting cert and cert password from Keyvault...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...
Found config file: /tmp/nexus_repos_config/almalinux_proxy_conf.json. Sending to Nexus...
Response received from Nexus: 000
Response received from Nexus: 000
Response received from Nexus: 000

// THIS MESSAGE REPEATED SEVERAL TIMES

The most important line is:

docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.

I've been trying some solutions and tests found in Internet. The most effective one was to edit the file templates/shared_services/sonatype-nexus-vm/terraform/cloud-config.yaml so that the runcmd section looks like this:

01 runcmd:
02  - export DEBIAN_FRONTEND=noninteractive
03  # Give the Nexus process write permissions on the folder mounted as persistent volume
04  - chown -R 200 /etc/nexus-data
05  - systemctl restart docker.service
06  - sleep 60
07  # Run the nexus container with mapped volume for nexus config
08  - docker run -d -p 80:8081 -p 443:8443 -p 8083:8083 -v /etc/nexus-data:/nexus-data
09    --restart always
10    --name nexus
11    --log-driver local
12    sonatype/nexus3
13  # Reset the admin password of Nexus to the one created by TF and stored in KeyVault
14  - bash /tmp/reset_nexus_password.sh "${NEXUS_ADMIN_PASSWORD}"
15  # Invoke Nexus SSL configuration (which will also be ran as CRON daily to renew cert)
16  - bash /etc/cron.daily/configure_nexus_ssl.sh
17  # Configure Nexus repositories
18  - bash /tmp/configure_nexus_repos.sh "${NEXUS_ADMIN_PASSWORD}"

It means that Docker daemon is restarted and an 1 minute delay is added until the Nexus Docker is started (lines 05 and 06).

Steps to reproduce

  1. Deploy Nexus shared service
  2. Check if Nexus container is running
@migldasilva migldasilva added the bug Something isn't working label Aug 1, 2023
@marrobi
Copy link
Member

marrobi commented Aug 3, 2023

@SvenAelterman this could be what you are hitting.

@marrobi
Copy link
Member

marrobi commented Aug 3, 2023

@migldasilva thanks, something must have changed around the docker install upstream. Think we have a few people hitting this.

@marrobi marrobi pinned this issue Aug 3, 2023
@marrobi marrobi changed the title Nexus container doesn't start Nexus container doesn't start causing Linux VM's to fail deployment Aug 3, 2023
@marrobi
Copy link
Member

marrobi commented Aug 3, 2023

@migldasilva did you try with the sleep and without restarting the service?

@marrobi
Copy link
Member

marrobi commented Aug 3, 2023

I tried to replicate earlier today and Nexus has deployed successfully for me.

Current theory is something transient and need to try rerunning cloud init scripts on a failed deployment - https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts

@martinpeck
Copy link
Member

As mentioned by @marrobi, our hope is that this was caused by a transient issue (either related to the update of the nexus container on docker hub, or related to docker's registry APIs). Local testing suggests that fresh deployments of the nexus server work fine.

Let's keep an eye on this.

I've just confirmed that re-running the cloud-init scripts as described in the docs that @marrobi shared has successfully run on a VM that was previously seeing this issue, and that the nexus server is now working as expected.

@marrobi marrobi unpinned this issue Aug 3, 2023
@SvenAelterman
Copy link
Collaborator

I agree that the line docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF. seems to be key. I wonder if this is related to Docker Hub's throttling policies. This would explain the transient nature of the issue.

Background: https://docs.docker.com/docker-hub/download-rate-limit/

It wouldn't be feasible to pull with authentication, of course. We might need to mirror the package in Azure somewhere.

The way to find out would be to inspect the response header.

@martinpeck
Copy link
Member

Hmmm. Re-running cloud init didn't entirely fix my broken nexus.

The service came up and the web UI was available, but a number of other things (for example pip install on VMs within workspaces) failed due to configuration/setup not having properly run.

I've tried to redeploy and am hitting various other issues (docker run is fine, but other setup steps are not).

@migldasilva
Copy link
Contributor Author

migldasilva commented Aug 3, 2023

@marrobi I didn't try sleep without restarting, only both actions together. The reason for not testing it is that I added a loop in the cloud init script, in order to wait until something in the VM would get correctly set. This loop run for around 5 minutes, and then I stop it manually.

@martinpeck
Copy link
Member

OK. Hopefully this is the final update for today:

I was wrong in saying that this issue was transient, and that I wasn't seeing it.

I've spent much of today going through a cycle of:

  • disable and then delete the Nexus shared service
  • deploy the Nexus shared service
  • observe it to be broken, and try and bring it back to life re-running cloud-init
  • repeat

Unfortunately, it's a bit of a dice roll as to whether you hit the issue or not. I've hit the issue more times than not.

In terms of my comments on re-running the cloud-init stuff, it turns out this isn't always an option. The terraform within this bundle drops files that are used by cloud-init into the /tmp folder. Ubuntu will clear this folder on reboot (and may clear it at other times) so re-running cloud-init after a reboot is then impossible. So, other than "delete it and try again" I've not found a sure-fire work around to this issue.

While adding a sleep into the script might well reduce the likelihood of this issue happening, I wonder if there are a number of other changes that are needed to make deployment and configuration of the Nexus service more reliable. These might include...

  • remove the use of the file provisioner within Terraform. This is generally discouraged, and any failures in copying the files (in this case, the nexus_repos_config folder containing JSON files) difficult to debug. TF doesn't managed the state of these files which makes recovery harder too. Before I realised that the files were being cleaned up I was looking at the Resource Processor logs to determine if the nexus config had been copied to the VM, and the use of the file provisioner makes it very hard to tell.
  • when deploying setup scripts to the VM, avoid using the /tmp folder. This folder gets cleaned up periodically/at reboot, and this makes recovery of the service harder.
  • investigate ways of ensuring that errors in cloud-init are more visible. Whether that's a health check, or something else, there should be something obvious. This is especially true if the nexus service starts, but it's the config setup fails. In this situation (which I had several times today while trying to recover a broken install) you have a fully deployed Nexus (with API/website) but there's no package libraries configured and so the user experience within a workspace is still broken.
  • potentially find an entirely different way to deploy nexus into the Azure TRE

All of this needs more thought, but I wanted to brain-dump it here before I forget.

@migldasilva
Copy link
Contributor Author

migldasilva commented Aug 4, 2023

As @martinpeck mentioned, adding only a sleep in cloud-init is not enough.

On the other hand, at the current stage of Nexus Shared Service, what would the disadvantages of restarting Docker daemon and sleeping?

@marrobi
Copy link
Member

marrobi commented Aug 4, 2023

@migldasilva rather than sleeping maybe we have a script that tried to pull, and if fails, restarts, and tries? The sleep fees very much like a hack.

I'd rather know what the underlying issue was, but know that might take time.

The VM is currently running 18.04, maybe a first step would be to try with a newer version of Ubuntu.

@bwiseman
Copy link

bwiseman commented Aug 4, 2023

I'm seeing the nexus/vm connection problem as well. I have added the sleep to the cloud-config-yaml but how do I roll that out?

  • Nexus shared-service ' tre-shared-service-certs' is stuck in 'pending' so I can't delete that one.
  • Nexus shared service 'tre-shared-service-sonatype-nexus' is disabled so could be deleted

I think I need to delete the one that's pending as well before deploying and creating the shared service again? Tried via the api but 'pending' is locking it. Can I delete in the Azure portal? I can't work out what the 'pending' resource actually is though?

@migldasilva
Copy link
Contributor Author

@marrobi Thanks for the comments. I do agree with you about avoiding "hacks". 😄

I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem.

@marrobi
Copy link
Member

marrobi commented Aug 4, 2023

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

My suggestion if you aren't yet familiar with registering custom templates is to delete the nexus service, and redeploy, logon to the nexus VM and check cloud init has completed, and if it has failed run these commands: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/cloud-init/#re-running-cloud-init-scripts

@marrobi
Copy link
Member

marrobi commented Aug 4, 2023

@marrobi Thanks for the comments. I do agree with you about avoiding "hacks". 😄

I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem.

I think if we had a more robust script that was being run by cloud-init that could help resolve some of these issues. Also taking account of maybe moving the files away from the /tmp folder.

@bwiseman
Copy link

bwiseman commented Aug 7, 2023

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

It doesn't seem to be here... in CosmoDB the 'pending' resource ID is listed in the Resources/Items tag as deploymentStatus -= updated but there is nothing in the Operations section with that ID and nothing that I can find saying "pending"

@marrobi
Copy link
Member

marrobi commented Aug 7, 2023

@bwiseman to resolve the operations stuck as pending you will need to go into cosmos, operations, and change the status of the operation to Failed, see: https://microsoft.github.io/AzureTRE/v0.12.0/troubleshooting-faq/manually-editing-resources/

It doesn't seem to be here... in CosmoDB the 'pending' resource ID is listed in the Resources/Items tag as deploymentStatus -= updated but there is nothing in the Operations section with that ID and nothing that I can find saying "pending"

In the operations documents it will be the resource_id that matches the service ID. It might say awaiting_action, imagine one of the last ones in the list of operation documents.

For reference the valid values are:

# Resource Action Status
RESOURCE_STATUS_AWAITING_ACTION = "awaiting_action"
RESOURCE_ACTION_STATUS_INVOKING = "invoking_action"
RESOURCE_ACTION_STATUS_SUCCEEDED = "action_succeeded"
RESOURCE_ACTION_STATUS_FAILED = "action_failed"

@marrobi
Copy link
Member

marrobi commented Aug 7, 2023

Or you could delete the certs service via the API. It's just the UI that's blocking the disable/delete operations.

@bwiseman
Copy link

bwiseman commented Aug 7, 2023 via email

@marrobi
Copy link
Member

marrobi commented Aug 7, 2023

Ok, just replicated:
image

If change any operations for the shared service with status awaiting_action to action_failed

@marrobi
Copy link
Member

marrobi commented Aug 7, 2023

Ah, sorry, I'm getting confused between two issues. :/

Anyway if look in cosmos for any operations not completed for that resource. Let me know how get on.

@bwiseman
Copy link

bwiseman commented Aug 8, 2023

In frustration I destroyed the tre install and reinstalled (good excuse to check my documentation). All seemed to go well and I'm back to just having the Nexus problem. Unfortunately this time running the initcloud scripts manually isn't fixing the guacamole connection not working from a Linux vm issue. Trying to connect to the nexus web page from a windows vm doesn' work, and from the nexus admin vm I see the error you're talking about. Is adding the sleep and redeploying nexus worth trying?

docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
Setting up Nexus SSL...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...

@marrobi
Copy link
Member

marrobi commented Aug 8, 2023

Sorry to hear still having issues, as per @migldasilva solution adding:

systemctl restart docker.service
sleep 60

To cloud-config.yaml and reregistering the template has solved the issue for him. It's odd as sometimes it just works.

If you drop me an email to marrobi at microsoft.com I can maybe screen share to try and help you along if that would be useful?

@migldasilva are you intending to do a PR? Thanks.

@marrobi
Copy link
Member

marrobi commented Aug 11, 2023

@migldasilva are you intending to do this, otherwise I will do a PR with a script like:

#!/bin/bash

while true; do
    docker pull sonatype/nexus3
    if [ $? -eq 0 ]; then
        break
    else
        echo "Failed to pull image, restarting Docker service"
        systemctl restart docker.service
        sleep 60
    fi
done

@marrobi
Copy link
Member

marrobi commented Aug 16, 2023

While looking at @SvenAelterman's PR I noticed the firewall rules are added in the pipeline AFTER the VM is deployed. This could well be the source of the issue. Either way the retry script will fix it.

The reason the pipeline is that way round is as the firewall rules are using output from the terraform.

@migldasilva
Copy link
Contributor Author

migldasilva commented Aug 21, 2023

@marrobi I wasn't able to create a PR before going on vacations. As you mentioned in this comment, my approach was using a script quite similar to that one.

I have just pulled the newest code and saw that such script was already merged (including a timeout option for limiting number of retries). Thus, it won't be required to create a new PR.

Thanks.

@SvenAelterman SvenAelterman unpinned this issue Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
5 participants