Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-11090: Enhancement Doc: Assisted boot-reporter service #4444

Conversation

nmagnezi
Copy link
Contributor

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2022
@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 28, 2022
@nmagnezi
Copy link
Contributor Author

/assign @avishayt
/assign @tsorya
/assign @mkowalski
/assign @filanov


## Motivation

There are multiple occurrences of clusters that fail to install at the stage where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the case here is when the host did boot successfully from the right disk, configured network, and pulled the ignition.
In this case we have the installer running on the bootstrap node that should collect the relevant logs from the rest of the nodes.
In case the problem happened in the bootstrap node, the assisted-installer-controller should collect the relevant logs.

Copy link
Contributor

@omertuc omertuc Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the problem happened in the bootstrap node, the assisted-installer-controller should collect the relevant logs.

The controller can't collect anything about the bootstrap node's post reboot early boot process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is mostly relevant for sno, in case controller is not started due to some issues like kube-api/kubelet and etc
we will get logs

Copy link
Contributor

@eranco74 eranco74 Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this service is mostly relevant for SNO and I think the enhancement motivation, background and goals should reflect that.

The controller can't collect anything about the bootstrap node's post-reboot early boot process
This is something we can solve with the current implementation (copy the private key to the controller).

On the other hand, if the motivation is to have a robust mechanism to get the information about the host boot status and the logs from the host regardless of the scenario (SNO, worker stuck in configuring, host changed IP) then this service makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will it be even relevantl:

  1. wrong boot order - wont help
  2. no networking post reboot (static networking was not applied correctly) - low chances that despite that, the host will still have networking to reach the saas / hub cluster
  3. wrong dns configuration - lets add better validations before triggering the installation
  4. what else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literally any error that prevents crio/kubelet from functioning properly (e.g. expired certs? the crio ipv4/ipv6 bug we had recently?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. crio and kubelet and super complex and prone to failure, we want to see why and how they fail even on SNO/bootstrap nodes

Copy link
Contributor

@omertuc omertuc Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we can't really tell why some installations fail when it's the bootstrap node post-reboot or a SNO.

This service is a best-effort attempt at maybe finding an answer (or maybe not, depending on the failure, which we know nothing about)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with Nir quite a bit about the motivation and I would like to write it explicitly here - any solution that involves running assisted-installer-controller is not a solution here.

Reasoning - in order to run assisted-installer-controller we need to have a working k8s cluster. In order to have a working k8s cluster we need to have a working kubelet. In order to have a working kubelet we need to have a working crio. Therefore, before booting a machine and getting to run assisted-installer-controller there is a bunch of components out of which none reports its status mid-way.

To have a robust solution we need to have something that is plugged at the beginning of this dependency chain. Thus, we need something with no dependency on crio. Thus, our solution cannot be running a container. This takes assisted-installer-controler out of the game here.

We keep discussing improving assisted-installer-controller for ages now (and it's good, it needs improvements) but none of those discussions focuses on the problem we try to solve here - installation can fail because a component of the RHCOS is dodged. We need ability to debug this, otherwise we will keep receive tickets that are unsolvable without overseeing the installation in real-time.

We don't need an ideal solution. We need a solution good enough that will tell us e.g. "machine booted with network connectivity but XYZ happened so kubelet never started". Omer summarizes it perfectly - we need best-effort, not an ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the feedback.
I think a "best effort" is a fair description here.
I agree with the reasoning that we should not depend on crio or other things that are prone to fail the installation, leaving us as blind as we are today.

Therefore, this service needs to be basic and with little to no dependencies. Start via systemd as soon as boot happens and immediately try to report logs (and status).

I think this should cover both SNO and non-SNO cases; however, I am aware of some SNO-specific concerns @eranco74 has, and we will probably keep discussing that.

@romfreiman as for:

2. no networking post reboot (static networking was not applied correctly) - low chances that despite that, the host will still have networking to reach the saas / hub cluster

as we have seen today numbers do show that some clusters are using dhcp alongside nmstate, so we might also get logs in those situations.



### Changes To Assisted Service
For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts
For cases as mentioned in `Motivation`, we need a new service (started via `systemd`) as soon as possible after hosts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts
are booting from the disk and contact `assisted-service` to:

1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the new stage name too much because it kinda sounds like what Configuring already is. How about Reporting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuring means that it pulled the ignition from the MCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we piggy-back here and maybe rethink renaming Configuring? I'm not sure we have a common agreement that everyone understands that Configuring means "ignition pulled from MCS but not necessarily applied correctly". Maybe something like following states: Configuring (ignition pulled) followed by Configuring (ignition applied)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsorya @avishayt, what do you think about the name?

are booting from the disk and contact `assisted-service` to:

1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
and `Waiting for control plane`. This stage refers to a boot with pointer ignition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for control plane comes before Configuring, but this sentence seems to imply otherwise

Also waiting for control plane is not a universal stage, non bootstrap masters don't go through that stage so it's not a good example.

Maybe the sentence should read "between Configuring and Joined"

Also Configuring is not a guaranteed stage, hosts can go from rebooting to joined/done immediately when there's user managed networking and an API LoadBalancer

## Motivation

There are multiple occurrences of clusters that fail to install at the stage where
hosts manage to pull ignition but fail to get to a point where the `assisted-controller`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after f2f discussion we have 2 cases here:

  1. The hosts manage to boot with the pointer ignition but didn't pull the ignition from the MCS, in this case, we have no idea what's going on and the new boot-reporter service should help (assuming the host network is configured and the service can report to the assisted-service).
  2. The host did pull the ignition from the MCS (host stage should be configuring) - in this case, the current implementation should suffice and we should get the relevant logs from the installer/controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one motivation for this service (in case we intend to use it for more than just reporting successful boot) is to align the SNO stages and logs gathering functionality with the multi-node stages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranco74 not really to say the truth.
Configuring doesn't mean that host actually managed to apply the ignition so this case will bring us logs even if ssh is not ready yet (ignition failed to be applied)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario is this?
Why would it apply the new boot-reporter service but not the rest of the ignition payload?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Igal here. We had cases where we managed to pull the ignition but RHCOS failed to apply the ignition. E.g. using storage stanza that refers to a disk via non-existing name. That's syntax-wise a totally correct ignition override but RHCOS will refuse to apply it and will not continue the boot process.

Whether this service here would help is another question as we would need to design its dependency chain quite carefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranco74
Given that the service is planned to run immediately after the boot process, on each host independently - I'm not sure I understand the specific case for SNO here.

I think a f2f meeting is needed here.
I have booked one for the coming Thursday. Let me know if that works

Comment on lines 43 to 44
1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
and `Waiting for control plane`. This stage refers to a boot with pointer ignition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
and `Waiting for control plane`. This stage refers to a boot with pointer ignition.
1. Change the installation stage to a new `Booted with local ignition` between `Rebooting` and `Configuring`. This stage refers to a boot with pointer ignition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranco74, you suggest:
Rebooting --> Booted with local ignition --> Configuring
@omertuc, you suggest:
Configuring --> Booted with local ignition --> Joined

@mkowalski made a comment about the need to rename Configuring (which I agree to), and if I recall correctly, it means that MCS ignition got pulled.

If that is the case, I think @eranco74's suggestion makes sense here since we want this service to run and report logs when pointer ignition got applied (thus, we have networking).

and `Waiting for control plane`. This stage refers to a boot with pointer ignition.


2. Change the installation stage to a new `Booted with control plane ignition` Between
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure I understand this one.

  1. Waiting for control plane stage applies only for the bootstrap node, this is the stage before the bootstrap node get rebooted:
    Waiting for control plane -> Rebooting -> Configuring -> Joined -> Done
    And what this enhancement proposed is changing that to:
    Waiting for control plane -> Rebooting -> Waiting for control plane -> Rebooting -> Configuring -> Joined -> Done -> Configuring -> Joined -> Done

Which is what we have in 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was aiming for injecting this new stage right after the boot from MCS ignition was completed. From your comment, I understand it should be changed to:

  1. Change the installation stage to a new Booted with control plane ignition Between Configuring and Joined.

What do you think?

### Information Needed And Changes To The Discovery Agent

In order to communicate with `assisted-service`, The assisted boot reported service needs to know:
1. The `assisted-service` IP address and port.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The `assisted-service` IP address and port.
1. The `assisted-service` URL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The assisted boot reporter service should start via `systemd` and determine its operating stage. As mentioned above:

### First Boot (Pointer Ignition)
Expected stage to be read from assisted-service backend: `Configuring`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expected stage to be read from assisted-service backend: `Configuring`.
Expected stage to be read from assisted-service backend: `Rebooting`.

Since the host just started after it was rebooted.

  • We set the host to Configureing after it pulled the ignition from the MCS

Copy link
Contributor

@omertuc omertuc Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the host just started after it was rebooted

But by the time the new systemd service runs the host will already be Configuring (unless there's UMN API load balancer as we discussed when I was at the office)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think you are right,
If the host did pull the ignition from the MCS it's already configuring (so what's the point in adding the new stage??).
If the host didn't pull the ignition from the MCS (failed to reach the API or some other issue with the bootstrap control plane) it will fail during ignition and I doubt if the new boot reporter service will start in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the host did pull the ignition from the MCS it's already configuring (so what's the point in adding the new stage??).

It's still nice to see a new stage because you know not only it tried to pull ignition but it also began applying it and our service was able to run and our service was able to communicate with the assisted-service, tells you the networking is working great

If the host didn't pull the ignition from the MCS (failed to reach the API or some other issue with the bootstrap control plane) it will fail during ignition and I doubt if the new boot reporter service will start in that case.

This is indeed a problem. I wish we could start the new service even when the ignition "remote merge" fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing it from Configuring to Rebooting so it will be on-par with the section above.
Also, indeed it does not catch all cases but as Omer said it does let you know that networking is up and running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Open Questions

1. What should we use for implementation? Our options are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use both bash and go (container):
For the boot reporting - use bash, a simple curl command should do the trick, and we don't really on anything other than network access to the service.
For sending logs - I'd prefer if we reuse the send logs functionality already implemented in the assisted-installer-agent. but since this depends on other things (container run time, access to the registry) I guess we can try to run the agent and in case this method fails, fall back to a bash script as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to rely on pulling image. Idea is to send logs no matter what if we will need to pull image it will break this idea.
I assume we need to use bash , it should be easy code to write and maintain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree, the main value of this service is the ability to get the logs from the node with minimum dependencies.
in that regard bash script is the way to go.

Copy link
Contributor

@omertuc omertuc Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bash? it's not compiled (compilation tells you about mistakes long before you run the program), it's not typed (much easier to write bugs), it's a language riddled with horrible pitfalls and I have no doubt we will be chasing bugs if we use it.

You can compile a static go binary (to eliminate OS shared library dependencies - we're not always sure which OS / RHCOS version we're running against) and place it in the ignition just as easily, then you can use the proper assisted go api library and make structured API calls without relying on curl CLI mess. Let's use a real language - we already use it and we share a lot of code with the installer / controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with go library is to provide it through ignition.
If we decide to go with go library we will need to bring it through pulling image(better to use one of the existent already and not to add new one). In that case we need to understand that image can't be pulled sometimes and we will not get relevant data.

Copy link
Contributor

@omertuc omertuc Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with go library is to provide it through ignition.

Why is that a problem?

If we decide to go with go library we will need to bring it through pulling image

no images, just a static binary embedded directly inside the ignition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example a statically built controller executable with upx + removing debugging information (-ldflags="-s -w") from the executable it only weighs 20M base64 encoded, and a controller does much more things than what we want to build, and I bet we can make it even smaller if we try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to disagree here with anything that says "container". We need to ship as a standalone binary at most because we want to catch cases where Container Runtime on the host is broken. No matter whether we would need to pull a new image or just run something that we shipped, if someone thinks about running this service via e.g. podman run my very first question will be - what are you going to do if podman binary doesn't execute as you expected?

We had cases where e.g. c/storage was broken. You weren't able to run any container. Not only to pull a container, but not even run what you already have.

If we want to be robust, we should use the simplest stack possible. I agree with Omer on superiority of statically-built go over bash. If we combine it with Nick's idea of just pointing towards the payload instead of embedding it all in the ignition, we will not even increase the ignition's size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @mkowalski.
To download that script, won't that mean we will need to have an endpoint with no auth? (or, "empty auth")?
Won't that be a problem with the SaaS deployment?

Unless.. we can place the pull secret inside the pointer ignition, which allows us to use agent auth.

Copy link
Member

@carbonin carbonin Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to likely have a problem with either release engineering or our build chain if we try to use golang here.

I see a few options and issues with all of them:

  1. We compile these binaries from go source and commit them to our code base:
    • Generally poor form. Source repo should have source, not compiled code.
    • I think this will run afoul of some security guidelines that I can't find a reference to at the moment
    • We will need a build chain with all the architectures we support. (with power and Z on the way this could get complicated very quickly)
  2. We ship source code to the host and compile there
    • Needs go tooling and dependencies on the host
  3. Build an RPM with our changes
    • RPMs are kind of annoying to deal with
    • Need to deal with releng
    • Hard to install on coreos

I see the benefits that @omertuc is referencing, but practically I'm not sure how we would ship something like this to a coreos host without a container image.

1. Python is not included with RHCOS. We will need to install it and some third-party python libs it will probably need (e.g., python-requests).


2. How to handle pending user action state?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending user action might be the node current stage instead of Rebooting (the node moves to pending user action in case the node failed to boot from the installation disk.
So I think we should handle it the same way as we handle Rebooting if the boot-reporter service started it means that the node booted with the right disk and we can move to Booted with control plane ignition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to write anything to disk? we aim to upload logs that can be pulled in real-time (journal etc)

`Waiting for control plane` and `Joined`.


3. For each of the above-mentioned, Read the current host stage from `assisted-service`, to determine if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure what is the added value of this service after the first boot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was about the self-cleanup, am I right? But maybe the service should be a one-shot, so that the very first instance of the service is responsible for collecting the logs, reporting them back to SaaS and immediately killing itself afterwards? That way there will never be sign of our service in any subsequent reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a discussion with @avishayt and @tsorya about this, in which we determined that we want to reflect each boot in its own stage.
This will:

  1. Allow more visibility about the host's progress.
  2. Allow to service to determine the first or second boot (and by that, activate the cleanup) by reading the host stage from assisted-service.

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

2. How to handle pending user action state?

3. Are there any SNO specific things to take into account?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case of SNO ( there is no configuring stage, the node installation is more cumbersome and there is no installer/controller to report the progress or send logs) it makes sense for this service to keep reporting and send logs until the assisted-installer-controller is running.
In multi-node cluster installation it does not and this service might override the logs gathered by the current logs collection implementation.


It will need this information starting from the first boot, which is a pointer ignition-based boot.

For that, we need to modify the discovery agent to store this information on disk in a pre-defined location for the assisted boot reporter service to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not embed the files in the installed ignition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the intention here, probably requires rephrasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still testing this. However, I had chat with @carbonin about this, and he thinks we should avoid it.
Instead, he proposed to have a script file downloaded via assisted-service API (probably a bash script).
I'll complete the test anyhow but mentioned this in case Nick would want to expand on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my suggestion was just to avoid an incredibly large payload of base64 data in ignition.
Ignition itself also allows files to be created using regular http urls so I'm suggesting to download whatever script/executable thing we use with that format rather than embedding directly into the pointer ignition.

I also suggested bash because it's the most portable.
I don't think we want to get into the business of compiling go programs for every architecture we might soon support.

I think there's a lot of room for discussion on how to go about distributing whatever this service is, and a lot of it depends on how complicated it ends up being.


It will need this information starting from the first boot, which is a pointer ignition-based boot.

For that, we need to modify the discovery agent to store this information on disk in a pre-defined location for the assisted boot reporter service to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure i get it right. Why this is needed in discovery agent?
I think assisted service should patch pointer ignition with all relevant values, no need for anything in agent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to place the pull secret in the pointer ignition?

@nmagnezi nmagnezi force-pushed the MGMT-11090_post_boot_status_reporter branch from bbf13ba to 627eda7 Compare October 3, 2022 09:30
@nmagnezi nmagnezi force-pushed the MGMT-11090_post_boot_status_reporter branch from 627eda7 to b6b9e4c Compare October 3, 2022 11:28
For cases as mentioned in `Motivation`, we need a new service (started via `systemd`) as soon as possible after hosts
are booting from the disk and contact `assisted-service` to:

1. Change the installation stage to a new `Booted with local ignition` between `Rebooting` and `Configuring`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we identify that the host already was installed?
for example if the host just reboot for some reason the service will start again no? what will it do?

@nmagnezi nmagnezi mentioned this pull request Oct 24, 2022
20 tasks
openshift-merge-robot pushed a commit that referenced this pull request Jan 12, 2023
Per assisted-service/pull/4444[1] , add a new log type named: node-boot.

This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type.
In addition:
1. A new event was added: host_boot_logs_uploaded.
2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz
3. Note that V2UploadLogs with log type node-boot will not update the host progress.

[1] #4444
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 18, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 19, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 19, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 22, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 24, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023
eliorerz pushed a commit to eliorerz/assisted-service that referenced this pull request Jan 29, 2023
eliorerz pushed a commit to eliorerz/assisted-service that referenced this pull request Jan 29, 2023
Per assisted-service/pull/4444[1] , add a new log type named: node-boot.

This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type.
In addition:
1. A new event was added: host_boot_logs_uploaded.
2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz
3. Note that V2UploadLogs with log type node-boot will not update the host progress.

[1] openshift#4444
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023
danielerez pushed a commit to danielerez/assisted-service that referenced this pull request Oct 15, 2023
danielerez pushed a commit to danielerez/assisted-service that referenced this pull request Oct 15, 2023
Per assisted-service/pull/4444[1] , add a new log type named: node-boot.

This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type.
In addition:
1. A new event was added: host_boot_logs_uploaded.
2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz
3. Note that V2UploadLogs with log type node-boot will not update the host progress.

[1] openshift#4444
danielerez pushed a commit to danielerez/assisted-service that referenced this pull request Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants