MGMT-11090: Enhancement Doc: Assisted boot-reporter service #4444

nmagnezi · 2022-09-28T05:36:23Z

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci · 2022-09-28T05:36:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

nmagnezi · 2022-09-28T05:42:11Z

/assign @avishayt
/assign @tsorya
/assign @mkowalski
/assign @filanov

eranco74 · 2022-09-28T07:25:53Z

docs/enhancements/boot-status-reporter.md

+
+## Motivation
+
+There are multiple occurrences of clusters that fail to install at the stage where


So the case here is when the host did boot successfully from the right disk, configured network, and pulled the ignition.
In this case we have the installer running on the bootstrap node that should collect the relevant logs from the rest of the nodes.
In case the problem happened in the bootstrap node, the assisted-installer-controller should collect the relevant logs.

In case the problem happened in the bootstrap node, the assisted-installer-controller should collect the relevant logs.

The controller can't collect anything about the bootstrap node's post reboot early boot process

This case is mostly relevant for sno, in case controller is not started due to some issues like kube-api/kubelet and etc
we will get logs

I agree that this service is mostly relevant for SNO and I think the enhancement motivation, background and goals should reflect that.

The controller can't collect anything about the bootstrap node's post-reboot early boot process
This is something we can solve with the current implementation (copy the private key to the controller).

On the other hand, if the motivation is to have a robust mechanism to get the information about the host boot status and the logs from the host regardless of the scenario (SNO, worker stuck in configuring, host changed IP) then this service makes sense.

when will it be even relevantl:

wrong boot order - wont help

no networking post reboot (static networking was not applied correctly) - low chances that despite that, the host will still have networking to reach the saas / hub cluster

wrong dns configuration - lets add better validations before triggering the installation

what else?

Literally any error that prevents crio/kubelet from functioning properly (e.g. expired certs? the crio ipv4/ipv6 bug we had recently?)

i.e. crio and kubelet and super complex and prone to failure, we want to see why and how they fail even on SNO/bootstrap nodes

Also we can't really tell why some installations fail when it's the bootstrap node post-reboot or a SNO.

This service is a best-effort attempt at maybe finding an answer (or maybe not, depending on the failure, which we know nothing about)

I discussed with Nir quite a bit about the motivation and I would like to write it explicitly here - any solution that involves running assisted-installer-controller is not a solution here.

Reasoning - in order to run assisted-installer-controller we need to have a working k8s cluster. In order to have a working k8s cluster we need to have a working kubelet. In order to have a working kubelet we need to have a working crio. Therefore, before booting a machine and getting to run assisted-installer-controller there is a bunch of components out of which none reports its status mid-way.

To have a robust solution we need to have something that is plugged at the beginning of this dependency chain. Thus, we need something with no dependency on crio. Thus, our solution cannot be running a container. This takes assisted-installer-controler out of the game here.

We keep discussing improving assisted-installer-controller for ages now (and it's good, it needs improvements) but none of those discussions focuses on the problem we try to solve here - installation can fail because a component of the RHCOS is dodged. We need ability to debug this, otherwise we will keep receive tickets that are unsolvable without overseeing the installation in real-time.

We don't need an ideal solution. We need a solution good enough that will tell us e.g. "machine booted with network connectivity but XYZ happened so kubelet never started". Omer summarizes it perfectly - we need best-effort, not an ideal.

Thank you all for the feedback.
I think a "best effort" is a fair description here.
I agree with the reasoning that we should not depend on crio or other things that are prone to fail the installation, leaving us as blind as we are today.

Therefore, this service needs to be basic and with little to no dependencies. Start via systemd as soon as boot happens and immediately try to report logs (and status).

I think this should cover both SNO and non-SNO cases; however, I am aware of some SNO-specific concerns @eranco74 has, and we will probably keep discussing that.

@romfreiman as for:

2. no networking post reboot (static networking was not applied correctly) - low chances that despite that, the host will still have networking to reach the saas / hub cluster

as we have seen today numbers do show that some clusters are using dhcp alongside nmstate, so we might also get logs in those situations.

docs/enhancements/boot-status-reporter.md

omertuc · 2022-09-28T07:34:02Z

docs/enhancements/boot-status-reporter.md

+
+
+### Changes To Assisted Service
+For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts


Suggested change

For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts

For cases as mentioned in `Motivation`, we need a new service (started via `systemd`) as soon as possible after hosts

omertuc · 2022-09-28T07:35:45Z

docs/enhancements/boot-status-reporter.md

+For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts
+are booting from the disk and contact `assisted-service` to:
+
+1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`


I don't like the new stage name too much because it kinda sounds like what Configuring already is. How about Reporting?

Configuring means that it pulled the ignition from the MCS.

Could we piggy-back here and maybe rethink renaming Configuring? I'm not sure we have a common agreement that everyone understands that Configuring means "ignition pulled from MCS but not necessarily applied correctly". Maybe something like following states: Configuring (ignition pulled) followed by Configuring (ignition applied)?

@tsorya @avishayt, what do you think about the name?

omertuc · 2022-09-28T07:38:30Z

docs/enhancements/boot-status-reporter.md

+are booting from the disk and contact `assisted-service` to:
+
+1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
+   and  `Waiting for control plane`. This stage refers to a boot with pointer ignition.


Waiting for control plane comes before Configuring, but this sentence seems to imply otherwise

Also waiting for control plane is not a universal stage, non bootstrap masters don't go through that stage so it's not a good example.

Maybe the sentence should read "between Configuring and Joined"

Also Configuring is not a guaranteed stage, hosts can go from rebooting to joined/done immediately when there's user managed networking and an API LoadBalancer

eranco74 · 2022-09-28T07:49:41Z

docs/enhancements/boot-status-reporter.md

+## Motivation
+
+There are multiple occurrences of clusters that fail to install at the stage where
+hosts manage to pull ignition but fail to get to a point where the `assisted-controller`


OK, after f2f discussion we have 2 cases here:

The hosts manage to boot with the pointer ignition but didn't pull the ignition from the MCS, in this case, we have no idea what's going on and the new boot-reporter service should help (assuming the host network is configured and the service can report to the assisted-service).

The host did pull the ignition from the MCS (host stage should be configuring) - in this case, the current implementation should suffice and we should get the relevant logs from the installer/controller.

I think one motivation for this service (in case we intend to use it for more than just reporting successful boot) is to align the SNO stages and logs gathering functionality with the multi-node stages.

@eranco74 not really to say the truth.
Configuring doesn't mean that host actually managed to apply the ignition so this case will bring us logs even if ssh is not ready yet (ignition failed to be applied)

What scenario is this?
Why would it apply the new boot-reporter service but not the rest of the ignition payload?

Agree with Igal here. We had cases where we managed to pull the ignition but RHCOS failed to apply the ignition. E.g. using storage stanza that refers to a disk via non-existing name. That's syntax-wise a totally correct ignition override but RHCOS will refuse to apply it and will not continue the boot process.

Whether this service here would help is another question as we would need to design its dependency chain quite carefully.

@eranco74
Given that the service is planned to run immediately after the boot process, on each host independently - I'm not sure I understand the specific case for SNO here.

I think a f2f meeting is needed here.
I have booked one for the coming Thursday. Let me know if that works

eranco74 · 2022-09-28T07:51:32Z

docs/enhancements/boot-status-reporter.md

+1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
+   and  `Waiting for control plane`. This stage refers to a boot with pointer ignition.


Suggested change

1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`

and `Waiting for control plane`. This stage refers to a boot with pointer ignition.

1. Change the installation stage to a new `Booted with local ignition` between `Rebooting` and `Configuring`. This stage refers to a boot with pointer ignition.

@eranco74, you suggest:
Rebooting --> Booted with local ignition --> Configuring
@omertuc, you suggest:
Configuring --> Booted with local ignition --> Joined

@mkowalski made a comment about the need to rename Configuring (which I agree to), and if I recall correctly, it means that MCS ignition got pulled.

If that is the case, I think @eranco74's suggestion makes sense here since we want this service to run and report logs when pointer ignition got applied (thus, we have networking).

eranco74 · 2022-09-28T07:54:45Z

docs/enhancements/boot-status-reporter.md

+   and  `Waiting for control plane`. This stage refers to a boot with pointer ignition.
+
+
+2. Change the installation stage to a new `Booted with control plane ignition` Between


Unsure I understand this one.

Waiting for control plane stage applies only for the bootstrap node, this is the stage before the bootstrap node get rebooted:
Waiting for control plane -> Rebooting -> Configuring -> Joined -> Done
And what this enhancement proposed is changing that to:
Waiting for control plane -> Rebooting -> Waiting for control plane -> Rebooting -> Configuring -> Joined -> Done -> Configuring -> Joined -> Done

Which is what we have in 1.

I was aiming for injecting this new stage right after the boot from MCS ignition was completed. From your comment, I understand it should be changed to:

Change the installation stage to a new Booted with control plane ignition Between Configuring and Joined.

What do you think?

eranco74 · 2022-09-28T07:57:06Z

docs/enhancements/boot-status-reporter.md

+### Information Needed And Changes To The Discovery Agent
+
+In order to communicate with `assisted-service`, The assisted boot reported service needs to know:
+1. The `assisted-service` IP address and port.


Suggested change

1. The `assisted-service` IP address and port.

1. The `assisted-service` URL.

eranco74 · 2022-09-28T07:59:37Z

docs/enhancements/boot-status-reporter.md

+The assisted boot reporter service should start via `systemd` and determine its operating stage. As mentioned above:
+
+### First Boot (Pointer Ignition)
+Expected stage to be read from assisted-service backend: `Configuring`.


Suggested change

Expected stage to be read from assisted-service backend: `Configuring`.

Expected stage to be read from assisted-service backend: `Rebooting`.

Since the host just started after it was rebooted.

We set the host to Configureing after it pulled the ignition from the MCS

Since the host just started after it was rebooted

But by the time the new systemd service runs the host will already be Configuring (unless there's UMN API load balancer as we discussed when I was at the office)

Hmm, I think you are right,
If the host did pull the ignition from the MCS it's already configuring (so what's the point in adding the new stage??).
If the host didn't pull the ignition from the MCS (failed to reach the API or some other issue with the bootstrap control plane) it will fail during ignition and I doubt if the new boot reporter service will start in that case.

If the host did pull the ignition from the MCS it's already configuring (so what's the point in adding the new stage??).

It's still nice to see a new stage because you know not only it tried to pull ignition but it also began applying it and our service was able to run and our service was able to communicate with the assisted-service, tells you the networking is working great

If the host didn't pull the ignition from the MCS (failed to reach the API or some other issue with the bootstrap control plane) it will fail during ignition and I doubt if the new boot reporter service will start in that case.

This is indeed a problem. I wish we could start the new service even when the ignition "remote merge" fails

I'm changing it from Configuring to Rebooting so it will be on-par with the section above.
Also, indeed it does not catch all cases but as Omer said it does let you know that networking is up and running.

eranco74 · 2022-09-28T08:11:28Z

docs/enhancements/boot-status-reporter.md

+
+### Open Questions
+
+1. What should we use for implementation? Our options are:


I think we can use both bash and go (container):
For the boot reporting - use bash, a simple curl command should do the trick, and we don't really on anything other than network access to the service.
For sending logs - I'd prefer if we reuse the send logs functionality already implemented in the assisted-installer-agent. but since this depends on other things (container run time, access to the registry) I guess we can try to run the agent and in case this method fails, fall back to a bash script as well.

I don't think we need to rely on pulling image. Idea is to send logs no matter what if we will need to pull image it will break this idea.
I assume we need to use bash , it should be easy code to write and maintain

I tend to agree, the main value of this service is the ability to get the logs from the node with minimum dependencies.
in that regard bash script is the way to go.

Why bash? it's not compiled (compilation tells you about mistakes long before you run the program), it's not typed (much easier to write bugs), it's a language riddled with horrible pitfalls and I have no doubt we will be chasing bugs if we use it.

You can compile a static go binary (to eliminate OS shared library dependencies - we're not always sure which OS / RHCOS version we're running against) and place it in the ignition just as easily, then you can use the proper assisted go api library and make structured API calls without relying on curl CLI mess. Let's use a real language - we already use it and we share a lot of code with the installer / controller

The problem with go library is to provide it through ignition.
If we decide to go with go library we will need to bring it through pulling image(better to use one of the existent already and not to add new one). In that case we need to understand that image can't be pulled sometimes and we will not get relevant data.

The problem with go library is to provide it through ignition.

Why is that a problem?

If we decide to go with go library we will need to bring it through pulling image

no images, just a static binary embedded directly inside the ignition

For example a statically built controller executable with upx + removing debugging information (-ldflags="-s -w") from the executable it only weighs 20M base64 encoded, and a controller does much more things than what we want to build, and I bet we can make it even smaller if we try

I have to disagree here with anything that says "container". We need to ship as a standalone binary at most because we want to catch cases where Container Runtime on the host is broken. No matter whether we would need to pull a new image or just run something that we shipped, if someone thinks about running this service via e.g. podman run my very first question will be - what are you going to do if podman binary doesn't execute as you expected?

We had cases where e.g. c/storage was broken. You weren't able to run any container. Not only to pull a container, but not even run what you already have.

If we want to be robust, we should use the simplest stack possible. I agree with Omer on superiority of statically-built go over bash. If we combine it with Nick's idea of just pointing towards the payload instead of embedding it all in the ignition, we will not even increase the ignition's size.

+1 @mkowalski.
To download that script, won't that mean we will need to have an endpoint with no auth? (or, "empty auth")?
Won't that be a problem with the SaaS deployment?

Unless.. we can place the pull secret inside the pointer ignition, which allows us to use agent auth.

I think we're going to likely have a problem with either release engineering or our build chain if we try to use golang here.

I see a few options and issues with all of them:

We compile these binaries from go source and commit them to our code base:

Generally poor form. Source repo should have source, not compiled code.

I think this will run afoul of some security guidelines that I can't find a reference to at the moment

We will need a build chain with all the architectures we support. (with power and Z on the way this could get complicated very quickly)

We ship source code to the host and compile there

Needs go tooling and dependencies on the host

Build an RPM with our changes

RPMs are kind of annoying to deal with

Need to deal with releng

Hard to install on coreos

I see the benefits that @omertuc is referencing, but practically I'm not sure how we would ship something like this to a coreos host without a container image.

eranco74 · 2022-09-28T08:14:53Z

docs/enhancements/boot-status-reporter.md

+         1. Python is not included with RHCOS. We will need to install it and some third-party python libs it will probably need (e.g., python-requests).
+
+
+2. How to handle pending user action state?


pending user action might be the node current stage instead of Rebooting (the node moves to pending user action in case the node failed to boot from the installation disk.
So I think we should handle it the same way as we handle Rebooting if the boot-reporter service started it means that the node booted with the right disk and we can move to Booted with control plane ignition

Is there a need to write anything to disk? we aim to upload logs that can be pulled in real-time (journal etc)

eranco74 · 2022-09-28T08:18:24Z

docs/enhancements/boot-status-reporter.md

+   `Waiting for control plane` and  `Joined`.
+
+
+3. For each of the above-mentioned, Read the current host stage from `assisted-service`, to determine if:


Unsure what is the added value of this service after the first boot.

I think it was about the self-cleanup, am I right? But maybe the service should be a one-shot, so that the very first instance of the service is responsible for collecting the logs, reporting them back to SaaS and immediately killing itself afterwards? That way there will never be sign of our service in any subsequent reboot.

I had a discussion with @avishayt and @tsorya about this, in which we determined that we want to reflect each boot in its own stage.
This will:

Allow more visibility about the host's progress.

Allow to service to determine the first or second boot (and by that, activate the cleanup) by reading the host stage from assisted-service.

docs/enhancements/boot-status-reporter.md

eranco74 · 2022-09-28T08:23:59Z

docs/enhancements/boot-status-reporter.md

+
+2. How to handle pending user action state?
+
+3. Are there any SNO specific things to take into account?


Yes, in the case of SNO ( there is no configuring stage, the node installation is more cumbersome and there is no installer/controller to report the progress or send logs) it makes sense for this service to keep reporting and send logs until the assisted-installer-controller is running.
In multi-node cluster installation it does not and this service might override the logs gathered by the current logs collection implementation.

omertuc · 2022-09-28T08:42:39Z

docs/enhancements/boot-status-reporter.md

+
+It will need this information starting from the first boot, which is a pointer ignition-based boot.
+
+For that, we need to modify the discovery agent to store this information on disk in a pre-defined location for the assisted boot reporter service to use.


Why not embed the files in the installed ignition?

I think that was the intention here, probably requires rephrasing.

Still testing this. However, I had chat with @carbonin about this, and he thinks we should avoid it.
Instead, he proposed to have a script file downloaded via assisted-service API (probably a bash script).
I'll complete the test anyhow but mentioned this in case Nick would want to expand on this.

Yes, but my suggestion was just to avoid an incredibly large payload of base64 data in ignition.
Ignition itself also allows files to be created using regular http urls so I'm suggesting to download whatever script/executable thing we use with that format rather than embedding directly into the pointer ignition.

I also suggested bash because it's the most portable.
I don't think we want to get into the business of compiling go programs for every architecture we might soon support.

I think there's a lot of room for discussion on how to go about distributing whatever this service is, and a lot of it depends on how complicated it ends up being.

docs/enhancements/boot-status-reporter.md

tsorya · 2022-09-28T09:04:32Z

docs/enhancements/boot-status-reporter.md

+
+It will need this information starting from the first boot, which is a pointer ignition-based boot.
+
+For that, we need to modify the discovery agent to store this information on disk in a pre-defined location for the assisted boot reporter service to use.


Not sure i get it right. Why this is needed in discovery agent?
I think assisted service should patch pointer ignition with all relevant values, no need for anything in agent

Is it okay to place the pull secret in the pointer ignition?

docs/enhancements/boot-status-reporter.md

filanov · 2022-10-06T06:58:50Z

docs/enhancements/boot-status-reporter.md

+For cases as mentioned in `Motivation`, we need a new service (started via `systemd`) as soon as possible after hosts
+are booting from the disk and contact `assisted-service` to:
+
+1. Change the installation stage to a new `Booted with local ignition` between `Rebooting` and `Configuring`.


how can we identify that the host already was installed?
for example if the host just reboot for some reason the service will start again no? what will it do?

Per assisted-service/pull/4444[1] , add a new log type named: node-boot. This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type. In addition: 1. A new event was added: host_boot_logs_uploaded. 2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz 3. Note that V2UploadLogs with log type node-boot will not update the host progress. [1] #4444

Implements: openshift#4444

…t#4444)

Per assisted-service/pull/4444[1] , add a new log type named: node-boot. This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type. In addition: 1. A new event was added: host_boot_logs_uploaded. 2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz 3. Note that V2UploadLogs with log type node-boot will not update the host progress. [1] openshift#4444

Implements: openshift#4444

) Implements: #4444

…t#4444)

Per assisted-service/pull/4444[1] , add a new log type named: node-boot. This change updates both V2UploadLogs and V2DownloadClusterLogs to accept the new log type. In addition: 1. A new event was added: host_boot_logs_uploaded. 2. Defined download name: <cluster_name>_<host_role>_boot_<host_id>.tar.gz 3. Note that V2UploadLogs with log type node-boot will not update the host progress. [1] openshift#4444

…enshift#4543) Implements: openshift#4444

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2022

openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 28, 2022

openshift-ci bot assigned avishayt, filanov, mkowalski and tsorya Sep 28, 2022

nmagnezi force-pushed the MGMT-11090_post_boot_status_reporter branch from 69ea2bc to bbf13ba Compare September 28, 2022 05:59

eranco74 reviewed Sep 28, 2022

View reviewed changes

omertuc reviewed Sep 28, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

omertuc reviewed Sep 28, 2022

View reviewed changes

eranco74 reviewed Sep 28, 2022

View reviewed changes

omertuc reviewed Sep 28, 2022

View reviewed changes

tsorya reviewed Sep 28, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

tsorya reviewed Sep 28, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Outdated Show resolved Hide resolved

tsorya reviewed Sep 28, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

filanov reviewed Sep 29, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

mkowalski reviewed Sep 29, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

mkowalski reviewed Sep 29, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

mkowalski reviewed Sep 29, 2022

View reviewed changes

docs/enhancements/boot-status-reporter.md Show resolved Hide resolved

nmagnezi force-pushed the MGMT-11090_post_boot_status_reporter branch from bbf13ba to 627eda7 Compare October 3, 2022 09:30

MGMT-11090: Enhancement Doc: Assisted boot-reporter service

b6b9e4c

nmagnezi force-pushed the MGMT-11090_post_boot_status_reporter branch from 627eda7 to b6b9e4c Compare October 3, 2022 11:28

filanov reviewed Oct 6, 2022

View reviewed changes

nmagnezi mentioned this pull request Oct 24, 2022

MGMT-12312: Adds node-boot log type #4529

Merged

20 tasks

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

d796cf0

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

b761b3a

Implements: openshift#4444

nmagnezi mentioned this pull request Jan 17, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition #4543

Merged

20 tasks

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 17, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

fac40ff

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 18, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

95f83a9

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 19, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

6741405

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 19, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

86b04dd

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 22, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

c88865a

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 24, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

42f0fb4

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

2fa1d26

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

cd15901

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 25, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

e46e3c1

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

a42a363

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

2a9f025

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

fcf1a3f

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

c43f72f

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 26, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

b1c0c39

Implements: openshift#4444

eliorerz pushed a commit to eliorerz/assisted-service that referenced this pull request Jan 29, 2023

MGMT-11090: Enhancement Doc: Assisted boot-reporter service (openshif…

ef4c4b7

…t#4444)

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

0a4da8b

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

07cac9a

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

821283c

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

3c379d2

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

6fab269

Implements: openshift#4444

nmagnezi pushed a commit to nmagnezi/assisted-service that referenced this pull request Jan 29, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition

36347b2

Implements: openshift#4444

openshift-merge-robot pushed a commit that referenced this pull request Jan 30, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition (#4543

bc47f31

) Implements: #4444

danielerez pushed a commit to danielerez/assisted-service that referenced this pull request Oct 15, 2023

MGMT-11090: Enhancement Doc: Assisted boot-reporter service (openshif…

e10cb51

…t#4444)

danielerez pushed a commit to danielerez/assisted-service that referenced this pull request Oct 15, 2023

MGMT-12329: Implements assisted boot reporter and add to ignition (op…

35194c1

…enshift#4543) Implements: openshift#4444


		## Motivation

		There are multiple occurrences of clusters that fail to install at the stage where



		### Changes To Assisted Service
		For cases as mentioned in `Motivation`, we need a new service (start via `systemd`) soon as possible after hosts

		1. Change the installation stage to a new `Booted with local ignition` Between `Configuring`
		and `Waiting for control plane`. This stage refers to a boot with pointer ignition.

		and `Waiting for control plane`. This stage refers to a boot with pointer ignition.


		2. Change the installation stage to a new `Booted with control plane ignition` Between

	1. The `assisted-service` IP address and port.
	1. The `assisted-service` URL.

	Expected stage to be read from assisted-service backend: `Configuring`.
	Expected stage to be read from assisted-service backend: `Rebooting`.


		### Open Questions

		1. What should we use for implementation? Our options are:

		1. Python is not included with RHCOS. We will need to install it and some third-party python libs it will probably need (e.g., python-requests).


		2. How to handle pending user action state?

		`Waiting for control plane` and `Joined`.


		3. For each of the above-mentioned, Read the current host stage from `assisted-service`, to determine if:


		2. How to handle pending user action state?

		3. Are there any SNO specific things to take into account?


		It will need this information starting from the first boot, which is a pointer ignition-based boot.

		For that, we need to modify the discovery agent to store this information on disk in a pre-defined location for the assisted boot reporter service to use.

MGMT-11090: Enhancement Doc: Assisted boot-reporter service #4444

MGMT-11090: Enhancement Doc: Assisted boot-reporter service #4444

Conversation

nmagnezi commented Sep 28, 2022

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Sep 28, 2022

nmagnezi commented Sep 28, 2022

Choose a reason for hiding this comment

omertuc Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eranco74 Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omertuc Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omertuc Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omertuc Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omertuc Sep 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carbonin Oct 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omertuc Sep 28, 2022 •

edited

eranco74 Sep 28, 2022 •

edited

omertuc Sep 28, 2022 •

edited

omertuc Sep 28, 2022 •

edited

omertuc Sep 28, 2022 •

edited

omertuc Sep 28, 2022 •

edited

carbonin Oct 3, 2022 •

edited