Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

Fix node local storage #72

Merged
merged 2 commits into from
Sep 27, 2019
Merged

Fix node local storage #72

merged 2 commits into from
Sep 27, 2019

Conversation

rata
Copy link
Contributor

@rata rata commented Sep 26, 2019

There are two commits, with messages explaining them in detail. I'll copy paste one, the most important one that actually fixes the issue, here:

The rkt pod is exposed with /mnt/ volume1 and in this PR:

    https://github.com/kinvolk/lokomotive-kubernetes/pull/48/files#diff-b6c7caf796cd86bdcdd936319b1793a1R152

the location where the RAID 0 array is mounted was changed from /mnt
to /mnt/<some-dir>. This breaks using local volumes in pods, as it
seems the mount is not visible for the rkt container or the kubelet
running in rkt.

The investigation of the root cause of this issue, to understand why
mounts inside mnt (like in /mnt/node-local-storage) can't be used by
pods currently and what needs to be changed for that, is left as a for
future patches and issue #73 was created to investigate this.

This patch just makes the mount location back to /mnt for all the cases
(setup_raid, setup_raid_*) so it works on all cases. However, this
imposes one limitation: setup_raid_hdd and setup_raid_ssd are mutually
exclusive now.

This limitation is not limiting something that was working in master, in
fact setup_raid_hdd and setup_raid_ssd (when setup_raid_ssd_fs is set)
were completely broken since they were merged and pods never wrote to
those volumes, for the very same reason issue #73 states: mounts insde
/mnt are not visible to pods.

Therefore, this patches fixes node local storage while making those
options exclusive (only one can be set), and this is not a big problem
as those options never really worked.

Some alternative fixes were considered, like changing the path that is
exposed to rkt container to be /mnt/node-local-storage or
/mnt/node-hdd-local-storage, according to what option was used
(setup_raid, setup_raid_hdd, and exposing both if both are set), but
that was messy without any good reason (it is better to tackle #73
before doing something that ofuscated). So, we decided for this more
simpler approach.

This patch is just a minimal fix, "a revert" to mount in /mnt/ again,
to make this work again on master.

As a side effect of this PR, too, another issue to reconsider if we need
so many setup_raid_* flags was created
(#74) and to even
consider a totally different approach than the current bash script
before it gets out of control: #75

Copy link
Contributor

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please create follow up issues as we discussed

Copy link
Contributor

@surajssd surajssd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pothos pothos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSD and HDD RAID are currently not exclusive: https://github.com/kinvolk/lokomotive-kubernetes/blob/master/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl#L216 (I would remove the comment and turn the second if into an else-if.)

At least, please update the docs to say that they are exclusive.

Maybe we can we validate this in terraform and throw an error as early as possible?

@rata
Copy link
Contributor Author

rata commented Sep 27, 2019

@pothos oh, good point! It is exclusive with setup_raid but those others are not exclusive. Thanks for noticing!

I'll follow your recommendation and put and if-else there, document them as exclusive and create two issues: one to address the root cause and make sure we understand why mounts are not seen by the kubelet when done inside and another issue to either remove those options (not sure they serve any useful purpose, would like to double check that because the code is messy already) or fix them so both can be used at the same time (right not, sine they were added, it was not working as expected).

This RAID 0 array should be created and mounted before the kubelet
starts, as pods may need to use the node local storage to run on this
node.

This dependency was lost in this PR:

	#28

Before this PR, the RAID 0 array was created when IPXE booting and
therefore already existed when the node was rebooted after
faltcat-install. After this PR there is no IPXE boot so the RAID 0 array
is created in the same boot that the kubelet starts, but we missed to
add that dependency that was implicit before (as different systemd units
were executed on different boots, and therefore the array was always
created when the kubelet was about to start).

This PR is just a cleanup to have the dependency, but note that this
doesn't fix node local storage and the previous commit in this PR is
needed for that.
The rkt pod is exposed with /mnt/ volume[1] and in this PR:

	https://github.com/kinvolk/lokomotive-kubernetes/pull/48/files#diff-b6c7caf796cd86bdcdd936319b1793a1R152

the location where the RAID 0 array is mounted was changed from `/mnt`
to `/mnt/<some-dir>`. This breaks using local volumes in pods, as it
seems the mount is not visible for the rkt container or the kubelet
running in rkt.

The investigation of the root cause of this issue, to understand why
mounts inside mnt (like in `/mnt/node-local-storage`) can't be used by
pods currently and what needs to be changed for that, is left as a for
future patches and issue #73 was created to investigate this.

This patch just makes the mount location back to `/mnt` for all the cases
(`setup_raid`, `setup_raid_*`) so it works on all cases. However, this
imposes one limitation: `setup_raid_hdd` and `setup_raid_ssd` are mutually
exclusive now.

This limitation is not limiting something that was working in master, in
fact `setup_raid_hdd` and `setup_raid_ssd` (when `setup_raid_ssd_fs` is set)
were completely broken since they were merged and pods never wrote to
those volumes, for the very same reason issue #73 states: mounts insde
`/mnt` are not visible to pods.

Therefore, this patches fixes node local storage while making those
options exclusive (only one can be set), and this is not a big problem
as those options never really worked.

Some alternative fixes were considered, like changing the path that is
exposed to rkt container to be /mnt/node-local-storage or
/mnt/node-hdd-local-storage, according to what option was used
(setup_raid, setup_raid_hdd, and exposing both if both are set), but
that was messy without any good reason (it is better to tackle #73
before doing something that ofuscated). So, we decided for this, more
simpler, approach.

This patch is just a minimal fix, "a revert" to mount in `/mnt/` again,
to make this work again on master.

As a side effect of this PR, too, another issue to reconsider if we need
so many `setup_raid_*` flags was created
(#74) and to even
consider a totally different approach than the current bash script
before it gets out of control: #75

[1]: https://github.com/kinvolk/lokomotive-kubernetes/blob/d59d071a451f45ac61c2524b94a146a6cde60401/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl#L65-L66
@rata rata force-pushed the rata-suraj/fix-node-local-storage branch from 7c4b6bc to 60c0e22 Compare September 27, 2019 14:12
@rata rata requested a review from pothos September 27, 2019 14:12
@rata
Copy link
Contributor Author

rata commented Sep 27, 2019

@pothos PTAL :)

Will also update the PR description with these changes.

@rata rata requested a review from invidian September 27, 2019 14:21
Copy link
Contributor

@pothos pothos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this and hope it stays that way for some time but of course we can change it again later if needed. The other approach which forwards the subdirectories to the kubelet is maybe more generic (and the terraform template can make use of the variables to only set the option that is needed).

@rata rata merged commit 886b36b into master Sep 27, 2019
@rata rata deleted the rata-suraj/fix-node-local-storage branch September 27, 2019 14:31
if [ ${setup_raid_ssd} = true ]; then
create_data_raid "$${major_numbers}" 0 /dev/md/node-local-ssd-storage ${setup_raid_ssd_fs}
fi
elif [ ${setup_raid_hdd} = true ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure, why this change was introduced?

With this code now we can only set RAID of type HDD or SSD. This change has added another regression!

Copy link
Contributor Author

@rata rata Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because mounting hdd and sdd both on /mnt is not possible, and the original patch was doing that (if you configure both to have an fs. This is explained in the updated PR description :-)

And those changes never worked (new setup_raid_* flags), they never wrote to those devices, as things mounted under /mnt are not visible to pods. I've created #74 to double check if we really need this and also created #73 to tackle the root cause and even proposed a fix there (if better fixes can't be found), see the comment that links to ba80265

However, as far as I understand those never worked and my reasoning was this: as those variables never worked, introduced for a few commits a backwards incompatible change and I'm not really sure we need them, let's just revert to the previous working situation and if we need to introduce them and do a backwards incompatible change, let's just do it. But it would be great if we can avoid using these, as the code is getting out of control (also created #75 to suggest a path to keep this under control) and introducing more bugs than solutions (as just created bugs but never really worked :-D).

If we later need to do the backwards incompatible change, let's do them (as we tried to do it when we first introduced these vars and change mounting from /mnt to /mnt/node-local-storage). But let's do them making sure everything works and, if possible, trying to make the change as simple as possible and not generate much tech debt :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants