Skip to content

Commit

Permalink
Merge pull request #6916 from smarterclayton/static_masters
Browse files Browse the repository at this point in the history
Automatic merge from submit-queue.

Make openshift-ansible use static pods to install the control plane, make nodes prefer bootstrapping

1. Nodes continue to be configured for bootstrapping (as today)
2. For bootstrap nodes, we write a generic bootstrap-node-config.yaml that contains static pod references and any bootstrap config, and then use that to start a child kubelet using `--write-flags` instead of launching the node ourselves.  If a node-config.yaml is laid down in `/etc/origin/node` it takes precedence.
3. For 3.10 we want dynamic node config from Kubernetes to pull down additional files, but there are functional gaps.  For now, the openshift SDN container has a sidecar that syncs node config to disk and updates labels (kubelet doesn't update labels, kubernetes/kubernetes#59314)
4. On the masters, if openshift_master_bootstrap_enabled we generate the master-config.yaml and the etcd config, but we don't start etcd or the masters (no services installed)
5. On the masters, we copy the static files into the correct pod-manifest-path (/etc/origin/node/pods) or similar
6. The kubelet at that point should automatically pick up the new static files and launch the components
7. We wait for them to converge
8. We install openshift-sdn as the first component, which allows nodes to go ready and start installing things.  There is a gap here where the masters are up, the nodes can bootstrap, but the nodes are not ready because no network plugin is installed.

Challenges at this point:

* The master shims (`master-logs` and `master-restart`) need to deal with CRI-O and systemd.  Ideally this is a temporary shim until we remove systemd for these components and have cri-ctl installed.
* We need to test failure modes of the static pods
* Testing

Further exploration things:

* need to get all the images using image streams or properly replaced into the static pods
* need to look at upgrades and updates
* disk locations become our API (`/var/lib/origin`, `/var/lib/etcd`) - how many customers have fiddled with this?
* may need to make the kubelet halt if it hasn't been able to get server/client certs within a bounded window (5m?) so to ensure that autoheals happen (openshift/origin#18430)
* have to figure out whether dynamic kubelet config is a thing we can rely on for 3.10 (@liggitt), and what gaps there are with dynamic reconfig
* client-ca.crt is not handled by bootstrapping or dynamic config.  This needs a solution unless we keep the openshift-sdn sidecar around
* kubelet doesn't send sd notify to systemd (kubernetes/kubernetes#59079)

@derekwaynecarr @sdodson @liggitt @deads2k this is the core of self-hosting.
  • Loading branch information
openshift-merge-robot committed Mar 11, 2018
2 parents 570ea7d + c826c43 commit 4d51b56
Show file tree
Hide file tree
Showing 84 changed files with 2,289 additions and 551 deletions.
16 changes: 13 additions & 3 deletions .papr.inventory
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,24 @@ etcd
ansible_ssh_user=root
ansible_python_interpreter=/usr/bin/python3
openshift_deployment_type=origin
openshift_image_tag="{{ lookup('env', 'OPENSHIFT_IMAGE_TAG') }}"
openshift_master_default_subdomain="{{ lookup('env', 'RHCI_ocp_node1_IP') }}.xip.io"
openshift_check_min_host_disk_gb=1.5
openshift_check_min_host_memory_gb=1.9
osm_cluster_network_cidr=10.128.0.0/14
openshift_portal_net=172.30.0.0/16
osm_host_subnet_length=9

[all:vars]
# bootstrap configs
openshift_node_groups=[{"name":"node-config-master","labels":["node-role.kubernetes.io/master=true","node-role.kubernetes.io/infra=true"]},{"name":"node-config-node","labels":["node-role.kubernetes.io/compute=true"]}]
openshift_master_bootstrap_enabled=true
openshift_master_bootstrap_auto_approve=true
openshift_master_bootstrap_auto_approver_node_selector={"region":"infra"}
osm_controller_args={"experimental-cluster-signing-duration": ["20m"]}
openshift_node_bootstrap=true
openshift_hosted_infra_selector="node-role.kubernetes.io/infra=true"
osm_default_node_selector="node-role.kubernetes.io/compute=true"

[masters]
ocp-master

Expand All @@ -23,5 +33,5 @@ ocp-master

[nodes]
ocp-master openshift_schedulable=true
ocp-node1 openshift_node_labels="{'region':'infra'}"
ocp-node2 openshift_node_labels="{'region':'infra'}"
ocp-node1
ocp-node2
24 changes: 12 additions & 12 deletions .papr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,16 @@ set -xeuo pipefail
# specific version which quickly becomes stale.

if [ -n "${PAPR_BRANCH:-}" ]; then
target_branch=$PAPR_BRANCH
target_branch=$PAPR_BRANCH
else
target_branch=$PAPR_PULL_TARGET_BRANCH
target_branch=$PAPR_PULL_TARGET_BRANCH
fi
if [[ "${target_branch}" =~ ^release- ]]; then
target_branch="${target_branch/release-/v}"
else
dnf install -y sed
target_branch="$( git describe | sed 's/^openshift-ansible-\([0-9]*\.[0-9]*\)\.[0-9]*-.*/v\1/' )"
fi

# this is a bit wasteful, though there's no easy way to say "only clone up to
# the first tag in the branch" -- ideally, PAPR could help with caching here
git clone --branch $target_branch --single-branch https://github.com/openshift/origin
export OPENSHIFT_IMAGE_TAG=$(git -C origin describe --abbrev=0)

echo "Targeting OpenShift Origin $OPENSHIFT_IMAGE_TAG"

pip install -r requirements.txt

Expand All @@ -32,10 +31,11 @@ upload_journals() {

trap upload_journals ERR

# make all nodes ready for bootstrapping
ansible-playbook -vvv -i .papr.inventory playbooks/openshift-node/private/image_prep.yml

# run the actual installer
# FIXME: override openshift_image_tag defined in the inventory until
# https://github.com/openshift/openshift-ansible/issues/4478 is fixed.
ansible-playbook -vvv -i .papr.inventory playbooks/deploy_cluster.yml -e "openshift_image_tag=$OPENSHIFT_IMAGE_TAG"
ansible-playbook -vvv -i .papr.inventory playbooks/deploy_cluster.yml -e "openshift_release=${target_release}"

### DISABLING TESTS FOR NOW, SEE:
### https://github.com/openshift/openshift-ansible/pull/6132
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,23 @@
retries: 3
delay: 30

- name: Restart static master services
command: /usr/local/bin/master-restart "{{ item }}"
with_items:
- api
- controllers
- etcd
failed_when: false
when: openshift_is_containerized | bool

- name: Restart containerized services
service: name={{ item }} state=started
with_items:
- etcd_container
- openvswitch
- "{{ openshift_service_type }}-master-api"
- "{{ openshift_service_type }}-master-controllers"
- "{{ openshift_service_type }}-node"
- etcd_container
- openvswitch
- "{{ openshift_service_type }}-master-api"
- "{{ openshift_service_type }}-master-controllers"
- "{{ openshift_service_type }}-node"
failed_when: false
when: openshift_is_containerized | bool

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,22 @@
- name: Stop containerized services
service: name={{ item }} state=stopped
with_items:
- "{{ openshift_service_type }}-master-api"
- "{{ openshift_service_type }}-master-controllers"
- "{{ openshift_service_type }}-node"
- etcd_container
- openvswitch
- "{{ openshift_service_type }}-master-api"
- "{{ openshift_service_type }}-master-controllers"
- "{{ openshift_service_type }}-node"
- etcd_container
- openvswitch
failed_when: false
when: openshift_is_containerized | bool

- name: Restart static master services
command: /usr/local/bin/master-restart "{{ item }}"
with_items:
- api
- controllers
- etcd
failed_when: false

- name: Check Docker image count
shell: "docker images -aq | wc -l"
register: docker_image_count
Expand Down
1 change: 0 additions & 1 deletion playbooks/init/base_packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@
- >
(openshift_use_system_containers | default(False)) | bool
or (openshift_use_etcd_system_container | default(False)) | bool
or (openshift_use_openvswitch_system_container | default(False)) | bool
or (openshift_use_node_system_container | default(False)) | bool
or (openshift_use_master_system_container | default(False)) | bool
register: result
Expand Down
3 changes: 3 additions & 0 deletions playbooks/openshift-master/private/additional_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
etcd_urls: "{{ openshift.master.etcd_urls }}"
omc_cluster_hosts: "{{ groups.oo_masters | join(' ')}}"
roles:
# TODO: this is currently required in order to schedule pods onto the masters, but
# should be moved into components once nodes are using dynamic config
- role: openshift_sdn
- role: openshift_project_request_template
when: openshift_project_request_template_manage
- role: openshift_examples
Expand Down
18 changes: 18 additions & 0 deletions playbooks/openshift-master/private/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,18 @@
openshift_no_proxy_etcd_host_ips: "{{ hostvars | lib_utils_oo_select_keys(groups['oo_etcd_to_config'] | default([]))
| lib_utils_oo_collect('openshift.common.ip') | default([]) | join(',')
}}"
pre_tasks:
# This will be moved into the control plane role once openshift_master is removed
- name: Add static pod and systemd shim commands
import_role:
name: openshift_control_plane
tasks_from: static_shim
- name: Prepare the bootstrap node config on masters for self-hosting
import_role:
name: openshift_node_group
tasks_from: bootstrap
when: openshift_master_bootstrap_enabled | default(false) | bool

roles:
- role: openshift_master_facts
- role: openshift_clock
Expand All @@ -184,6 +196,8 @@
- role: openshift_builddefaults
- role: openshift_buildoverrides
- role: nickhammond.logrotate

# DEPRECATED: begin moving away from this
- role: openshift_master
openshift_master_ha: "{{ (groups.oo_masters | length > 1) | bool }}"
openshift_master_hosts: "{{ groups.oo_masters_to_config }}"
Expand All @@ -193,6 +207,10 @@
openshift_master_default_registry_value: "{{ hostvars[groups.oo_first_master.0].l_default_registry_value }}"
openshift_master_default_registry_value_api: "{{ hostvars[groups.oo_first_master.0].l_default_registry_value_api }}"
openshift_master_default_registry_value_controllers: "{{ hostvars[groups.oo_first_master.0].l_default_registry_value_controllers }}"
when: not ( openshift_master_bootstrap_enabled | default(false) | bool )

- role: openshift_control_plane
when: openshift_master_bootstrap_enabled | default(false) | bool
- role: tuned
- role: nuage_ca
when: openshift_use_nuage | default(false) | bool
Expand Down
17 changes: 6 additions & 11 deletions playbooks/openshift-master/private/scaleup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,14 @@
yaml_key: 'kubernetesMasterConfig.masterCount'
yaml_value: "{{ openshift.master.master_count }}"
notify:
- restart master api
- restart master controllers
- restart master
handlers:
- name: restart master api
service: name={{ openshift_service_type }}-master-controllers state=restarted
- name: restart master
command: /usr/local/bin/master-restart "{{ item }}"
with_items:
- api
- controllers
notify: verify api server
# We retry the controllers because the API may not be 100% initialized yet.
- name: restart master controllers
command: "systemctl restart {{ openshift_service_type }}-master-controllers"
retries: 3
delay: 5
register: result
until: result.rc == 0
- name: verify api server
command: >
curl --silent --tlsv1.2
Expand Down
15 changes: 5 additions & 10 deletions playbooks/openshift-master/private/tasks/wire_aggregator.yml
Original file line number Diff line number Diff line change
Expand Up @@ -191,16 +191,11 @@
#restart master serially here
- when: yedit_output.changed or (yedit_asset_config_output is defined and yedit_asset_config_output.changed)
block:
- name: restart master api
systemd: name={{ openshift_service_type }}-master-api state=restarted

# We retry the controllers because the API may not be 100% initialized yet.
- name: restart master controllers
command: "systemctl restart {{ openshift_service_type }}-master-controllers"
retries: 3
delay: 5
register: result
until: result.rc == 0
- name: restart master
command: /usr/local/bin/master-restart "{{ item }}"
with_items:
- api
- controllers

- name: Verify API Server
# Using curl here since the uri module requires python-httplib2 and
Expand Down
4 changes: 4 additions & 0 deletions playbooks/openshift-node/private/image_prep.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@
- import_role:
name: openshift_node
tasks_from: bootstrap.yml
- import_role:
name: openshift_node_group
tasks_from: bootstrap.yml


- name: Re-enable excluders
import_playbook: enable_excluders.yml
Expand Down
1 change: 0 additions & 1 deletion roles/container_runtime/tasks/package_docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
- >
(openshift_use_system_containers | default(False)) | bool
or (openshift_use_etcd_system_container | default(False)) | bool
or (openshift_use_openvswitch_system_container | default(False)) | bool
or (openshift_use_node_system_container | default(False)) | bool
or (openshift_use_master_system_container | default(False)) | bool
Expand Down
2 changes: 1 addition & 1 deletion roles/etcd/defaults/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ etcd_listen_client_urls: "{{ etcd_url_scheme }}://{{ etcd_ip }}:{{ etcd_client_p
#etcd_peer: 127.0.0.1
etcdctlv2: "{{ r_etcd_common_etcdctl_command }} --cert-file {{ etcd_peer_cert_file }} --key-file {{ etcd_peer_key_file }} --ca-file {{ etcd_peer_ca_file }} -C https://{{ etcd_peer }}:{{ etcd_client_port }}"

etcd_service: "{{ 'etcd_container' if r_etcd_common_etcd_runtime == 'docker' else 'etcd' }}"
etcd_service: etcd
# Location of the service file is fixed and not meant to be changed
etcd_service_file: "/etc/systemd/system/{{ etcd_service }}.service"

Expand Down
40 changes: 40 additions & 0 deletions roles/etcd/files/etcd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
kind: Pod
apiVersion: v1
metadata:
name: master-etcd
namespace: kube-system
labels:
openshift.io/control-plane: "true"
openshift.io/component: etcd
spec:
restartPolicy: Always
hostNetwork: true
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.3
workingDir: /var/lib/etcd
command: ["/bin/sh", "-c"]
args:
- |
#!/bin/sh
set -o allexport
source /etc/etcd/etcd.conf
exec etcd
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/etcd/
name: master-config
readOnly: true
- mountPath: /var/lib/etcd/
name: master-data
livenessProbe:
tcpSocket:
port: 2379
volumes:
- name: master-config
hostPath:
path: /etc/etcd/
- name: master-data
hostPath:
path: /var/lib/etcd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
package:
name: "etcd{{ '-' + etcd_version if etcd_version is defined else '' }}"
state: present
when: not etcd_is_containerized | bool
when: not etcd_is_atomic | bool
delegate_to: "{{ etcd_ca_host }}"
run_once: true
register: result
until: result is succeeded

Expand Down Expand Up @@ -178,8 +180,8 @@
file:
path: "{{ item }}"
mode: 0600
owner: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
group: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
owner: "etcd"
group: "etcd"
when: etcd_url_scheme == 'https'
with_items:
- "{{ etcd_ca_file }}"
Expand All @@ -190,8 +192,8 @@
file:
path: "{{ item }}"
mode: 0600
owner: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
group: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
owner: "etcd"
group: "etcd"
when: etcd_peer_url_scheme == 'https'
with_items:
- "{{ etcd_peer_ca_file }}"
Expand All @@ -202,6 +204,6 @@
file:
path: "{{ etcd_conf_dir }}"
state: directory
owner: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
group: "{{ 'etcd' if not etcd_is_containerized | bool else omit }}"
owner: "etcd"
group: "etcd"
mode: 0700
Loading

0 comments on commit 4d51b56

Please sign in to comment.