Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OKE - Ability to Set Node Taints #1504

Open
steve-gray opened this issue Dec 15, 2021 · 7 comments
Open

OKE - Ability to Set Node Taints #1504

steve-gray opened this issue Dec 15, 2021 · 7 comments

Comments

@steve-gray
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

Ability to set node taints for OKE on node pools, allowing for workloads to be split across nodes based on roles. Today this kind of split is only possible through manual tainting of nodes (onerous) or using labels and anti-affinity scheduling rules per workload to keep all other pods off nodes.

New or Affected Resource(s)

oci_containerengine_node_pool

Potential Terraform Configuration

  taint {
    key    = "special"
    value  = "true"
    effect = "PREFER_NO_SCHEDULE"
  }

References

Present in AWS and other cloud provider terraform modules for a while. Example(s) of prior art are:

@bbenlazreg
Copy link

Any updates on this ?

@steve-gray
Copy link
Author

You can force this in by setting the kubelet-extra-args in the cloud-init script as a workaround @bbenlazreg - thats what we're doing. It works well enough, but it means now we have to template that when creating OKE clusters, which isn't great.

@jkrajniak
Copy link

@steve-gray could you provide a sample tf with this approach?

@manics
Copy link

manics commented Feb 14, 2023

@jkrajniak try something like this

resource "oci_containerengine_node_pool" "pool1" {
  ...

  node_metadata = {
    # https://blogs.oracle.com/cloud-infrastructure/post/container-engine-for-kubernetes-custom-worker-node-startup-script-support
    # https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengusingcustomcloudinitscripts.htm
    user_data = base64encode(<<-EOT
      #!/bin/bash
      curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
      bash /var/run/oke-init.sh --kubelet-extra-args "--register-with-taints=${var.kubernetes-pool1-taint}"
      EOT
    )
  }

  ...
}

@winston0410
Copy link

@manics I have checked the doc, and tried to apply your snippets. The apply worked, but the taint does not appear in the node. Is your snippet still working for you?

@manics
Copy link

manics commented May 23, 2023

@winston0410 I haven't tried it recently. The last time I deployed this was with version 4.101.0 of registry.terraform.io/hashicorp/oci

@tkellen
Copy link

tkellen commented Apr 21, 2024

This does the trick currently:

node_metadata = {
    user_data = base64encode(<<-EOT
      #!/bin/bash
      export KUBELET_EXTRA_ARGS="--register-with-taints=node.wescaleout.cloud/routing=true:NoSchedule"
      curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode > /var/run/oke-init.sh
      bash /var/run/oke-init.sh
      EOT
    )
  }

The content of http://169.254.169.254/opc/v2/instance/metadata/oke_init_script:

#!/usr/bin/env bash
set -x
set -e
set -o pipefail

v1_sha_file="ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz.sha256"
v2_sha_file=$(echo "ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz" | sed 's/\.tgz/-v2\.tgz.sha256/')

verifyChecksum() {
    if grep 'PRETTY_NAME="Oracle Linux Server 8\..*"' /etc/os-release >/dev/null; then
    openssl dgst -sha256 -signature $1 -verify $2 $3
    else
        OPENSSL_FIPS=1 openssl dgst -sha256 -signature $1 -verify $2 $3
    fi
}

downloadAnsible() {
    curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz -o /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz &&
    curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_artifact_signing_key > /var/run/tkw/oke-signing.pub &&
    if curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/$v1_sha_file -o /var/run/tkw/$v1_sha_file && \
    verifyChecksum /var/run/tkw/$v1_sha_file /var/run/tkw/oke-signing.pub /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz; then
        echo "Ansible bundle is being signed with artifact-signing-key"
    elif curl -v --fail -L0 https://objectstorage.us-chicago-1.oraclecloud.com/n/odx-oke/b/tkw-cloud-init-prd-0/o/$v2_sha_file -o /var/run/tkw/$v2_sha_file && \
    verifyChecksum /var/run/tkw/$v2_sha_file /var/run/tkw/oke-signing.pub /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz; then
        echo "Ansible bundle is being signed with artifact-signing-key-v2"
    else
        return 1
    fi
}

exec &> >(tee -ia /var/run/oke-init.log)
if [ ! -f /var/log/cloud-init-output.log ]; then
    exec &> >(tee -ia /var/log/cloud-init-output.log)
fi
exec 2>&1

if [ -f /etc/.oke_init_complete ]; then
    echo "OKE provisioning already completed... Exiting..."
    exit 0
fi

if [ -f /etc/oke/oke-install.sh ]; then
    until bash -x "/etc/oke/oke-install.sh" "$@"
    do
        echo "oke-install failed...retrying in 5s..."
        sleep 5
    done
else
    mkdir -p /var/run/tkw
    rm -rf /var/run/tkw/*
    until downloadAnsible
    do
        echo "Failed to download TKW ansible bundle...retrying in 5s"
        rm -rf /var/run/tkw/*
        mkdir -p /var/run/tkw
        sleep 5
    done
    tar -xzvf /var/run/tkw/ansible-playbook-1.64.0-915fe38b72ac3137651a29aa3e28b5726022af63-41.tgz -C /var/run/tkw
    until bash -x "/var/run/tkw/bootstrap.sh" "$@"
    do
        echo "bootstrap failed...retrying in 5s..."
        sleep 5
    done
fi

The content of /etc/oke/oke-install.sh on Oracle-Linux-8.9-2024.01.26-0-OKE-1.28.2-679

#!/bin/bash
set -xe
set -o pipefail

echo "$(date) Starting OKE bootstrap"

# Load necessary functions that will be used later in the script
source /etc/oke/oke-functions.sh

# Allow user to specify arguments through custom cloud-init
while [[ $# -gt 0 ]]; do
  key="$1"
  case "$key" in
    --kubelet-extra-args)
        export KUBELET_EXTRA_ARGS="$2"
        shift
        shift
        ;;
    --cluster-dns)
        export CLUSTER_DNS="$2"
        shift
        shift
        ;;
    --apiserver-endpoint)
        export APISERVER_ENDPOINT="$2"
        shift
        shift
        ;;
    --kubelet-ca-cert)
        export KUBELET_CA_CERT="$2"
        shift
        shift
        ;;
    *) # Ignore unsupported args
        shift
        ;;
  esac
done

export OKE_BOOTSTRAP_METRICS_FILE_PATH="/etc/oke/metric.py"

# Captures the start time of worker node bootstrapping. Avoid placing any code that can be considered part of node
# bootstrapping above this line.
bootstrap_start_time=$(time_in_ms)

KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS:-}"
CLUSTER_DNS="${CLUSTER_DNS:-}"
APISERVER_ENDPOINT="${APISERVER_ENDPOINT:-$(get_apiserver_host)}"
KUBELET_CA_CERT="${KUBELET_CA_CERT:-}"

# Location of proxymux config and drop-in for proxymux config
PROXYMUX_CONFIG_PATH="/etc/proxymux/config.yaml"
PROXYMUX_CERTS_SERVICE_D_PATH="/etc/systemd/system/proxymux-certs.service.d"
mkdir -p "${PROXYMUX_CERTS_SERVICE_D_PATH}"

# Execute NPWF/BYON-specific logic
if [[ -n "$(get_oke_pool_id)" ]]; then
  service_env_rule "PROXYMUX_ENDPOINT" "certs" > "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  env_rule "PROXYMUX_ARGS" "--config ${PROXYMUX_CONFIG_PATH}" >> "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  get_bootstrap_kubelet_conf >/etc/kubernetes/bootstrap-kubelet.conf
  get_kubelet_client_ca >/etc/kubernetes/ca.crt
elif [[ -n "$APISERVER_ENDPOINT" && -n "$KUBELET_CA_CERT" ]]; then
  # Use the bootstrap endpoint to allow the BYON node to attempt to join the cluster
  service_env_rule "PROXYMUX_ENDPOINT" "bootstrap" > "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  env_rule "PROXYMUX_ARGS" "--server-host ${APISERVER_ENDPOINT}" >> "${PROXYMUX_CERTS_SERVICE_D_PATH}"/10_cloud.conf
  echo "$KUBELET_CA_CERT" | base64 -d > /etc/kubernetes/ca.crt
  get_oke_k8version > /etc/oke/oke-k8s-version
else
  echo "--apiserver-endpoint and/or --kubelet-ca-cert args must be set"
  exit 1
fi

# Get the pause image for the given region/realm/k8s version and populate the crio config
REGION="$(get_region)"
REALM="$(get_realm)"
K8S_VERSION=$(get_oke_k8version | awk -F'v' '{print $NF}')
PAUSE_IMAGE=$(get_pause_image "$REGION" "$REALM" "$K8S_VERSION")
sed -i s,"PAUSE_IMAGE_PLACEHOLDER","$PAUSE_IMAGE",g /etc/crio/crio.conf
export K8S_VERSION PAUSE_IMAGE

# Get instance ocid for proxymux and kubelet configs
INSTANCE_ID="$(get_instance_id)"
export INSTANCE_ID

# Get info needed to populate the proxymux config
PROXYMUX_PORT="$(get_proxymux_port)"
TM_ID="$(get_oke_tm)"
SHORT_CLUSTER_ID="$(get_cluster_label)"
PRIVATE_NODE_IP=$(get_private_ip)
TENANCY_ID="$(get_tenancy_id)"
GPU=false
if [[ "$(get_shape)" == *"GPU"* ]]; then
  NET_INF=eno2
  GPU=true
else
  NET_INF=ens3
fi

# Populate the proxymux config
cat > $PROXYMUX_CONFIG_PATH << EOF

node-id: ${INSTANCE_ID}
net-inf: ${NET_INF}

server-addr: https://${APISERVER_ENDPOINT}:${PROXYMUX_PORT}
tm-id: ${TM_ID}
cluster-id: ${SHORT_CLUSTER_ID}
public-ip-address:
private-ip-address: ${PRIVATE_NODE_IP}
node-name: ${PRIVATE_NODE_IP}
cert-path: /var/lib/kubelet/pki
tenancy-id: ${TENANCY_ID}
bind-addr: 172.16.11.1:80
oci-realm: ${REALM}
EOF

# Get info needed to populate the kubelet config
KUBELET_CONFIG=/etc/kubernetes/kubelet-config.json

# Add kubelet args for ONSRs
IS_ONSR="$(get_oke_is_onsr)"
if [[ $IS_ONSR == "true" ]];then
  if (semantic_version_lt "$K8S_VERSION" "1.24.0");then
    echo "$(jq '. += {"streamingConnectionIdleTimeout": "5m", "featureGates": "DynamicKubeletConfig=false"}' $KUBELET_CONFIG)" > $KUBELET_CONFIG
  else
    echo "$(jq '. += {"streamingConnectionIdleTimeout": "5m"}' $KUBELET_CONFIG)" > $KUBELET_CONFIG
  fi
fi

# Get node labels placed by OKE, including user-specified initial node labels
NODE_LABELS="$(get_node_labels)"
export NODE_LABELS

# Get default kubelet args
KUBELET_DEFAULT_ARGS="$(get_kubelet_default_args)"
export KUBELET_DEFAULT_ARGS

MAX_PODS="$(get_max_pods)"
NATIVE_POD_NETWORKING="$(get_native_pod_networking)"
if [[ -n ${MAX_PODS} && -n ${NATIVE_POD_NETWORKING} ]]; then
  # Append max-pods to kubelet args from oke-max-pods if using OCI VCN IP Native CNI. Kubelet only cares about the last flag if flags repeat.
  KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS} --max-pods ${MAX_PODS}"
fi

IS_PREEMPTIBLE="$(get_is_preemptible)"
GPU_TAINT="nvidia.com/gpu=:NoSchedule"
PREEMPTIBLE_TAINT="oci.oraclecloud.com/oke-is-preemptible=:NoSchedule"
# Add taint for GPU nodes and preemptible instances. If customer specifies additional taints through kubelet-extra-args, then merge the taints
if [[ "$GPU" == "true" || "$IS_PREEMPTIBLE" == "true" ]]; then
  LOCAL_TAINT=""
  if [[ "$GPU" == "true" ]]; then
    LOCAL_TAINT="${GPU_TAINT}"
  fi
  if [[ "$IS_PREEMPTIBLE" == "true" ]]; then
    if [[ -n "$LOCAL_TAINT" ]]; then
      LOCAL_TAINT="${LOCAL_TAINT},${PREEMPTIBLE_TAINT}"
    else
      LOCAL_TAINT="${PREEMPTIBLE_TAINT}"
    fi
  fi
  NODE_TAINTS=$(echo "$KUBELET_EXTRA_ARGS" | { grep -o -E -- '--register-with-taints=[^ ]+' || true; })
  if [[ -n "$NODE_TAINTS" ]]; then
    NODE_TAINTS="${NODE_TAINTS},${LOCAL_TAINT}"
    KUBELET_EXTRA_ARGS=$(echo "$KUBELET_EXTRA_ARGS" | sed 's/--register-with-taints[^ ]\+//')
    KUBELET_EXTRA_ARGS="${KUBELET_EXTRA_ARGS} ${NODE_TAINTS}"
  else
    KUBELET_DEFAULT_ARGS="${KUBELET_DEFAULT_ARGS} --register-with-taints=${LOCAL_TAINT}"
  fi
fi

# Path for kubelet drop-in files
KUBELET_SERVICE_D_PATH="/etc/systemd/system/kubelet.service.d"
mkdir -p "$KUBELET_SERVICE_D_PATH"

# Store default kubelet args and extra kubelet args in environment variables to be used by kubelet
service_env_rule "KUBELET_DEFAULT_ARGS" "$KUBELET_DEFAULT_ARGS" > "${KUBELET_SERVICE_D_PATH}"/kubelet-default-args.conf
service_env_rule "KUBELET_EXTRA_ARGS" "$KUBELET_EXTRA_ARGS" > "${KUBELET_SERVICE_D_PATH}"/kubelet-extra-args.conf

# Disable swap volumes
sed -i '/swap/ s/^\(.*\)$/# \1/g' /etc/fstab
swapoff -a

# Enable and restart necessary systemd services
daemon_reload
enable_and_restart "$(get_container_runtime_service)"
proxymux_certs_start_time=$(time_in_ms)
enable_and_restart 'proxymux-certs'
if [[ -n "${proxymux_certs_start_time}" ]]; then
    emit_elapsed_time_metric "oke.workerNode.softwareBootstrap.ProxymuxCertsStart.Time" "${proxymux_certs_start_time}"
fi

# Handle clusterDNS and providerID here. User-specified clusterDNS will have the highest priority. Otherwise, use clusterDNS from ansible
# args for NPWFs or proxymux endpoint for BYON. The proxymux endpoint will place clusterDNS in /etc/oke/oke-cluster-dns by default
if [[ -z "$CLUSTER_DNS" ]]; then
  CLUSTER_DNS_PATH="/etc/oke/oke-cluster-dns"
  export CLUSTER_DNS_PATH
  CLUSTER_DNS="$(get_cluster_dns)"
fi
export CLUSTER_DNS
echo "$(jq --arg CLUSTER_DNS "$CLUSTER_DNS" --arg INSTANCE_ID "$INSTANCE_ID" '. += {"clusterDNS": [$CLUSTER_DNS], "providerID": $INSTANCE_ID}' $KUBELET_CONFIG)" > ${KUBELET_CONFIG}

daemon_reload
enable_and_restart 'kubelet'
enable_and_restart 'systemd-journald'
if [[ "$GPU" == "true" && -f /etc/systemd/system/nvidia-modprobe.service ]]; then
  enable_and_restart 'nvidia-modprobe'
fi
if [[ "$GPU" == "true" && -f /etc/systemd/system/nvidia-persistenced.service ]]; then
  enable_and_restart 'nvidia-persistenced'
fi
enable_and_restart 'kubelet-monitor'
enable_and_restart 'kube-container-runtime-monitor'
sudo systemctl enable oke-node-startup-cmds

# Captures the end time of worker node bootstrapping. Avoid placing any code that can be considered part of node
# bootstrapping below this line.
if [[ -n "$bootstrap_start_time" ]]; then
    emit_elapsed_time_metric "oke.workerNode.softwareBootstrap.Time" "${bootstrap_start_time}"
fi

echo "$(date) Finished OKE bootstrap"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants