Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688

Open
ivansharamok opened this issue Apr 3, 2024 · 6 comments
Assignees

Comments

@ivansharamok
Copy link

ivansharamok commented Apr 3, 2024

Environment

  • Calico/VPP version: tigera-operator v3.26.3 / Calico VPP v3.26.0 also tried tigera-operator v3.27.2 / Calico VPP v3.27.0
  • Kubernetes version: v1.28.8
  • Deployment type: kubeadm cluster on Azure Compute instances
  • Network configuration: Calico default with VXLAN enabled
  • Pod CIDR: 192.168.0.0/16
  • Service CIDR: 10.96.0.0/12
  • CRI: containerd 1.6.28 (docker is not installed)
  • OS: Ubuntu 22.04
  • kernel: Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue description
The calico-vpp-node pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The /etc/resolv.conf file on the hosts get edited when the calico-vpp-node pod is running. The DNS resolution from within the calico-vpp-node pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in ImagePullBackOff state.

To Reproduce
Steps to reproduce the behavior:

  • provision Azure Compute instances (e.g. control-plane1, worker1). Used Standard_D4s_v3 size instances
  • deploy kubeadm cluster. Used kubeadm v1.28.8 version
  • install Calico VPP. Used calico-vpp-nohuge.yaml
  • edited CALICOVPP_INTERFACES to use interfaceName: eth0 instead of the default eth1 as shown below:
  CALICOVPP_INTERFACES: |-
    {
      "maxPodIfSpec": {
        "rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
      },
      "defaultPodIfSpec": {
        "rx": 1, "tx":1, "isl3": true
      },
      "vppHostTapSpec": {
        "rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
      },
      "uplinkInterfaces": [
        {
          "interfaceName": "eth0",
          "vppDriver": "af_packet"
        }
      ]
    }
  • installation-default.yaml was edit as the following:
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    linuxDataplane: VPP
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLAN

Expected behavior
Installation of Calico VPP should not disrupt host's DNS resolution.

Additional context

  • the order of manifest installation
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
  • while calico-vpp-node pods getting initialized, the DNS resolution on the host works as expected. However, once the calico-vpp-dataplane/calico-vpp-node pods get to the Running state, the DNS resolution stops working on the host and /etc/resolv.conf file gets modified.
  • example of /etc/resolv.conf on the host before Calico VPP is installed
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
  • example of the /etc/resolv.conf on the host after calico-vpp-node pod reaches the Running state
nameserver 127.0.0.53
options edns0 trust-ad
search .
  • example of the /etc/resolv.conf inside the calico-vpp-node pods
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
  • I have not issue getting a response when running curl google.com from within the calico-vpp-node pod, but the same query fails on the host with the message curl: (6) Could not resolve host: google.com
  • I noticed that Calico VPP seems to add the service CIDR into the routing table on the host. I'm not sure if this has any impact on host's DNS resolution, but the programming of that route seems to correlate with the time when DNS resolution on the host stops working.
  • example of programmed routes on the host before calico-vpp-node is up or right after when you manually kill the pod and before it's back up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
  • example of programmed routes on the host after the calico-vpp-node pod is up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440
  • one way I can get pods to pull necessary images after the calico-vpp-node pods get up and running, is to manually kill the calico-vpp-node pods and force restart the pods that are failing to pull the images. Since it takes the calico-vpp-node pods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again.
  • I can get a bit better workaround if I manually edit the /etc/resolv.conf file on the host and make it look like the one I fetch from within the calico-vpp-node pods. The DNS starts working until the calico-vpp-node gets restarted as the restart of that pod seems to overwrite the /etc/resolv.conf file once again.

Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.

@onong
Copy link
Collaborator

onong commented Apr 4, 2024

Hi @ivansharamok, could you share the vpp-manager logs:

kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp

Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?

@onong
Copy link
Collaborator

onong commented Apr 4, 2024

Are the nodes using NetworkManager or systemd.networkd? Could you please share the appropriate logs (NM or systemd.networkd) when this issue happens?

@ivansharamok
Copy link
Author

ivansharamok commented Apr 4, 2024

Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?

I tried v3.27.0 but the calicovpp/install-whereabouts image wasn't published to the Docker Hub which prompted me to switch to v3.26.0. I see that it was published a few days ago. I'll give it a try and update this ticket.

@ivansharamok
Copy link
Author

ivansharamok commented Apr 4, 2024

Installed Calico VPP v3.27.0. Hit the same issue. Below is the info collected from the cluster using Calico VPP v3.27.0.

Looks like Ubuntu 22.04 by default uses systemd-networkd.

# checking if NetworkManager is used
azureuser@master:~$ systemctl status NetworkManager
Unit NetworkManager.service could not be found.

azureuser@master:~$ systemctl status network-manager
Unit network-manager.service could not be found.

# checking if systemd-networkd is used
azureuser@master:~$ systemctl status /etc/network/interfaces
Unit etc-network-interfaces.mount could not be found.
azureuser@master:~$ systemctl status systemd-networkd
● systemd-networkd.service - Network Configuration
     Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-04-04 17:10:06 UTC; 15min ago
TriggeredBy: ● systemd-networkd.socket
       Docs: man:systemd-networkd.service(8)
   Main PID: 7977 (systemd-network)
     Status: "Processing requests..."
      Tasks: 1 (limit: 19179)
     Memory: 1.3M
        CPU: 136ms
     CGroup: /system.slice/systemd-networkd.service
             └─7977 /lib/systemd/systemd-networkd

Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP

Here's the log for systemd-networkd (journalctl -u systemd-networkd).

Apr 04 16:32:09 master systemd[1]: Starting Network Configuration...
Apr 04 16:32:09 master systemd-networkd[539]: lo: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: lo: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: Enumeration completed
Apr 04 16:32:09 master systemd[1]: Started Network Configuration.
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: DHCPv4 address 172.10.1.5/24 via 172.10.1.1
Apr 04 16:32:11 master systemd-networkd[539]: eth0: Gained IPv6LL
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCP lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCPv6 lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 17:10:06 master systemd[1]: Stopping Network Configuration...
Apr 04 17:10:06 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:10:06 master systemd[1]: Stopped Network Configuration.
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: Enumeration completed
Apr 04 17:10:06 master systemd[1]: Started Network Configuration.
Apr 04 17:10:07 master systemd-networkd[7977]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd[1]: Stopping Network Configuration...
Apr 04 17:29:43 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:29:43 master systemd[1]: Stopped Network Configuration.
Apr 04 17:29:43 master systemd[1]: Starting Network Configuration...
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd-networkd[17212]: Enumeration completed
Apr 04 17:29:43 master systemd[1]: Started Network Configuration.
  • in the log, the time Apr 04 17:10:06 corresponds to when I installed Calico VPP in my cluster
  • the time Apr 04 17:29:43 corresponds to sudo systemctl restart systemd-networkd command as I tried to see if restarting the networking service could help fix the problem. It didn't.

Logs for one of calico-vpp-node pods

time="2024-04-04T17:10:03Z" level=info msg="Version info\nImage tag                   : ab81a775fbdeba932888690c68ddf7e9f4bd8d2b\nVPP-dataplane version       : ab81a77 Release v3.27.0\nVPP Version                 : 24.02-rc0~8-g9db45f6ae\nBinapi-generator version    : v0.8.0\nVPP Base commit             : 06efd532e gerrit:34726/3 interface: add buffer stats api\n------------------ Cherry picked commits --------------------\ncapo: Calico Policies plugin\nacl: acl-plugin custom policies\ncnat: [WIP] no k8s maglev from pods\npbl: Port based balancer\ngerrit:40078/3 vnet: allow format deleted swifidx\ngerrit:40090/3 cnat: undo fib_entry_contribute_forwarding\ngerrit:39507/13 cnat: add flow hash config to cnat translation\ngerrit:34726/3 interface: add buffer stats api\n-------------------------------------------------------------\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SWAP_DRIVER="
time="2024-04-04T17:10:03Z" level=info msg="Config:SERVICE_PREFIX=[10.96.0.0/12]"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_GRACEFUL_SHUTDOWN_TIMEOUT=10s"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACES={\n  \"defaultPodIfSpec\": {\n    \"rx\": 1,\n    \"tx\": 1,\n    \"rxqsz\": 0,\n    \"txqsz\": 0,\n    \"isl3\": true,\n    \"rxMode\": 0\n  },\n  \"maxPodIfSpec\": {\n    \"rx\": 10,\n    \"tx\": 10,\n    \"rxqsz\": 1024,\n    \"txqsz\": 1024,\n    \"isl3\": null,\n    \"rxMode\": 0\n  },\n  \"vppHostTapSpec\": {\n    \"rx\": 1,\n    \"tx\": 1,\n    \"rxqsz\": 1024,\n    \"txqsz\": 1024,\n    \"isl3\": false,\n    \"rxMode\": 0\n  },\n  \"uplinkInterfaces\": [\n    {\n      \"rx\": 0,\n      \"tx\": 0,\n      \"rxqsz\": 0,\n      \"txqsz\": 0,\n      \"isl3\": null,\n      \"rxMode\": 0,\n      \"isMain\": false,\n      \"physicalNetworkName\": \"\",\n      \"interfaceName\": \"eth0\",\n      \"vppDriver\": \"af_packet\",\n      \"newDriver\": \"\",\n      \"annotations\": null,\n      \"mtu\": 0\n    }\n  ]\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_FEATURE_GATES={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC={\n  \"nbAsyncCryptoThreads\": 0,\n  \"extraAddresses\": 0\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INITIAL_CONFIG={\n  \"vppStartupSleepSeconds\": 1,\n  \"corePattern\": \"/var/lib/vpp/vppcore.%e.%p\",\n  \"extraAddrCount\": 0,\n  \"ifConfigSavePath\": \"\",\n  \"defaultGWs\": \"\",\n  \"redirectToHostRules\": null\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_TEMPLATE=unix {\n  nodaemon\n  full-coredump\n  cli-listen /var/run/vpp/cli.sock\n  pidfile /run/vpp/vpp.pid\n  exec /etc/vpp/startup.exec\n}\napi-trace { on }\ncpu {\n    workers 0\n}\nsocksvr {\n    socket-name /var/run/vpp/vpp-api.sock\n}\nplugins {\n    plugin default { enable }\n    plugin dpdk_plugin.so { disable }\n    plugin calico_plugin.so { enable }\n    plugin ping_plugin.so { disable }\n    plugin dispatch_trace_plugin.so { enable }\n}\nbuffers {\n  buffers-per-numa 131072\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_IF_READ=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:NODENAME=master"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_BGP_LOG_LEVEL=INFO"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_VPP_RUN=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_RUNNING=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SRV6={\n  \"localsidPool\": \"\",\n  \"policyPool\": \"\"\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_FORMAT="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INIT_SCRIPT_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_EXEC_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_DONE_OK=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_LEVEL=info"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_DEBUG={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_ERRORED=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC_IKEV2_PSK="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_NATIVE_DRIVER="
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="No pci device for interface eth0"
time="2024-04-04T17:10:03Z" level=info msg="-- Environment --"
time="2024-04-04T17:10:03Z" level=info msg="Hugepages            0"
time="2024-04-04T17:10:03Z" level=info msg="KernelVersion        5.15.0-1042"
time="2024-04-04T17:10:03Z" level=info msg="Drivers              map[uio_pci_generic:false vfio-pci:true]"
time="2024-04-04T17:10:03Z" level=info msg="initial iommu status N"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface Spec --"
time="2024-04-04T17:10:03Z" level=info msg="Interface Name:      eth0"
time="2024-04-04T17:10:03Z" level=info msg="Native Driver:       af_packet"
time="2024-04-04T17:10:03Z" level=info msg="New Drive Name:      "
time="2024-04-04T17:10:03Z" level=info msg="PHY target #Queues   rx:0 tx:0"
time="2024-04-04T17:10:03Z" level=info msg="Tap MTU:             0"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface config --"
time="2024-04-04T17:10:03Z" level=info msg="Node IP4:            172.10.1.5/24"
time="2024-04-04T17:10:03Z" level=info msg="Node IP6:            "
time="2024-04-04T17:10:03Z" level=info msg="PciId:               "
time="2024-04-04T17:10:03Z" level=info msg="Driver:              "
time="2024-04-04T17:10:03Z" level=info msg="Linux IF was up ?    true"
time="2024-04-04T17:10:03Z" level=info msg="Promisc was on ?     false"
time="2024-04-04T17:10:03Z" level=info msg="DoSwapDriver:        false"
time="2024-04-04T17:10:03Z" level=info msg="Mac:                 00:22:48:c0:5e:e6"
time="2024-04-04T17:10:03Z" level=info msg="Addresses:           [172.10.1.5/24 eth0,fe80::222:48ff:fec0:5ee6/64]"
time="2024-04-04T17:10:03Z" level=info msg="Routes:              [{Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, <Dst: nil (default), Ifindex: 2, Gw: 172.10.1.1, Src: 172.10.1.5, >]"
time="2024-04-04T17:10:03Z" level=info msg="PHY original #Queues rx:64 tx:64"
time="2024-04-04T17:10:03Z" level=info msg="MTU                  1500"
time="2024-04-04T17:10:03Z" level=info msg="isTunTap             false"
time="2024-04-04T17:10:03Z" level=info msg="isVeth               false"
time="2024-04-04T17:10:03Z" level=info msg="Running with uplink af_packet"
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="VPP started [PID 7918]"
vpp[7918]: clib_sysfs_prealloc_hugepages:236: pre-allocating 149 additional 2048K hugepages on numa node 0
vpp[7918]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
time="2024-04-04T17:10:04Z" level=info msg="Waiting for VPP... [0/10]"
vpp[7918]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
vpp[7918]: tls_init_ca_chain:1086: Could not initialize TLS CA certificates
vpp[7918]: tls_openssl_init:1209: failed to initialize TLS CA chain
vpp[7918]: vat-plug/load: vat_plugin_register: idpf plugin not loaded...
vpp[7918]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
time="2024-04-04T17:10:06Z" level=info msg="Created AF_PACKET interface 1"
time="2024-04-04T17:10:06Z" level=info msg="tagging interface [1] with: main-eth0"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to uplink interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to uplink interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Creating Linux side interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to data interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Adding ND proxy for address fe80::222:48ff:fec0:5ee6"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address fe80::222:48ff:fec0:5ee6/64 to tap interface"
time="2024-04-04T17:10:06Z" level=warning msg="add addr fe80::222:48ff:fec0:5ee6/64 via vpp EEXIST, file exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="add route via vpp : {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: <nil> Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Using 172.10.1.254 as next hop for cluster IPv4 routes"
time="2024-04-04T17:10:06Z" level=info msg="Setting BGP nodeIP 172.10.1.5/24"
time="2024-04-04T17:10:06Z" level=info msg="Updating node, version = 1741, metaversion = 1741"
default_hook: using systemctl...
default_hook: system is using systemd-networkd; restarting...
time="2024-04-04T17:10:06Z" level=info msg="Received signal child exited, vpp index 1"
time="2024-04-04T17:10:06Z" level=info msg="Ignoring SIGCHLD for pid 0"
time="2024-04-04T17:10:06Z" level=info msg="Done with signal child exited"

@ivansharamok
Copy link
Author

ivansharamok commented Apr 4, 2024

I just tried switching from Ubuntu 22.04 to CentOS 8 and I didn't run into DNS resolution issue on the host when using CentOS hosts. I noticed that CentOS uses NetworkManager by default. At this point, I'm not sure what the exact root cause of the issue is, but it might be related to networking managed by systemd-networkd or perhaps some other default network management configuration bundled in Ubuntu.
On CentOS hosts the /etc/resolv.conf file doesn't get edited when the calico-vpp-node pods get up and running.

@ivansharamok ivansharamok changed the title DNS resolution on Azure Compute hosts stops working once calico-vpp-node pods get up and running DNS resolution on Azure Compute hosts running Ubuntu stops working once calico-vpp-node pods get up and running Apr 4, 2024
@ivansharamok ivansharamok changed the title DNS resolution on Azure Compute hosts running Ubuntu stops working once calico-vpp-node pods get up and running DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running Apr 4, 2024
@onong
Copy link
Collaborator

onong commented Apr 5, 2024

Thanks for the details and sorry about the missing whereabouts image - tagging it somehow got missed out during the release :)

What happens is that when calico-vpp-node starts, it takes over the uplink interface and replaces it with a tap, and systemd-networkd does not like this disappearance act and causes a reset which involves expiry of the DHCP lease ( as can be seen in the logs ) and in some cases also wipes out the DNS config.

We have faced this issue in the past and usually a restart of systemd-networkd has done the trick. Somehow the restart trick doesn't seem to be effective in your case. This will require some further digging. But for a quick fix I can think of the following:

NetworkManager has a config option, dns=none, which tells it to not meddle with the dns config at all which means the dns config remains intact when calico-vpp-node gets running. So, if switching to NM is ok with you then you could try it.

After the azure instances are up and running, modify netplan to make the network config static instead of DHCP, and then start the kubeadm steps to install the cluster.

Try the systemd-networkd option, Unmanaged=true for the uplink interface. It seems like similar to the NM dns=none but not really sure. Refer to this link: systemd/systemd#28626

@onong onong self-assigned this Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants