diff --git a/README.md b/README.md index 36e1a98fae..69431721cf 100644 --- a/README.md +++ b/README.md @@ -63,6 +63,8 @@ in github. This can also function similar to a mailing list if you subscribe to clicking "Watch" (or "Unwatch") and selecting "Custom" -> "Discussions" (or by selecting "All Activity" if you want to receive notifications about everything else too). +If you have issues with an eBPF program, start with the [Troubleshooting Guide](docs/TroubleshootingGuide.md). + ## Frequently Asked Questions ### 1. Is this a fork of eBPF? diff --git a/docs/Diagnostics.md b/docs/Diagnostics.md new file mode 100644 index 0000000000..b55878a7a5 --- /dev/null +++ b/docs/Diagnostics.md @@ -0,0 +1,90 @@ +This document contains information about diagnostic tools and outputs used for debugging and diagnosing eBPF issues. + +-------------------- + +- [WFP State](#wfp-state) +- [bpftool](#bpftool) +- [eBPF Diagnostic Traces](#ebpf-diagnostic-traces) + - [Trace Providers](#trace-providers) + - [Logman Trace Command](#logman-trace-command) + - [Decoding Traces](#decoding-traces) + - [Viewing Traces](#viewing-traces) + +-------------------- + +## WFP State + +netebpfext.sys uses the Windows Filtering Platform (WFP) to implement certain eBPF program types. Depending on the +program and attach type, different WFP objects are expected to be created. + +The following program types rely on WFP: +- BPF_PROG_TYPE_XDP +- BPF_PROG_TYPE_BIND +- BPF_PROG_TYPE_CGROUP_SOCK_ADDR +- BPF_PROG_TYPE_SOCK_OPS + +Use the command `netsh wfp show state` to produce a `wfpstate.xml`. This file shows the WFP state on the system, +including all WFP `sublayer`, `callout`, and `filter` objects. This can be used to determine if eBPF objects are +correctly configured or if there are other callout objects present that may interfere with eBPF behavior. + +-------------------- +## bpftool + +`bpftool.exe` can be used to show eBPF object state. This is useful when checking if your eBPF program is loaded, +attached, and any maps used are properly configured. + +-------------------- + +## eBPF Diagnostic Traces + +For some issues, Event Trace Logs (ETL) are necessary to further root cause and resolve the issue. + +-------------------- + +### Trace Providers + +- `NetEbpfExtProvider` + - {f2f2ca01-ad02-4a07-9e90-95a2334f3692} + - This provider is part of the eBPF platform. This traces content from NetEbpfExt.sys. +- `EbpfForWindowsProvider` + - {394f321c-5cf4-404c-aa34-4df1428a7f9c} + - This provider is part of the eBPF platform. This traces content from ebpfCore.sys. +- `Microsoft.Windows.Networking.WFP.Callout` + - {00e7ee66-5b24-5c41-22cb-af98f63e2f90} + - This provider is part of the Windows OS. This traces content from WFP callout actions. + +-------------------- + +### Logman Trace Command + +You can use the following trace commands to collect traces. This uses maximum verbosity: +``` +logman create trace "ebpf_diag_manual" -o C:\ebpf_trace.etl -f bincirc -max 1024 -ets +logman update trace "ebpf_diag_manual" -p "{f2f2ca01-ad02-4a07-9e90-95a2334f3692}" 0xffffffffffffffff 0xff -ets +logman update trace "ebpf_diag_manual" -p "{394f321c-5cf4-404c-aa34-4df1428a7f9c}" 0xffffffffffffffff 0xff -ets +logman update trace "ebpf_diag_manual" -p "{00e7ee66-5b24-5c41-22cb-af98f63e2f90}" 0xffffffffffffffff 0xff -ets + + + +logman stop "ebpf_diag_manual" -ets +``` + +-------------------- + +### Decoding Traces + +Once you have the `.etl` file captured with the above providers, you will need to first decode the traces before viewing +them. + +One method for decoding traces is to use the `netsh` tool. The following command can be used for decoding: +``` +netsh trace convert +``` + +-------------------- + +### Viewing Traces +Once decoded, you can open the file with any text viewing tool. One option for viewing text files is: +https://textanalysistool.github.io/ + +-------------------- \ No newline at end of file diff --git a/docs/TroubleshootingGuide.md b/docs/TroubleshootingGuide.md new file mode 100644 index 0000000000..90c82fd90e --- /dev/null +++ b/docs/TroubleshootingGuide.md @@ -0,0 +1,550 @@ +This document contains a troubleshooting guide for issues related to eBPF. + +-------------------- + +# What Kind of Issue Are You Having ? + +- [A specific eBPF program is failing verification](./debugging.md) +- [The eBPF program is not getting invoked](#troubleshooting-general-ebpf-program-issues) +- [A specific eBPF program is not behaving as expected](#troubleshooting-issues-related-to-a-specific-program-type) + +-------------------- + +# Troubleshooting General eBPF Program Issues + +If the eBPF program is not getting invoked at all, walk through the following steps to determine where the issue is and +resolve it: + +1. [Verify eBPF components are running](#verify-ebpf-components-are-running) +2. [Verify Windows Filtering Platform (WFP) objects are present](#verify-wfp-objects-are-present) +3. [Verify the eBPF Program is Configured Correctly](#verify-the-ebpf-program-is-configured-correctly) + +-------------------- + +## Verify eBPF Components Are Running + +Verify that the necessary services are running. Run the following commands: +``` +sc.exe queryex netebpfext +sc.exe queryex ebpfcore +``` +We expect to see the following output, notably that the service is in the **Running** state: +``` +SERVICE_NAME: ebpfcore + TYPE : 1 KERNEL_DRIVER + STATE : 4 RUNNING + (STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN) + WIN32_EXIT_CODE : 0 (0x0) + SERVICE_EXIT_CODE : 0 (0x0) + CHECKPOINT : 0x0 + WAIT_HINT : 0x0 + PID : 0 + FLAGS : +``` + +**Mitigation:** For each service that is not running, execute: +``` +sc.exe start netebpfext +sc.exe start ebpfcore +``` + +If the problem persists, obtain the `SERVICE_EXIT_CODE` and look at the +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces) for further diagnosis. + +-------------------- + +## Verify WFP objects are present +netebpfext.sys uses the WFP platform to implement certain eBPF program types. If you are observing issues with the eBPF +program not getting invoked at all, you should check if the necessary WFP objects are present. + +Depending on the program and attach type, different WFP objects are expected to be created. You can use the +[WFP state diagnostic file](./Diagnostics.md#wfp-state) to confirm that the necessary objects are present. + +There are a few different WFP object types. Depending on the program type, you should check for specific instances of +each WFP object. +- `sublayer` object. Depending on the program type, a different `sublayerKey` may be expected. Note that the `weight` + field may be different in the expected output than on your device, and it is not an issue if it is different. +- `callout` object. You should check that the `applicableLayer` of this object matches the expected output for the + program type. +- `filter` object. When looking for the expected `filters` check for the following: + - The `layerKey` matches the expected output. + - The `sublayerKey` matches the `sublayerKey` in the expected output. + - The `filterType` has the same GUID as the `calloutKey` in the `callout` object. + +Note that the `calloutId` and `filterId` fields are NOT constant and are expected to change. Instead, use the +`calloutKey` and `filterKey` values to uniquely identify these objects. + +The below section details the specific expected WFP objects for each program type. + +**Mitigation**: If any of the expected objects are not present or incorrect, attempt mitigation by restarting both +`ebpfcore` and `netebpfext`: +``` +sc.exe stop ebpfcore +sc.exe stop netebpext +sc.exe start ebpfcore +sc.exe start netebpfext +``` + +If the objects are still not present, check the [eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces) for +any errors. + +**Next Steps**: If you have verified that the WFP objects are present, but the eBPF progarm is still not getting +invoked, see [troubleshooting eBPF program issues](#Troubleshooting-general-eBPF-Program-Issues). + +-------------------- + +### Expected WFP objects for the program type BPF_PROG_TYPE_CGROUP_SOCK_ADDR +The following are the expected `sublayer` objects for this program type: +``` + + {7c7b3fb9-3331-436a-98e1-b901df457fff} + + EBPF Sub-Layer + Sub-Layer for use by eBPF callouts + + + + + 8 + + + {98849e12-b07d-11ec-9a30-18602489beee} + + EBPF CGroup Connect V4 Sub-Layer + Sub-Layer for use by eBPF connect redirect callouts + + + + + 9 + + + {98849e13-b07d-11ec-9a30-18602489beee} + + EBPF CGroup Connect V6 Sub-Layer + Sub-Layer for use by eBPF connect redirect callouts + + + + + 10 + +``` + +For eBPF programs using the `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` program type and attached at the +`EBPF_ATTACH_TYPE_CGROUP_INET4_CONNECT` hook, we expect a `callout` and `filter` present in the following layers: +1. `FWPM_LAYER_ALE_CONNECT_REDIRECT_V4` +2. `FWPM_LAYER_ALE_CONNECT_REDIRECT_V6` +3. `FWPM_LAYER_ALE_AUTH_CONNECT_V4` + +In this scenario, the `FWPM_LAYER_ALE_CONNECT_REDIRECT_V6` layer objects are necessary due to the way the WFP stack +handles dual-stack sockets. + +This is the expected `callout` and `filter` at the `FWPM_LAYER_ALE_CONNECT_REDIRECT_V4` layer: +``` + + {98849e0f-b07d-11ec-9a30-18602489beee} + + ALE Connect Redirect eBPF Callout v4 + ALE Connect Redirect callout for eBPF + + + FWPM_CALLOUT_FLAG_REGISTERED + + + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V4 + 300 + + + {d18b796a-2018-408e-af4a-ac1978b5a364} + + net eBPF sock_addr hook + net eBPF sock_addr hook WFP filter + + + + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V4 + {7c7b3fb9-3331-436a-98e1-b901df457fff} + + FWP_EMPTY + + + + FWP_ACTION_CALLOUT_TERMINATING + {98849e0f-b07d-11ec-9a30-18602489beee} + + 18446603911448051536 + + 68591 + + FWP_UINT64 + 0 + + +``` + +This is the expected `callout` and `filter` at the `FWPM_LAYER_ALE_CONNECT_REDIRECT_V6` layer: +``` + + {98849e10-b07d-11ec-9a30-18602489beee} + + ALE Connect Redirect eBPF Callout v6 + ALE Connect Redirect callout for eBPF + + + FWPM_CALLOUT_FLAG_REGISTERED + + {ddb851f5-841a-4b77-8a46-bb7063e9f162} + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V6 + 279 + + + {162acb09-0cd9-4b80-b7a7-bdd653cca03a} + + net eBPF sock_addr hook + net eBPF sock_addr hook WFP filter + + + {ddb851f5-841a-4b77-8a46-bb7063e9f162} + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V6 + {98849e12-b07d-11ec-9a30-18602489beee} + + FWP_EMPTY + + + + FWP_ACTION_CALLOUT_TERMINATING + {98849e10-b07d-11ec-9a30-18602489beee} + + 18446624845314639248 + + 68246 + + FWP_UINT64 + 0 + + + +``` + +This is the expected `callout` and `filter` at the `FWPM_LAYER_ALE_AUTH_CONNECT_V4` layer: +``` + + {98849e0b-b07d-11ec-9a30-18602489beee} + + ALE Authorize Connect eBPF Callout v4 + ALE Authorize Connect callout for eBPF + + + FWPM_CALLOUT_FLAG_REGISTERED + + {ddb851f5-841a-4b77-8a46-bb7063e9f162} + + FWPM_LAYER_ALE_AUTH_CONNECT_V4 + 274 + + + {f202cbe9-da2b-41bc-8db0-b25a799531b5} + + net eBPF sock_addr hook + net eBPF sock_addr hook WFP filter + + + {ddb851f5-841a-4b77-8a46-bb7063e9f162} + + FWPM_LAYER_ALE_AUTH_CONNECT_V4 + {7c7b3fb9-3331-436a-98e1-b901df457fff} + + FWP_EMPTY + + + + FWP_ACTION_CALLOUT_TERMINATING + {98849e0b-b07d-11ec-9a30-18602489beee} + + 18446624845314639248 + + 68244 + + FWP_UINT64 + 0 + + +``` + +-------------------- + +## Verify the eBPF Program is Configured Correctly + +1. [Verify the eBPF program passes the verifier](./debugging.md) +2. [Verify the eBPF program is loaded](#verify-the-ebpf-program-is-loaded) +3. [Verify the eBPF program is attached](#verify-the-ebpf-program-is-attached) +4. [Resolve eBPF Program Load or Attach Failures](#ebpf-program-load-or-attach-failures) +5. [Verify eBPF maps are properly configured](#verify-ebpf-maps-are-properly-configured) + +-------------------- + +### Verify the eBPF Program is Loaded + +To check that the eBPF program is loaded, execute: +``` +bpftool.exe -p prog +``` +In this output, check that you see the expected eBPF program, looking at the `name` and `type`. Take note of the `id` +and `map_ids` for the next set of checks. + +Example Output: +``` +[{ + "id": 196867, + "type": "sock_addr", + "name": "authorize_connect4", + "map_ids": [66054,131331] +}] +``` + +-------------------- + +### Verify the eBPF Program is Attached + +To check that the eBPF program is attached, execute: +``` +bpftool.exe -p link +``` +In this output, check for an entry with the `prog_id` which matches the `id` from the above output, and confirm that +the `attach_type` is as expected. + +Example output: +``` +bpftool.exe -p link +[{ + "id": 262403, + "type": 2, + "prog_id": 196867, + "cgroup_id": 0, + "attach_type": "cgroup/connect4" +}] +``` + +-------------------- + +### Verify eBPF Maps are Properly Configured + +To check the map content, execute: +``` +bpftool.exe -p map show id +``` +In this output, use the `map_ids` from the above output. Map usage is up to the eBPF program developer, so you should +confirm that the `type` and `name` is as expected for the scenario. This example output is from invoking the bpftool +for each map: +``` +{ + "id": 66054, + "type": "hash", + "name": "policy_map", + "flags": 0, + "bytes_key": 24, + "bytes_value": 24, + "max_entries": 10 +} + +{ + "id" : 131331, + "type" : "lru_hash", + "name" : "audit_map", + "flags" : 0, + "bytes_key" : 8, + "bytes_value" : 24, + "max_entries" : 1000 +} +``` + +Once you have confirmed that the expected maps are present, you can then dump the map entries and check that the values +are as expected. You will need the `map_ids` from above. Then, you can execute the following command: +``` +bpftool.exe map dump id +``` + +Example Output: +``` +key: +08 08 08 08 00 00 00 00 00 00 00 00 00 00 00 00 +1a 0a 00 00 06 00 00 00 +value: +7f 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 +15 b3 00 00 00 00 00 00 +Found 1 element +``` + +The map usage is up to the eBPF program developer. You should follow up with the developer to understand what +structures are used in the map and how you can use this output to verify that the map entries are populated correctly. + +-------------------- + +### eBPF Program Load or Attach Failures + +Once you have [identified that the program is not attached or loaded](#troubleshooting-general-ebpf-program-issues), +you should first confirm that the eBPF client has attempted to load and attach the program (i.e, there were no issues +within the eBPF client itself). If you have confirmed that the eBPF client has attempted to load/attach the program, +but it has failed, you can use the following to further debug your issue. + +The common flow for configuring a eBPF program would be to first `open` the program, then `load` the program, and +finally, `attach` the program. For each of these operations, you can look for a trace statement within the +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces) which indicates failure: +- Open: Look for a trace with `ebpf_object_open` +- Load: Look for a trace with `ebpf_object_load` +- Attach: Look for a trace with `ebpf_program_attach_by_fd` + +There are a few classes of known issues: + +**eBPF Client Issues** + +There are certain errors that likely point to the eBPF client. These errors will be present in +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces): +- `ERROR_ACCESS_DENIED` or `STATUS_ACCESS_DENIED`. This means that the user-mode application is not running as admin or + localsystem. This points to an issue with the application. The resolution here is to run the user-mode application or + service as localsystem or admin. +- `ERROR_FILE_NOT_FOUND`. This indicates that the application tried to open an eBPF program with an invalid path. This + points to an issue within the application. The resolution is to change the path used by the application. + +**NMR Attach Failures** + +Another possibility is NMR attach failing. When this occurs, you may see error traces in +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces) such as: +``` +[0]0C38.0490::2023/05/10-13:48:19.502521000 [EbpfForWindowsProvider]{"ErrorMessage":"ebpf_program_create returned +error","Error":23,"meta":{"provider":"EbpfForWindowsProvider","event":"EbpfGenericError", +"time":"2023-05-10T20:48:19.5025210Z","cpu":0,"pid":3128,"tid":1168,"channel":11,"level":2,"keywords":"0x4"}} + +[1]0AE4.1B34::2023/05/10-13:54:44.309563500 [EbpfForWindowsProvider]{"ErrorMessage": +"_ebpf_extension_client_attach_provider returned error","Error":-1073741127,"meta":{"provider": +"EbpfForWindowsProvider","event":"EbpfGenericError","time":"2023-05-10T20:54:44.3095635Z","cpu":1,"pid":2788,"tid": +6964,"channel":11,"level":2,"keywords":"0x4"}} + +[1]0AE4.1B34::2023/05/10-13:54:44.309569900 [EbpfForWindowsProvider]{"ErrorMessage":"_ebpf_program_load_providers +returned error","Error":23,"meta":{"provider":"EbpfForWindowsProvider","event":"EbpfGenericError","time": +"2023-05-10T20:54:44.3095699Z","cpu":1,"pid":2788,"tid":6964,"channel":11,"level":2,"keywords":"0x4"}} +``` + +The first trace shows `ebpf_program_create` failed. Then, we see that `_ebpf_extension_client_attach_provider` fails, +indicating that this is a NMR failure. Furthermore, we see `_ebpf_program_load_providers` which shows that the NMR +provider load failed. + +**Mitigation**: If you observe NMR failures, you can attempt to restart `netebpfext` and `ebpfcore`: +``` +sc.exe stop ebpfcore +sc.exe stop netebpext +sc.exe start ebpfcore +sc.exe start netebpfext +``` +Then, attempt to load the program again. If this continues to fail, you will need to look further in +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces). + +-------------------- + +# Troubleshooting Issues Related to a Specific Program Type + +- [Program Type BPF\_PROG\_TYPE\_CGROUP\_SOCK\_ADDR Issues](#program-type-bpf_prog_type_cgroup_sock_addr-issues) + +-------------------- +## Program Type BPF_PROG_TYPE_CGROUP_SOCK_ADDR Issues + +The following are common issues with programs attached at the `BPF_CGROUP_INET4_CONNECT` or `BPF_CGROUP_INET6_CONNECT` +hook: +- [The eBPF program redirects traffic, but it is not working as expected.](#traffic-is-not-redirected-as-expected) + +-------------------- + +### Traffic Is Not Redirected As Expected + +If you are attaching your program at the `BPF_CGROUP_INET4_CONNECT` or `BPF_CGROUP_INET6_CONNECT` hooks, you can +redirect traffic to a different target IP address. Use the guidance below if the traffic is not getting redirected as +you expect. + +Ensure that you have [verified the program is configured correctly](#verify-the-ebpf-program-is-configured-correctly), +notably, checking that any expected map usage is correctly configured. + +Once you have confirmed that the program and any maps used are correctly configured, the next thing to look for is +whether or not the eBPF platform is performing the redirection. In the +[eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces), you should look for the following trace: +``` +[3]10A8.0A54::2023/04/28-10:31:41.312214200 [NetEbpfExtProvider]{"Message":"connect_redirect_classify", +"TransportEndpointHandle":463,"Protocol":6,"src_ip":"0.0.0.0","src_port":51346,"dst_ip":"8.8.8.8","dst_port":6666, +"redirected_ip":"127.0.0.1","redirected_port":5555,"Verdict":1,"meta":{"provider":"NetEbpfExtProvider","event": +"NetEbpfExtGenericMessage","time":"2023-04-28T17:31:41.3122142Z","cpu":3,"pid":4264,"tid":2644,"channel":11,"level":4, +"keywords":"0x20"}} +``` + +From this trace, you should look at the IP properties of the original connection (`src_ip`, `src_port`, `dst_ip`, and +`dst_port`) and also of the redirected remote address (`redirected_ip` and `redirected_port`). Note that the `src_ip` +value may be `0.0.0.0`, which is expected, as the source address may not be identified at the time of connect redirection. +There may be a few cases after looking for this trace: +1. This trace is present, but the IP properties are not as expected. In this case, please + [verify eBPF maps are properly configured](#verify-ebpf-maps-are-properly-configured). +2. This trace is present and has the expected IP properties, but traffic is still not reaching the proxy. Please + [check for interoperability issues with another WFP callout](#interoperability-issues-with-another-wfp-callout). +3. This trace is not present at all. First, check the [eBPF diagnostic traces](./Diagnostics.md#ebpf-diagnostic-traces) + to identify if there were any issues within the callout itself. If there are no errors in this codepath, + [check for interoperability issues with another WFP callout](#interoperability-issues-with-another-wfp-callout). + +-------------------- + +#### Interoperability Issues With Another WFP Callout + +Multiple WFP callouts at the connect redirect layer may cause unexpected results. This may surface as one of the +following symptoms: +1. The connection is not reaching the proxy. This can happen both even when the eBPF callout is getting invoked, but + also when it does not get invoked. +2. The connection reaches the proxy, but does not reach the expected final destination. +3. Kernel crash + +To check if there is another WFP callout at the connect redirect layer, you should search in the +[WFP state diagnostic file](./Diagnostics.md#wfp-state) for the string `FWPM_LAYER_ALE_CONNECT_REDIRECT_V4` (or `V6`, +if applicable). Within this layer, you can look in the `callouts` section of the file. We expect to see only 1 eBPF +callout here. If you see more than 1, then another WFP callout driver may be attempting to redirect the same connections +that your eBPF program is, which may affect the final connection. + +Sample output: +``` + + + {98849e0f-b07d-11ec-9a30-18602489beee} + + ALE Connect Redirect eBPF Callout v4 + ALE Connect Redirect callout for eBPF + + + FWPM_CALLOUT_FLAG_REGISTERED + + + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V4 + 300 + + + {c2a93a3e-cff4-5339-be53-21365ba19f35} + + Another Connect Redirect callout + + + + FWPM_CALLOUT_FLAG_USES_PROVIDER_CONTEXT + FWPM_CALLOUT_FLAG_REGISTERED + + + + FWPM_LAYER_ALE_CONNECT_REDIRECT_V4 + 316 + + +``` + +**Mitigation:** If there are any issues observed and multiple WFP callouts are identified, it is recommended to +uninstall or disable the other WFP callouts. Note that the `name` field in the `wfpstate` output may differ from the +actual driver or product name. + +-------------------- \ No newline at end of file