Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow failed - km_mt_stress_tests_restart_extension #3097

Closed
github-actions bot opened this issue Dec 1, 2023 · 23 comments · Fixed by #3148
Closed

Workflow failed - km_mt_stress_tests_restart_extension #3097

github-actions bot opened this issue Dec 1, 2023 · 23 comments · Fixed by #3148
Assignees
Labels
bug Something isn't working ci/cd Issue is specific to CI/CD triaged Discussed in a triage meeting
Milestone

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2023

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

@github-actions github-actions bot added bug Something isn't working ci/cd Issue is specific to CI/CD labels Dec 1, 2023
Copy link
Contributor Author

github-actions bot commented Dec 3, 2023

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

@shankarseal
Copy link
Collaborator

@dv-msft these are failing even after #3017 . The logs show the tests were canceled. Can you please take a look?

@dahavey dahavey added the triaged Discussed in a triage meeting label Dec 4, 2023
@dahavey dahavey added this to the 2312 milestone Dec 4, 2023
Copy link
Contributor Author

github-actions bot commented Dec 6, 2023

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

@dahavey dahavey modified the milestones: 2312, 2401 Dec 11, 2023
Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

@dv-msft
Copy link
Collaborator

dv-msft commented Dec 27, 2023

First off, the generation of kernel-mode crash dumps in CI/CD runs (at least for the kernel-mode mt stress tests) seems to be inconsistent. i.e.:

The https://github.com/microsoft/ebpf-for-windows/actions/runs/7327818979 run shows a failure for the debug version of the km_mt_stress_tests_restart_extension test but does not contain the traces or kernel mode dump.

OTOH, the prior https://github.com/microsoft/ebpf-for-windows/actions/runs/7320276834 run also shows the same failure but does contain the related Test-Logs-km_mt_stress_tests_restart_extension-x64-Debug artifact with a valid kernel mode dump.

(@rectified95: any idea why this is happening?)

Based on the available dump, the crash is caused by an explicit __fastfail bugcheck due to net_ebpf_extension_hook_client_enter_rundown failing to acquire rundown protection on the client context. (invalid EX_RUNDOWN_REF context?):

 FAULTING_SOURCE_LINE:  D:\a\ebpf-for-windows\ebpf-for-windows\netebpfext\net_ebpf_ext_bind.c

FAULTING_SOURCE_FILE:  D:\a\ebpf-for-windows\ebpf-for-windows\netebpfext\net_ebpf_ext_bind.c

FAULTING_SOURCE_LINE_NUMBER:  359

FAULTING_SOURCE_CODE:  
   355:         goto Exit;
   356:     }
   357: 
   358:     attached_client = (net_ebpf_extension_hook_client_t*)filter_context->client_context;
>  359:     ENTER_HOOK_CLIENT_RUNDOWN(attached_client);
   360: 
   361:     addr.sin_port = incoming_fixed_values->incomingValue[FWPS_FIELD_ALE_RESOURCE_RELEASE_V4_IP_LOCAL_PORT].value.uint16;
   362:     addr.sin_addr.S_un.S_addr =
   363:         incoming_fixed_values->incomingValue[FWPS_FIELD_ALE_RESOURCE_RELEASE_V4_IP_LOCAL_ADDRESS].value.uint32;
   364: 

This code should've been skipped as we first check if the client is detached just above (seen in the data struct dumps below). This looks like some edge-case race between netebpfext and wfp.

Data struct dumps at the point of the crash:

1: kd> dx filter_context
filter_context                 : 0xffffa90a015c0fe0 [Type: _net_ebpf_extension_wfp_filter_context *]
    [+0x000] reference_count  : 2 [Type: long]
    [+0x008] client_context   : 0xffffa90a01514f90 [Type: _net_ebpf_extension_hook_client *]
    [+0x010] filter_ids       : 0x0 [Type: unsigned __int64 *]
    [+0x018] filter_ids_count : 0x0 [Type: unsigned int]
    [+0x01c ( 0: 0)] client_detached  : true [Type: bool]
    
1: kd> dx filter_context->client_context
filter_context->client_context                 : 0xffffa90a01514f90 [Type: _net_ebpf_extension_hook_client *]
    [+0x000] link             [Type: _LIST_ENTRY]
    [+0x010] nmr_binding_handle : 0xffffa909e27f4a9e [Type: void *]
    [+0x018] client_module_id : {2C159C2C-A2FF-11EE-A0E7-00155DA34A26} [Type: _GUID]
    [+0x028] client_binding_context : 0xffffa909e1342ed0 [Type: void *]
    [+0x030] client_data      : 0xffffa909e1342f88 [Type: _ebpf_extension_data *]
    [+0x038] invoke_program   : 0xfffff80559fced40 : ebpfcore!_ebpf_link_instance_invoke+0x0 [Type: ebpf_result (__cdecl*)(void *,void *,unsigned int *)]
    [+0x040] provider_data    : 0xffffa90a015c0fe0 [Type: void *]
    [+0x048] provider_context : 0xffffa90a01440f80 [Type: _net_ebpf_extension_hook_provider *]
    [+0x050] detach_work_item : 0xffffa90a01336fa0 [Type: _IO_WORKITEM *]
    [+0x058] rundown          [Type: _net_ebpf_ext_hook_client_rundown]
    
1: kd> dx -r2 filter_context->client_context->rundown
filter_context->client_context->rundown                 [Type: _net_ebpf_ext_hook_client_rundown]
    [+0x000] protection       [Type: _EX_RUNDOWN_REF]
        [+0x000] Count            : 0xfffffe0b75d648c1 [Type: unsigned __int64]
        [+0x000] Ptr              : 0xfffffe0b75d648c1 [Type: void *]
    [+0x008] rundown_occurred : false [Type: bool]

Stack trace:

00 fffffe0b`75af3418 fffff805`5a7cc669     nt!KeBugCheckEx [minkernel\ntos\ke\amd64\procstat.asm @ 140] 
01 fffffe0b`75af3420 fffff805`5a7cca10     nt!KiBugCheckDispatch+0x69 [minkernel\ntos\ke\amd64\trap.asm @ 3724] 
02 fffffe0b`75af3560 fffff805`5a7cae08     nt!KiFastFailDispatch+0xd0 [minkernel\ntos\ke\amd64\trap.asm @ 3898] 
03 fffffe0b`75af3740 fffff805`59ba538a     nt!KiRaiseSecurityCheckFailure+0x308 [minkernel\ntos\ke\amd64\trap.asm @ 2551] 
04 fffffe0b`75af38d0 fffff805`5b569808     netebpfext!net_ebpf_ext_resource_release_classify+0x14a [D:\a\ebpf-for-windows\ebpf-for-windows\netebpfext\net_ebpf_ext_bind.c @ 359] 
05 (Inline Function) --------`--------     NETIO!ValidateClassifyOutFromCallout+0x160 [minio\netio\wfp\filterengine\notify.c @ 243] 
06 fffffe0b`75af3990 fffff805`5b56813a     NETIO!ProcessCallout+0x6e8 [minio\netio\wfp\filterengine\notify.c @ 587] 
07 (Inline Function) --------`--------     NETIO!ProcessFastCalloutClassify+0x41 [minio\netio\wfp\filterengine\classify.c @ 315] 
08 fffffe0b`75af3b20 fffff805`5d141700     NETIO!KfdClassify+0x25a [minio\netio\wfp\filterengine\classify.c @ 855] 
09 fffffe0b`75af3f30 fffff805`5d0878a9     tcpip!AleNotifyEndpointTeardown+0xb82e8 [minio\netio\wfp\sys\ale\ale_shim.c @ 471] 
0a (Inline Function) --------`--------     tcpip!WfpAleEndpointTeardownHandler+0x360 [minio\netio\wfp\sys\ale\endpoint.c @ 360] 
0b (Inline Function) --------`--------     tcpip!InetInspectCleanupEndpoint+0x37f [minio\netio\transport\common\inetinspect.c @ 2030] 
0c (Inline Function) --------`--------     tcpip!TcpCleanupEndpointWorkQueueRoutine+0x463 [minio\netio\transport\tcp\endpoint.c @ 881] 
0d (Inline Function) --------`--------     tcpip!TcpCleanupEndpoint+0x477 [minio\netio\transport\tcp\endpoint.c @ 797] 
0e (Inline Function) --------`--------     tcpip!TcpDereferenceEndpoint+0x492 [minio\netio\transport\tcp\endpoint.h @ 218] 
0f fffffe0b`75af43c0 fffff805`5d0873b9     tcpip!TcpCloseEndpointWorkQueueRoutine+0x4d9 [minio\netio\transport\tcp\endpoint.c @ 1032] 
10 (Inline Function) --------`--------     tcpip!TcpCloseEndpoint+0xc6 [minio\netio\transport\tcp\endpoint.c @ 969] 
11 fffffe0b`75af44c0 fffff805`5c0f2c70     tcpip!TcpTlEndpointCloseEndpoint+0xe9 [minio\netio\transport\tcp\provider.c @ 314] 
12 fffffe0b`75af4520 fffff805`5c0d35d4     afd!AfdTLCloseEndpoint+0x48 [minio\sockets\winsock2\wsp\afdsys\tlisup.c @ 1940] 
13 (Inline Function) --------`--------     afd!AfdDerefTLBaseEndpoint+0x18 [minio\sockets\winsock2\wsp\afdsys\afdprocs.h @ 3292] 
14 fffffe0b`75af4560 fffff805`5c0f3fa7     afd!AfdCloseTransportEndpoint+0x7c [minio\sockets\winsock2\wsp\afdsys\close.c @ 188] 
15 fffffe0b`75af4640 fffff805`5c0f39b6     afd!AfdCleanupCore+0x377 [minio\sockets\winsock2\wsp\afdsys\close.c @ 1068] 
16 (Inline Function) --------`--------     afd!AfdCleanup+0x2c [minio\sockets\winsock2\wsp\afdsys\close.c @ 296] 
17 fffffe0b`75af4790 fffff805`5a7779aa     afd!AfdDispatch+0x86 [minio\sockets\winsock2\wsp\afdsys\dispatch.c @ 150] 
18 fffffe0b`75af4800 fffff805`5aeafef9     nt!IopfCallDriver+0x56 [minkernel\ntos\io\iomgr\iomgr.h @ 3383] 
19 fffffe0b`75af4840 fffff805`5a822ca5     nt!IovCallDriver+0x275 [minkernel\ntos\io\iomgr\ioverifier.c @ 589] 
1a fffffe0b`75af4880 fffff805`5ab3da4e     nt!IofCallDriver+0x15db75 [minkernel\ntos\io\iomgr\iosubs.c @ 3153] 
1b fffffe0b`75af48c0 fffff805`5abb2429     nt!IopCloseFile+0x15e [minkernel\ntos\io\iomgr\objsup.c @ 613] 
1c (Inline Function) --------`--------     nt!ObpDecrementHandleCount+0x8a [minkernel\ntos\ob\obhandle.c @ 2896] 
1d fffffe0b`75af4950 fffff805`5abb6d1e     nt!ObCloseHandleTableEntry+0x229 [minkernel\ntos\ob\obclose.c @ 209] 
1e (Inline Function) --------`--------     nt!ObpCloseHandle+0xaf [minkernel\ntos\ob\obclose.c @ 353] 
1f fffffe0b`75af4a90 fffff805`5a7cc085     nt!NtClose+0xde [minkernel\ntos\ob\obclose.c @ 516] 
20 fffffe0b`75af4b00 00007ffe`4681e774     nt!KiSystemServiceCopyEnd+0x25 [minkernel\ntos\ke\amd64\trap.asm @ 3449] 
21 0000005e`859ff2c8 00000000`00000000     0x00007ffe`4681e774

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 1, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 3, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 4, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 5, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 6, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Copy link
Contributor Author

github-actions bot commented Jan 8, 2024

Failed Run
Codebase
Test name - km_mt_stress_tests_restart_extension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci/cd Issue is specific to CI/CD triaged Discussed in a triage meeting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants