Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression]Subscriber-db is getting locked very frequently and AMF is not sending any response to UE #11088

Closed
NeelimaKalpana opened this issue Jan 7, 2022 · 12 comments
Assignees
Labels
product: 5g sa type: bug Something isn't working

Comments

@NeelimaKalpana
Copy link

Your Environment

  • Version: 1.7.0-1641497442-16e40f75
  • Affected Component: Access Gateway
  • Affected Subcomponent: AGW service
  • Deployment Environment: bare-metal (AGW)- VMBased

Describe the Issue

Subscriber-db is getting locked very frequently and AMF is not sending any response to UE

To Reproduce
Execute the entire Regression SUIT and we can see the subscriber-db is getting locked and when a Reg-Request is received from UE , AMF is not sending any response back as it is not getting any answer from subscriberdb .
Expected behavior
AMF should wait for the response from subscriberdb for a certain time duration and if there is no response it should send a Reg-Reject.

Notice the ISSUE in the following TC in Todays(6-1-2022) Regression Run
TC1a Hyper_SUCI_No_PDU_LoopTest
TC6c Concurrent GUTI Registration with 2 UE (Mix of Valid + Invalid UE Reg - no subdb)
TC1d Concurrent SUCI Registration with 2 UE (Mix of Valid + Invalid UE Reg) P0 (subscriber Lock)
TC1h Verify unsupported DNN config P1.7 (subscriber Lock)

Screenshots:

syslogs:
image.png

mme.log: No Answer received from subscriberdb.
Fail_Case:
image.png

General_Flow:https://app.zenhub.com/files/170803235/950d955c-39c8-4e1b-9c6c-4e541dd2cc93/download
image.png
Noticed above Logs hit at different points which caused the TC-Failure.

@moinuddin1980
Copy link
Contributor

In API handle_subs_authentication_info_ans
62 if (!status.ok()) {
63 std::cout << "get_subs_auth_info fails with code " << status.error_code()
64 << ", msg: " << status.error_message() << std::endl;
65 } else {
66 strncpy(amf_app_subs_auth_info_resp_p->imsi, imsi.c_str(), imsi_length);
67 amf_app_subs_auth_info_resp_p->imsi_length = imsi_length;

68 amf_app_subs_auth_info_resp_p->ue_id = ue_id;
69
70 magma5g::convert_proto_msg_to_itti_m5g_auth_info_ans(
71 response, amf_app_subs_auth_info_resp_p);
72 }

If the input parameter status is not ok, then the imsi value is not gettig populated in authentication response towards AMF.
When the authentication response is processed at AMF the imsi value is resulting in value 0.

275638 Sun Jan 09 12:46:42 2022 7FC0660ED700 INFO AMF-AP tasks/amf/amf_app_main.cpp :0062 Received msg from :[TASK_GRPC_SERVICE] id:[137] name:[AMF_APP_SUBS_AUTH_INFO_RESP]
275639 Sun Jan 09 12:46:42 2022 7FC0660ED700 DEBUG AMF-AP tasks/amf/nas_proc.cpp :0427 Handling imsi 0
275640 Sun Jan 09 12:46:42 2022 7FC0660ED700 ERROR NAS-AM tasks/amf/nas_proc.cpp :0436 That's embarrassing as we don't know this IMSI

@RahulKalsangra
Copy link
Contributor

#pr #11109 raised

@panyogesh
Copy link
Contributor

Reason for the issue seems to be due to SqLite operation done by 2 process :

  1. On one end subscriberdb will periodically poll list of subscribers from orc8r and update the SqLite database present locally. Before doing this a lock is taken to make sure database is in consistent state.
  2. Other end is AMF sending request to SubscriberDB to fetch the details of Subscriber and the associated keys. This operation also takes the corresponding locks.

So the most probable reason seems to be simultaneous access of db by both the processes. The periodic operation happens roughly around 15 minutes .

Additional debug prints like print_grpc_payload setting to true in subscriberdb.yml can help in getting additional info for debugging.

@arun-magma
Copy link

Could understand that there is possible clash between AMF fetch and Sub_db periodic Updates with Orc8r.
@panyogesh : Do we have plans to manage this lock & its lock period ? go sequential, interest of sub session Auth fetch takes More priority than periodic update, And Note this issue is found when subs load is more, here in our case 600UEs.

@NeelimaKalpana
Copy link
Author

Tried the following steps to narrow down the failure cases where subscriber-db is getting locked.

  1. Tried executing the entire suit on local servers.
  2. Tried loop run of some of the TCs . Executed for 50 and 100 loops upto 10 times.
  3. Since this issue is happening only in regression setup, tried replicating the same in our local setup the 140-server.
    Tried for looprun .
  4. Tried by modifying the subscriber information from dash board and executed the loop run .

In all the above cases couldn't not see any subscriber-db lock which is causing the regression TC failure.

@arun-magma
Copy link

Hi @ulaskozat @panyogesh @daljit-mahi :
Here is the another instance of Subscriber lock issue which resulted in AGW sending "Illegal UE" Error in Registration Reject.
Note: During the time of this error there is no manual operations are being carried out in sub_db on NMS or AGW.

[syslog, mmelog and 3gpp pcaps enclosed below]
https://app.zenhub.com/files/170803235/e65c5dc4-3757-4fe9-a6a5-5c7edc9e6eb4/download
https://app.zenhub.com/files/170803235/8df410bd-faf4-4f8e-81b4-a80d4043fcba/download
https://app.zenhub.com/files/170803235/9745c564-c510-4b51-81f9-285ba9febe29/download

[Snap of syslog] - enclosing full syslog also

[Cause-3 Illegal UE]
Screen Shot 2022-02-04 at 6.29.38 PM.png

[Syslog-Sub_db during AIR/AIA procedure finding a db lock]
Screen Shot 2022-02-04 at 6.28.58 PM.png

@panyogesh
Copy link
Contributor

@arun-magma : We are able to emulate the issue in our setup with some python sqlite commands. Still its not a fix but it handles the exception and also dumps the Process (name and PID) which is holding the lock. PR is #11438. Its under review once merge I will update.

@arun-magma
Copy link

arun-magma commented Feb 9, 2022

Hi @panyogesh, I hope this issue is reproducible during the SUCI Reg related loop test (need to go for 100 loops) as probability of occurrence is every 5min, will test & let you know.
@NeelimaKalpana @rupa455wavelabs: Did we found this sub_db lock issue during earlier loop tests ?

@uri200
Copy link
Contributor

uri200 commented Feb 10, 2022

is this the fix for this issue ?
#11438

@panyogesh
Copy link
Contributor

panyogesh commented Feb 11, 2022

#11438 is first part of the fix which will handle the exception and send the error code to AMF.
There is one more PR for re-trying authentication from AMF to be raised. Hopefully it will be raised for review today.

@RahulKalsangra
Copy link
Contributor

#11454 pr raised for this issue.

@NeelimaKalpana
Copy link
Author

Issue is not observed in the latest runs. So closing with appropriate logs.
https://app.zenhub.com/files/170803235/84628911-0eda-4e7c-97a5-8754e46d4e11/download

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product: 5g sa type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants