Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make RADIUS timeout value configurable #3185

Closed
1 task done
aLTeReGo-SWI opened this issue May 22, 2023 · 7 comments · Fixed by #3188
Closed
1 task done

Make RADIUS timeout value configurable #3185

aLTeReGo-SWI opened this issue May 22, 2023 · 7 comments · Fixed by #3188
Labels
feature-request Request for new features to be added

Comments

@aLTeReGo-SWI
Copy link

⚠️ Please verify that this feature request has NOT been suggested before.

  • I checked and didn't find similar feature request

🏷️ Feature Request Type

UI Feature

🔖 Feature description

The RADIUS monitor appears to have a hard-coded 2500ms timeout, though it could be two 1-second and another 30-second timeout.

We have instances where RADIUS requests can take as much as 10 seconds to respond. It's not performant, but it isn't 'down' either. Making this value configurable would alleviate a lot of the false positives I'm seeing.

2023-05-22_10-09-53

image

✔️ Solution

Add a new UI element to monitor that allows for the input of a user-defined integer timeout value

❓ Alternatives

Increase the hard coded timeout values to be higher. Not a good solution, but it is an alternative.

📝 Additional Context

No response

@aLTeReGo-SWI aLTeReGo-SWI added the feature-request Request for new features to be added label May 22, 2023
@CommanderStorm
Copy link
Collaborator

CommanderStorm commented May 22, 2023

Could you further, how such a high ping could happen?
For a user who has nothing to do with radius: Is this expected behaviour to have such abnormally high latency?

@I71d0r
Copy link

I71d0r commented May 23, 2023

@CommanderStorm for basic scenarios the Radius will verify access quickly using internal means.
However, the Radius implementation allows more advanced scenarios to verify identity against external services like Active Directory, Okta, Google Workspace etc. Typically such information would be cached, but the cache may be expired or invalidated on purpose.
This may cause spikes that are evaluated as failures, although eventually the requests would succeed with delay. To distinguish whether the service is sluggish or not working a fine tuning of request timeout is essential to minimize the false positives.

@CommanderStorm
Copy link
Collaborator

So basically the avg number you would expect for Laltency is below the current value, right?

Is the Usecase you are talking about not better solved via the Retries Option?
What you say would be a good helptext in the monitor setup to distinguish between Timeout and Retries?

@aLTeReGo-SWI
Copy link
Author

@I71d0r is 100% spot on. While most RADIUS requests should take less than 2.5 seconds to complete, there are instances where this simply takes more time. It's not 'Down', as the response is eventually set. Sometimes that takes as much as 10 seconds, but this is normal and expected behavior, even if it's not optimal.

That means you shouldn't receive an alert for something that is normal/expected behavior. That's what causes alert fatigue and causes people to ignore alerts because they're not confident they are accurate.

Retries as I understand them aren't going to solve the problem if the response is going to take 10 seconds to complete. What retries are doing is 'continue retrying X number of times, or until the response takes only 2.5 seconds' That's not the same thing as a configurable timeout value. Especially for other instances where the normal average response time is greater than 2.5 seconds. You could retry forever, but it might not ever complete in 2.5 seconds.

@CommanderStorm
Copy link
Collaborator

CommanderStorm commented May 25, 2023

@aLTeReGo-SWI please answer all my questions

So basically, the avg number you would expect for Latency of RADIUS is below the current value, right? (as in Latency>2.5s is the absolute exception?)
Is the Usecase (cache miss => long latency) you are talking about not better solved via the Retries Option?

What you say would be a good helptext in the monitor setup to distinguish between Timeout and Retries?

@aLTeReGo-SWI
Copy link
Author

@CommanderStorm Latency varies based on the request. If a request comes in that has cached data, the response is relatively quick. Less than a second on average. Requests that are not cached take longer to be served. Upwards of ~10 seconds

Increasing the retries simply results in hammering the same request stacking up these requests in the queue, causing further delaying the response.

A 'retry' is.. this was down.. E.G. it exceeded the timeout value. That timeout value right now is 2.5 seconds, but it might come back so try again.

A 'timeout' is how long should I wait for my request to be responded to before giving up, and retrying if a retry value is configured.

Also, I may be mistaken but the little bits of yellow on my availability charts suggest that retries count against overall availability. An extended timeout value should not if the request was serviced within the user-definable timeout period.

@CommanderStorm
Copy link
Collaborator

CommanderStorm commented May 30, 2023

Linking a few PRs/Issues:

Current state of timeouts: #2142
Timeouts are generally tracked in #877

⇒ once #2142 and #3188 are merged, adding a timeout to radius is eazy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features to be added
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants