New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large /var/log/btmp file wrecks PAM performance and causes 2-3 seconds delay in su/sudo/ssh login. #270
Comments
|
I had exactly the same problem. A server to which I could connect with ssh at a blink of an eye, after a couple of years it would take me more than a minute (!!) to ssh into. I tried fixing all sort of usual suspects - such as reverse DNS lookup of the ssh client machine, restarting all sorts of services and so on - but nothing helped. Then by chance I realized that I have a huge (multi-gigabyte) /var/log/btmp file. Its size didn't bother me (I had a terabyte of free space), but deleting this file fixed the ssh connection lag! Clearly, something in SSH's PAM usage has some algorithm which is linear in the size of the /var/log/btmp file - which is a disaster. This should either be fixed (e.g., just append to the file without reading all its existing content), or, if it can't be fixed, the code should refuse to work on a file beyond some size limit, perhaps logging a warning that the user should logrotate or outright delete the btmp file. Without such a limit and warning, most users wouldn't be able to guess that /var/log/btmp is responsible for their ssh delays. |
|
Does your PAM stack contains |
|
Yes, it appears it does: Now that I know what the problem is, I guess I could disable So I think this "showfailed" should either have an O(1) algorithm (e.g., append to a file), or if this is impossible, it should at least warn to the main system log that the file is too large - and perhaps even stop doing anything beyond a certain size. |
|
Please note that what bothered me was the lag in successful ssh connections that were longer than a minute - I am not talking about lag on unsuccessful logins. So I'm guessing that |
|
Since /var/log/btmp doesn't have to be sorted, |
I never put this option myself (I wasn't even aware it existed until yesterday) - it came pre-configured with my Linux distributions (Centos 7 in this case - I didn't check what newer distributions do). So I think that if "showfailed" is something which Linux distributions tend to enable (without logrotate on btmp), it should have reasonable (and O(1)) performance.
Wow, thanks. Now I finally understand. I saw these "There were a gazillion failed login attempts since the last successful login" on successful logins, but didn't realize they came from slowly counting the entire btmp file. I assumed there was a counter somewhere. I think the existence of this user-visible message from pam_lastlog.so makes a solution even easier - |
|
By the way, apparently this bug has been known for at least 5 years before this issue was opened, and keeps being rediscovered over and over - see for example the following discussions which I found with Google that pointed to this problem between 2015 and 2020: https://serverfault.com/questions/691127/ssh-login-hangs-for-several-minutes (May 2015) |
|
At the same time, internet is full of recipes how to rotate wtmp and btmp files using logrotate, |
This may be true, but how is a user supposed to know that the oversized btmp is causing the login delay, and look for those logrotate entries? If you google "slow ssh" or something like that, there are at least a dozen more likely explanations suggested, not the oversized btmp and pam being involved. Clearly not all Linux distribution logrotate btmp by default (in my case, it was Centos 7). Many people will find the oversized log file because they run out of disk space. But other people like myself, have huge amounts of disk space, and were never bothered by large log files so didn't notice them. Nobody expects that a large log file causes symptoms like slow login - this is a bug in the login program (or, in this case, PAM), not an expected symptom of large log files. |
|
I found this in about a week of searching, thanks to more people having the same problem over time. This is not a bug, IMHO, it's a design flaw. If the only way to show "There were %d failed login attempts since the last successful login." is scan the entire file piece by piece then PAM should put a warning in the logs when the delay gets too high. And that should be "time taken to parse btmp" and not based on the size of the file (since disk access and other factors can affect the overall time needed, which is really the key for the end-user: "how long does it take to log in to my system?"). FWIW my system (RHEL 8.6) does rotate btmp by default, but it does it on a one-month cycle with no size restriction. Any internet-facing system runs the potential risk of hackers trying to force-break passwords by repeated login attempts... which will generate this problem in days (if not hours). I find it amazing that this is still an issue, but thank you to those that have done the research on this. |
|
We could workaround it by giving up reading the btmp entries after some configurable timeout. Of course proper solution would use something like the faillock module. |
First and foremost, my apologies if this is placed incorrectly.
I just spend a day figuring out why
suto another user had an unexplainable 2-3 seconds delay. Similar delays applied to connecting to the server with SSH.Example:
su <someuser> -c "whoami"could take between 2-3 seconds.With
strace -o trace.log su <someuser> -c "whoamiI was able to get more information.After going back and forth a bit I found the following lines in the strace
and this continued for thousands of lines. After emptying the /var/log/btmp file both sudo and an SSH connection was blazing fast again.
For such operations, it does not make sense that it is required to load/read the whole file in order to do this operation?
I'm running CentOS Linux release 8.2.2004 (Core)
Kernel 4.18.0-193.19.1.el8_2.x86_64.
No changes to default settings.
Of course I can provide more information if required.
Keywords: var log btmp, slow ssh, slow sudo, slow su. seconds delay with su to user.
The text was updated successfully, but these errors were encountered: