Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large /var/log/btmp file wrecks PAM performance and causes 2-3 seconds delay in su/sudo/ssh login. #270

Open
PauluzzNL opened this issue Sep 23, 2020 · 11 comments

Comments

@PauluzzNL
Copy link

First and foremost, my apologies if this is placed incorrectly.

I just spend a day figuring out why su to another user had an unexplainable 2-3 seconds delay. Similar delays applied to connecting to the server with SSH.

Example:
su <someuser> -c "whoami" could take between 2-3 seconds.

With
strace -o trace.log su <someuser> -c "whoami I was able to get more information.

After going back and forth a bit I found the following lines in the strace

     0.000053 openat(AT_FDCWD, "/var/log/btmp", O_RDONLY) = 7
     0.000065 read(7, "\6\0\0\0T\355#\0ssh:notty\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 384) = 384
     0.000059 read(7, "\6\0\0\0T\355#\0ssh:notty\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 384) = 384
     0.000056 read(7, "\6\0\0\0k\355#\0ssh:notty\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 384) = 384
     0.000056 read(7, "\6\0\0\0n\355#\0ssh:notty\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 384) = 384

and this continued for thousands of lines. After emptying the /var/log/btmp file both sudo and an SSH connection was blazing fast again.

For such operations, it does not make sense that it is required to load/read the whole file in order to do this operation?

I'm running CentOS Linux release 8.2.2004 (Core)
Kernel 4.18.0-193.19.1.el8_2.x86_64.

No changes to default settings.

Of course I can provide more information if required.

Keywords: var log btmp, slow ssh, slow sudo, slow su. seconds delay with su to user.

@nyh
Copy link

nyh commented Aug 21, 2021

I had exactly the same problem. A server to which I could connect with ssh at a blink of an eye, after a couple of years it would take me more than a minute (!!) to ssh into. I tried fixing all sort of usual suspects - such as reverse DNS lookup of the ssh client machine, restarting all sorts of services and so on - but nothing helped. Then by chance I realized that I have a huge (multi-gigabyte) /var/log/btmp file. Its size didn't bother me (I had a terabyte of free space), but deleting this file fixed the ssh connection lag!

Clearly, something in SSH's PAM usage has some algorithm which is linear in the size of the /var/log/btmp file - which is a disaster. This should either be fixed (e.g., just append to the file without reading all its existing content), or, if it can't be fixed, the code should refuse to work on a file beyond some size limit, perhaps logging a warning that the user should logrotate or outright delete the btmp file. Without such a limit and warning, most users wouldn't be able to guess that /var/log/btmp is responsible for their ssh delays.

@ldv-alt
Copy link
Member

ldv-alt commented Aug 21, 2021

Does your PAM stack contains pam_lastlog.so with showfailed option?

@nyh
Copy link

nyh commented Aug 21, 2021

Yes, it appears it does:

% grep pam_lastlog.so /etc/pamd/*
/etc/pam.d/postlogin:session     [default=1]   pam_lastlog.so nowtmp showfailed
/etc/pam.d/postlogin:session     optional      pam_lastlog.so silent noupdate showfailed

Now that I know what the problem is, I guess I could disable pam_lastlog.so or showfailed, but the point is for more than a year I didn't know what my problem was... I saw ridiculously slow ssh login times, and had many guesses as to what might be the problem - it took me a year to figure out it is /var/log/btmp.

So I think this "showfailed" should either have an O(1) algorithm (e.g., append to a file), or if this is impossible, it should at least warn to the main system log that the file is too large - and perhaps even stop doing anything beyond a certain size.

@nyh
Copy link

nyh commented Aug 21, 2021

Please note that what bothered me was the lag in successful ssh connections that were longer than a minute - I am not talking about lag on unsuccessful logins. So I'm guessing that pam_lastlog.so reads /var/log/btmp even if it's not going to write to it, because the login didn't fail? The strace log by the original poster of this issue also shows this. Why would it even do that?

@ldv-alt
Copy link
Member

ldv-alt commented Aug 21, 2021

pam_lastlog.so doesn't use btmp at all unless showfailed option is specified.

Since /var/log/btmp doesn't have to be sorted, pam_lastlog.so in showfailed mode has to scan the whole btmp file just to print "Last failed login:" and "There were %d failed login attempts since the last successful login." messages.

@nyh
Copy link

nyh commented Aug 22, 2021

pam_lastlog.so doesn't use btmp at all unless showfailed option is specified.

I never put this option myself (I wasn't even aware it existed until yesterday) - it came pre-configured with my Linux distributions (Centos 7 in this case - I didn't check what newer distributions do). So I think that if "showfailed" is something which Linux distributions tend to enable (without logrotate on btmp), it should have reasonable (and O(1)) performance.

Since /var/log/btmp doesn't have to be sorted, pam_lastlog.so in showfailed mode has to scan the whole btmp file just to print "Last failed login:" and "There were %d failed login attempts since the last successful login." messages.

Wow, thanks. Now I finally understand. I saw these "There were a gazillion failed login attempts since the last successful login" on successful logins, but didn't realize they came from slowly counting the entire btmp file. I assumed there was a counter somewhere.

I think the existence of this user-visible message from pam_lastlog.so makes a solution even easier - pam_lastlog.so showfailed can, when it sees the btmp file is larger than some reasonable threshold (something that would delay a login by, say, 0.1 seconds), print a message saying "/var/log/btmp is too large, pam_lastog.so 'showfailed' option ignored: not showing number of failed login attempts". A user which won't like this message can clear or remove /var/log/btmp, or change the pam_lastlog.so configuration.

@nyh
Copy link

nyh commented Aug 22, 2021

By the way, apparently this bug has been known for at least 5 years before this issue was opened, and keeps being rediscovered over and over - see for example the following discussions which I found with Google that pointed to this problem between 2015 and 2020:

https://serverfault.com/questions/691127/ssh-login-hangs-for-several-minutes (May 2015)
https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network (July 2016)
https://forum.directadmin.com/threads/login-in-ssh-is-slow.57666/ (March 2019)
https://superuser.com/questions/166359/why-is-my-ssh-login-slow (April 2020)

@ldv-alt
Copy link
Member

ldv-alt commented Aug 22, 2021

At the same time, internet is full of recipes how to rotate wtmp and btmp files using logrotate,
and my OS provides logrotate entries for both of them by default.

@nyh
Copy link

nyh commented Aug 23, 2021

At the same time, internet is full of recipes how to rotate wtmp and btmp files using logrotate,
and my OS provides logrotate entries for both of them by default.

This may be true, but how is a user supposed to know that the oversized btmp is causing the login delay, and look for those logrotate entries? If you google "slow ssh" or something like that, there are at least a dozen more likely explanations suggested, not the oversized btmp and pam being involved.

Clearly not all Linux distribution logrotate btmp by default (in my case, it was Centos 7).

Many people will find the oversized log file because they run out of disk space. But other people like myself, have huge amounts of disk space, and were never bothered by large log files so didn't notice them. Nobody expects that a large log file causes symptoms like slow login - this is a bug in the login program (or, in this case, PAM), not an expected symptom of large log files.

@jlacros
Copy link

jlacros commented Aug 3, 2022

I found this in about a week of searching, thanks to more people having the same problem over time. This is not a bug, IMHO, it's a design flaw. If the only way to show "There were %d failed login attempts since the last successful login." is scan the entire file piece by piece then PAM should put a warning in the logs when the delay gets too high. And that should be "time taken to parse btmp" and not based on the size of the file (since disk access and other factors can affect the overall time needed, which is really the key for the end-user: "how long does it take to log in to my system?").

FWIW my system (RHEL 8.6) does rotate btmp by default, but it does it on a one-month cycle with no size restriction. Any internet-facing system runs the potential risk of hackers trying to force-break passwords by repeated login attempts... which will generate this problem in days (if not hours).

I find it amazing that this is still an issue, but thank you to those that have done the research on this.

@t8m
Copy link
Member

t8m commented Aug 15, 2022

We could workaround it by giving up reading the btmp entries after some configurable timeout. Of course proper solution would use something like the faillock module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants