Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDAP request failed timed out #788

Open
nunohorta opened this issue Oct 21, 2015 · 19 comments
Open

LDAP request failed timed out #788

nunohorta opened this issue Oct 21, 2015 · 19 comments

Comments

@nunohorta
Copy link

I have ejabberd 15.07 running with LDAP authentication. This works fine for 95% of the times. Randomly a user is unable to authenticate. Looking at ejabberd.log the following log is displayed moments after it communicates with the server.

2015-10-21 14:20:47.227 [error] <0.7412.0>@eldap_pool:do_request:75 LDAP request failed: timed out 2015-10-21 14:20:47.227 [info] <0.7412.0>@ejabberd_c2s:wait_for_feature_request:756 ({socket_state,p1_tls,{tlssock,#Port<0.28105>,#Port<0.28107>},<0.7411.0>}) Accepted authentication for johndoe by ejabberd_auth_ldap from 31.112.96.168 2015-10-21 14:20:47.227 [debug] <0.7412.0>@ejabberd_c2s:send_text:1901 Send XML on stream = <<"<success xmlns='urn:ietf:params:xml:ns:xmpp-sasl'/>">> 2015-10-21 14:20:47.227 [debug] <0.7412.0>@ejabberd_socket:send:187 Error in p1_tls:send: {error,einval}

If I try again it will happen the same, over and over, while other users can log in just fine and the chat keeps working correctly. Waiting for longer periods, let's say 10 ou 20 minutes, or restart the server, will allow the user to log in again.
Is there anything that I should change on my side to fix this issue?

@zinid
Copy link
Contributor

zinid commented Oct 21, 2015

2015-10-21 14:20:47.227 [error] <0.7412.0>@eldap_pool:do_request:75 LDAP request failed: timed out

Did you check LDAP logs? Did you investigate the disconnections via tcpick or something? There is not enougth information for us to give any advice.

@nunohorta
Copy link
Author

I'll investigate those logs as soon as possible!

@nunohorta
Copy link
Author

sorry but I can´t find any relevant LDAP logs. I'll do a better investigation next time this occurs. It can be closed for now, thanks

@eLvErDe
Copy link

eLvErDe commented Nov 11, 2015

Hello,

I can confirm this behavior, especially under heavy load. The LDAP server is running on a different machine and there's a Cisco ASA between them.
I highly suspect ejabberd to have a permanent connection to LDAP, Cisco ASA reset it after a while, ejabberd is unable to detect the broken link.

I found this blog post saying nearly the same:
http://start.nwt.fhstp.ac.at/blog/?p=307

@eLvErDe
Copy link

eLvErDe commented Nov 11, 2015

@nunohorta
Copy link
Author

Yeah I have the same scenario where LDAP is running on a different server. It's not happening since I created the issue, but it did happen 3 ou 4 times before

@eLvErDe
Copy link

eLvErDe commented Nov 11, 2015

According to ejabberd code, I think it now requests for TCP keepalive when establishing LDAP connection:
https://github.com/processone/ejabberd/blob/master/src/eldap.erl#L1064

However, we may have a network setup issue here: Default Linux keepalive is 7200s (2 hours) and there's a Cisco ASA firewall between the ejabberd server and the LDAP server which seems to reset the connection after 1 hour of inactivity.

I chanded the server's keepalive parameters:

root@server:~# cat /etc/sysctl.d/98-local-keepalived.conf 
### Cisco ASA closes connection after 1hours inactivity
###
### ejabberd triggers keepalive but never closes and restarts LDAP
### connections pool
###
### Workaround by configuring keepalive shorter than Cisco timeout
# Start keepalive after 30min since last real packet
net.ipv4.tcp_keepalive_time = 1800
# Send keepalive packet every 1 minute
net.ipv4.tcp_keepalive_intvl    = 60
# Consider the connection being dead after 5 tries (5x1min)
net.ipv4.tcp_keepalive_probes   = 5

And I'm quite confident it'll fix the issue.

However, I don't think ejabberd behavior is sane. First I think it shouldn't expect for permanent connection. It should have a maximum lifetime setting and should handle the LDAP connection pool by closing too old connections and opening new ones. Second, it should be able to detect dead connection without the help of keepalive mechanism. What's the point requesting again and again and established connection that returns nothing ? Just close it, reopen and give another try.

I'm not a dev, so I may be saying bullshit.

@nunohorta
Copy link
Author

it's happening to me again...I'll try your solution to see if it fixes.

@eLvErDe
Copy link

eLvErDe commented Nov 13, 2015

Problem looks fixed for us with my workaround

@eLvErDe
Copy link

eLvErDe commented Dec 1, 2015

uh?
Closed? Why?

There's obviously something wrong in ejabberd behavior. It cannot expect lifetime connection with no ability to recover from a failure imho.

Regards, Adam.

@mremond mremond reopened this Dec 1, 2015
@nunohorta
Copy link
Author

Yeah it's still happening here(running 15.10). I have installed tcpick but I can't seem to get it working properly. If you guys could help me with tcpick so I can catch the logs that would be great. Every time this happens I have to restart the server and I simply cannot change what @eLvErDe has changed on this server. Update: I did the same changes as @eLvErDe but it keeps timing out.

@k-poole
Copy link

k-poole commented Jul 19, 2016

Yep same issue here. No Cisco in between though, just one VM running CentOS 7.2 x64 with eJabberd 14.07 from http://files.lucidsolutions.co.nz/linux/centos/ejabberd-el7/ejabberd-14.07-6.el7.centos.x86_64.rpm talking to another VM running MS SBS 2011 AD DS.

@btlogy
Copy link

btlogy commented Nov 27, 2017

It looks like I see this problem too, whole the way to 17.08
And as @eLvErDe pointed out: I do believe a client (LDAP or else) should never assume endless tcp session life time (especially when using a pool - see src/eldap_pool.erl).
From what I've collected, it is recommended to implement a periodical no-op or send an abandon messageId=0 to implement a keep-alive.
Unfortunately, my experience with Erlang does not allow me to propose a PR to implement this: :-(

@zavinator
Copy link

Any news on this? We've this problem several times a day :(
Centos 7, ejabberd 18.01

@Bundesferkel
Copy link

I request honestly a solution to this too...

Nearly all users experience an "invalid user / password", because the connection to both (two domaincontrollers) were interrupted. After a few retries (saved login-data) they're able to login (new connection from ejabberd to ldap).

Server: Debian
LDAP: Windows AD (2012 r2)
Clients: Gajim (Windows), Conversations (Android)

@stefanotoro
Copy link

Same problem for us. LDAP timeout even when low trafic.
Ubuntu 16.04.5 LTS
ejabberd 16.01-2
Active Directory 2012 R2
pfSense 2.3.4-p1
OS X clients with Apple Messages

@photoninger
Copy link

Today one of the three LDAP servers was not reachable, because it was upgraded. I assumed ejabberd would failover to the other two, but no, I get [error] <0.581.0>@eldap:connect_bind:1078 LDAP connection to ldap.server:636 failed: connection refused and (some) users cannot login.
What is the point in having a list of LDAP servers if ejabberd does no failover if one of them is unavailable?
ejabberd 18.06-1
debian stretch

@Neustradamus
Copy link
Contributor

Any news about it?

@Neustradamus
Copy link
Contributor

@rkujawa, @eLvErDe, @photoninger, @zavinator, @btlogy, @k-poole, @stefanotoro, @Bundesferkel: What do you think with current version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests