New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running php processes: LDAP timeout #4831

Closed
linuxrrze opened this Issue May 12, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@linuxrrze

linuxrrze commented May 12, 2017

Bug initially reported to nextant (nextant issue #175), but could be caused by a more global user_ldap problem.

For long running jobs some LDAP servers (Samba domain controller in my case) seems to kill the LDAP connection after some time. This connection is not re-established by nextcloud/user_ldap when required.

Possible explanation: During normal operation most php processes of nextcloud run in apache config mode with a relatively short runtime - typically 30s. Most likely not enough to run into any LDAP timeouts. However nextclouds cron.php (when started by system cron) does run with php cli config (with unlimited runtime) - in this case a re-connect to LDAP may be required.

As apps are (most likely?) unaware of the actual user backend, the problem cannot be solved by the app developer, but needs to be handled in the user backend itself.

Steps to reproduce

  1. Use samba domain controller as LDAP server (samba 4.6.3 in my case)
  2. Running long lasting php jobs (e.g. "nextant:index" and lots of files to index) with LDAP user authentication enabled

Expected behaviour

Job should finish normally.

Actual behaviour

Jobs breaks after indexing all files (takes days to finish!)
On next run, the indexing starts all over again ...

# sudo -u www-data php occ nextant:index
nextant v1.0.8

* Extracting files:

/Administrator/files            12/    12 [============================] 100%
/dani/files                      7/     7 [============================] 100%
/flo/files                       7/     7 [============================] 100%
/frank/files                     7/     7 [============================] 100%
/ijp/files                       7/     7 [============================] 100%
/marcel/files               145150/145150 [============================] 100%


                                    
  [OC\ServerNotAvailableException]  
  Lost connection to LDAP server.   
                                    

nextant:index [--output [OUTPUT]] [--debug] [--debugall [DEBUGALL]] [-k|--unlock] [-f|--force] [-u|--user [USER]] [-c|--background] [-b|--bookmarks] [-i|--files] [-x|--files_extract] [-r|--files_update]

Server configuration

Operating system:
Ubuntu 16.04

Web server:
Apache 2.4.18

Database:

Postgres 9.5

PHP version:

7.0

Nextcloud version: (see Nextcloud admin page)

11.0.3

Updated from an older Nextcloud/ownCloud or fresh install:

Updated

@MorrisJobke

This comment has been minimized.

Show comment
Hide comment
@MorrisJobke
Member

MorrisJobke commented May 12, 2017

@daita

This comment has been minimized.

Show comment
Hide comment
@daita

daita May 12, 2017

Member

I do agree on the possible explanation as I had some doubt when some people reported having issue with nextant background process not ending while using LDAP.

Member

daita commented May 12, 2017

I do agree on the possible explanation as I had some doubt when some people reported having issue with nextant background process not ending while using LDAP.

@linuxrrze

This comment has been minimized.

Show comment
Hide comment
@linuxrrze

linuxrrze May 13, 2017

I did some code research, and the cause of this seems to be the LDAP error handling here:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...
} else if ($errorCode === -1) {
               throw new ServerNotAvailableException('Lost connection to LDAP server.');
...

Looks like an exception is thrown, but nobody is willing to handle it.
Trying to reconnect to the LDAP server in case of this exception could do the trick.

The LDAP code at this point is a little hard to read and understand for someone not working with it on a daily basis. I had to match the numeric error codes like "-1":
define LDAP_SERVER_DOWN				(-1)

from LDAP header files:
https://github.com/openldap/openldap/blob/master/include/ldap.h

As a side note: it'd be nice to have those constants mentioned in the source for better readability.

linuxrrze commented May 13, 2017

I did some code research, and the cause of this seems to be the LDAP error handling here:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...
} else if ($errorCode === -1) {
               throw new ServerNotAvailableException('Lost connection to LDAP server.');
...

Looks like an exception is thrown, but nobody is willing to handle it.
Trying to reconnect to the LDAP server in case of this exception could do the trick.

The LDAP code at this point is a little hard to read and understand for someone not working with it on a daily basis. I had to match the numeric error codes like "-1":
define LDAP_SERVER_DOWN				(-1)

from LDAP header files:
https://github.com/openldap/openldap/blob/master/include/ldap.h

As a side note: it'd be nice to have those constants mentioned in the source for better readability.

@blizzz

This comment has been minimized.

Show comment
Hide comment
@blizzz

blizzz May 15, 2017

Member

in this case a re-connect to LDAP may be required.

this could potentially result in other bugs, when paged results are being used. If those are restarted, it's possible to run again into a timeout and the fun starts again.

Perhaps it makes more sense to set/request a higher timeout from the LDAP Server when a background task is running (keyword LDAP_OPT_NETWORK_TIMEOUT). If we can figure this out. It still might not help in the long run.

Alternatively, can the background job be split into smaller chunks?

Member

blizzz commented May 15, 2017

in this case a re-connect to LDAP may be required.

this could potentially result in other bugs, when paged results are being used. If those are restarted, it's possible to run again into a timeout and the fun starts again.

Perhaps it makes more sense to set/request a higher timeout from the LDAP Server when a background task is running (keyword LDAP_OPT_NETWORK_TIMEOUT). If we can figure this out. It still might not help in the long run.

Alternatively, can the background job be split into smaller chunks?

@linuxrrze

This comment has been minimized.

Show comment
Hide comment
@linuxrrze

linuxrrze May 16, 2017

The handling of long running paged results may require special treatment, that's right.

I'm not sure wether splitting a background job into smaller junks will help: As far as I observed, the LDAP connection during "nextant:index" is opened (and used?) at the very beginning, then a long time with no LDAP activity follows - and I guess that's what makes the LDAP server close the connection.

So splitting the background job into smaller junks would only improve the situation if every junk was processed by a new process (with a new LDAP connection, that won't hit the timeout).

I also thought about changing some LDAP connection related settings, however most of the options I found seem to be OpenLDAP specific (and therefore wont help with other LDAP implementations like my Samba DC LDAP):

/* private and experimental options */
/* OpenLDAP specific options */
...
#define LDAP_OPT_NETWORK_TIMEOUT	0x5005	/* socket level timeout */

...

/*
 * OpenLDAP per connection tcp-keepalive settings
 * (Linux only, ignored where unsupported)
 */
#define LDAP_OPT_X_KEEPALIVE_IDLE		0x6300
#define LDAP_OPT_X_KEEPALIVE_PROBES		0x6301
#define LDAP_OPT_X_KEEPALIVE_INTERVAL	0x6302

At the very end I still think re-connecting is the best solution, even if it may require some extra handling (e.g. for paged results).

How do other user backends handle situations like this?

linuxrrze commented May 16, 2017

The handling of long running paged results may require special treatment, that's right.

I'm not sure wether splitting a background job into smaller junks will help: As far as I observed, the LDAP connection during "nextant:index" is opened (and used?) at the very beginning, then a long time with no LDAP activity follows - and I guess that's what makes the LDAP server close the connection.

So splitting the background job into smaller junks would only improve the situation if every junk was processed by a new process (with a new LDAP connection, that won't hit the timeout).

I also thought about changing some LDAP connection related settings, however most of the options I found seem to be OpenLDAP specific (and therefore wont help with other LDAP implementations like my Samba DC LDAP):

/* private and experimental options */
/* OpenLDAP specific options */
...
#define LDAP_OPT_NETWORK_TIMEOUT	0x5005	/* socket level timeout */

...

/*
 * OpenLDAP per connection tcp-keepalive settings
 * (Linux only, ignored where unsupported)
 */
#define LDAP_OPT_X_KEEPALIVE_IDLE		0x6300
#define LDAP_OPT_X_KEEPALIVE_PROBES		0x6301
#define LDAP_OPT_X_KEEPALIVE_INTERVAL	0x6302

At the very end I still think re-connecting is the best solution, even if it may require some extra handling (e.g. for paged results).

How do other user backends handle situations like this?

@daita

This comment has been minimized.

Show comment
Hide comment
@daita

daita May 16, 2017

Member

I could totally catch the ServerNotAvailableException, but because I do not directly work on the LDAP (I think it is managed by the UserManager), how can I quickly reconnect to the LDAP from my script ?

Member

daita commented May 16, 2017

I could totally catch the ServerNotAvailableException, but because I do not directly work on the LDAP (I think it is managed by the UserManager), how can I quickly reconnect to the LDAP from my script ?

@blizzz

This comment has been minimized.

Show comment
Hide comment
@blizzz

blizzz May 16, 2017

Member

@daita you don't. try to set $resource to null before https://github.com/nextcloud/server/blob/master/apps/user_ldap/lib/LDAP.php#L333 Not sure whether it works, and as stated above this might create more issues without further adjustments. For you it's probably okay since you don't interact much with LDAP, as I interpret it.

Member

blizzz commented May 16, 2017

@daita you don't. try to set $resource to null before https://github.com/nextcloud/server/blob/master/apps/user_ldap/lib/LDAP.php#L333 Not sure whether it works, and as stated above this might create more issues without further adjustments. For you it's probably okay since you don't interact much with LDAP, as I interpret it.

@linuxrrze

This comment has been minimized.

Show comment
Hide comment
@linuxrrze

linuxrrze May 16, 2017

I've been running a similar test overnight:

Instead of setting $resource to null I replaced the exception with a simple debug warning:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...   
      } else if ($errorCode === -1) {
                        # throw new ServerNotAvailableException('Lost connection to LDAP server.');
                        \OCP\Util::writeLog('user_ldap',
                                'LDAP error Lost connection to LDAP server.'.$errorMsg.' (' .
                                $errorCode.') after calling '.
                                $this->curFunc,
                                \OCP\Util::DEBUG);

"occ nextant:index" now finishes without (obvious) error.
However this seems to be only true for the "Extracting files" section, which is run for all local and LDAP user accounts.
Sections "Updating files" and "Indexing bookmarks" are only executed for local user accounts (=Administrator). The nextcloud.log reports "Connection lost" errors in these two sections. So I guess they started after the LDAP connection got lost.

# sudo -u www-data php occ nextant:index
nextant v1.0.8

* Extracting files:

/Administrator/files             7/     7 [============================] 100%
/dani/files                      1/     1 [============================] 100%
/flo/files                       7/     7 [============================] 100%
/frank/files                     7/     7 [============================] 100%
/ijp/files                       7/     7 [============================] 100%
/marcel/files                35132/ 35132 [============================] 100%
/martin/files                    7/     7 [============================] 100%
/moni/files                      7/     7 [============================] 100%
/sebbi/files                     7/     7 [============================] 100%

  33942 file(s) processed ; 0 orphan(s) removed
  33942 documents indexed ; 33942 fully extracted
  82 file(s) were not processed (failure)

* Updating files:

/Administrator/files             7/     7 [============================] 100%

  0 document(s) updated ; 0 failure(s)

* Indexing bookmarks:

/Administrator                   0/     0 [>---------------------------]   0%

  0 bookmark(s) processed ; 0 orphan(s) removed
  0 document indexed ; 0 fully extracted

Time spent: 00:10:26
Your index now contains 18 segments

linuxrrze commented May 16, 2017

I've been running a similar test overnight:

Instead of setting $resource to null I replaced the exception with a simple debug warning:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...   
      } else if ($errorCode === -1) {
                        # throw new ServerNotAvailableException('Lost connection to LDAP server.');
                        \OCP\Util::writeLog('user_ldap',
                                'LDAP error Lost connection to LDAP server.'.$errorMsg.' (' .
                                $errorCode.') after calling '.
                                $this->curFunc,
                                \OCP\Util::DEBUG);

"occ nextant:index" now finishes without (obvious) error.
However this seems to be only true for the "Extracting files" section, which is run for all local and LDAP user accounts.
Sections "Updating files" and "Indexing bookmarks" are only executed for local user accounts (=Administrator). The nextcloud.log reports "Connection lost" errors in these two sections. So I guess they started after the LDAP connection got lost.

# sudo -u www-data php occ nextant:index
nextant v1.0.8

* Extracting files:

/Administrator/files             7/     7 [============================] 100%
/dani/files                      1/     1 [============================] 100%
/flo/files                       7/     7 [============================] 100%
/frank/files                     7/     7 [============================] 100%
/ijp/files                       7/     7 [============================] 100%
/marcel/files                35132/ 35132 [============================] 100%
/martin/files                    7/     7 [============================] 100%
/moni/files                      7/     7 [============================] 100%
/sebbi/files                     7/     7 [============================] 100%

  33942 file(s) processed ; 0 orphan(s) removed
  33942 documents indexed ; 33942 fully extracted
  82 file(s) were not processed (failure)

* Updating files:

/Administrator/files             7/     7 [============================] 100%

  0 document(s) updated ; 0 failure(s)

* Indexing bookmarks:

/Administrator                   0/     0 [>---------------------------]   0%

  0 bookmark(s) processed ; 0 orphan(s) removed
  0 document indexed ; 0 fully extracted

Time spent: 00:10:26
Your index now contains 18 segments
@blizzz

This comment has been minimized.

Show comment
Hide comment
@blizzz

blizzz May 17, 2017

Member

Instead of setting $resource to null I replaced the exception with a simple debug warning:

This would fail really, really hard in several other places. Been there, not pleasant. ;)

Member

blizzz commented May 17, 2017

Instead of setting $resource to null I replaced the exception with a simple debug warning:

This would fail really, really hard in several other places. Been there, not pleasant. ;)

@linuxrrze

This comment has been minimized.

Show comment
Hide comment
@linuxrrze

linuxrrze May 17, 2017

I'm totally aware of that.

That's why I hope for a clean solution ;-)

linuxrrze commented May 17, 2017

I'm totally aware of that.

That's why I hope for a clean solution ;-)

@blizzz

This comment has been minimized.

Show comment
Hide comment
@blizzz

blizzz May 24, 2017

Member

Potential fix in #5104

Member

blizzz commented May 24, 2017

Potential fix in #5104

@blizzz blizzz removed the 0. Needs triage label May 24, 2017

@blizzz blizzz added this to the Nextcloud 12.0.1 milestone May 24, 2017

@MorrisJobke

This comment has been minimized.

Show comment
Hide comment
@MorrisJobke

MorrisJobke Jul 24, 2017

Member

backported to stable12 in #5210

Member

MorrisJobke commented Jul 24, 2017

backported to stable12 in #5210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment