Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pconnect: Operation now in progress #1881

Open
kanidjar opened this issue Nov 15, 2020 · 66 comments
Open

pconnect: Operation now in progress #1881

kanidjar opened this issue Nov 15, 2020 · 66 comments

Comments

@kanidjar
Copy link

Hi,

We are using PHPRedis with Laravel. We push jobs into Redis and our workers pop jobs from it.

Our workers are long running processes which are not supposed to be restarted.

99% of our jobs are popped successfully. However, we sometimes face the following issue when trying to establish the connection, using pconnect:

[2020-11-13 12:52:41] production.ERROR: Operation now in progress {"exception":"[object] (RedisException(code: 0): Operation now in progress at /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php:135)
[stacktrace]
#0 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(135): Redis->pconnect('azure...', '6379', 0, NULL, 0, 0, Array)

I'm seeing this behaviour on

  • OS: Ubuntu-18.04.1
  • Redis: 4.0.14 on Azure (tcp)
  • PHP: 7.4.12
  • phpredis: 5.3.2

Steps to reproduce, backtrace or example script

Very hard to reproduce since it happens only a few times per day (we process around 50K jobs per day).

Any idea of what could go wrong or a way to fix it?

@michael-grunder
Copy link
Member

I've never seen that error before.

Generally EINPROGRESS happens in the context of a non-blocking socket, but PhpRedis doesn't do non-blocking sockets.

@kanidjar
Copy link
Author

Is there something I can provide you with in order to investigate? We are still having this issue occurring everyday.

@kanidjar
Copy link
Author

kanidjar commented Nov 27, 2020

Here are some information about the execution context:

  • This error seems to occur only with our workers
  • Our apps run on a Kubernetes cluster: a pod contains one container which executes one worker (php, long-running process) and one wep-app (nginx + php-fpm).
  • We use the same Laravel configuration for our workers and our web apps (both use Redis persistent connections)
  • Thread-safety is disabled
  • We do not provide any persistent_id

@michael-grunder
Copy link
Member

An exact call stack and the output for the Redis extension via either phpinfo() $php --ri redis would be useful.

If the error is coming from PhpRedis (and not a user-land PHP wrapper) I it must be coming from php_stream_xport_create via the eptr argument, which we then pass back to the caller.

redis_sock->stream = php_stream_xport_create(host, host_len,

That said, I'm not sure how this is possible given that EINPROGRESS is a non-blocking error.
For reference: every occurrence in php-src involves non-blocking sockets.

@kanidjar
Copy link
Author

Here it is:

/usr/local/etc/php/conf.d/docker-php-ext-redis.ini,
redis
redis.arrays.algorithm => no value => no value
redis.arrays.auth => no value => no value
redis.arrays.autorehash => 0 => 0
redis.arrays.connecttimeout => 0 => 0
redis.arrays.consistent => 0 => 0
redis.arrays.distributor => no value => no value
redis.arrays.functions => no value => no value
redis.arrays.hosts => no value => no value
redis.arrays.index => 0 => 0
redis.arrays.lazyconnect => 0 => 0
redis.arrays.names => no value => no value
redis.arrays.pconnect => 0 => 0
redis.arrays.previous => no value => no value
redis.arrays.readtimeout => 0 => 0
redis.arrays.retryinterval => 0 => 0
redis.clusters.auth => no value => no value
redis.clusters.cache_slots => 0 => 0
redis.clusters.persistent => 0 => 0
redis.clusters.read_timeout => 0 => 0
redis.clusters.seeds => no value => no value
redis.clusters.timeout => 0 => 0
redis.pconnect.connection_limit => 0 => 0
redis.pconnect.echo_check_liveness => 1 => 1
redis.pconnect.pool_pattern => no value => no value
redis.pconnect.pooling_enabled => 1 => 1
redis.session.lock_expire => 0 => 0
redis.session.lock_retries => 10 => 10
redis.session.lock_wait_time => 2000 => 2000
redis.session.locking_enabled => 0 => 0
$client->pconnect('xxx.redis.cache.windows.net', '6379', 0.0, null, 0, 0.0, [
    'auth' =>  [
      0 => "***"
    ]
]);

Exception:

[2020-11-27 23:02:05] production.ERROR: Operation now in progress {"exception":"[object] (RedisException(code: 0): Operation now in progress at /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php:135)
[stacktrace]
#0 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(135): Redis->pconnect('xxx...', '6379', 0, NULL, 0, 0, Array)
#1 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(84): Illuminate\\Redis\\Connectors\\PhpRedisConnector->establishConnection(Object(Redis), Array)
#2 /var/www/html/vendor/illuminate/support/helpers.php(263): Illuminate\\Redis\\Connectors\\PhpRedisConnector->Illuminate\\Redis\\Connectors\\{closure}(Object(Redis))
#3 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(105): tap(Object(Redis), Object(Closure))
#4 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(28): Illuminate\\Redis\\Connectors\\PhpRedisConnector->createClient(Array)
#5 /var/www/html/vendor/illuminate/redis/Connectors/PhpRedisConnector.php(32): Illuminate\\Redis\\Connectors\\PhpRedisConnector->Illuminate\\Redis\\Connectors\\{closure}()
#6 /var/www/html/vendor/illuminate/redis/RedisManager.php(110): Illuminate\\Redis\\Connectors\\PhpRedisConnector->connect(Array, Array)
#7 /var/www/html/vendor/illuminate/redis/RedisManager.php(90): Illuminate\\Redis\\RedisManager->resolve('default')
#8 /var/www/html/vendor/illuminate/cache/RedisStore.php(242): Illuminate\\Redis\\RedisManager->connection('default')
#9 /var/www/html/vendor/illuminate/cache/RedisStore.php(54): Illuminate\\Cache\\RedisStore->connection()
#10 /var/www/html/vendor/illuminate/cache/Repository.php(97): Illuminate\\Cache\\RedisStore->get('illuminate:queu...')
#11 /var/www/html/vendor/illuminate/queue/Worker.php(628): Illuminate\\Cache\\Repository->get('illuminate:queu...')
#12 /var/www/html/vendor/illuminate/queue/Worker.php(123): Illuminate\\Queue\\Worker->getTimestampOfLastQueueRestart()
#13 /var/www/html/vendor/illuminate/queue/Console/WorkCommand.php(116): Illuminate\\Queue\\Worker->daemon('redis', 'offers', Object(Illuminate\\Queue\\WorkerOptions))
#14 /var/www/html/vendor/illuminate/queue/Console/WorkCommand.php(100): Illuminate\\Queue\\Console\\WorkCommand->runWorker('redis', 'offers')
#15 /var/www/html/vendor/illuminate/container/BoundMethod.php(36): Illuminate\\Queue\\Console\\WorkCommand->handle()
#16 /var/www/html/vendor/illuminate/container/Util.php(40): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#17 /var/www/html/vendor/illuminate/container/BoundMethod.php(93): Illuminate\\Container\\Util::unwrapIfClosure(Object(Closure))
#18 /var/www/html/vendor/illuminate/container/BoundMethod.php(37): Illuminate\\Container\\BoundMethod::callBoundMethod(Object(Laravel\\Lumen\\Application), Array, Object(Closure))
#19 /var/www/html/vendor/illuminate/container/Container.php(610): Illuminate\\Container\\BoundMethod::call(Object(Laravel\\Lumen\\Application), Array, Array, NULL)
#20 /var/www/html/vendor/illuminate/console/Command.php(136): Illuminate\\Container\\Container->call(Array)
#21 /var/www/html/vendor/symfony/console/Command/Command.php(258): Illuminate\\Console\\Command->execute(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#22 /var/www/html/vendor/illuminate/console/Command.php(121): Symfony\\Component\\Console\\Command\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#23 /var/www/html/vendor/symfony/console/Application.php(920): Illuminate\\Console\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#24 /var/www/html/vendor/symfony/console/Application.php(266): Symfony\\Component\\Console\\Application->doRunCommand(Object(Illuminate\\Queue\\Console\\WorkCommand), Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#25 /var/www/html/vendor/symfony/console/Application.php(142): Symfony\\Component\\Console\\Application->doRun(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#26 /var/www/html/vendor/illuminate/console/Application.php(93): Symfony\\Component\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#27 /var/www/html/vendor/laravel/lumen-framework/src/Console/Kernel.php(116): Illuminate\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#28 /var/www/html/artisan(35): Laravel\\Lumen\\Console\\Kernel->handle(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#29 {main}
"} 

@kanidjar
Copy link
Author

"CLIENT LIST" command (sample):

>> client list
id=5100145 addr=xxx:6916 fd=73 name= age=203888 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=eval numops=474345
id=5626765 addr=xxx:4101 fd=14 name= age=2022 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=ping numops=3618
id=5174832 addr=xxx:8324 fd=46 name= age=175243 idle=2 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=eval numops=408559
id=5627315 addr=xxx:4103 fd=139 name= age=1802 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=ping numops=3252
id=5098507 addr=xxx:4363 fd=143 name= age=204490 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=eval numops=475136
id=5627662 addr=xxx:5186 fd=174 name= age=1678 idle=6 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=ping numops=3024
id=5625670 addr=xxx:2057 fd=89 name= age=2440 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=ping numops=4404
id=5097995 addr=xxx:2123 fd=166 name= age=204684 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=eval numops=476130
id=5622316 addr=xxx:2070 fd=161 name= age=3729 idle=9 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 ow=0 owmem=0 events=r cmd=ping numops=4476

@kanidjar
Copy link
Author

I have done some investigations and there is no issue with PHP-FPM, the connections are persisted until the child process is killed.

The issue is with PHP, used by our workers and launched with a command-line.

@yatsukhnenko
Copy link
Member

@kanidjar try to use non-zero timeout values while connecting

@Velizz
Copy link

Velizz commented Dec 22, 2020

Did you mange to fix this issue?

@kanidjar
Copy link
Author

Sorry, forgot to give you a feedback.

We are still having the issue with the PHP cli. We tried to set a non-zero timout value. No effect.

We are still running under PHP 7.4.

@ericcomellas
Copy link

Hi @kanidjar nice to meet you. We have the same issue on our services. AWS infrastructure + php-fpm 7.1. We have the problem with redis but also mysql. We opened a ticket on AWS support but no solution. Did you fixed the problem on your side?

@Velizz
Copy link

Velizz commented Feb 22, 2021

Hi @kanidjar nice to meet you. We have the same issue on our services. AWS infrastructure + php-fpm 7.1. We have the problem with redis but also mysql. We opened a ticket on AWS support but no solution. Did you fixed the problem on your side?

Ive fixed the issue by not using redis for queues. Using SQS for queues eliminated this particular error, while redis still handles sessions

@cyrrill
Copy link

cyrrill commented Mar 17, 2021

+1 seeing this issue, happens only a few times a day out of hundreds of thousands of jobs.

@Braunson
Copy link

Braunson commented Jul 28, 2021

I am seeing this while running on Heroku, starting to happen more frequently daily.

UPDATE (9/22/2021): Still happening, much more frequently now. Anyone have any solutions or ideas on this?

@dwightwatson
Copy link

This has popped up out of nowhere for me and appears to be happening more frequently. Also on Heroku.

@dzuelke
Copy link

dzuelke commented Oct 1, 2021

An exact call stack and the output for the Redis extension via either phpinfo() $php --ri redis would be useful.

If the error is coming from PhpRedis (and not a user-land PHP wrapper) I it must be coming from php_stream_xport_create via the eptr argument, which we then pass back to the caller.

redis_sock->stream = php_stream_xport_create(host, host_len,

That said, I'm not sure how this is possible given that EINPROGRESS is a non-blocking error. For reference: every occurrence in php-src involves non-blocking sockets.

I've been investigating this a bit (no way of reproducing, so just walking through the code), and in the process of that, came across someone claiming this may also happen with very low timeout values.

The Laravel Phpredis connector explicitly sets timeouts to 0.0, so I was suspecting that maybe this code is tripping over rare floating point arithmetic edge cases from time to time in redis_sock_connect() for both connect and read timeouts:

phpredis/library.c

Lines 2336 to 2339 in 7c7f2fa

tv.tv_sec = (time_t)redis_sock->timeout;
tv.tv_usec = (int)((redis_sock->timeout - tv.tv_sec) * 1000000);
if (tv.tv_sec != 0 || tv.tv_usec != 0) {
tv_ptr = &tv;

-However, I'd imagine that multiplications of 0.0 would generally be special cased... but who knows, maybe zend_parse_parameters() and friends occasionally give you a 0.000001 when handed 0.0?

Any other ideas, @michael-grunder? I haven't been able to dig into the Linux sources enough yet to determine if there are blocking socket cases where EINPROGRESS may still happen - my guess is that a lot of the code internally shares code paths for blocking and non-blocking sockets, so maybe it's a bug?

@Tarasovych
Copy link

Tarasovych commented Oct 14, 2021

Same issue with AWS ECS. We have ECS tasks connecting to redis on EC2.

socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6

socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6

connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.0.2")}, 16) = 0

sendmmsg(6, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="m\30\1\0\0\1\0\0\0\0\0\0\20EC2_PUBLIC_DNS_HERE\teu"..., iov_len=66}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=66}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\301\"\1\0\0\1\0\0\0\0\0\0\20EC2_PUBLIC_DNS_HERE\teu"..., iov_len=66}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=66}], 2, MSG_NOSIGNAL) = 2

recvfrom(6, "m\30\201\200\0\1\0\1\0\0\0\0\20EC2_PUBLIC_DNS_HERE\teu"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.0.2")}, [28->16]) = 82

recvfrom(6, "\301\"\201\200\0\1\0\0\0\1\0\0\20EC2_PUBLIC_DNS_HERE\teu"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.0.2")}, [28->16]) = 150

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6

connect(6, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.1.0.205")}, 16) = -1 EINPROGRESS (Operation now in progress)

Client timeout was 5 seconds (default). I have tried 10 with no success. php:8.0.11-fpm-buster is used

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

Ohhhh that's interesting @Tarasovych, is that an strace of a failing call?

Do you have a reproduce case?

The problem there is that it opens a blocking IPv6 socket, but a non-blocking IPv4 socket.

I wonder if that's a bug somewhere in PHP or the OS, given how ext-redis definitely does not set the socket to non-blocking...

Maybe it's intermittent because sometimes, hostnames resolve as IPv6 first, and sometimes, as IPv4 first, and it opens a socket for both?

Can you share the hostname please?

@Tarasovych
Copy link

Tarasovych commented Oct 14, 2021

is that an strace of a failing call

Yes.

Do you have a reproduce case?

No( It occurs from time to time.

Maybe it's intermittent because sometimes, hostnames resolve as IPv6 first

Subnet is used for AWS ECS doesn't have IPv6 CIDR, and AWS ECS tasks don't receive IPv6 address on startup. Anyway, the VPC has IPv6 CIDR, so VPC subnet route tables have IPv6 route.

Can you share the hostname please?

I can't share real hostname, it's Public IPv4 DNS for EC2, if you're familiar with AWS EC2. Something like ec2-11-22-33-44.eu-west-1.compute.amazonaws.com

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

Ah so you're running Redis yourself in EC2, right?

I meant the EC2 hostname. I have seen this happen in the wild, with the error being thrown for an IPv6 address connection attempt, so I think we're getting closer.

Can you dig -4 and dig -6 the EC2 hostname? Ideally from within an ECS instance.

@Tarasovych
Copy link

Ah so you're running Redis yourself in EC2, right?

You're right.

Can you dig -4 and dig -6 the EC2 hostname? Ideally from within an ECS instance.

Yes, I'll take a look

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

Oh nevermind. It looks like the first two connects are just DNS queries (to 10.1.0.2). We're back to square one then.

It doesn't close the DNS socket, and re-connects to 10.1.0.205 using the same FD (return value 6), maybe that's related? I'm not familiar enough with low-level OS networking internals.

@yatsukhnenko
Copy link
Member

@dzuelke yes, that is the reason of EINPROGRESS error

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

Can you strace a successful connect for comparison and paste the same section please, @Tarasovych?

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

@dzuelke yes, that is the reason of EINPROGRESS error

So, a bug in PHP then, @yatsukhnenko ?

I've been digging through the PHP sources a bit to see where it does the name resolution during connects, but haven't found anything... or is it happening inside ext-redis?

The streams subsystem is royally confusing; I think setting a particular option on a socket actually causes a connect, if I'm understanding that right, so it's a bit of a mess to untangle.

@dzuelke
Copy link

dzuelke commented Oct 14, 2021

(but why would all these socket() calls return the same file descriptor in the first place? even just initially for the AF_INET6 and AF_INET domains?)

@Tarasovych
Copy link

(but why would all these socket() calls return the same file descriptor in the first place? even just initially for the AF_INET6 and AF_INET domains?)

Those traces I posted above are from different ECS containers, by the way. Each run is executed on "clean" env

@Tarasovych
Copy link

Tarasovych commented Oct 15, 2021

I've created a gist with readable strace diff between success and failed connections, take a look on 2nd revision
https://gist.github.com/Tarasovych/6b365bb78ccdf41779de548563e689c5/revisions

@arezk84
Copy link

arezk84 commented Dec 8, 2021

Does anyone have any suggestions for this issue?

@ocind
Copy link

ocind commented Dec 22, 2021

@kanidjar Do you have a solution to this problem in the end?

@bjarn
Copy link

bjarn commented Aug 18, 2023

The error still occurs, but way less often, possibly confirming that this has to do something with it, as "Operation now in progress" sounds like it's waiting on Redis to release something.

The "Operation now in progress" error comes from the connect() syscall, not from the Redis server.

The really weird thing is that socket() seems to return the same FD multiple times, if you look at some of the strace output further above. And then the connect to Redis occurs on a non-blocking socket, presumably one that is somehow still open from a prior DNS query.

Hm, in my case I use IP addresses to connect, instead of hostnames, so there shouldn't be a DNS lookup afaik.

This error has only happened on my servers at Vultr, not on Hetzner (though the Redis server is hosted at Hetzner too), so I might give Digitalocean a shot for the next worker to see if that one will throw this error. Just to try it out.

@bjarn
Copy link

bjarn commented Sep 5, 2023

Small update; even with Digitalocean it happens. However, it appears to be latency related.

I have a server in Sydney whilst the Redis server is in Germany. Every other location (London, Amsterdam and NYC) is fine. Latency to Sydney is the highest, so I believe this is what might cause the issue for me personally. Not sure about others.

@georaldc
Copy link

georaldc commented Sep 5, 2023

Small update; even with Digitalocean it happens. However, it appears to be latency related.

I have a server in Sydney whilst the Redis server is in Germany. Every other location (London, Amsterdam and NYC) is fine. Latency to Sydney is the highest, so I believe this is what might cause the issue for me personally. Not sure about others.

I've seen "Operation now in progress" errors suddenly being returned after we moved from a locally installed instance of Redis Server to a cloud managed solution like Google memorystore. No changes in code other than connection configuration. The framework/client we use does not use phpredis though. It makes a connection via stream_socket_client(), which now (very intermittently. Probably once every other day) can return false with that error description.

@mgsmus
Copy link

mgsmus commented Sep 20, 2023

We are using Laravel Horizon, and AWS ElastiCache cache.t4g.medium. The situation in the last 6 months has been as shown in the image. Despite trying numerous different configurations, I still haven't found a fix
for it. We process an average of around 200,000 jobs per day, and we also use it for caching. I don't know what happened in the last 24 hours, but suddenly errors increased. Redis usage hasn't even reached 30%...

redis

@fwilliamconceicao
Copy link

We are using Laravel Horizon, and AWS ElastiCache cache.t4g.medium. The situation in the last 6 months has been as shown in the image. Despite trying numerous different configurations, I still haven't found a fix for it. We process an average of around 200,000 jobs per day, and we also use it for caching. I don't know what happened in the last 24 hours, but suddenly errors increased. Redis usage hasn't even reached 30%...

redis

Here in my case when we migrated AWS to Azure the errors appeared just sometimes, but it was something related to our infra architecture + phpredis issue. But now it happens just a few times per week. But anyway, now we are 100% sure that this is an issue with phpredis.

@yevhenlisovenko
Copy link

Same issue here. Also Laravel Horizon + Redis :(

@georaldc
Copy link

So in our case, we realized that the issue occurs whenever letsencrypt/certbot runs renew on an affected server. We have one such server with a cron job that runs this once a day at midnight and this is also the exact time I start seeing "operation now in progress" errors. It might be related to certbot restarting/reloading apache or something else, but it seems to affect the server's ability to make outside connections (eg. connect to a mysql server, or connect to redis) for a second or two.

@yevhenlisovenko
Copy link

So in our case, we realized that the issue occurs whenever letsencrypt/certbot runs renew on an affected server. We have one such server with a cron job that runs this once a day at midnight and this is also the exact time I start seeing "operation now in progress" errors. It might be related to certbot restarting/reloading apache or something else, but it seems to affect the server's ability to make outside connections (eg. connect to a mysql server, or connect to redis) for a second or two.

We don't use Let's Encrypt on affected servers. Our theory is that this may be a memory issue. When the available memory is low on the server or pod, the issue occurs, but we are not entirely certain. If you have any ideas, please share your experiences.

P.S. We want to move to Predis in production to see if there will be any issues.

@dpacmittal
Copy link

Small update; even with Digitalocean it happens. However, it appears to be latency related.

I have a server in Sydney whilst the Redis server is in Germany. Every other location (London, Amsterdam and NYC) is fine. Latency to Sydney is the highest, so I believe this is what might cause the issue for me personally. Not sure about others.

Appears to be a latency issue in my case. I'm connecting my worker to redis server hosted in another continent over tailscale network. My ping time to the server is 170ms. I'm consistently getting this error.

@sjdrew
Copy link

sjdrew commented Jul 18, 2024

I am seeing this just on localhost (MacOS) with local redis server, php 8.3. I have added a retry if I get that message, and gets by the issue for now.

@trevorgehman
Copy link

We are also seeing this very intermittently in our Laravel Horizon workers. All we can see in logs is a small spike in commands per second:

CleanShot 2024-08-28 at 15 14 04@2x

@jeffdafoe
Copy link

The bug is in PhpiredisSocketConnection.php . This line is wrong

$selected = socket_select($selectable, $selectable, $null, $timeoutSecs, $timeoutUSecs);

You can't send the same array twice, it needs to be a copy.

@dzuelke
Copy link

dzuelke commented Oct 17, 2024

@jeffdafoe that can't be it because at least the OP is using the Redis extension, not Predis.

@jeffdafoe
Copy link

PHPRedis is more complicated. If it's OK for the PHP streams calls to emit EINPROGRESS, then PHPRedis needs to handle that, as "EINPROGRESS is not an error". Predis wouldn't have that issue, as they do handle EINPROGRESS properly, but the issue is caused by a different bug.

It's not uncommon for people trying to implement sockets to make mistakes that only show under certain situations, usually when the sockets can't be handled as fast as their state changes. There's a lot of room for footgun in the underlying unix implementation.

@dzuelke
Copy link

dzuelke commented Oct 17, 2024

@jeffdafoe I think you're missing a crucial detail here: EINPROGRESS is supposed to only occur on asynchronous sockets, but the error is occurring on a synchronous socket.

@jeffdafoe
Copy link

EINPROGRESS occurs on synchronous or async sockets, it's a byproduct of the socket connect process. Regardless, both PHPRedis and Predis use async sockets.

@dzuelke
Copy link

dzuelke commented Oct 21, 2024

There is no documentation anywhere that indicates EINPROGRESS can happen on a blocking socket connect(), do you have an example, @jeffdafoe?

And no, PHPRedis, doesn't use non-blocking sockets (also see the very first reply on this ticket from @michael-grunder).

@jeffdafoe
Copy link

jeffdafoe commented Oct 21, 2024

Non-blocking connect, then the socket is put in blocking mode for normal tx/rx. In any case, my earlier PRedis evaluation was wrong as that code is deprecated. It now uses sockets way more similar to PHPRedis. I've encountered the issue using PRedis, which is why I ended up here. In the end, it may end up being a PHP bug, there's a ticket opened but it doesn't have much traction.

Your statement about EINPROGRESS only happening on non-blocking connects is correct. And that's the issue, you have to conclude that the socket generating that error was opened non-blocking and then switched to blocking. It's just a (painful) matter of figuring out where.

@michael-grunder
Copy link
Member

Non-blocking connect, then the socket is put in blocking mode for normal tx/rx

This is right. PHP sets O_NONBLOCK on connect and then restores it to blocking since we're not requesting an asynchronous socket.

Sets non-block here
Restores blocking here

I see an early return (when err != EINPROGRESS), so maybe we're hitting this early return but PhpRedis is not correctly discarding the socket.

I'll take a look at the pconnect logic and see if there is a simple sanity check I can make

@georaldc
Copy link

georaldc commented Oct 21, 2024

Non-blocking connect, then the socket is put in blocking mode for normal tx/rx

This is right. PHP sets O_NONBLOCK on connect and then restores it to blocking since we're not requesting an asynchronous socket.

Sets non-block here Restores blocking here

I see an early return (when err != EINPROGRESS), so maybe we're hitting this early return but PhpRedis is not correctly discarding the socket.

I'll take a look at the pconnect logic and see if there is a simple sanity check I can make

I'm not familiar with PHP innards, but is this also the reason why I can get this error under the different circumstances I mentioned in this previous comment of mine?

I have been seeing the "Operation now in progress" error messages for years now but never really got around to determining what causes them due to how extremely intermittent they appear for me. I've seen it happen with code that:

  • does a blocking stream_socket_client() call
  • do an ftp_fput() call, which returns a "php_connect_nonb() failed: Operation now in progress (115)" error message
  • uses PDO for connecting to a mysql server (SQLSTATE[HY000] [2002] Operation now in progress)

No strace though, but it makes me wonder if these are all timeout related when PHP tries to establish network connections.

@michael-grunder
Copy link
Member

I'm not familiar with PHP innards, but is this also the reason why I can get this error under the different circumstances I mentioned in this previous comment of mine?

Possibly, but I'd need to see it happen to be sure.

It could happen if it is an edge case where the socket is not restored to blocking after the connect loop.

@jeffdafoe
Copy link

jeffdafoe commented Oct 21, 2024

I keep looking at php_network_connect socket and trying to figure out what's preventing this:

  • Connect gets EINPROGRESS on a non-async
  • error remains EINPROGRESS, never gets cleared
  • php_pollfd_for gets called and the read is ready before the timeout, aka no error there
  • note that error still contains EINPROGRESS
  • We've now fallen down into the OK block
  • error STILL has EINPROGRESS, so the connect process returns an incorrect EINPROGRESS error. The previous poll didn't time out, so the socket is actually now connected, but the caller received an EINPROGRESS that they were not at all expecting, as they requested blocking sockets.

Might be able to mitigate by calling php_pollfd_for again, if it's just a wrapper around select it'll return the same results if the socket ready to read hasn't been drained.

@michael-grunder
Copy link
Member

Shouldn't the getsockopt clear the error when n > 0?

https://github.com/php/php-src/blob/65f885738defe8c723e3d1131c0eb007cb71866d/main/network.c#L365

@jeffdafoe
Copy link

Yes, it sure looks like it would. Now I really have no idea how the EINPROGRESS is getting out of that function. I wonder whether or not the socket is connected when that happens.

bukka added a commit to php/php-src that referenced this issue Nov 29, 2024
When connecting to socket, it is possible to get EINTR. In such case,
there should be an another attempt to connect if we are not over the
timeout. The timeout should be adjusted accordingly in that case.

This fixes phpredis/phpredis#1881

Closes GH-16606
charmitro pushed a commit to wasix-org/php that referenced this issue Mar 13, 2025
When connecting to socket, it is possible to get EINTR. In such case,
there should be an another attempt to connect if we are not over the
timeout. The timeout should be adjusted accordingly in that case.

This fixes phpredis/phpredis#1881

Closes phpGH-16606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.