Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection problems with 6.0.2 (Broken pipe, Redis server went away...) #2437

Open
2 tasks done
potsky opened this issue Jan 21, 2024 · 2 comments
Open
2 tasks done

Comments

@potsky
Copy link

potsky commented Jan 21, 2024

Expected behaviour

We would like to use version 6.0.2 as we used the 5.3.7 without connection problems.

Actual behaviour

After about 10 hours of operation, we start getting these kind of errors when hundreds of writes are done in a foreach loop on a single backend server:

  • Redis server went away
  • Redis::exec(): Send of 2504 bytes failed with errno=32 Broken pipe
  • Redis::hMset(): Send of 1431 bytes failed with errno=32 Broken pipe
  • ...

We run about 30k jobs by day with Laravel Horizon, so we had hundreds of thousands exceptions with 6.0.2 in just a few days.

Found workaround

  1. Restarting our web servers resolves temporarily the problem.
  2. We rolled back on Scalingo PaaS to phpredis 5.3.7 and everything is fine.

Current investigations

  • There is no error on the Redis server logs
  • We have checked the count of connections from our backend to Redis cluster and we stay under 100, including cluster connections, which is a far cry from the default 10k connections.
  • The Redis servers metrics are ok, there is no memory leak, no CPU peak, ...
  • This is not a network problem in the datacenter given that it works fine with 5.3.7

I'm seeing this behaviour on

  • Infrastructure: Scalingo PaaS
  • OS: Linux app-sirenergies-one-off-8185 5.4.0-121-generic # 137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Redis: 7.2.3 (Image 7.2.3-1)
  • PHP: 8.2.14
  • phpredis: 6.0.2
  • Cluster configuration:
Screenshot 2024-01-21 - 13-17-23@2x

Steps to reproduce, backtrace or example script

I have no idea on how to reproduce this behaviour because it's not systematic. For example, when we send SMS to our customers at 8am and reboot the servers 1h before at 7am, we don't have any problems. On the other hand, when we only rebooted them the day before at 8pm, they crash at 8am the next day, even though it's the same code, the same volume, ...

phpinfo on 5.3.7

5 3 7

phpinfo on 6.0.2

6 0 2
  • The Redis sentinel version hase changed from 0.1 to 1.0
  • redis.session.early_refresh is new on 6.0.2
  • The default value of redis.session.lock_retries has changed from 10 to 100
  • The default value of redis.session.lock_wait_time has changed from 2000 to 20000

I've checked

  • There is no similar issue from other users
  • Issue isn't fixed in develop branch
@michael-grunder
Copy link
Member

The broken pipe (errno 32) error often happens when sending huge payloads, such as massive multi-exec or pipeline blocks, which hit kernel limits.

However, your situation appears slightly different. The trick is going to be reproducing the behavior.

Laravel horizon are long-running job queues, right? Maybe we can simulate similar activity with runners that execute the same Redis commands and same general volume? Perhaps PhpRedis 6.0.2 has a bug where we continue to try and use a socket even after it's failed.

Another option would be to run one of your jobs under something like rr. If you replicate the problem that way it would almost certainly identify exactly what's going wrong. The debugging would need to be on your side though because it would record everything including all of the payloads to and from Redis.

@potsky
Copy link
Author

potsky commented Jan 22, 2024

HI @michael-grunder,

yep Laravel Horizon is a long-running job queue tool like Sidekiq for example. The possible problem with a dead socket is an idea but how to test this? Complicated if think.

Using rr is a very good idea. We need to check this with the Scalingo team:

  • what kind ov VM do they use because it seems to me that rr did not work on all VM hosts
  • how to get debug logs because given that we run on ephemeral servers, the local disk is not accessible

To be continued...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants