-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horizon Orphan Process #178
Comments
I'm having the same issue. I woke up to my main server using all the ram and 4 gb of swap disk. Running purge killed 78 processes. I'm running horizon on a Ubuntu 16.04 with php 7.1.9-1 |
Hello everyone, Ideally you shouldn't have any rogue processes running horizon, but for exceptions like you shared we built the purge command for you to run on regular basis if your sever is having many rogue processes, this is a safe approach until we're able to investigate the issue more. So I recommend that you schedule the command to run every day. On the other hand, can you please share some information about your setup? I'm trying to understand what might cause rogue processes. |
Forge Daemon command: |
Our envoy deploy script uses horizon terminate to kill the processes, the high number of orphan procesess maybe are related to this. We added a horizon purge too at the deployement script. @themsaid if you find difficult to reproduce this bug, we could arrange ssh access to my vps. |
Sorry to hijack the thread, but I'm seeing weird behaviour with Horizon running old code after I've deployed fresh code. Wondering if it's related to the orphans. I've got my stuff setup through envoyer/forge. I've setup the following hook in Envoyer, I currently have it running it after For now, as a workaround I have been manually restarting the deamon in forge whenever I deploy, but this is annoying and defeats the purpose of having Envoyer in the first place: I also notice that if I ssh into the server and run Any ideas @themsaid ? |
It's hard to track down why a process would go orphaned. You can technically run horizon:purge after every deploy if you wanted to. |
Hi @taylorotwell :) I do already, as per the screenshot above, but they still seem to go orphaned all the time. I don't really know what that means or if it's a real issue though. But it not running my latest code after a deploy is more of a problem. |
Well, this basically happens when you run After one deploy:
Interesting thing is that it displayed After another deploy:
My job takes an average of |
@jkudish I'm experience similar things on my server. I've checked this and something seems off:
|
Hello everyone 😄 First off: great product! However we deployed Horizon yesterday to production and it didn't take long until we realized we have orphaned processes :-/ Regarding the "always wrongly 1 orphan"@dbpolito (at al)
I can reproduce this. I always get shown a PID which is untraceable as to what process it belongs. I did do some debugging (e.g. using I also used https://github.com/a2o/snoopy to log which processes are spawned and this particular PID never showed up. E.g. in on case I got reported an orphaned PID
I don't think it's a coincidence that all This PID is received from the call to: $this->exec->run('pgrep -f horizon') as part of But even when I do run OTOH I noticed that due to using So depending on what other processes you have running on the system which have the name Our orphaned process problemYesterday we realized we had two orphaned processes. Their characteristics:
We're running Ubuntu 14.04 LTS with
Here how the rogue processes did appear (I removed the surroundings):
I know these were the correct processes because with When does re-parenting happen? E.g. https://unix.stackexchange.com/a/152400/7924
So why can/does this happen in case of supervisor/horizon/etc.? Some guesses (?):
Lots of guessing as you see. For now we will also add the purge command but seeing that there's something off it's clear a code fix would be desired. I can assist with more information / debugging once I get new orphaned processes. PS: we're also using https://github.com/wa0x6e/Cake-Resque to drive background job from a CakePHP project. It has its own set of problems but we never experienced such kind if orphaned. Or: at least not at that rate. |
We are seeing the same weird issue where Horizon keeps running old code, even after using |
@elghobaty Try also running |
@elghobaty I have my deploy restarting |
I've noticed in our staging environment the same thing, orphaned horizon workers, stale code etc. I find if I use horizon:terminate while supervisor is running it seems to cause it, where as if I stop the horizon supervisor, then pull, then start again it's okay |
Should running 'purge' still let the workers finish their active jobs? I'm also seeing a lot of orphaned process, running purge kills them cleanly. I'm using Deployer (so different release directories, symlink the active one to 'current' and using Forge to manage the Deamon) |
Same issue here! And good question @barryvdh, the processes should finish their job, are they with This is Taylor his article about deployments with Horizon on Forge: https://medium.com/@taylorotwell/deploying-horizon-to-laravel-forge-fc9e01b74d84, |
I don't get this command
|
I have the same problem. I'm using Daemon and every time I'm deploying I need to kill processes manually |
Yeah I would like to figure out a root cause or maybe easy recreation of the issue so we could track it down. I don't have a good understand of why it is happening since all horizon:work processes would be started from a supervisor and its unclear to me how the parent supervisor could die without also killing all its child processes. |
Hi all, We have tagged |
I will try right now. |
@marianvlad thanks |
Trying now 🎉
|
If this doesn't fix it I'll probably be putting a nice little bounty on this 😄 |
Just a notice, the |
I ran |
@marianvlad these could be from before, run the |
Can't we just send the queue:restart signal from horizon terminate? That seems to work reliable. |
@barryvdh that could be an option. To be honest, I've forgotten why it didn't work like that in the first place but Horizon changed a bit while I was in the process of writing it. We can wait until we get a bit more feedback on the 1.2.0 termination process and then consider other options if we're not seeing improvement. |
@marianvlad did you make sure you started with zero horizon processes? Any more feedback? |
Hey all, Just wanted to note we have made some further tweaks on the latest 1.2.2 release, including using the queue:restart cache logic approach in addition to a couple other fixes. Please let us know how this release works for you. We rely on your feedback ❤️ |
I did multiple tests on 1.2.0 on a clean project everything worked perfectly except in my real project. My setup looks like this: config/queue.phh
config/horizon.php
A sample of my job:
DeleteUploadedService, UploadService are just classes with Symfony\Process calling a docker container. |
@marianvlad please upgrade to 1.2.2 and then report your findings. |
Our team just tried with 1.2.2 and we're still having the same issue of the workers being restarted, however we observe differences between this and running normally via terminal window. |
If killing rogue/child processes sometimes work, sometimes doesn't - a solution would be to introduce a (worker) self-kill option:
This would introduce an extra lookup in Redis for each loop, but the cost is a small one to pay for not having rogue processes. Also to consider is when the main horizon process is stopped, but rogue processes continue to exist. I suppose an extra check here would be a heart beat issued by the supervisor. If this heart beat is older than a few seconds, workers can safely die. |
Isn't that exactly what queue:restart did and is now done with the last commits: horizon/src/Console/TerminateCommand.php Line 48 in 62c60ef
|
I should have checked the last commits, but I believe there's a small difference. This cache check is implemented in the "queue:restart" command, and if this one is not executed, rogue processes will not die. My thinking was to make this a bit more invasive, executed within the main loop cycle. It would link a worker process to a know token, and workers would die automatically if their parent no longer exists or the token changed. |
@TruckersMP-Kat we would need way more information to diagnose anything. |
@dragosperca are you currently using 1.2.2? |
i am using Horizon on a project, and thought about using it for another larger, critical mission one. I don't experience the current issue, I was just reading all the issues and trying to understand failure points. My comment was just an idea to avoid rogue processes. Before Horizon, we were using the standard worker & supervisor setup and did a similar thing you implemented in 1.2.2, that is, a marker that when set, would make workers die. Since supervisor was there to start them if any process died, we had a "self-kill and revive" process. We implemented it in the main (worker) loop though. This marker would be set when new code would be deployed to the server. Anyway, just want to say thank you for the great work you've been doing with Laravel and the ecosystem around it. We all love it! |
OK Thanks. We're still looking for any feedback from people running Horizon 1.2.2. |
On my testserver this seems to be working correctly. I'll deploy it to staging/production tomorrow hopefully and see how it turns out. Does it matter from which version the termininate command is run, in a zero-downtime environment ala Envoyer (so different release directories). I assume it's safe to just run it from the next release dir (to become current), just before symlinking? |
Anyone with problems using failed() method inside job to do things like update something in database? |
@barryvdh I don't think it makes a difference. @marianvlad different topic, different issue please :) This issue is already large and it'd be easier if we just keep it focused on one matter. |
Anyone? 😄 |
Sorry, soon; hope to upgrade on Monday! We're basically quite constantly hit by it the every other day, but this week we gave 5.6 a priority first ;-) (PS: Thanks for the new logging infrastructure 👍 ). |
I updated yesterday Horizon to 1.2.2 on five workers that are doing the same kind of queued jobs. A minute ago I ran |
Just deployed |
@taylorotwell After |
@tomschlick you see anything? @marianvlad thanks! |
@taylorotwell just checked our logs from last week and we saw two instances of orphans
Weirdly enough two of the processes had the same process id, even though the deploys took place 15 minutes apart. Horizon appeared to terminate them and restart correctly so not sure how that's possible 🤷♂️ |
So in this case the two orphans 8781 and 8862 are the actual orphans, however it seems that even the purge command didn't kill them so that could mean they're really stuck on a long process that the next loop didn't run yet. What's your timeout value? |
Timeout value is |
I'm not able to reproduce this issue anymore... So it looks like fixed to me... ❤️ |
As i created the ticket and seems it got fixed, i'm closing this one... We can start new tickets and mention this one if necessary. |
We are encountering strange issues. Sometimes one queue is stuck and does not process any jobs. Even when supervisor is stopped there are Here is our config and the problem is only with
|
I'm running horizon on a latest forge machine as documented... the daemon with
php artisan horizon
and on deploymentsphp artisan horizon:terminate
but time to time i need to manually runphp artisan horizon:purge
.This is the output i just got after after hours of last release:
I can confirm it's orphans by running
htop
on tree mode (press f5) and i see these process as root process, not inside the masterphp artisan horizon
process.And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:
I haven't found a pattern yet why this is happening.
The text was updated successfully, but these errors were encountered: