-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent 500's #5
Comments
I've been able to reliable reproduce the error. I believe this happens when the proxy server doesn't successfully complete a request because of an error, like the upstream request died. I reproduced this by forcing my upstream code to prematurely exit before receiving a response. Subsequent requests began to intermittently fail from there. It's almost as if one of the available chrome instances is in a stuck and unavailable state. |
@adamgotterer thanks for the report 👍 appreciated! I think we need to be sure that the Browser is indeed in a Bad State and is no longer acting in a responsive way and for this we will need to get the error (+ stack trace) which is causing the 500 or the steps to recreate the issue and chrome verbose logging enabled (mentioned a bit down below). Client dropped the Connection Automatically Cleans up Chrome Tab Resource on DC But that doesn't explain Chrome going into a "Bad State" - nor the 500s on new connection requests Let's see what Chrome is Doing If it is in a bad state, and Chrome is emitting a Log Message that we can match on - Easy Win
Let's hope 🤞 Chrome is emitting a message, and it will just be a case of adding a condition to match on it and restart Chrome. If not -> A new Chrome OS Process Healthcheck? Ultimately - we have ideas and we will be able to address the problems when we have more information from the Verbose chrome logging and/or steps to recreate locally, @adamgotterer 🙏. The truth is, |
I was trying to set up travis testing and ran into this bug - https://travis-ci.org/j-manu/chroxy/jobs/394488723 I have created a branch with verbose logging enabled and I was able to recreate this bug here - https://travis-ci.org/j-manu/chroxy/jobs/394497915. Also encountered a timeout error here - https://travis-ci.org/j-manu/chroxy/jobs/394495151 Instead of having a chrome process healthcheck, we can have an API endpoint which accepts a ws_url and recycles the chrome instance associated with that. |
Not conclusive - but if i start only one chrome process, I've been unable to trigger this bug. Otherwise running |
@j-manu yes that could work, I would say though that background process which is continuously testing the chrome browsers health would do the same without the end user needing to worry about such things. I am open to your suggestion of an endpoint which would act as a "hint" that a ChromeServer is unresponsive. It opens up an issue of needing to maintain the websocket urls as an Identifier (simple enough), but also that there may be many working connections running against that browser which would be impacted every time and overzealous client script got an error response. I didn't run into this kind of issue unless I was running in Docker, and trying to address that cost me a few weeks of the time I had to work on this fulltime 🤕 - but I am still very interested in getting chrome to run as stable as possible in such an environment as it is what we target. There could be a mode of operation implemented whereby browsers are spawned per connection (and terminated after) - maybe use something like poolboy to manage the access / number of browsers which can be spawned (via poolsize), but this is an extremely resource intensive mode when running 1000s of concurrent connections to Browsers versus Tabs, and also exposes the full browser API (which might not be a bad thing at all tbh, as far as features go). When running 100s of concurrent tests against the same browser, you are trusting that chrome browser process is able to manage that number of web-socket connections and schedule all the commands across each of the pages reliably, and I have little doubt this is not the most stable code in the world within Chrome - if it was |
yeah. background process works too. Simpler for the client. Thinking aloud, I'm wondering if instead of a background process, chroxy does a sanity check of websocket connection before returning it to the API client. This will of course add a delay and hence can be optional. |
Btw i'm able to reproduce this problem pretty reliably outside of docker (on a mac) by running |
I also haven't been able to reproduce this issue with a single running chrome instance. |
@adamgotterer @j-manu noted - single instance with many tabs may be the work-a-round for stability, with multiple instances behind a load balancer. Can I ask if you are running in Docker? / Have you seen the 500 errors outside of it? |
I'm running in Docker. @j-manu said he also can reproduce the problem outside of Docker. I'm concerned that running a single Chrome instance will end up having the same scaling characteristics as running Chrome without Chroxy at all. In my testing with remote debugging interfacing with Chrome directly connections start to drop after a few context are created. Not sure the exact number, but my non-scientific guesstimate puts it somewhere around 5-10. |
@holsee If the issue happens with only multiple chrome instances, then it is probably a problem at the chroxy proxy server and not chrome? I tried setting different "user-data-dir" for the different chrome instances but that didn't help. Other than that why would running multiple chrome instances cause a problem? |
@adamgotterer a concern I share. I am currently very busy with another project delivery at the minute, but I think the unit of work to pick up next will come out of this thread.
From these units of work, I think we will address the issue or at least minimise the impact of when chrome gets unstable. Another work item I would like to add is support for Firefox - but that is another thread that has yet to be started. |
@j-manu Ah HAH! I think you might have hit the nail on the head here buddy! I have a theory... (only a theory at this point 😬) but hear me out: Hypothesis:
If you trace the code from the
RACE
I think this is the problem. |
So, to clarify the theory behind this scenario: Client 1 - GET /connection Client 2 - GET /connection Client 2 - connects first to -> Chroxy instructs internal TCP Listern to accept connection on "ws://PROXY_HOST:PROXY_PORT/devtools/page/$CHROME_2_PAGE_ID_BAR" As you can see there would be no such issue with a single chrome. This means we need to correctly route the incoming connections based on the PAGE_ID to the correct |
The connection is accepted here: The The FIX (needs more thought but):
And might I add that this would work well for cleaning up Pages in Chrome for which a connection is never made, which is another feature on the backlog 👍 |
Phew Yeah, a race condition is what I thought too but I had assumed you were routing based on url to process mapping already and thought the storing/retrieving of it had some race condition. Btw for splitting into worker and server nodes, how will the worker node register with the server node? Also if chrome crashes how will server node know? I think the current architecture works really well. The only hard part is getting it set up (install erlang, install elixir, mix compile). Is there a way to package this as a self contained binary? |
@holsee Once you have connection tracking is it possible to eliminate the API step? i.e the client connects to This makes it easier to load balance multiple chroxy instancers because it will be a single connection. Currently it takes two connections (one to api and one ws) and those can go to different instances. The only way to solve it from what I see is to ensure that a client machine always gets routes to the same chroxy instance but that means that a single client machine can only use one chroxy instance |
@j-manu possibly, but the client needs to connect to what it thinks is a Chrome Remote Debug instance (but is, in fact, a proxy) - so this might break the protocol and the client compatibility. |
@holsee But that's how it works now also? Here is how I envisioned it.
I'm not familiar with elixir but I guess you can differentiate between new and old connections and this change doesn't require holding a mapping of connections -> chromeProxy for multiple chrome instances to work. |
What are some Chrome use case for targeting and reconnecting to the same session via the Chrome provided WS URL? |
@adamgotterer Are you asking for use cases or are you saying that there are use cases for targeting and reconnecting to the same session? |
I’m asking for use cases. I’ve personally never needed to access a session a second time or in parallel. So I’m curious what people have used it for. |
I am working on the connection routing fix now (while baby sleeps 👶) if I don't get this done today I will pick it up on Wednesday as I will have time while I travel. Please open other issues for ideas and bugs not related to the 500 issue as defined here, thanks. |
@j-manu @adamgotterer I have a fix in place in pull request #10 - I need to tidy this up and perform some testing on my end before merging this. I also needed to perform #9 first in order to reliably recreate the issue (i.e. each connection request will take the next chrome browser in the pool instead of random selection). The unit test added in #9 recreates the issue reliably, and I can confirm #10 sees that test passing. |
You can try the fixes out on |
@adamgotterer @j-manu version 0.4.0 push to hex.pm which includes the fix for the issue detailed here. Closing now. Thanks for your reports and help tracking issues down, much appreciated! |
Thank you! Now I can deploy this to production :) |
@j-manu version tag 0.4.1 for sanity 👍 let me know how you get on. |
I'm seeing random and intermittent failures when trying to interface with Chroxy. I'm testing in a isolated environment with plenty of resources and currently have 8 available port ranges for chrome.
When running one off requests I will get a handful of 500 errors and then a handful of successful responses.
The log of the failed response is:
The text was updated successfully, but these errors were encountered: