Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gate One WebSocket causes reload which removes gateone.pid resulting in timeouts #497

Closed
rene00 opened this issue Feb 22, 2015 · 2 comments

Comments

@rene00
Copy link

rene00 commented Feb 22, 2015

I've seen instances where GateOne (commit 62af45) reloads and clears the pid file which causes timeouts for subsequent connections.

I've got a box that is online now (Ubuntu 14.04) that has Gate One timing out. I'm yet to restart the process incase we want to debug this live. I've also seen this with CentOS6.

The Ubuntu box uses upstart to manage Gate One (see /etc/init/gateone.conf). The hack in post-start to create the pid file is there incase the pid file wasnt created by Gate One.

# status gateone
gateone stop/waiting
# ps axuww|grep gateone
root      3666  0.0  0.9 231536 16040 ?        Sl   Feb20   0:00 /opt/virtualenvs/gateone/bin/python /opt/virtualenvs/gateone/bin/gateone --pid_file=/var/run/gateone.pid
root     10105  0.0  0.0  10464   928 pts/3    S+   22:11   0:00 grep --color=auto gateone
# ls -la /var/run/gateone.pid
ls: cannot access /var/run/gateone.pid: No such file or directory

From /var/log/gateone/gateone.log, this is Gate One starting.

[W 150219 18:10:46 app_terminal:2714] dtach command not found.  dtach support has been disabled.
[I 150219 18:10:46 server:4047] Imported applications: Terminal
[I 150219 18:10:46 server:4189] Version: 1.2.0 (20140609214034)
[I 150219 18:10:46 server:4190] Tornado version 3.2.2
[I 150219 18:10:46 server:4210] Connections to this server will be allowed from the following origins: '.*'
[I 150219 18:10:46 server:3728] No authentication method configured. All users will be ANONYMOUS
[I 150219 18:10:46 server:3855] Loaded global plugins: control_alt_w.js, help.js
[I 150219 18:10:46 server:4328] Listening on http://*:8080/
[I 150219 18:10:46 server:4348] Process running with pid 19048

Everything is fine for a while until this

[E 150220 17:20:39 server:1873] Error/Unknown WebSocket action, terminal:get_ter
minals: None (/opt/virtualenvs/gateone/local/lib/python2.7/site-packages/gateone
/applications/terminal/app_terminal.py line 732)
[I 150220 17:20:42 server:866] All user sessions have terminated.
[I 150220 17:20:42 server:876] The last idle session has timed out. Reloading...
[W 150220 17:20:42 app_terminal:2714] dtach command not found.  dtach support ha
s been disabled.
[I 150220 17:20:42 server:4047] Imported applications: Terminal
[I 150220 17:20:42 server:4189] Version: 1.2.0 (20140609214034)
[I 150220 17:20:42 server:4190] Tornado version 3.2.2
[I 150220 17:20:42 server:4210] Connections to this server will be allowed from 
the following origins: '.*'
[I 150220 17:20:42 server:3728] No authentication method configured. All users w
ill be ANONYMOUS
[I 150220 17:20:42 server:3855] Loaded global plugins: control_alt_w.js, help.js
[I 150220 17:20:42 server:4328] Listening on http://*:8080/
[E 150220 17:20:42 server:4357] Could not listen on 0.0.0.0:8080 (address:port i
s already in use by another application).
[E 150220 17:20:42 server:4371] Exception was: (98, 'Address already in use')
[I 150220 17:20:42 server:4378] Clearing cache_dir: /tmp/gateone_cache

The user started using Gate One at 17:02 successfully.

I'm not sure what was done on the client end that triggered the WebSocket error.

I use nginx to reverse proxy connections into the Gate One box which only has an internal IP address. I run several Gate One boxes concurrently and would like to keep the reverse proxy architecture in place so that I don't have to provide each Gate One box an external IP address.

The relevant parts of the nginx config.

    location / {
        proxy_pass http://$node.$id.REDACTED:8080;
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

The nginx error logs.

2015/02/22 21:18:42 [error] 1206#0: *2207 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXX.XXX.XXX.XXX, server: ~^(node-.*)-(r-.*)\.REDACTED$, request: "GET /favicon.ico HTTP/1.1", upstream: "http://10.80.25.72:8080/favicon.ico", host: "REDACTED"

nginx terminates SSL and proxies to Gate One over plain HTTP.

The Gate One process seems to sit in a weird state after this. These are recurring logs within gateone.log up until 02/21 at 06:06:31.

# tail /var/log/gateone/gateone.log 
[I 150221 06:06:31 server:4210] Connections to this server will be allowed from the following origins: '.*'
[I 150221 06:06:31 server:3728] No authentication method configured. All users will be ANONYMOUS
[I 150221 06:06:31 server:3855] Loaded global plugins: control_alt_w.js, help.js
[I 150221 06:06:31 server:4328] Listening on http://*:8080/
[E 150221 06:06:31 server:4357] Could not listen on 0.0.0.0:8080 (address:port is already in use by another application).
[E 150221 06:06:31 server:4371] Exception was: (98, 'Address already in use')
[I 150221 06:06:31 server:4378] Clearing cache_dir: /tmp/gateone_cache
[I 150221 06:06:31 server:4381] pid file removed.
[W 150221 06:06:32 utils:836] Could not open pid_file (/var/run/gateone.pid).  You *may* have to kill gateone.py manually (probably not).
[W 150221 06:06:32 utils:836] Could not open pid_file (/var/run/gateone.pid).  You *may* have to kill gateone.py manually (probably not).

The current TCP state of play.

# ps axuww|grep gateone
root      3666  0.0  0.9 231536 16040 ?        Sl   Feb20   0:00 /opt/virtualenvs/gateone/bin/python /opt/virtualenvs/gateone/bin/gateone --pid_file=/var/run/gateone.pid
# lsof -n -i:8080
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python  3666 root    5u  IPv4  13117      0t0  TCP *:http-alt (LISTEN)
python  3666 root   10u  IPv4  13098      0t0  TCP 10.80.25.72:http-alt->10.80.31.253:46171 (CLOSE_WAIT)
python  3666 root   16u  IPv6  13118      0t0  TCP *:http-alt (LISTEN)

Note the connection in CLOSE_WAIT.

Any idea what is causing the WebSocket error and is there a way I can gracefully recover without requiring a human to restart the Gate One process?

My config files below.

# cat 10server.conf 
// This is Gate One's main settings file.
{
    // "gateone" server-wide settings fall under "*"
    "*": {
        "gateone": { // These settings apply to all of Gate One
            "address": "",
            "auth": "none",
            "api_timestamp_window": 30,
            "ca_certs": null,
            "cache_dir": "/tmp/gateone_cache",
            "certificate": "/etc/gateone/ssl/certificate.pem",
            "combine_css": "",
            "combine_css_container": "gateone",
            "combine_js": "",
            "cookie_secret": "REDACTED",
            "debug": true,
            "disable_ssl": true,
            "embedded": false,
            "enable_unix_socket": false,
            "gid": "0",
            "https_redirect": false,
            "js_init": "{showToolbar: false, showTitle: false}",
            "keyfile": "/etc/gateone/ssl/keyfile.pem",
            "locale": "en_US",
            "log_file_max_size": 100000000,
            "log_file_num_backups": 10,
            "log_file_prefix": "/var/log/gateone/gateone.log",
            "log_to_stderr": null,
            "logging": "info",
            "origins": [".*"],
            "pid_file": "/var/run/gateone.pid",
            "port": 8080,
            "session_dir": "/tmp/gateone",
            "session_timeout": 0,
            "syslog_facility": "daemon",
            "syslog_host": null,
            "uid": "0",
            "unix_socket_path": "/tmp/gateone.sock",
            "url_prefix": "/",
            "user_dir": "/var/lib/gateone/users",
            "user_logs_max_age": "30d"
        }
    }
}
# cat 50terminal.conf 
// This is Gate One's Terminal application settings file.
{
    // "*" means "apply to all users" or "default"
    "*": {
        "terminal": {
            "commands": {"LOGIN": "/usr/bin/sudo -u admin -i"},
            "default_command": "LOGIN",
            "dtach": false,
            "enabled_filetypes": "all",
            "environment_vars": {"TERM": "xterm-256color"},
            "max_terms": 6,
            "session_logging": false,
            "syslog_session_logging": false
        }
    }
}
@rene00 rene00 changed the title Gate One reloading removes gateone.pid causing timeouts Gate One WebSocket causes reload which removes gateone.pid resulting in timeouts Feb 22, 2015
@liftoff
Copy link
Owner

liftoff commented Mar 4, 2015

After some thought ("I thought I remember fixing something like this a while back...") I believe I know what's going on: There's a bug in the version of Gate One you're using that I corrected in commit 9334592

commit 9334592411911aba35b2c387a1907beacba3deb7
Author: Dan McDougall <daniel.mcdougall@liftoffsoftware.com>
Date:   Sat Aug 23 21:04:37 2014 -0400

    core/server.py:  Removed the code that restarts Gate One after the last user logs out.  Turns out it messes up a lot of the time on a lot of platforms.  It just isn't worth it.

So if you upgrade to the latest code this problem should go away. Alternatively you could see what I changed between f11d0e4 and 9334592 which was basically just removing a bunch of lines from server.py.

git diff f11d0e4574b7dbfb680f1113098d7694d877ab1a 9334592411911aba35b2c387a1907beacba3deb7

@rene00
Copy link
Author

rene00 commented Mar 4, 2015

@liftoff thanks! I'll merge 9334592.

@rene00 rene00 closed this as completed Mar 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants