fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

rhelmer · 2015-03-21T17:27:26Z

r? @phrawzty - I need to get figure out how to get this associated with bugs properly (it fixes several), so don't merge yet :) I think we should land them all together to minimize breaking external users - I think the docs aren't quite good enough yet in particular, but the following all WFM:

removes all config files except alembic and django local.py
- local.py can go away when PR fix bug 1123908 - use env vars for django config #2684 lands
adds systemd service files for all socorro services
provides working nginx sample configs
updates docs
splits socorro initial setup out to a /usr/bin/setup-socorro.sh

I've been testing this out on our AMI and it seems to work! Now that some of the "socorro lite" work has landed, you can run collection+processing without any additional services running, except for consul which is now required for all Socorro services:

sudo yum install consul envconsul
consul agent -bootstrap-expect 1 -server -data-dir=./consul-data/

First you must onfigure collector to use WSGI rather than default built-in web.py server:

curl -X PUT -d 'socorro.webapi.servers.ApacheModWSGI' http://localhost:8500/v1/kv/socorro/collector/web_server__wsgi_server_class

(@twobraids we should really change the name of that config key to remove the ApacheMod part and just call it socorro.webapi.servers.WSGI)

Then just start the services:

sudo yum install socorro
sudo systemctl start socorro-collector
sudo systemctl start socorro-processor

This will store both raw and processed crashes to ~socorro/crashes, and will scan the filesystem rather than using a queue. Storing crashes to ES/S3/PG and enabling RabbitMQ are just a matter of setting the right keys/values in consul.

You should be able to submit crashes, and they should be processed successfully (both raw .json and .dump and processed .jsonz files are stored in ~socorro/crashes):

# from a Socorro checkout w/ activated virtualenv
socorro submitter -u http://crash-reports/submit -s testcrash/raw

For a distributed setup we don't want to share a filesystem, so you'll need to turn RabbitMQ and S3 on via consul.

The webapp has more dependencies before it'll work:

sudo /usr/pgsql-9.3/bin/postgresql93-setup initdb
# set local connections to "trust"
vi /var/lib/pgsql/9.3/data/pg_hba.conf 
sudo systemctl start postgresql-9.3
sudo systemctl start elasticsearch
sudo yum install memcached
sudo systemctl start memcached

PG and ES need to be set up (NOTE this script assumes they are on localhost, it should fail gracefully if that's not the case. Also it's safe to re-run the script, it won't destroy anything already set up):

sudo setup-socorro.sh

# configure middleware to use WSGI rather than default built-in web.py server
curl -X PUT -d 'socorro.webapi.servers.ApacheModWSGI' http://localhost:8500/v1/kv/socorro/middleware/web_server__wsgi_server_class

sudo systemctl start socorro-middleware
sudo systemctl start socorro-webapp

The RPM drops nginx sample configs into /etc/nginx/conf.d that listen on the vhosts crash-reports (collector), crash-stats (webapp), and socorro-middleware (this one only listens on localhost since it's not safe, don't want anyone to accidentally expose it)

The webapp will give you 404s for the default WaterWolf unless you either use socorro setupdb's --fakedata option, or set up a new product via the admin UI per http://socorro.readthedocs.org/en/latest/configuring-socorro.html (maybe we should have the setup-socorro script do some/all of this?)

phrawzty · 2015-03-22T14:20:59Z

docs/production-install.rst

+Install the Socorro repository.
+::
+  sudo rpm -ivh https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.packages-public/el/7/noarch/socorro-public-repo-1-1.el7.centos.noarch.rpm
+
 Now you can actually install the packages:
 ::


Consider adding a sudo yum makecache step before install, just to be sure.

phrawzty · 2015-03-22T15:10:24Z

Concerning the systemd unit files:

Given the environmental requirements and the length of the ExecStart, I strongly suggest restructuring this (and the other systemd unit files) to use sysconfig. This could be done per-service or using a global socorro sysconfig. In either case, it allows us to set common config data in one spot, which reduces the number of files we have to maintain and helps to ensure consistency at runtime. Of the two, I prefer the global model (in this case), though I concede that it's the less pure of the two. 😇

Per-service example:

# socorro-collector.service (snipped for brevity)
EnvironmentFile=-/etc/sysconfig/socorro-collector
ExecStart=$CMD $CMD_OPTS

# sysconfig/socorro-collector
VENV="/data/socorro/socorro-virtualenv"
CMD="/usr/bin/envconsul"
CMD_OPTS="-once -upcase=false -prefix socorro/collector $VENV/bin/uwsgi -H $VENV -M --need-app -w wsgi.collector -s /var/run/uwsgi/socorro-collector.sock --chmod-socket=664 --uid=socorro --gid=nginx"

Global example:

# socorro-collector.service (snipped for brevity)
EnvironmentFile=-/etc/sysconfig/socorro
ExecStart=$ENVCONSUL_BIN $ENVCONSUL_OPTS -prefix socorro/collector $VENV/bin/uwsgi -H $VENV -M --need-app -w wsgi.collector -s $UWSGI_DIR/socorro-collector.sock --chmod-socket=$SOCKET_MODE --uid=$USER --gid=$GROUP"

# sysconfig/socorro
VENV="/data/socorro/socorro-virtualenv"
ENVCONSUL_BIN="/usr/bin/envconsul"
ENVCONSUL_OPTS="-once -upcase=false"
UWSGI_DIR="/var/run/uwsgi"
USER="socorro"
GROUP="nginx"
SOCKET_MODE="664"

Note that these are strictly off the top of my head, so YMMV. 😁

Update: I hadn't yet read about the so-called "Emperor and Vassal" approach, which definitely has merit, and would play into the suggestions I've made above should we choose to roll that way. (cf. bug 1145487)

phrawzty · 2015-03-22T15:18:15Z

docs/production-install.rst

-::
-  sudo service httpd start
-  sudo chkconfig httpd on
+    rabbitmq-server elasticsearch httpd mod_wsgi memcached socorro


Are you sure about httpd and mod_wsgi here given that we're now using Nginx and uwsgi? Maybe s/httpd\smod_wsgi/nginx/; or so?

phrawzty · 2015-03-22T15:47:04Z

Concerning scripts/install.sh and scripts/package.sh (and yes, I've said this before), but at some point we'll want to replace FPM with proper packaging. We don't need to fix this today, but we should probably create a bug for it - could be good for an intern or junior?

Also, the name install.sh is sort of misleading. 😛

phrawzty · 2015-03-22T15:56:46Z

scripts/setup-socorro.sh

+fi
+
+# create ElasticSearch indexes
+echo "Creating ElasticSearch indexes"


Elasticsearch (below, too). 😄

Concerning scripts/install.sh and scripts/package.sh (and yes, I've said this before), but at some point we'll want to replace FPM with proper packaging. We don't need to fix this today, but we should probably create a bug for it - could be good for an intern or junior?

sgtm, but some things I'd note:

we're going to start breaking up the repo into separate modules and therefore will need separate packages (planning to do socorro-collector first)

the build system in the current socorro module is difficult to deal with, it'll be easier with the new split-up modules

we're probably going to want debian packages too (which was part of the impetus for fpm)

the new split-up socorro modules are going to be proper python packages, so we might be able to get away from having to do distro-specific packaging (say we used something like https://github.com/progrium/buildstep and had generic treatment of "web" and "worker" classes of app)

phrawzty · 2015-03-22T16:24:21Z

Concerning scripts/setup-socorro.sh, this is a good first stab (functionality++), but as I mentioned above, we'll need to amplify it along with the install docs in order to make it more useful for non-all-on-one-box installs. A simple enough solution would be to split out each role-specific step (or steps) into functions which are then called as appropriate. Off the top of my head, an example:

#!/usr/bin/env bash

function help {
    echo "USAGE: ${0} <role>"
    echo "Valid roles are: postgres, webapp, whatever."
    exit 1
}

function validate {
    # No argument? That's a paddlin'.
    if [ "x${1}" == "x" ]; then
        help
    fi

    # Invalid function? That's a paddin'.
    VALID_FUNC=`type -t $1 | grep -q function`
    if [ $? != 0 ]; then
        help
    fi
}

function postgres {
    echo "this is the postgres function"
}

function webapp {
    echo "this is the webapp function"
}

function whatever {
    echo "this is the whatever function"
}

# Aaaaand go!
validate $1
echo "Initialising ${1}."
$1
exit 0

phrawzty · 2015-03-22T16:29:18Z

scripts/setup-socorro.sh

+
+# create DB if it does not exist
+# TODO handle DB not on localhost - could use setupdb for this
+su - postgres -c "psql breakpad -c ''" > /var/log/socorro/setupdb.log 2>&1


Previously this stuff was run as part of the package install, and thus, implicitly as root; however, now that it has been split out into a separate script, su's should likely be run via sudo. Either that, or we mandate that the entire script be run via sudo.

I don't want to embed sudo in the script, that seems too surprising - I'd rather the script just check if you're running as root and fail if not, then people can sudo or su as appropriate.

phrawzty · 2015-03-22T16:45:22Z

I've been testing this out on our AMI and it seems to work! Now that some of the "socorro lite" work has landed, you can run collection+processing without any additional services running, except for consul which is now required for all Socorro services:
sudo yum install consul envconsul
consul agent -bootstrap-expect 1 -server -data-dir=./consul-data/

I realise that this is contextual and explanatory, but we should be careful here going forward: the example above implies that each node is running a Consul server when in fact, each node needs to have (at least) a Consul agent.

Globally this speaks to a larger issue we have with regards to our packaging, install scripts, documentation, etc, where we can't decide whether everything is supposed to run on one node by default or not. I think we should strongly consider elevating the idea of "roles" throughout - it would allow us to more cleanly describe every element both programatically and in documentation.

Thoughts ❓

rhelmer · 2015-03-23T14:27:19Z

Globally this speaks to a larger issue we have with regards to our packaging, install scripts, documentation, etc, where we can't decide whether everything is supposed to run on one node by default or not. I think we should strongly consider elevating the idea of "roles" throughout - it would allow us to more cleanly describe every element both programatically and in documentation.

Yes this is tricky, I have found that most people who want to try out Socorro (and even for production) besides us tend to have low volume, and want to run it on a single node or maybe just a few. I want to make initial setup as simple as we can, while making it clear how to distribute services if that's what you need - any ideas for how to expose this consistently in the docs would be great!

phrawzty · 2015-03-23T15:35:44Z

daad0be adopts the sysconfig model for systemd unit files (yay) - but where is the sysconfig file itself? We should at least include a functioning sample.

rhelmer · 2015-03-24T01:50:11Z

@phrawzty just testing this, and one thing we're missing - we need to install consul and envconsul on these nodes.

Should we just make these a dependency for the RPM?

rhelmer · 2015-03-24T01:52:29Z

@phrawzty also - this drops in a nginx config fragment to /etc/nginx/conf.d/, is it kosher for the Socorro RPM to reload nginx?

rhelmer · 2015-03-24T02:41:59Z

@phrawzty btw systemd doesn't seem to be playing nice with /etc/sysconfig/socorro - it complains that the WorkingDir is not an absolute path when set to $WORKING_DIR and ignores it :/

Having similar problems running the actual command, haven't worked those out quite yet...

rhelmer · 2015-03-24T02:44:49Z

@phrawzty http://0pointer.de/blog/projects/on-etc-sysinit.html talks about "what's wrong with /etc/sysconfig (...) Why might it make sense to fade out use of these files in a systemd world"

rhelmer · 2015-03-24T03:06:30Z

@phrawzty so (as you are no doubt aware) Fedora recommends sysconfig, although Lennart obviously doesn't:
http://fedoraproject.org/wiki/Packaging%3aSystemd#EnvironmentFiles_and_support_for_.2Fetc.2Fsysconfig_files

I think the problem is we can't use it for WorkingDir or the first arg in ExecStart I guess?
For now I reverted the whole thing, but we could put some configs in there I suppose, lmk what you think.

rhelmer · 2015-03-24T03:17:57Z

OK - tested all Socorro-specific roles and PR currently works OK - Socorro apps start up fine, but have no Consul agent to connect to, work on that is happening elsewhere.

rhelmer · 2015-03-24T03:20:58Z

Oh BTW - crontabber is the only remaining app that isn't running under envconsul, I think it's going to be complex enough that it needs its own PR (just as Django has its own) - form the perspective of the current PR we're doing everything necessary to support it.

phrawzty · 2015-03-24T11:58:31Z

@phrawzty just testing this, and one thing we're missing - we need to install consul and envconsul on these nodes. Should we just make these a dependency for the RPM?

PR 40 addresses a bunch of Consul-related stuff, including the addition of both consul (agent) and envconsul to the Base AMI. We should be careful about making them dependencies for the RPM since not everybody will be using our Yum repo - this is a policy decision more than a technical one. I don't time to form an opinion on this right now, heh.

@phrawzty also - this drops in a nginx config fragment to /etc/nginx/conf.d/, is it kosher for the Socorro RPM to reload nginx?

👍

phrawzty · 2015-03-24T12:19:51Z

I think the problem is we can't use it for WorkingDir or the first arg in ExecStart I guess? For now I reverted the whole thing, but we could put some configs in there I suppose, lmk what you think.

This is a battle that pits a young tool with no established best practices against a legacy of behaviour that's never been particularly consistent. There are no winners. 😾 I don't understand why those variables aren't working (the ExecStart case is especially baffling), but given that I'm not intimately familiar with systemd, I'm more than willing to concede that there is a "good reason" that is unknown to us at this time.

In the interest of ⏩ let's keep the reverted (non-sysconfig) version for now. We can re-examine this issue later - in fact, we should, since it may help to explain how to better configure and run stuff under systemd in general.

rhelmer · 2015-03-24T16:12:28Z

I would guess that the issue with systemd is it is parsing the config and seeing that the string $WORKING_DIR does not begin with / and so is not an absolute path and is complaining, it should really expand the variable at runtime (maybe they have a good reason for not doing this, idk)

rhelmer · 2015-03-24T16:13:25Z

We should be careful about making them dependencies for the RPM since not everybody will be using our Yum repo - this is a policy decision more than a technical one. I don't time to form an opinion on this right now, heh.

Everybody using our docs should be using our repo. If they don't want to that's fine, but they need to figure it out.

rhelmer · 2015-03-24T16:14:10Z

Oh BTW - crontabber is the only remaining app that isn't running under envconsul, I think it's going to be complex enough that it needs its own PR (just as Django has its own) - form the perspective of the current PR we're doing everything necessary to support it.

I take this back - turns out it wasn't hard at all. I think I can get this too, I just need to adjust the crontabber.sh wrapper

rhelmer · 2015-03-25T02:44:05Z

@phrawzty OK tested this by activating all of the roles on a single node, and (given a correct config) we seem to work in "distributed mode" (using S3 and RabbitMQ)

I found a problem though - processor can't push crashes into ES, we get:

InvalidJsonResponseError: <Response [404]>

We've got ES 1.4 in AWS, so I tried swapping out the socorro.external.elasticsearch storage classes for the newer socorro.external.es and it seems to work.

rhelmer · 2015-03-25T03:16:40Z

OK I can:

submit and process crashes
log into webapp as superuser, add products+releases
see processed crashes in the web UI

The only problem I've found is supersearch doesn't seem to work:

File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/external/es/supersearch.py", line 26, in __init__
self.config.elasticsearch
TypeError: object() takes no parameters

@phrawzty @AdrianGaudebert do you know if the middleware has been tested with the new ES classes ^?

Middleware is configured like this:

implementations.implementation_list=psql: socorro.external.postgresql, hbase: socorro.external.boto, es: socorro.external.es, fs: socorro.external.filesystem, http: socorro.external.http, rabbitmq:socorro.external.rabbitmq
implementations.service_overrides=CrashData: hbase, Correlations: http, CorrelationsSignatures: http, SuperSearch: es, Priorityjobs: rabbitmq, Search: es, Query: es

phrawzty · 2015-03-25T13:12:48Z

We've got ES 1.4 in AWS, so I tried swapping out the socorro.external.elasticsearch storage classes for the newer socorro.external.es and it seems to work.

Yep! The former is for <= 0.90, while the latter is for >= 1.0 .

adngdb · 2015-03-25T13:34:19Z

@rhelmer That should be a configuration problem. Make sure that resources.elasticsearch.elasticsearch_class is pointing to the connection context in the es module and not the elasticsearch module, and it should work better! :)

rhelmer · 2015-03-25T15:02:12Z

That should be a configuration problem. Make sure that resources.elasticsearch.elasticsearch_class is pointing to the connection context in the es module and not the elasticsearch module, and it should work better! :)

Cool, thanks! I set this and it works, I saved it in the socorro-config repo.

rhelmer · 2015-03-25T17:56:06Z

@phrawzty ok this is all rebased down, I've done a lot of testing on it and I think it's good to go.

I am not thrilled with the docs, but I think they are good enough for the moment except for one thing - we need to provide an example of how to use consul, since that's what our RPM supports now.

One advantage we have now is that our infra is all public, I think what I should do is make the socorro-config repo public and link to that. We should also link to socorro-infra somewhere too (may not apply for people who are not running in AWS, or need to use different tools, but worth linking to at least.)

phrawzty · 2015-03-25T18:42:14Z

@phrawzty ok this is all rebased down, I've done a lot of testing on it and I think it's good to go.

I dared not believe this day would ever come. 😂

I am not thrilled with the docs, but I think they are good enough for the moment except for one thing - we need to provide an example of how to use consul, since that's what our RPM supports now.

We need to establish how we intend to use Consul before we can craft examples. Let's do some light policy work here and then we'll put in a new PR for for doc, OK?

I think what I should do is make the socorro-config repo public and link to that.

Sure, we just need to be really careful about the contents of that repo - for example, revealing the names of private buckets.

phrawzty · 2015-03-25T18:46:26Z

Excepting the talking points above (which don't prevent this from landing, imho), this PR is

`r+` 🚀 🎉 💯

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update

phrawzty reviewed Mar 22, 2015
View reviewed changes

phrawzty changed the title ~~[DO NOT MERGE] Centos7 fpm update~~ [DO NOT MERGE] Centos7 major packaging / infra update Mar 22, 2015

rhelmer changed the title ~~[DO NOT MERGE] Centos7 major packaging / infra update~~ fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update Mar 23, 2015

rhelmer changed the title ~~fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update~~ [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update Mar 23, 2015

rhelmer closed this Mar 24, 2015

rhelmer reopened this Mar 24, 2015

rhelmer changed the title ~~[NEEDS REBASE] fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update~~ [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489, 1146984 - Centos7 major packaging / infra update Mar 24, 2015

rhelmer changed the title ~~[NEEDS REBASE] fixes bug 1106992, 1145487, 1145489, 1146984 - Centos7 major packaging / infra update~~ [NEEDS REBASE] fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update Mar 24, 2015

fix bug 1106992 - use nginx+uwsgi

a7d1638

rhelmer force-pushed the centos7-fpm-update branch from 102f1ae to cf29cbe Compare March 24, 2015 17:07

rhelmer changed the title ~~[NEEDS REBASE] fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update~~ fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update Mar 24, 2015

rhelmer force-pushed the centos7-fpm-update branch from 23e5419 to 59f93cb Compare March 24, 2015 22:11

fix bug 1146984 - build RPM for Centos 7

750fd73

rhelmer force-pushed the centos7-fpm-update branch from c7d000e to 8f75dd2 Compare March 25, 2015 16:15

fix bug 1145487 - add systemd service files

d9b56f8

rhelmer force-pushed the centos7-fpm-update branch from 48d7e43 to d9b56f8 Compare March 25, 2015 18:50

rhelmer added a commit that referenced this pull request Mar 25, 2015

Merge pull request #2685 from rhelmer/centos7-fpm-update

e3b2077

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update

rhelmer merged commit e3b2077 into mozilla-services:master Mar 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

rhelmer commented Mar 21, 2015

phrawzty Mar 22, 2015

phrawzty commented Mar 22, 2015

phrawzty Mar 22, 2015

phrawzty commented Mar 22, 2015

phrawzty Mar 22, 2015

rhelmer Mar 23, 2015

phrawzty commented Mar 22, 2015

phrawzty Mar 22, 2015

rhelmer Mar 23, 2015

phrawzty commented Mar 22, 2015

rhelmer commented Mar 23, 2015

phrawzty commented Mar 23, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

phrawzty commented Mar 24, 2015

phrawzty commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 25, 2015

rhelmer commented Mar 25, 2015

phrawzty commented Mar 25, 2015

adngdb commented Mar 25, 2015

rhelmer commented Mar 25, 2015

rhelmer commented Mar 25, 2015

phrawzty commented Mar 25, 2015

phrawzty commented Mar 25, 2015

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

Conversation

rhelmer commented Mar 21, 2015

phrawzty Mar 22, 2015

Choose a reason for hiding this comment

phrawzty commented Mar 22, 2015

Per-service example:

Global example:

phrawzty Mar 22, 2015

Choose a reason for hiding this comment

phrawzty commented Mar 22, 2015

phrawzty Mar 22, 2015

Choose a reason for hiding this comment

rhelmer Mar 23, 2015

Choose a reason for hiding this comment

phrawzty commented Mar 22, 2015

phrawzty Mar 22, 2015

Choose a reason for hiding this comment

rhelmer Mar 23, 2015

Choose a reason for hiding this comment

phrawzty commented Mar 22, 2015

rhelmer commented Mar 23, 2015

phrawzty commented Mar 23, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

phrawzty commented Mar 24, 2015

phrawzty commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 24, 2015

rhelmer commented Mar 25, 2015

rhelmer commented Mar 25, 2015

phrawzty commented Mar 25, 2015

adngdb commented Mar 25, 2015

rhelmer commented Mar 25, 2015

rhelmer commented Mar 25, 2015

phrawzty commented Mar 25, 2015

phrawzty commented Mar 25, 2015

r+ 🚀 🎉 💯

`r+` 🚀 🎉 💯