Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update #2685

Merged
merged 3 commits into from Mar 25, 2015

Conversation

rhelmer
Copy link
Contributor

@rhelmer rhelmer commented Mar 21, 2015

r? @phrawzty - I need to get figure out how to get this associated with bugs properly (it fixes several), so don't merge yet :) I think we should land them all together to minimize breaking external users - I think the docs aren't quite good enough yet in particular, but the following all WFM:

  • removes all config files except alembic and django local.py
  • adds systemd service files for all socorro services
  • provides working nginx sample configs
  • updates docs
  • splits socorro initial setup out to a /usr/bin/setup-socorro.sh

I've been testing this out on our AMI and it seems to work! Now that some of the "socorro lite" work has landed, you can run collection+processing without any additional services running, except for consul which is now required for all Socorro services:

sudo yum install consul envconsul
consul agent -bootstrap-expect 1 -server -data-dir=./consul-data/

First you must onfigure collector to use WSGI rather than default built-in web.py server:

curl -X PUT -d 'socorro.webapi.servers.ApacheModWSGI' http://localhost:8500/v1/kv/socorro/collector/web_server__wsgi_server_class

(@twobraids we should really change the name of that config key to remove the ApacheMod part and just call it socorro.webapi.servers.WSGI)

Then just start the services:

sudo yum install socorro
sudo systemctl start socorro-collector
sudo systemctl start socorro-processor

This will store both raw and processed crashes to ~socorro/crashes, and will scan the filesystem rather than using a queue. Storing crashes to ES/S3/PG and enabling RabbitMQ are just a matter of setting the right keys/values in consul.

You should be able to submit crashes, and they should be processed successfully (both raw .json and .dump and processed .jsonz files are stored in ~socorro/crashes):

# from a Socorro checkout w/ activated virtualenv
socorro submitter -u http://crash-reports/submit -s testcrash/raw

For a distributed setup we don't want to share a filesystem, so you'll need to turn RabbitMQ and S3 on via consul.

The webapp has more dependencies before it'll work:

sudo /usr/pgsql-9.3/bin/postgresql93-setup initdb
# set local connections to "trust"
vi /var/lib/pgsql/9.3/data/pg_hba.conf 
sudo systemctl start postgresql-9.3
sudo systemctl start elasticsearch
sudo yum install memcached
sudo systemctl start memcached

PG and ES need to be set up (NOTE this script assumes they are on localhost, it should fail gracefully if that's not the case. Also it's safe to re-run the script, it won't destroy anything already set up):

sudo setup-socorro.sh
# configure middleware to use WSGI rather than default built-in web.py server
curl -X PUT -d 'socorro.webapi.servers.ApacheModWSGI' http://localhost:8500/v1/kv/socorro/middleware/web_server__wsgi_server_class
sudo systemctl start socorro-middleware
sudo systemctl start socorro-webapp

The RPM drops nginx sample configs into /etc/nginx/conf.d that listen on the vhosts crash-reports (collector), crash-stats (webapp), and socorro-middleware (this one only listens on localhost since it's not safe, don't want anyone to accidentally expose it)

The webapp will give you 404s for the default WaterWolf unless you either use socorro setupdb's --fakedata option, or set up a new product via the admin UI per http://socorro.readthedocs.org/en/latest/configuring-socorro.html (maybe we should have the setup-socorro script do some/all of this?)

Install the Socorro repository.
::
sudo rpm -ivh https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.packages-public/el/7/noarch/socorro-public-repo-1-1.el7.centos.noarch.rpm

Now you can actually install the packages:
::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a sudo yum makecache step before install, just to be sure.

@phrawzty
Copy link
Contributor

Concerning the systemd unit files:

Given the environmental requirements and the length of the ExecStart, I strongly suggest restructuring this (and the other systemd unit files) to use sysconfig. This could be done per-service or using a global socorro sysconfig. In either case, it allows us to set common config data in one spot, which reduces the number of files we have to maintain and helps to ensure consistency at runtime. Of the two, I prefer the global model (in this case), though I concede that it's the less pure of the two. 😇

Per-service example:

# socorro-collector.service (snipped for brevity)
EnvironmentFile=-/etc/sysconfig/socorro-collector
ExecStart=$CMD $CMD_OPTS
# sysconfig/socorro-collector
VENV="/data/socorro/socorro-virtualenv"
CMD="/usr/bin/envconsul"
CMD_OPTS="-once -upcase=false -prefix socorro/collector $VENV/bin/uwsgi -H $VENV -M --need-app -w wsgi.collector -s /var/run/uwsgi/socorro-collector.sock --chmod-socket=664 --uid=socorro --gid=nginx"

Global example:

# socorro-collector.service (snipped for brevity)
EnvironmentFile=-/etc/sysconfig/socorro
ExecStart=$ENVCONSUL_BIN $ENVCONSUL_OPTS -prefix socorro/collector $VENV/bin/uwsgi -H $VENV -M --need-app -w wsgi.collector -s $UWSGI_DIR/socorro-collector.sock --chmod-socket=$SOCKET_MODE --uid=$USER --gid=$GROUP"
# sysconfig/socorro
VENV="/data/socorro/socorro-virtualenv"
ENVCONSUL_BIN="/usr/bin/envconsul"
ENVCONSUL_OPTS="-once -upcase=false"
UWSGI_DIR="/var/run/uwsgi"
USER="socorro"
GROUP="nginx"
SOCKET_MODE="664"

Note that these are strictly off the top of my head, so YMMV. 😁

Update: I hadn't yet read about the so-called "Emperor and Vassal" approach, which definitely has merit, and would play into the suggestions I've made above should we choose to roll that way. (cf. bug 1145487)

::
sudo service httpd start
sudo chkconfig httpd on
rabbitmq-server elasticsearch httpd mod_wsgi memcached socorro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about httpd and mod_wsgi here given that we're now using Nginx and uwsgi? Maybe s/httpd\smod_wsgi/nginx/; or so?

@phrawzty
Copy link
Contributor

Concerning scripts/install.sh and scripts/package.sh (and yes, I've said this before), but at some point we'll want to replace FPM with proper packaging. We don't need to fix this today, but we should probably create a bug for it - could be good for an intern or junior?

Also, the name install.sh is sort of misleading. 😛

fi

# create ElasticSearch indexes
echo "Creating ElasticSearch indexes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elasticsearch (below, too). 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning scripts/install.sh and scripts/package.sh (and yes, I've said this before), but at some point we'll want to replace FPM with proper packaging. We don't need to fix this today, but we should probably create a bug for it - could be good for an intern or junior?

sgtm, but some things I'd note:

  • we're going to start breaking up the repo into separate modules and therefore will need separate packages (planning to do socorro-collector first)
    • the build system in the current socorro module is difficult to deal with, it'll be easier with the new split-up modules
  • we're probably going to want debian packages too (which was part of the impetus for fpm)
  • the new split-up socorro modules are going to be proper python packages, so we might be able to get away from having to do distro-specific packaging (say we used something like https://github.com/progrium/buildstep and had generic treatment of "web" and "worker" classes of app)

@phrawzty
Copy link
Contributor

Concerning scripts/setup-socorro.sh, this is a good first stab (functionality++), but as I mentioned above, we'll need to amplify it along with the install docs in order to make it more useful for non-all-on-one-box installs. A simple enough solution would be to split out each role-specific step (or steps) into functions which are then called as appropriate. Off the top of my head, an example:

#!/usr/bin/env bash

function help {
    echo "USAGE: ${0} <role>"
    echo "Valid roles are: postgres, webapp, whatever."
    exit 1
}

function validate {
    # No argument? That's a paddlin'.
    if [ "x${1}" == "x" ]; then
        help
    fi

    # Invalid function? That's a paddin'.
    VALID_FUNC=`type -t $1 | grep -q function`
    if [ $? != 0 ]; then
        help
    fi
}

function postgres {
    echo "this is the postgres function"
}

function webapp {
    echo "this is the webapp function"
}

function whatever {
    echo "this is the whatever function"
}

# Aaaaand go!
validate $1
echo "Initialising ${1}."
$1
exit 0


# create DB if it does not exist
# TODO handle DB not on localhost - could use setupdb for this
su - postgres -c "psql breakpad -c ''" > /var/log/socorro/setupdb.log 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this stuff was run as part of the package install, and thus, implicitly as root; however, now that it has been split out into a separate script, su's should likely be run via sudo. Either that, or we mandate that the entire script be run via sudo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to embed sudo in the script, that seems too surprising - I'd rather the script just check if you're running as root and fail if not, then people can sudo or su as appropriate.

@phrawzty phrawzty changed the title [DO NOT MERGE] Centos7 fpm update [DO NOT MERGE] Centos7 major packaging / infra update Mar 22, 2015
@phrawzty
Copy link
Contributor

I've been testing this out on our AMI and it seems to work! Now that some of the "socorro lite" work has landed, you can run collection+processing without any additional services running, except for consul which is now required for all Socorro services:

sudo yum install consul envconsul
consul agent -bootstrap-expect 1 -server -data-dir=./consul-data/

I realise that this is contextual and explanatory, but we should be careful here going forward: the example above implies that each node is running a Consul server when in fact, each node needs to have (at least) a Consul agent.

Globally this speaks to a larger issue we have with regards to our packaging, install scripts, documentation, etc, where we can't decide whether everything is supposed to run on one node by default or not. I think we should strongly consider elevating the idea of "roles" throughout - it would allow us to more cleanly describe every element both programatically and in documentation.

Thoughts ❓

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 23, 2015

Globally this speaks to a larger issue we have with regards to our packaging, install scripts, documentation, etc, where we can't decide whether everything is supposed to run on one node by default or not. I think we should strongly consider elevating the idea of "roles" throughout - it would allow us to more cleanly describe every element both programatically and in documentation.

Yes this is tricky, I have found that most people who want to try out Socorro (and even for production) besides us tend to have low volume, and want to run it on a single node or maybe just a few. I want to make initial setup as simple as we can, while making it clear how to distribute services if that's what you need - any ideas for how to expose this consistently in the docs would be great!

@rhelmer rhelmer changed the title [DO NOT MERGE] Centos7 major packaging / infra update fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update Mar 23, 2015
@phrawzty
Copy link
Contributor

daad0be adopts the sysconfig model for systemd unit files (yay) - but where is the sysconfig file itself? We should at least include a functioning sample.

@rhelmer rhelmer changed the title fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update Mar 23, 2015
@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

@phrawzty just testing this, and one thing we're missing - we need to install consul and envconsul on these nodes.

Should we just make these a dependency for the RPM?

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

@phrawzty also - this drops in a nginx config fragment to /etc/nginx/conf.d/, is it kosher for the Socorro RPM to reload nginx?

@rhelmer rhelmer closed this Mar 24, 2015
@rhelmer rhelmer reopened this Mar 24, 2015
@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

@phrawzty btw systemd doesn't seem to be playing nice with /etc/sysconfig/socorro - it complains that the WorkingDir is not an absolute path when set to $WORKING_DIR and ignores it :/

Having similar problems running the actual command, haven't worked those out quite yet...

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

@phrawzty http://0pointer.de/blog/projects/on-etc-sysinit.html talks about "what's wrong with /etc/sysconfig (...) Why might it make sense to fade out use of these files in a systemd world"

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

@phrawzty so (as you are no doubt aware) Fedora recommends sysconfig, although Lennart obviously doesn't:
http://fedoraproject.org/wiki/Packaging%3aSystemd#EnvironmentFiles_and_support_for_.2Fetc.2Fsysconfig_files

I think the problem is we can't use it for WorkingDir or the first arg in ExecStart I guess?
For now I reverted the whole thing, but we could put some configs in there I suppose, lmk what you think.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

OK - tested all Socorro-specific roles and PR currently works OK - Socorro apps start up fine, but have no Consul agent to connect to, work on that is happening elsewhere.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

Oh BTW - crontabber is the only remaining app that isn't running under envconsul, I think it's going to be complex enough that it needs its own PR (just as Django has its own) - form the perspective of the current PR we're doing everything necessary to support it.

@phrawzty
Copy link
Contributor

@phrawzty just testing this, and one thing we're missing - we need to install consul and envconsul on these nodes. Should we just make these a dependency for the RPM?

PR 40 addresses a bunch of Consul-related stuff, including the addition of both consul (agent) and envconsul to the Base AMI. We should be careful about making them dependencies for the RPM since not everybody will be using our Yum repo - this is a policy decision more than a technical one. I don't time to form an opinion on this right now, heh.

@phrawzty also - this drops in a nginx config fragment to /etc/nginx/conf.d/, is it kosher for the Socorro RPM to reload nginx?

👍

@phrawzty
Copy link
Contributor

I think the problem is we can't use it for WorkingDir or the first arg in ExecStart I guess? For now I reverted the whole thing, but we could put some configs in there I suppose, lmk what you think.

This is a battle that pits a young tool with no established best practices against a legacy of behaviour that's never been particularly consistent. There are no winners. 😾 I don't understand why those variables aren't working (the ExecStart case is especially baffling), but given that I'm not intimately familiar with systemd, I'm more than willing to concede that there is a "good reason" that is unknown to us at this time.

In the interest of ⏩ let's keep the reverted (non-sysconfig) version for now. We can re-examine this issue later - in fact, we should, since it may help to explain how to better configure and run stuff under systemd in general.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

I would guess that the issue with systemd is it is parsing the config and seeing that the string $WORKING_DIR does not begin with / and so is not an absolute path and is complaining, it should really expand the variable at runtime (maybe they have a good reason for not doing this, idk)

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

We should be careful about making them dependencies for the RPM since not everybody will be using our Yum repo - this is a policy decision more than a technical one. I don't time to form an opinion on this right now, heh.

Everybody using our docs should be using our repo. If they don't want to that's fine, but they need to figure it out.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 24, 2015

Oh BTW - crontabber is the only remaining app that isn't running under envconsul, I think it's going to be complex enough that it needs its own PR (just as Django has its own) - form the perspective of the current PR we're doing everything necessary to support it.

I take this back - turns out it wasn't hard at all. I think I can get this too, I just need to adjust the crontabber.sh wrapper

@rhelmer rhelmer changed the title [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489 - Centos7 major packaging / infra update [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489, 1146984 - Centos7 major packaging / infra update Mar 24, 2015
@rhelmer rhelmer changed the title [NEEDS REBASE] fixes bug 1106992, 1145487, 1145489, 1146984 - Centos7 major packaging / infra update [NEEDS REBASE] fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update Mar 24, 2015
@rhelmer rhelmer changed the title [NEEDS REBASE] fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update Mar 24, 2015
@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 25, 2015

@phrawzty OK tested this by activating all of the roles on a single node, and (given a correct config) we seem to work in "distributed mode" (using S3 and RabbitMQ)

I found a problem though - processor can't push crashes into ES, we get:

InvalidJsonResponseError: <Response [404]>

We've got ES 1.4 in AWS, so I tried swapping out the socorro.external.elasticsearch storage classes for the newer socorro.external.es and it seems to work.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 25, 2015

OK I can:

  • submit and process crashes
  • log into webapp as superuser, add products+releases
  • see processed crashes in the web UI

The only problem I've found is supersearch doesn't seem to work:

File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/socorro-master-py2.7.egg/socorro/external/es/supersearch.py", line 26, in __init__
self.config.elasticsearch
TypeError: object() takes no parameters

@phrawzty @AdrianGaudebert do you know if the middleware has been tested with the new ES classes ^?

Middleware is configured like this:

implementations.implementation_list=psql: socorro.external.postgresql, hbase: socorro.external.boto, es: socorro.external.es, fs: socorro.external.filesystem, http: socorro.external.http, rabbitmq:socorro.external.rabbitmq
implementations.service_overrides=CrashData: hbase, Correlations: http, CorrelationsSignatures: http, SuperSearch: es, Priorityjobs: rabbitmq, Search: es, Query: es

@phrawzty
Copy link
Contributor

We've got ES 1.4 in AWS, so I tried swapping out the socorro.external.elasticsearch storage classes for the newer socorro.external.es and it seems to work.

Yep! The former is for <= 0.90, while the latter is for >= 1.0 .

@adngdb
Copy link
Contributor

adngdb commented Mar 25, 2015

@rhelmer That should be a configuration problem. Make sure that resources.elasticsearch.elasticsearch_class is pointing to the connection context in the es module and not the elasticsearch module, and it should work better! :)

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 25, 2015

That should be a configuration problem. Make sure that resources.elasticsearch.elasticsearch_class is pointing to the connection context in the es module and not the elasticsearch module, and it should work better! :)

Cool, thanks! I set this and it works, I saved it in the socorro-config repo.

@rhelmer
Copy link
Contributor Author

rhelmer commented Mar 25, 2015

@phrawzty ok this is all rebased down, I've done a lot of testing on it and I think it's good to go.

I am not thrilled with the docs, but I think they are good enough for the moment except for one thing - we need to provide an example of how to use consul, since that's what our RPM supports now.

One advantage we have now is that our infra is all public, I think what I should do is make the socorro-config repo public and link to that. We should also link to socorro-infra somewhere too (may not apply for people who are not running in AWS, or need to use different tools, but worth linking to at least.)

@phrawzty
Copy link
Contributor

@phrawzty ok this is all rebased down, I've done a lot of testing on it and I think it's good to go.

I dared not believe this day would ever come. 😂

I am not thrilled with the docs, but I think they are good enough for the moment except for one thing - we need to provide an example of how to use consul, since that's what our RPM supports now.

We need to establish how we intend to use Consul before we can craft examples. Let's do some light policy work here and then we'll put in a new PR for for doc, OK?

I think what I should do is make the socorro-config repo public and link to that.

Sure, we just need to be really careful about the contents of that repo - for example, revealing the names of private buckets.

@phrawzty
Copy link
Contributor

Excepting the talking points above (which don't prevent this from landing, imho), this PR is

r+ 🚀 🎉 💯

rhelmer added a commit that referenced this pull request Mar 25, 2015
fixes bug 1106992, 1145487, 1146984 - Centos7 major packaging / infra update
@rhelmer rhelmer merged commit e3b2077 into mozilla-services:master Mar 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants