authors | state |
---|---|
Trent Mick <trent.mick@joyent.com> |
publish |
The primary standalone IMGAPI deployments are https://images.joyent.com (a.k.a. images.jo) and https://updates.joyent.com (a.k.a. updates.jo). "Standalone" here means, not an IMGAPI instance that is running as one of the core services for a Triton DataCenter.
M0 is implemented and updates.jo and images.jo are deployed using the newer images and process. An Operator Guide documents setup/reprovision/backup/restore processes for a standalone IMGAPI.
Remaining milestones are incomplete and not currently scheduled. M0 implements a significant part of M1 (backup, deploy, restore) and M3 (log file rotation and upload). Good enough for now.
Updates.jo and images.jo are pets: manually setup, updated in-place, it isn't easy to update their images, backups are a manually setup cronjob that isn't part of setup, there isn't a documented/script recovery procedure, it isn't currently possible to make a standalone IMGAPI HA, etc.
E.g., given a new vuln in some part of its dependency stack (e.g. stud), it isn't clear how to update it.
This RFD will attempt to converge on a plan, implementation and docs for standalone IMGAPI deployments.
IMGAPI source is in https://github.com/TritonDataCenter/sdc-imgapi.
IMGAPI is an API to serve Triton images. The server stores two main kinds of things: image manifests (JSON documents) and image files (large binary blobs). The code's name for the component that handles manifests is "the database" (code) and for image files, the "storage" (code). An IMGAPI server can be configured with one "database.type" and one or more "storage" mechanisms.
The core IMGAPI in a Triton DataCenter runs with a "moray" database (a HA-able key-value store) and for starters "local" storage (needed at least for bootstrapping). A "manta" storage can be added -- generally a requirement to enable custom image creation in the datacenter for durability of custom image data.
Standalone IMGAPI deployments use a "local" database (generally there isn't a Moray to use) and typically use a "local" and, optionally, a "manta" storage. "local" storage is just files on the local disk (no replication).
Auth and TLS also differ: A in-Triton DataCenter IMGAPI doesn't use auth (it only accepts connections on the admin interface) and uses HTTP. A standalone IMGAPI of course requires auth for endpoints that change data (CreateImage et al). Basic auth is/was supported, but is deprecated. HTTP Signature auth (the same as CloudAPI) is preferred. TLS termination is currently done via stud (stud -> haproxy -> imgapi) per these setup docs. The use of HAproxy was a copy from Manta's muskie and CloudAPI. Currently only a single IMGAPI process is behind the HAproxy.
The base requirements of a standalone IMGAPI plan (i.e. good enough that we are happy to live with for images.jo and updates.jo) are:
- data (manifests, image files) is durable
- redeployable: There is a simple documented procedure for getting on a newer/older/latest version of the software.
An open question for #1: Is a periodic backup of the manifests (to Manta) sufficient (e.g. manifests in the last N minutes since last backup could be lost)? If you say "no", does your opinion change if a more durable solution comes with future planned work for HA (e.g. durability could be achieved via an HA DB cluster)? I.e. "periodic backups are fine for now if we have a plan to do better later."
Eventually these may be promoted to requirements.
- TLS certs auto-renew (via letsencrypt)
- HA
- logs are rotated and uploaded to manta
- Monitoring
Hopes and dreams: It would be lovely to expose delegate datasets and reprovision via cloudapi so theoretically those could be used for easier/quicker standalone IMGAPI instance updates/deployments.
Issues:
The milestones below are nice and I'd still like to do them. However, priorities call, so I need to get images.jo and updates.jo on a modern base and updateable in the shorter term. That's what this milestone is about. It will:
- Update sdc-imgapi.git such that the resultant "imgapi" images can be used both for DC-mode IMGAPI instances and for standalone IMGAPI instances (like images.jo).
- Document how to setup and maintain a new standalone imgapi zone for images.jo or updates.jo.
- Include a number of improvements for standalone IMGAPI instances:
- log file rotation and upload
- HTTP signature authkeys can be added to Manta and are sync'd from there
- background hourly backup (and support for restore)
Basically this milestone stops short of HA IMGAPI and image manifests are only backed up to durable storage (Manta) hourly.
Issues:
- IMGAPI-571: re-do images.jo/updates.jo deployment to be able use stock 'imgapi' images
Milestone "M" sections are for proposed order of work done to pick off first the requirements, then the nice-to-haves.
Here I'm making the assumption that the answer to the open question in the "Requirements" section is "yes, periodic backups, with a documented/scripted restore is sufficient for starters."
Right now setting up an images.jo, for example, is very manually and not wholely documented. With M1 we'll fix that and integrate first class (i.e. scripted and documented) backup and restore.
The overall plan here is to deploy using the same "imgapi" images we build for core Triton IMGAPI zones. We'll add the additional software needed for standalone mode (HAproxy and stud for now), and add "standalone" boot scripts (parallel to the standard Triton "boot/{setup,configure}.sh" boot scripts) so that running a standalone IMGAPI is as simple as: (a) (re-)provision with that image and (b) possibly ssh in to provide secrets (key access to Manta account, TLS cert).
Metadata provided at provision-time tells the zone which mode to run in. In core Triton, a user-script that runs "/opt/smartdc/boot/setup.sh" is what triggers running on "dc" mode. For standalone mode we'll use a separate user-script or metadata key.
First backup. A standalone-mode IMGAPI will regularly attempt to backup
local data to Manta (whether that is a cronjob or background process in
imgapi is TBD). imgapiadm status
will report a warning if not configured
for backup (i.e. no Manta config) or if backup is failing.
Specifics: A standalone IMGAPI's local data dir looks like this:
/data/imgapi/
# The stuff we want to backup:
manifests/... # all the image manifest
images/... # image files for those with stor=local
# The stuff we don't want to backup:
etc/
imgapi.config.json
imgapi-$shortzonename-$datestamp.id_rsa{,.pub}
cert.pem
archive/...
logs/... # NYI
Given a manta.user=bob
and manta.baseDir=myimages
the Manta base dir is: /bob/stor/myimages'. The regular backup process will backup to:
/bob/stor/myimages/backup/...`.
Minimally a new IMGAPI zone deployment will look as follows. For now we'll use the 'imgapi' images in updates.joyent.com. So the operator needs to import one to the DC (say we want the latest one):
# (1)
img=$(updates-imgadm list -H -o uuid --latest name=imgapi)
sdc-imgadm import -S https://updates.joyent.com $img
# Give the 'bob' account access to it:
sdc-imgadm add-acl $img $(sdc-useradm get bob | json uuid)
Then create the zone (we are using metadata for configuration):
# (2a)
triton create -w --name myimages0 imgapi g4-highcpu-2G \
-m mode=public -m manta.user=bob -m manta.baseDir=myimages
Practically speaking we'll probably want to use a delegate dataset and reprovisioning, so we'll be going through VMAPI (I'll still attempt to make things work without requiring a delegate dataset):
# (2b)
cat <<EOP | sdc-vmadm create
{
"alias": "myimages0",
"owner_uuid": "$(sdc-useradm get sdc | json uuid)",
"billing_id": "$(sdc-papi /packages | json -H -c "this.name=='g4-highcpu-2G'" 0.uuid)",
"networks": [{"uuid": "$(sdc-napi /network_pools | json -H -c "this.name=='Joyent-SDC-Public'" 0.uuid)"}],
"image_uuid": "$(sdc-imgadm list name=imgapi --latest -H -o uuid)",
"brand": "joyent-minimal",
"delegate_dataset": true,
"customer_metadata": {
"mode": "public"
"manta": {
"user": "bob",
"baseDir": "myimages"
}
}
}
EOP
Every IMGAPI instance creates its own SSH key. This key won't be on the 'bob' account, so we'll have to add it.
# (3)
ssh root@$(triton ip myimages0) cat /data/imgapi/etc/imgapi.id_rsa.pub \
| triton -a bob key add -n 'imgapi-4dad5922-20160607' -f -
As long as there are no other IMGAPI instances running out of this Manta dir, we should be good: The instance will download the backups, and then become active.
However, if a current or earlier instance was running out of this Manta dir, then this IMGAPI instance will refuse to go active (see next section). We'll need to tell to it take over:
# (4)
imgapiadm state set active
If that works, then our new guy should now be active:
$ curl -ik https://$(triton ip myimages0)/ping | json -H state
active
For IMGAPI services where we have operator access to the DC (i.e. images.jo and updates.jo) we'll want to use delegate datasets and reprovision to get faster updates. The procedure for images.joyent.com then will be:
# (1)
img=$(updates-imgadm list -H -o uuid --latest name=imgapi)
sdc-imgadm import -S https://updates.joyent.com $img
sdc-imgadm add-acl $img $(sdc-useradm get sdc | json uuid)
# (2) Put the service in readonly and drain.
joyent-imgadm adm readonly # or something like this
# (3)
owner=$(sdc-useradm get sdc | json uuid)
inst=$(sdc-vmadm list owner_uuid=$owner alias=imagesjo0 -H -o uuid)
sdc-vmadm reprovision $inst $img
# Should be back soon.
joyent-imgadm ping
We don't have multi-instance coordination here. There needs to be a single IMGAPI instance running out of a given Manta dir (called the "primary"). How do we coordinate that? My plan is a lock file in Manta using PutObject's test/set semantics:
/bob/stor/myimages/run/primary.lock
If that exists and does not contain this instance's id, then the instance won't go active. TODO: prove I can do this.
When a instance stops being active for whatever reason, the (now stale) 'primary.lock' is left in place. This tells us what instance last was active. If/when a new instance is to take over the active spot (see "Server States" section below), it knows from that stale lock that it needs to sync to the backup.
Further all instances will write a "runstate" JSON file to that dir:
/bob/stor/myimages/run/$(zonename).runstate
This will allow the imgapiadm status
command in each IMGAPI zone to know
about other standby instances.
This section describes the states an IMGAPI server goes through to become active (or not) and the valid transitions.
initializing
: A starting IMGAPI service starts in this state. In this state it will:- Setup its local data dir, "/data/imgapi/...". On failure, go to "initfail" state.
- Setup its Manta data dir, if appropriate, "/$user/stor/$baseDir/...". On failure, go to "initfail" state.
- If not already currently the primary instance, then sync to the backup local data in Manta. Go to "initfail" state, on failure.
- Attempt to become "active" (per the "primary.lock" described above). If able to grab the lock, go to "active" state. If not, go to "standby" state.
initfail
: There was an error in initialization. This'll be saved out soimgapiadm status
can report the specific issue. E.g. the common one will be that the new SSH key for this instance isn't on the Manta account yet. All API endpoints will respond with 503 Server Unavailable when in this state. Restart the service to get out of this state (it'll attempt "initialization" again).standby
: The server is a warm standby. Only "warm" because it may be slightly out of date if the active instance made changes since it last sync'd from the backup. In standby mode, the API will respond with 503 Service Unavailable for all endpoints. A manualimgapiadm state set ...
is required to get out of this mode.active
: The service is up, has the "primary.lock" and the API is responding.readonly
: The service is up, but only non-read endpoints will error with 503 Service Unavailable. The only way to this state is manually from either "active" or "standby" viaimgapiadm state set readonly
. This can be useful for switching primaries:- get B to standby
- put A (the current primary) in readonly
- wait for A to drain
- make B the primary and add to DNS
- remove A from DNS, drain, decommission
See about tooling/scripted setup for letsencrypt-based auto TLS cert renewal.
TODO: I need to reacquaint with the tooling here. Brian and I setup automatic renewal for mo.jo.
Also look at getting standard A or higher rating on https://www.ssllabs.com/ssltest/analyze.html.
tl;dr: Standard /etc/logadm.conf entry(ies) for standalone mode. Rotate to "/data/imgapi/logs". Script to upload rotated logs to Manta "/$account/stor/$baseDir/logs/imgapi/YYYY/MM/DD/$instance.log" -- possibly likewise for stud and HAproxy logs.
Might be able to crib the Manta shell script for log file uploading.
Currently I believe ops has a pingdom (or similar) check or checks on, say, images.jo. But there is nothing first class.
TODO: Explore using Amon for this. If that is feasible, then could ship suggested Amon probes to use for a given instance.
Getting to HA means:
- storage: All image files in manta (as opposed to local storage)
- database: An HA story for the "database" of manifests
- load balancing
- shared config
Storage (#1) is easy: imgapiadm status
would warn/error about image durability
if there are image files using local storage. With IMGAPI-536 there is now
AdminChangeImageStor to easily migrate image files from local to manta storage.
Load balancing (#3): For Joyent's services (images.jo, updates.jo) we could use the LB solution we use for other things. Another alternative is to use a CNAME to a CNS service record and call it good enough.
Shared config (#4): Some things, like updates.jo's configured "channels" would need to be common between all instances. Some ideas here would be to share some of this config in the Manta area with periodic checking for config changes. An alternative is something like consul for shared config. The poor man's solution is to just be careful and keep the instance configs in sync. :)
Database (#2): Trying to coordinate via a flat file database isn't tenable. My current thought is to look at using a RethinkDB cluster. We don't yet have RethinkDB in pkgsrc (rethinkdb/rethinkdb#4309). Lacking that, we'd probably run a separate set of RethinkDB instances in LX zones. So far we've had good experience with RethinkDB with thoth and sesat -- granted only with a single RethinkDB instance currently. Writing a RethinkDB backend for IMGAPI's database would not be difficult. We should also consider a Moray/Manatee cluster.
-
imgapi-standalone-reprovision confirmation should be before the remote image is imported or changed
-
ssh host key after reprov: Deal with this.
$ triton ssh imagesjo0 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! ...
-
Prometheus usage:
- /stats with the elements of imgapi-standalone-status
- a prom server
-
able to pass imgapi-standalone-status right after creation and key addition. Not sure how.
-
letsencrypt support
-
x-server-name: $zonename
We are now setting hostname to other than $zonename, andos.hostname()
is used for the x-server-name header from IMGAPI. Perhaps that should stick to being the -
quiesce for reprov:
- quiesce active connections
- to make a call whether to start the reprov and start quiescing, it would be nice to have an ETA on current connections. I.e. did we just start a huge download? Also useful for monitoring. Kang.
- rotate logs and put logs for upload on delegate dataset. are there space concerns there?
-
quiesce notes: It would be nice to have imgapi-standalone-quiesce to run before destroying it. Short downtime update process would be:
- create inst1 (CNS disabled)
- run imgapi-standalone-quiesce on inst0:
- wait for current write requests (reporting status on them)
- put IMGAPI in read-only mode (ideally without a restart)
- do an archive run
- disable crontab entries
- do a backup run
- do a log rotation and log upload run
- put a marker that this inst is quiesced: imgapi should refuse to come back if quiesced (perhaps -restore could bring it back)
- run imgapi-standalone-restore on inst1
- NOTE: part of restore should be to enable CNS
- NOTE: part of restore should be to enable the crontab entries
- NOTE: part of restore should be a first run of log rot (if can) and backup and log upload (i.e. get to clean *-status run)
- take inst0 out of CNS
- wait for inst0 to drop out of CNS
- should now be able to delete inst0 whenever
This is all too complex. If we could get an alternative to the 'local' database, then we could simplify it a lot.
-
restify (to v4)
-
manta-sync's bunyan to newer to get latest dtrace-provider
-
imgapi.git: drop nopt, used in main.js. Move main.js over to lib/ dir.
-
stud should run as 'nobody', ditto haproxy (or perhaps as the 'stud' user added by pkgsrc?)
-
get 'imgapi' off the path (in $prefix/bin/imgapi), don't need it and want
imgapiadm
to be easy with tab-complete -
imgapiadm status
report an error if setup status is failed -
RFD on logrotation v2 for triton:
- drop : and - separators in the log filenames
- 'upload' dir is just for the actual uploads. Then theoretically don't
need the "retain_time" feature in hermes and uploading can be
simpler -- e.g. a la the manta zones' upload script and what services
not using hermes could use.
- logadm to rotate to /var/log/triton/$name-*.log
- post_command renames to /var/log/triton/upload/ (with option to ln instead of just mv, that would allow "retain time" to be per the '-C N' configuration on the logadm.conf line)
- nodename: hostname vs zonename. Actually we are using "nodename" which is
uname -n
. Looks like we just don't set hostname on sdc core zones.
-
regular building (and retention policy) for 'imgapi-standalone' images by the Joyent_SW user, so don't have to play these games. Do we want/need
triton build
for this? -
A monitoring story:
- backup failures
- API server perf/status