Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Commits on Jul 28, 2015
  1. @smerritt @alistairncoles

    functests: use assertIn and assertNotIn

    smerritt authored alistairncoles committed
    We have a bunch of assertions like
        self.assertTrue(resp.status in (200, 204))
    Sometimes we get smart about failure messages and have something like
        self.assertTrue(resp.status in (200, 204), resp.status)
    so we can see what the status was when it failed.
    Since we don't have to support Python 2.6 any more, we can use
    assertIn/assertNotIn and get nice failure messages for free.
    Change-Id: I2d46c9969d41207a89e01017b4c2bc533c3d744f
Commits on Jul 27, 2015
  1. @smerritt

    Rename WsgiStringIO -> WsgiBytesIO.

    smerritt authored
    If we're going to have a subclass of BytesIO, having "StringIO" in its
    name is just asking for confusion.
    Change-Id: I695ab3105b1a02eb158dcf0399ae91888bc1c0ac
Commits on Jul 21, 2015
  1. @smerritt

    Add comment about ResumingGetter.used_source_etag

    smerritt authored
    This confused a couple developers and took about ten minutes to
    unravel in IRC; let's leave a clue for the next person.
    Change-Id: I356c8c7a44de23f02eaf68d23a39c9eb4c203ff1
Commits on Jul 2, 2015
  1. @smerritt

    Stop moving partitions unnecessarily when overload is on.

    smerritt authored
    When overload was on and in use, the ring builder was unnecessarily
    moving partitions. It would converge on a good solution and settle
    down eventually, but it moved more partitions than necessary along the
    There are three partition gatherers used in the ring builder:
    dev-removed, dispersion, and weight, in that order. The dev-removed
    gatherer will pick up all partitions on removed devices. The
    dispersion gatherer picks up replicas of partitions that are
    suboptimally dispersed. The weight gatherer picks up partitions on
    devices which are overweight.
    The weight gatherer was not overload-aware, so it would pick up
    partitions that did not need to move. Consider a device that would
    normally have 100 partitions assigned, assume we set overload to 0.1
    so that this device will hold up to 110 (10 extra) for the sake of
    dispersion, and assume the device actually has 104 partitions assigned
    to it. The correct behavior is to gather nothing from this device
    because it has fewer than the maximum. Prior to this commit, the
    weight gatherer would remove 4 partitions from this device; they would
    subsequently be reassigned by the overload-aware partition placer
    (_reassign_parts()). In a ring with multiple overloaded devices, the
    builder would pick up some partitions from each, shuffle them, and
    then assign them back to those same devices. Obviously, this just
    created extra replication work for no benefit.
    Now, the weight gatherer takes overload into account, and will no
    longer needlessly gather partitions.
    That's not all, though; this change worsened the behavior of a ring
    with more overload than necessary. Before, the ring would balance as
    best it could, using the minimal amount of overload. With the
    weight-gatherer change, the ring builder will stop gathering
    partitions once a device reaches its maximum-permissible assignment
    including overload.
    For example, imagine a 3-replica, 4-zone ring with overload=0.2 and
      z1: 100
      z2: 60
      z3: 60
      z4: 60
    Since z1 has more than 1/3 of the weight, z2, z3, and z4 must take
    more than their fair share for the sake of dispersion.
    Now, turn up the weights some:
      z1: 100
      z2: 80
      z3: 80
      z4: 80
    Overload is no longer needed; this ring can balance. However, z2, z3,
    and z4 would end up keeping more than their fair share since (a) they
    already had extra due to earlier conditions, and (b) the weight
    gatherer won't pick up partitions from them since they're not
    overburdened once it takes overload into account.
    To fix this, we compute the minimum overload factor required for
    optimal dispersion and then use min(user-input-overload,
    minimum-overload) during rebalance. This way, we don't overload
    devices more than the user says, but if the user sets overload higher
    than necessary, we'll still give the best balance possible.
    Change-Id: If5666ba654ee25da54f9144f3b78840273a49627
  2. @smerritt

    Add ring-builder analyzer.

    smerritt authored
    This is a tool to help developers quantify changes to the ring
    builder. It takes a scenario (JSON file) describing the builder's
    basic parameters (part_power, replicas, etc.) and a number of
    "rounds", where each round is a set of operations to perform on the
    builder. For each round, the operations are applied, and then the
    builder is rebalanced until it reaches a steady state.
    The idea is that a developer observes the ring builder behaving
    suboptimally, writes a scenario to reproduce the behavior, modifies
    the ring builder to fix it, and references the scenario with the
    commit so that others can see that things have improved.
    I decided to write this after writing my fourth or fifth hacky one-off
    script to reproduce some bad behavior in the ring builder.
    Change-Id: I114242748368f142304aab90a6d99c1337bced4c
Commits on Jun 18, 2015
  1. @smerritt

    Get better at closing WSGI iterables.

    smerritt authored
    PEP 333 (WSGI) says: "If the iterable returned by the application has
    a close() method, the server or gateway must call that method upon
    completion of the current request[.]"
    There's a bunch of places where we weren't doing that; some of them
    matter more than others. Calling .close() can prevent a connection
    leak in some cases. In others, it just provides a certain pedantic
    smugness. Either way, we should do what WSGI requires.
    Noteworthy goofs include:
      * If a client is downloading a large object and disconnects halfway
        through, a proxy -> obj connection may be leaked. In this case,
        the WSGI iterable is a SegmentedIterable, which lacked a close()
        method. Thus, when the WSGI server noticed the client disconnect,
        it had no way of telling the SegmentedIterable about it, and so
        the underlying iterable for the segment's data didn't get
        Here, it seems likely (though unproven) that the object server
        would time out and kill the connection, or that a
        ChunkWriteTimeout would fire down in the proxy server, so the
        leaked connection would eventually go away. However, a flurry of
        client disconnects could leave a big pile of useless connections.
      * If a conditional request receives a 304 or 412, the underlying
        app_iter is not closed. This mostly affects conditional requests
        for large objects.
    The leaked connections were noticed by this patch's co-author, who
    made the changes to SegmentedIterable. Those changes helped, but did
    not completely fix, the issue. The rest of the patch is an attempt to
    plug the rest of the holes.
    Co-Authored-By: Romain LE DISEZ <>
    Change-Id: I168e147aae7c1728e7e3fdabb7fba6f2d747d937
    Closes-Bug: #1466549
Commits on Jun 17, 2015
  1. @smerritt

    Use just IP, not port, when determining partition placement

    smerritt authored
    In the ring builder, we place partitions with maximum possible
    dispersion across tiers, where a "tier" is region, then zone, then
    IP/port,then device. Now, instead of IP/port, just use IP. The port
    wasn't really getting us anything; two different object servers on two
    different ports on one machine aren't separate failure
    domains. However, if someone has only a few machines and is using one
    object server on its own port per disk, then the ring builder would
    end up with every disk in its own IP/port tier, resulting in bad (with
    respect to durability) partition placement.
    For example: assume 1 region, 1 zone, 4 machines, 48 total disks (12
    per machine), and one object server (and hence one port) per
    disk. With the old behavior, partition replicas will all go in the one
    region, then the one zone, then pick one of 48 IP/port pairs, then
    pick the one disk therein. This gives the same result as randomly
    picking 3 disks (without replacement) to store data on; it completely
    ignores machine boundaries.
    With the new behavior, the replica placer will pick the one region,
    then the one zone, then one of 4 IPs, then one of 12 disks
    therein. This gives the optimal placement with respect to durability.
    The same applies to Ring.get_more_nodes().
    Co-Authored-By: Kota Tsuyuzaki <>
    Change-Id: Ibbd740c51296b7e360845b5309d276d7383a3742
Commits on Jun 3, 2015
  1. @smerritt

    EC: support multiple ranges for GET requests

    smerritt authored
    This commit lets clients receive multipart/byteranges responses (see
    RFC 7233, Appendix A) for erasure-coded objects. Clients can already
    do this for replicated objects, so this brings EC closer to feature
    parity (ha!).
    GetOrHeadHandler got a base class extracted from it that treats an
    HTTP response as a sequence of byte-range responses. This way, it can
    continue to yield whole fragments, not just N-byte pieces of the raw
    HTTP response, since an N-byte piece of a multipart/byteranges
    response is pretty much useless.
    There are a couple of bonus fixes in here, too. For starters, download
    resuming now works on multipart/byteranges responses. Before, it only
    worked on 200 responses or 206 responses for a single byte
    range. Also, BufferedHTTPResponse grew a readline() method.
    Also, the MIME response for replicated objects got tightened up a
    little. Before, it had some leading and trailing CRLFs which, while
    allowed by RFC 7233, provide no benefit. Now, both replicated and EC
    multipart/byteranges avoid extraneous bytes. This let me re-use the
    Content-Length calculation in swob instead of having to either hack
    around it or add extraneous whitespace to match.
    Change-Id: I16fc65e0ec4e356706d327bdb02a3741e36330a0
Commits on May 29, 2015
  1. @smerritt @alistairncoles

    Remove simplejson from staticweb

    smerritt authored alistairncoles committed
    Since we're dropping Python 2.6 support, we can rely on stdlib's json
    and get rid of our dependency on simplejson.
    This lets us get rid of some redundant Unicode encoding. Before, we
    would take the container-listing response off the wire,
    JSON-deserialize it (str -> unicode), then pass each of several fields
    from each entry to get_valid_utf8_str(), which would encode it,
    (unicode -> str), decode it (str -> unicode), and then encode it again
    (unicode -> str) for good measure.
    The net effect was that each object's name would, in the proxy server,
    go str -> unicode -> str -> unicode -> str.
    By replacing simplejson with stdlib json, we get a guarantee that each
    container-listing entry's name, hash, content_type, and last_modified
    are unicodes, so we can stop worrying about them being valid UTF-8 or
    not. This takes an encode and decode out of the path, so we just have
    str -> unicode -> str. While it'd be ideal to avoid this, the first
    transform (str -> unicode) happens when we decode the
    container-listing response body (json.loads()), so there's no way out.
    Change-Id: I00aedf952d691a809c23025b89131ea0f02b6431
Commits on May 28, 2015
  1. @smerritt

    Remove simplejson from tests

    smerritt authored
    Since we're dropping Python 2.6 support, we can rely on stdlib's json
    and get rid of our dependency on simplejson.
    This commit just takes simplejson out of the unit and functional
    tests. They still pass.
    Change-Id: I96f17df81fa5d265395a938b19213d2638682106
Commits on May 26, 2015
  1. @smerritt

    Remove simplejson from swift-recon

    smerritt authored
    Since we're dropping Python 2.6 support, we can rely on stdlib's json
    and get rid of our dependency on simplejson.
    All swift-recon was doing with json was decoding a JSON response (from
    the recon middleware) and printing it to the terminal. This still
    works just fine.
    Change-Id: I28cf25a7c2856f230d4642c62fb8bf9c4d37e9e5
  2. @smerritt

    EC: don't 503 on marginally-successful PUT

    smerritt authored
    On EC PUT in an M+K scheme, we require M+1 fragment archives to
    durably land on disk. If we get that, then we go ahead and ask the
    object servers to "commit" the object by writing out .durable
    files. We only require 2 of those.
    When we got exactly M+1 fragment archives on disk, and then one
    connection timed out while writing .durable files, we should still be
    okay (provided M is at least 3). However, we'd take our M > 2
    remaining successful responses and pass that off to best_response()
    with a quorum size of M+1, thus getting a 503 even though everything
    worked well enough.
    Now we pass 2 to best_response() to avoid that false negative.
    There was also a spot where we were getting the quorum size wrong. If
    we wrote out 3 fragment archives for a 2+1 policy, we were only
    requiring 2 successful backend PUTs. That's wrong; the right number is
    3, which is what the policy's .quorum() method says. There was a spot
    where the right number wasn't getting plumbed through, but it is now.
    Change-Id: Ic658a199e952558db329268f4d7b4009f47c6d03
    Co-Authored-By: Clay Gerrard <>
    Closes-Bug: 1452468
Commits on May 9, 2015
  1. @smerritt

    Remove workaround for old eventlet version

    smerritt authored
    Swift now requires eventlet >= 0.16.1, so we can get rid of this
    workaround for a bug in eventlet 0.9.16.
    Change-Id: I4a1200b9bd9266896a704a840fda0d1b720bc86d
Commits on May 4, 2015
  1. @smerritt

    Bump up a timeout in a test

    smerritt authored
    Got a slow crappy VM like I do? You might see this fail
    occasionally. Bump up the timeout a little to help it out.
    Change-Id: I8c0e5b99012830ea3525fa55b0811268db3da2a2
Commits on Apr 22, 2015
  1. @smerritt

    Make RingBuilders deep-copy-able

    smerritt authored
    We used to be able to deep-copy RingBuilder objects, but the addition
    of debug logging (8d3b3b2) broke that since you can't deep-copy a
    Python logger. This commit fixes that.
    Swift doesn't really deep-copy RingBuilders anywhere, but third-party
    code might.
    Change-Id: If8bdadd93d9980db3d8a093f32d76ca604de9301
  2. @smerritt

    Bulk upload: treat user xattrs as object metadata

    smerritt authored
    Currently, if you PUT a single object, then you can also associate
    metadata with it by putting it in the request headers, prefixed with
    "X-Object-Meta". However, if you're bulk-uploading objects, then you
    have no way to assign any metadata.
    The tar file format* allows for arbitrary UTF-8 key/value pairs to be
    associated with each file in an archive (as well as with the archive
    itself, but we don't care about that here). If a file has extended
    attributes, then tar will store those as key/value pairs.
    This commit makes bulk upload read those extended attributes, if
    present, and convert those to Swift object metadata. Attributes
    starting with "user.meta" are converted to object metadata, and
    "user.mime_type"** is converted to Content-Type.
    For example, if you have a file "":
        $ setfattr -n user.mime_type -v "application/python-setup"
        $ setfattr -n user.meta.lunch -v "burger and fries"
        $ setfattr -n user.meta.dinner -v "baked ziti"
        $ setfattr -n user.stuff -v "whee"
    This will get translated to headers:
        Content-Type: application/python-setup
        X-Object-Meta-Lunch: burger and fries
        X-Object-Meta-Dinner: baked ziti
    Swift will handle xattrs stored by both GNU and BSD tar***. Only
    xattrs user.mime_type and user.meta.* are processed; others are
    This brings bulk upload much closer to feature-parity with non-bulk upload.
    * The POSIX 1003.1-2001 (pax) format, at least. There are a few
      different, mutually-incompatible tar formats out there, because of
      course there are. This is the default format on GNU tar 1.27.1 or
    *** Even with pax-format tarballs, different encoders store xattrs
        slightly differently; for example, GNU tar stores the xattr
        "user.rubberducky" as pax header "SCHILY.xattr.user.rubberducky",
        while BSD tar (which uses libarchive) stores it as
        "LIBARCHIVE.xattr.user.rubberducky". One might wonder if this is
        some programmer's attempt at job security.
    Change-Id: I5e3ce87d31054f5239e86d47c45adbde2bb93640
Commits on Apr 20, 2015
  1. @smerritt

    SAIO instructions: ensure ~/bin exists before copying into it

    smerritt authored
    Change-Id: I16cd211b00b529ccc4b46f6b10497c32b6741896
Commits on Apr 16, 2015
  1. @smerritt

    Functional test for SLO PUT overwriting one of its own segments

    smerritt authored
    Change-Id: I4855816848f4fdb148d0b82735cf79bc68429617
Commits on Apr 14, 2015
  1. @smerritt @clayg

    Foundational support for PUT and GET of erasure-coded objects

    smerritt authored clayg committed
    This commit makes it possible to PUT an object into Swift and have it
    stored using erasure coding instead of replication, and also to GET
    the object back from Swift at a later time.
    This works by splitting the incoming object into a number of segments,
    erasure-coding each segment in turn to get fragments, then
    concatenating the fragments into fragment archives. Segments are 1 MiB
    in size, except the last, which is between 1 B and 1 MiB.
    |                             object data                            |
              |                        |                      |
              v                        v                      v
    +===================+    +===================+         +==============+
    |     segment 1     |    |     segment 2     |   ...   |   segment N  |
    +===================+    +===================+         +==============+
              |                       |
              |                       |
              v                       v
         /=========\             /=========\
         | pyeclib |             | pyeclib |         ...
         \=========/             \=========/
              |                       |
              |                       |
              +--> fragment A-1       +--> fragment A-2
              |                       |
              |                       |
              |                       |
              |                       |
              |                       |
              +--> fragment B-1       +--> fragment B-2
              |                       |
              |                       |
             ...                     ...
    Then, object server A gets the concatenation of fragment A-1, A-2,
    ..., A-N, so its .data file looks like this (called a "fragment archive"):
    |     fragment A-1     |     fragment A-2     |  ...  |  fragment A-N |
    Since this means that the object server never sees the object data as
    the client sent it, we have to do a few things to ensure data
    First, the proxy has to check the Etag if the client provided it; the
    object server can't do it since the object server doesn't see the raw
    Second, if the client does not provide an Etag, the proxy computes it
    and uses the MIME-PUT mechanism to provide it to the object servers
    after the object body. Otherwise, the object would not have an Etag at
    Third, the proxy computes the MD5 of each fragment archive and sends
    it to the object server using the MIME-PUT mechanism. With replicated
    objects, the proxy checks that the Etags from all the object servers
    match, and if they don't, returns a 500 to the client. This mitigates
    the risk of data corruption in one of the proxy --> object connections,
    and signals to the client when it happens. With EC objects, we can't
    use that same mechanism, so we must send the checksum with each
    fragment archive to get comparable protection.
    On the GET path, the inverse happens: the proxy connects to a bunch of
    object servers (M of them, for an M+K scheme), reads one fragment at a
    time from each fragment archive, decodes those fragments into a
    segment, and serves the segment to the client.
    When an object server dies partway through a GET response, any
    partially-fetched fragment is discarded, the resumption point is wound
    back to the nearest fragment boundary, and the GET is retried with the
    next object server.
    GET requests for a single byterange work; GET requests for multiple
    byteranges do not.
    There are a number of things _not_ included in this commit. Some of
    them are listed here:
     * multi-range GET
     * deferred cleanup of old .data files
     * durability (daemon to reconstruct missing archives)
    Co-Authored-By: Alistair Coles <>
    Co-Authored-By: Thiago da Silva <>
    Co-Authored-By: John Dickinson <>
    Co-Authored-By: Clay Gerrard <>
    Co-Authored-By: Tushar Gohad <>
    Co-Authored-By: Paul Luse <>
    Co-Authored-By: Christian Schwede <>
    Co-Authored-By: Yuan Zhou <>
    Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
  2. @smerritt @clayg

    Allow sending object metadata after data

    smerritt authored clayg committed
    This lets the proxy server send object metadata to the object server
    after the object data. This is necessary for EC, as it allows us to
    compute the etag of the object in the proxy server and still store it
    with the object.
    The wire format is a multipart MIME document. For sanity during a
    rolling upgrade, the multipart MIME document is only sent to the
    object server if it indicates, via 100 Continue header, that it knows
    how to consume it.
    Example 1 (new proxy, new obj server):
       proxy: PUT /p/a/c/o
              X-Backend-Obj-Metadata-Footer: yes
         obj: 100 Continue
            X-Obj-Metadata-Footer: yes
       proxy: --MIMEmimeMIMEmime...
    Example2: (new proxy, old obj server)
       proxy: PUT /p/a/c/o
              X-Backend-Obj-Metadata-Footer: yes
         obj: 100 Continue
       proxy: <obj body>
    Co-Authored-By: Alistair Coles <>
    Co-Authored-By: Thiago da Silva <>
    Co-Authored-By: John Dickinson <>
    Co-Authored-By: Clay Gerrard <>
    Co-Authored-By: Tushar Gohad <>
    Co-Authored-By: Paul Luse <>
    Co-Authored-By: Christian Schwede <>
    Co-Authored-By: Yuan Zhou <>
    Change-Id: Id38f7e93e3473f19ff88123ae0501000ed9b2e89
Commits on Mar 31, 2015
  1. @smerritt

    Add some debug output to the ring builder

    smerritt authored
    Sometimes, I get handed a builder file in a support ticket and a
    question of the form "why is the balance [not] doing $thing?". When
    that happens, I add a bunch of print statements to my local
    swift/common/ring/, figure things out, and then delete the
    print statements. This time, instead of deleting the print statements,
    I turned them into debug() calls and added a "--debug" flag to the
    rebalance command in hopes that someone else will find it useful.
    Change-Id: I697af90984fa5b314ddf570280b4585ba0ba363c
Commits on Mar 6, 2015
  1. @smerritt

    Small optimization to ring builder.

    smerritt authored
    We were already checking this condition a few lines up; no need to do
    it again.
    Change-Id: I066c635c8dfa3c3a1e9a944decae2f41e2c689c9
Commits on Feb 24, 2015
  1. @smerritt

    Clean up a couple deprecation warnings

    smerritt authored
    Change-Id: Ic293402702981cea124d0dc57e95341fda7eaf99
Commits on Feb 20, 2015
  1. @smerritt

    Make proxy_logging close the WSGI iterator

    smerritt authored
    PEP 333 says that the WSGI framework will call .close() on the
    iterator returned by a WSGI application once it's done, provided such
    a method exists. So, if our code wraps an iterator, then we have to
    call .close() on it once we're done with it. proxy_logging wasn't.
    Since WSGIContext gets it right, I looked at making proxy_logging use
    WSGIContext. However, WSGIContext is all about forcing the first chunk
    out of the iterator so that it can capture the final HTTP status and
    headers; it doesn't help if you want to look at every chunk.
    proxy_logging wants every chunk so it can count the bytes sent.
    This didn't hurt anything in Swift, but pconstantine was complaining
    in IRC that our failure to call .close() was goofing up some other
    middleware he had.
    Change-Id: Ic6ea0795ccef6cda2b5c6737697ef7d58eac9ab4
Commits on Feb 12, 2015
  1. @smerritt

    Fix account-reaper

    smerritt authored
    As part of commit efb39a5, the account reaper grew a bind_port
    attribute, but it wasn't being converted to int, so naturally "6002"
    != 6002, and it wouldn't reap anything.
    The bind_port was only used for determining the local devices. Rather
    than fix the code to call int(), this commit removes the need for
    bind_port entirely by skipping the port check. If your rings have IPs,
    this is the same behavior as pre-efb39a5, and if your rings have
    hostnames, this still works.
    Change-Id: I7bd18e9952f7b9e0d7ce2dce230ee54c5e23709a
Commits on Jan 23, 2015
  1. @smerritt

    Allow per-policy overrides in object replicator.

    smerritt authored
    The replicator already supports --devices and --partitions to restrict
    its operation to a subset of devices and partitions. However,
    operators don't always want to replicate a partition in all policies
    since different policies (usually) have different rings.
    For example, if I know that policy 0's partition 1234 is has no
    replicas on primary nodes due to over-aggressive rebalancing, I really
    want to find a node where the partition isa and make the replicator
    push it onto the primaries. However, if I haven't been messing with
    policy 1's ring, its partition 1234 is fine. With the existing
    replicator args, I get both or neither; this commit lets me get just
    the useful one.
    Change-Id: Ib1d58fdd228a6ee7865321e65d7c04a891fa5c49
Commits on Jan 22, 2015
  1. @smerritt

    Make ThreadPools deallocatable.

    smerritt authored
    Currently, a ThreadPool acquires resources that last until process
    exit. You can let the ThreadPool go out of scope, but that doesn't
    terminate the worker threads or close file descriptors or anything.
    This commit makes it so you can .terminate() a ThreadPool object and
    get its resources back. Also, after you call .terminate(), trying to
    use the ThreadPool raises an exception so you know you've goofed.
    I have some internal code that could really use this, plus it makes
    the unit test run not leak resources, which is nice.
    Change-Id: Ibf7c6dc14c14f379421a79afb6c90a5e64b235fa
  2. @smerritt

    Optimize replication of targeted devices/partitions.

    smerritt authored
    swift-object-replicator lets you specify --devices and --partitions to
    perform a single replication pass over just those devices and
    partitions. However, it still scans every device and every partition
    to build up a list of jobs to do, then throws away the jobs for the
    wrong devices and partitions. This isn't too bad with partitions since
    it only wastes some CPU, but with devices, it results in unnecessary
    disk IO.
    This commit pushes the device and partition filtering a little further
    down into collect_jobs to avoid wasted work.
    Change-Id: Ia711bfc5a86ed4a080d27e08fe923cb4cb92da43
Commits on Jan 12, 2015
  1. @smerritt

    Drop redundant check in SLO segment-size validation

    smerritt authored
    Change-Id: Idf459f37cd18c46421c2e7a1a0506e8f28da13b4
Commits on Jan 7, 2015
  1. @smerritt

    Add notion of overload to swift-ring-builder

    smerritt authored
    The ring builder's placement algorithm has two goals: first, to ensure
    that each partition has its replicas as far apart as possible, and
    second, to ensure that partitions are fairly distributed according to
    device weight. In many cases, it succeeds in both, but sometimes those
    goals conflict. When that happens, operators may want to relax the
    rules a little bit in order to reach a compromise solution.
    Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
    and using 3 replicas. The ring builder will place 1 replica of each
    partition on each node, as you'd expect.
    Now imagine that one disk fails in node C and is removed from the
    ring. The operator would probably be okay with remaining at 1 replica
    per node (unless their disks are really close to full), but to
    accomplish that, they have to multiply the weights of the other disks
    in node C by 20/19 to make C's total weight stay the same. Otherwise,
    the ring builder will move partitions around such that some partitions
    have replicas only on nodes A and B.
    If 14 more disks failed in node C, the operator would probably be okay
    with some data not living on C, as a 4x increase in storage
    requirements is likely to fill disks.
    This commit introduces the notion of "overload": how much extra
    partition space can be placed on each disk *over* what the weight
    For example, an overload of 0.1 means that a device can take up to 10%
    more partitions than its weight would imply in order to make the
    replica dispersion better.
    Overload only has an effect when replica-dispersion and device weights
    come into conflict.
    The overload is a single floating-point value for the builder
    file. Existing builders get an overload of 0.0, so there will be no
    behavior change on existing rings.
    In the example above, imagine the operator sets an overload of 0.112
    on his rings. If node C loses a drive, each other drive can take on up
    to 11.2% more data. Splitting the dead drive's partitions among the
    remaining 19 results in a 5.26% increase, so everything that was on
    node C stays on node C. If another disk dies, then we're up to an
    11.1% increase, and so everything still stays on node C. If a third
    disk dies, then we've reached the limits of the overload, so some
    partitions will begin to reside solely on nodes A and B.
    Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
Commits on Jan 6, 2015
  1. @smerritt

    Only move too-close-together replicas when they can spread out.

    smerritt authored
    Imagine a 3-zone ring, and consider a partition in that ring with
    replicas placed as follows:
    * replica 0 is on device A (zone 2)
    * replica 1 is on device B (zone 1)
    * replica 2 is on device C (zone 2)
    Further, imagine that there are zero parts_wanted in all of zone 3;
    that is, zone 3 is completely full. However, zones 1 and 2 each have
    at least one parts_wanted on at least one device.
    When the ring builder goes to gather replicas to move, it gathers
    replica 0 because there are three zones available, but the replicas
    are only in two of them. Then, it places replica 0 in zone 1 or 2
    somewhere because those are the only zones with parts_wanted. Notice
    that this does *not* do anything to spread the partition out better.
    Then, on the next rebalance, replica 0 gets picked up and moved
    (again) but doesn't improve its placement (again).
    If your builder has min_part_hours > 0 (and it should), then replicas
    1 and 2 cannot move at all. A coworker observed the bug because a
    customer had such a partition, and its replica 2 was on a zero-weight
    device. He thought it odd that a zero-weight device should still have
    one partition on it despite the ring having been rebalanced dozens of
    Even if you don't have zero-weight devices, having a bunch of
    partitions trade places on each rebalance isn't particularly good.
    Note that this only happens with an unbalanceable ring; if the ring
    *can* balance, the gathered partitions will swap places, but they will
    get spread across more zones, so they won't get gathered up again on
    the next rebalance.
    Change-Id: I8f44f032caac25c44778a497dedf23f5cb61b6bb
    Closes-Bug: 1400083
Commits on Dec 12, 2014
  1. @smerritt

    Add tests for metadata on 304 and 412 responses

    smerritt authored
    Commit 1f67eb7 added support for If-[None-]Match on DLOs and SLOs. It
    also made the 304 and 412 responses have the Content-Type and
    X-Object-Meta-* headers from the object instead of just having the
    Someone showed up in IRC today looking for this behavior, and was
    happy to learn it's in newer Swift versions than the one they were
    running. If we've got clients depending on this, we should have some
    unit tests to make sure we don't accidentally take it out again.
    Change-Id: If06149d13140148463004d426cb7ba4c5601404a
Commits on Dec 8, 2014
  1. @smerritt

    Improve object-replicator startup time.

    smerritt authored
    The object replicator checks each partition directory to ensure it's
    really a directory and not a zero-byte file. This was happening in
    collect_jobs(), which is the first thing that the object replicator
    The effect was that, at startup, the object-replicator process would
    list each "objects" or "objects-N" directory on each object device,
    then stat() every single thing in there. On devices with lots of
    partitions on them, this makes the replicator take a long time before
    it does anything useful.
    If you have a cluster with a too-high part_power plus some failing
    disks elsewhere, you can easily get thousands of partition directories
    on each disk. If you've got 36 disks per node, that turns into a very
    long wait for the object replicator to do anything. Worse yet, if you
    add in a configuration management system that pushes new rings every
    couple hours, the object replicator can spend the vast majority of its
    time collecting jobs, then only spend a short time doing useful work
    before the ring changes and it has to start all over again.
    This commit moves the stat() call (os.path.isfile) to the loop that
    processes jobs. In a complete pass, the total work done is about the
    same, but the replicator starts doing useful work much sooner.
    Change-Id: I5ed4cd09dde514ec7d1e74afe35feaab0cf28a10
Commits on Dec 5, 2014
  1. @smerritt

    Speed up reading and writing xattrs for object metadata

    smerritt authored
    Object metadata is stored as a pickled hash: first the data is
    pickled, then split into strings of length <= 254, then stored in a
    series of extended attributes named "user.swift.metadata",
    "user.swift.metadata1", "user.swift.metadata2", and so forth.
    The choice of length 254 is odd, undocumented, and dates back to the
    initial commit of Swift. From talking to people, I believe this was an
    attempt to fit the first xattr in the inode, thus avoiding a
    seek. However, it doesn't work. XFS _either_ stores all the xattrs
    together in the inode (local), _or_ it spills them all to blocks
    located outside the inode (extents or btree). Using short xattrs
    actually hurts us here; by splitting into more pieces, we end up with
    more names to store, thus reducing the metadata size that'll fit in
    the inode.
    I did some benchmarking of read_metadata with various xattr sizes
    against an XFS filesystem on a spinning disk, no VMs involved.
     name | rank | runs |      mean |        sd | timesBaseline
    32768 |    1 | 2500 | 0.0001195 |  3.75e-05 |           1.0
    16384 |    2 | 2500 | 0.0001348 | 1.869e-05 | 1.12809122912
     8192 |    3 | 2500 | 0.0001604 | 2.708e-05 | 1.34210998858
     4096 |    4 | 2500 | 0.0002326 | 0.0004816 | 1.94623473988
     2048 |    5 | 2500 | 0.0003414 | 0.0001409 | 2.85674781189
     1024 |    6 | 2500 | 0.0005457 | 0.0001741 | 4.56648611635
      254 |    7 | 2500 |  0.001848 |  0.001663 | 15.4616067887
    Here, "name" is the chunk size for the pickled metadata. A total
    metadata size of around 31.5 KiB was used, so the "32768" runs
    represent storing everything in one single xattr, while the "254" runs
    represent things as they are without this change.
    Since bigger xattr chunks make things go faster, the new chunk size is
    64 KiB. That's the biggest xattr that XFS allows.
    Reading of metadata from existing files is unaffected; the
    read_metadata() function already handles xattrs of any size.
    On non-XFS filesystems, this is no worse than what came before:
    ext4 has a limit of one block (typically 4 KiB) for all xattrs (names
    and values) taken together [1], so this change slightly increases the
    amount of Swift metadata that can be stored on ext4.
    ZFS let me store an xattr with an 8 MiB value, so that's plenty. It'll
    probably go further, but I stopped there.
    Change-Id: Ie22db08ac0050eda693de4c30d4bc0d620e7f7d4
Commits on Nov 14, 2014
  1. @smerritt

    Make error limits survive a ring reload

    smerritt authored
    The proxy was storing the error count and last-error time in the
    ring's internal data, specifically in the device dictionaries. This
    works okay, but it means that whenever a ring changes, all the error
    stats reset.
    Now the error stats live in the proxy server object, so they survive a
    ring reload.
    Better yet, the error stats are now keyed off of the node's
    IP/port/device triple, so if you have the same device in two rings
    (like with multiple storage policies), then the error stats are
    combined. If the proxy server sees a 507 for an objec request in
    policy X, then that will now result in that particular object disk
    being error-limited for requests in policies Y and Z as well.
    Change-Id: Icc72b68b99f37367bb16d43688e7e45327e3e022
Something went wrong with that request. Please try again.