Find file
Fetching contributors…
Cannot retrieve contributors at this time
287 lines (232 sloc) 13.3 KB

Riak 1.1.1 Release Notes

Major Features and Improvements for Riak

MapReduce Changes

This release includes fixes for systems with heavy MapReduce load.

Pipe maintains a limit on the number of active worker processes in the system. Workers are created on demand when the pipe executes - they are not reserved at pipe creation time.

Prior to 1.1.1 not all fittings (e.g. get object from riak, map value, reduce) were checking to make sure their output was being delivered to the next stage of the pipeline. This has been changed so that the pipe will error if processing errors or system limits are hit.

As a result of the change if the pipe is overloaded clients will see errors of this form: {error,<<"Error sending inputs: {error,worker_limit_reached}">>}

Clients can either decide to retry (preferably with a backoff/retries limit mechanism) or if the hardware has sufficient capacity the number of workers per vnode can be increased by setting worker_limit in etc/app.config.

{riak_pipe, [{worker_limit, #Workers}]}

Different numbers of workers are used based on the type of MapReduce query so tuning will be very workload dependent.

If you are using Javascript MapReduce you will also need to increase the number of Javascript virtual machines in the riak_kv section of app.config.

{map_js_vm_count,  #MapVMs},
{reduce_js_vm_count, #ReduceVMs}

Again, tuning will be workload dependent. If sufficient resources are available setting #MapVMs to the maximum number of JS MapReduce jobs being run in parallel will give best results and #ReduceVMs set to number of jobs / number of nodes plus a margin for overload.

Bugs Fixed

-Map-reduce jobs fail during rolling upgrade to 1.1 -Race Condition causes Javascript VMs to never be returned to VM Pool -Avoid {case_clause, ok} errors when vnode backend fails to start -Bitcask - Fix incorrect NIF error tuples

Riak 1.1.0 Release Notes

Major Features and Improvements for Riak

Riak Control

-Riak Control preview demo -Riak Control repository with documentation on how to get started


-Blog post announcing riaknostic -Riaknostic Homepage

Bitcask Improvements

  • CRC checks have been added to hintfiles for bitcask

LevelDB Improvements

  • Snappy (compression algorithm from Google) has been enabled
  • The compaction algorithm has been tuned to reduce delays due to compaction

Other backend changes

  • Multi-backend now supports 2i properly

Lager Improvements

  • Tracing support (see the README)
  • Term printing is ~4x faster and much more correct (compared to io:format)
  • Bitstring printing support was added

MapReduce Improvements

  • The MapReduce interface now supports requests with empty queries. This allows the 2i, list-keys, and search inputs to return matching keys to clients without needing to include a reduce_identity query phase.
  • MapReduce error messages have been improved. Most error cases should now return helpful information all the way to the client, while also producing less spam in Riak’s logs.

Riak Core Improvements

There has been numerous changes to riak_core which address issues with cluster scalability, and enable Riak to better handle large clusters and large rings. Specifically, the warnings about potential gossip overload and large ring sizes from the Riak 1.0 release notes are no longer valid. However, there are a few user visible configuration changes impacted by these improvements.

First, Riak has a new routing layer that determines how requests are sent to vnodes for servicing. In order to support mixed-version clusters during a rolling upgrade, Riak 1.1 starts in a legacy routing mode that is slower than the traditional behavior in prior releases. After upgrading or deploying a cluster that is composed solely of Riak 1.1 nodes, the legacy mode can be disabled by adding the following to the riak_core section of app.config:

{legacy_vnode_routing, false}

Second, the handoff_concurrency setting that can be enabled in the riak_core section of app.config has been changed to limit both incoming and outgoing handoff from a node. Prior to 1.1, this setting only limited the amount of outgoing handoff traffic from a node, and was unable to prevent a node from becoming overloaded with too much inbound traffic.

Finally, the gossip_interval setting that can be enabled in riak_core section of app.config has been removed, and replaced with a new gossip_limit setting. The gossip_limit setting takes the form {allowed_gossips, interval_in_ms}. For example, the default setting is {45, 10000} which limits gossip to 45 gossip messages every 10 seconds.

New Ownership Claim Algorithm

The new ring ownership claim algorithm introduced as an optional setting in the 1.0 release has been set as the default for 1.1.

The new claim algorithm significantly reduces the amount of ownerhip shuffling for clusters with more than N+2 nodes in them. Changes to the cluster membership code in 1.0 made the original claim algorithm fall back to just reordering nodes in sequence in more cases than it used to, causing massive handoff activity on larger clusters. Mixed clusters including pre-1.0 nodes will still use the original algorithm until all nodes are upgraded.

NOTE: The semantics of the new algorithm are different than the old algorithm when the number of nodes in the cluster is less than the target_n_val setting. When the number of nodes in your cluster is less than target_n_val, Riak will not guarantee that each of your replicas are on distinct physical nodes. By default, Riak is setup with a default replication value N=3 and a target_n_val of 4. As such, you must have at least a 4-node cluster in order to have your replicas placed on different physical nodes. If you want to deploy a 3-replica, 3-node cluster, you can override the target_n_val setting in the riak_core section of app.config:

{target_n_val, 3}

In practice, you should set target_n_val to match the largest replication value you plan to use across your entire cluster.

Riak KV Improvements

Listkeys Backpressure

Backpressure has been added to listkeys to prevent the node listing keys from being overwhelemed. The change has required a protocol change so that the key lister can limit the rate it receives data.

In mixed clusters where some of the nodes are < 1.1 please set listkeys_backpressure false in the riak_kv section of app.config until all nodes are upgraded.

{listkeys_backpressure, false}

Once all nodes are upgraded, set listkeys_backpressue to true in the riak_kv section of app.config

{listkeys_backpressure, true}

Running nodes can be upgraded without restarting by running this snippet from the riak console

application:set_env(riak_kv, listkeys_backpressure, true).

Fresh 1.1.0 and above installs default to using listkeys backpressure - adjust app.config if different behavior is desired.

Don’t drop post-commit errors on floor

In previous releases there is no easy way to determine if a post-commit hook is failing. In this release two counters have been added to riak-admin status that will indicate pre/post-commit hook failures. They are precommit_fail and postcommit_fail. By default the errors themselves are not logged. The thought is that a bad hook could cause unnecessary IO overload.

If the error needs to be discovered then Lager, Riak’s logging system, will allow you to dynamically change the logging level to debug on the function executing the hook. To do that you need to riak attach on one of the nodes and run the following.

{ok, Trace} = lager:trace_file("<path>/failing-postcommits", [{module, riak_kv_put_fsm}, {function, decode_postcommit}], debug).

This will output all post-commit errors to <path>/failing-postcommits. When you’ve got enough samples you can stop the trace like so.


Other Additions

Default small_vclock to be equal to big_vclock

If you are using bidirectional cluster replication and you have overridden the defaults for either of these then you should consider setting both to the same value.

The default value of small_vclock has been changed to be equal to big_vclock in order to delay or even prevent unnecessary sibling creation in a Riak deployment with bidirectional cluster replication. When you replicate a pruned vector clock the other cluster will think it isn’t a descendent, even though it is, and create a sibling. By raising small_vclock to match big_vclock you reduce the frequency of pruning and thus siblings. Combined with vnode vclocks, sibling creation, for this particular reason, may be entirely avoided since the number of entries will almost always stay below the threshold in a well behaved cluster (i.e. one not under constant node membership change or network partitions).

Known Issues

-Luwak has been deprecated in the 1.1 release -bz1160 - Bitcask fails to merge on corrupt file

Innostore on 32-bit systems

Depending on the partition count and the default stack size of the operating system, using Innostore as the backend on 32-bit systems can cause problems. A symptom of this problem would be a message similar to this on the console or in the Riak error log:

10:22:38.719 [error] Failed to start riak_kv_innostore_backend for index 1415829711164312202009819681693899175291684651008. Reason: {eagain,[{erlang,open_port,[{spawn,innostore_drv},[binary]]},{innostore,connect,0},{riak_kv_innostore_backend,start,2},{riak_kv_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}
10:22:38.871 [notice] "backend module failed to start."

The workaround for this problem is to reduce the stack size for the user Riak is run as (riak for most distributions). For example on Linux, the current stack size can be viewed using ulimit -s and can be altered by adding entries to the /etc/security/limits.conf file such as these:

riak             soft    stack           1024
riak             hard    stack           1024

Ownership Handoff Stall

It has been observed that 1.1.0 clusters can end up in a state with ownership handoff failing to complete. This state should not happen under normal circumstances, but has been observed in cases where Riak nodes crashed due to unrelated issues (eg. running out of disk space/memory) during cluster membership changes. The behavior can be identified by looking at the output of riak-admin ring_status and looking for transfers that are waiting on an empty set. As an example:

============================== Ownership Handoff ==============================
Owner:      riak@host1
Next Owner: riak@host2

Index: 123456789123456789123456789123456789123456789123
  Waiting on: []
  Complete:   [riak_kv_vnode,riak_pipe_vnode]

To fix this issue, copy/paste the following code sequence into an Erlang shell for each Owner node in riak-admin ring_status for which this case has been identified. The Erlang shell can be accessed with riak attach

fun() ->
  Node = node(),
  {_Claimant, _RingReady, _Down, _MarkedDown, Changes} =
  Stuck =
      [{Idx, Complete} || {{Owner, NextOwner}, Transfers} <- Changes,
                          {Idx, Waiting, Complete, Status} <- Transfers,
                          Owner =:= Node,
                          Waiting =:= [],
                          Status =:= awaiting],

  RT = fun(Ring, _) ->
           NewRing = lists:foldl(fun({Idx, Mods}, Ring1) ->
                         lists:foldl(fun(Mod, Ring2) ->
                             riak_core_ring:handoff_complete(Ring2, Idx, Mod)
                         end, Ring1, Mods)
                     end, Ring, Stuck),
           {new_ring, NewRing}

  case Stuck of
      [] ->
      _ ->
          riak_core_ring_manager:ring_trans(RT, []),

Bugs Fixed

-bz775 - Start-up script does not recreate /var/run/riak -bz1283 - erlang_js uses non-thread-safe driver function -bz1333 - Bitcask attempts to open backup/other files -Poolboy - Lots of potential bugs fixed, see detailed post by Andrew Thompson

Lager Specific Bugs Fixed

  • #26 - don’t make a crash log called ‘undefined’
  • #28 - R13A support
  • #29 - Don’t unnecessarily quote atoms
  • #31 - Better crash reports for proc_lib processes
  • #33 - Don’t assume supervisor children are named with atoms
  • #35 - Support printing bitstrings (binaries with trailing bits)
  • #37 - Don’t generate dynamic atoms