riak.tex

\chapter{Riak}

\begin{wrapfigure}{l}{0.4\textwidth}
  \vspace{-75pt}
  \begin{center}
    \includegraphics[width=0.38\textwidth]{riak.png}
  \end{center}
  \vspace{-30pt}
\end{wrapfigure}
Riak is a NoSQL database implementing the principles from Amazon's Dynamo paper.
Riak has a pluggable backend for its core shard-partitioned storage, with the default storage backend being Bitcask as of the 0.12 release. Riak also has built-in MapReduce with native support for both JavaScript (using the SpiderMonkey runtime) and Erlang, while supporting a variety of additional language drivers such as Python, Java, PHP and Ruby.

\section{Short specification}

\begin{itemize}
  \item \textbf{Initial release:} 2009
  \item \textbf{Written in:} Erlang \& C, some Javascript
  \item \textbf{Main point:} Fault tolerance
  \item \textbf{License:} Apache License 2.0
  \item \textbf{Protocol:} HTTP/REST or custom binary
  \item \textbf{Web site:} \href{http://wiki.basho.com/}{wiki.basho.com}
\end{itemize}

\section{Main features}

\begin{figure}[hb]
  \centering
  \includegraphics[width=1\textwidth]{riak_lamp.jpeg}
\end{figure}

\subsection{MapReduce}

Riak's MapReduce implementation supports both using JavaScript and Erlang to run MapReduce, with JavaScript being more suitable for ad hoc style queries, whereas Erlang code needs to be known to all physical nodes in the cluster before you can use it, but comes with some performance benefits.

\subsection{Links}

Linking objects is one of the neat additions of Riak over and above Dynamo. You can create logical trees or even graphs of objects. If you fancy object-oriented programming, this can be used as the equivalent of object associations.

Riak's HTTP API offers a simple way to fetch linked objects through an arbitrary number of links. When you request a single object, you attach one or more additional parameters to the URL, specifying the target bucket, the tag and whether you would like the linked object to be included in the response.

\subsection{Search}

Riak Search was heavily inspired by Apache Lucene, which is close enough to being a defacto standard for full-text search. The similarities focus mostly on the interface, though, manifesting themselves in the Solr-like HTTP interface and the Lucene-style query syntax, but Riak Search doesn't support the full Lucene query set yet.

\subsection{Secondary Indexes}

Secondary indexes (or short: 2i) are a very recent addition to Riak. They offer a much simpler way to do reverse lookups on data stored in Riak than Riak Search. Instead of having Riak analyze and tokenize data in documents (JSON, XML, text), 2i relies on the application to provide the indexing data as key-value pairs when storing data. It doesn't do any tokenization either, and doesn't allow things like partial matches like full-text search does, so it's much less computationally expensive.

Computation is instead pushed into the application layer, which basically tags objects stored in Riak with the data it wants to run queries on. Riak 2i can index integers and binary values like strings, and everything is stored in lowercase. So instead of indexing an object you add some metadata to it which Riak can then use to do index lookups. Like Riak Search, 2i only returns keys for you to use. It doesn't do any object lookups, but you can feed the results into MapReduce to do so in the same step.

\subsection{Cluster with Failover}

Riak's default mode of operation is in a cluster. A Riak cluster is generally run on a set of well-connected physical hosts. Each host in the cluster runs one Riak node. Each Riak node runs a set of virtual nodes, or <<vnodes>>, that are each responsible for storing a separate portion of the key space. Nodes are not clones of each other, nor do they all participate in fulfilling every request. The extent to which data is replicated, and when, and with what merge strategy and failure model, is configurable at runtime.

\subsection{Eventual Consistency}

In a distributed and fault-tolerant environment like Riak, the unexpected is expected. That means that nodes may leave and join the cluster at any time, be it by accident (node failure, network partition, etc.) or on purpose, e.g. by explicitly removing a node from a Riak cluster. Even with one or more nodes down, the cluster is still expected to accept writes and serve reads, and the system is expected to return the same data from all nodes eventually, even the failed ones after they rejoined the cluster.

\section{Strengths}

If you want to design a large-scale ordering system like Amazon, or in any situation where high availability, you should consider Riak. One of Riak's strengths lies in its focus on removing single points of failure in an attempt to support maximum uptime and grow (or shrink) to meet changing demands. If you do not have complex data, Riak keeps things simple but still allows for some pretty sophisticated data diving should you need it.\cite{seven_databases}

\section{Weaknesses}

If you require simple ability to do query, complex data structures, or a rigid schema or if you have no need to scale horizontally with your servers, Riak is probably not your best choice. One of our major gripes about Riak is it still lags in terms of an easy and robust ad hoc querying framework, although it is certainly on the right track. MapReduce provides fantastic and powerful functionality, but we’d like to see more built-in URL-based or other PUT query actions. The addition of indexing was a major step in the right direction and a concept we’d love to see expanded upon. Finally, if you don’t want to write Erlang, you may see a few limitations using JavaScript, such as the unavailability of post-commit or slower MapReduce execution.\cite{seven_databases}

\section{Tips}

\section{Use cases}

\begin{itemize}
  \item Session Storage
  \item User Data storage
  \item S3-like services
  \item Cloud infrastructure
  \item Scalable, low-latency storage for mobile apps
  \item Critical Data Storage
  \item Building Block for custom-built distributed systems
\end{itemize}