# Topic Proposal

## Introducing and Comparing NoSQL Database Management Systems

### Abstract

With the exponentially generated and accumulated data from varied sources like mobile devices, mechanical sensors, financial transactions, satellite imaging, the task of data managing gets much more challenging day after day. And these collected raw data are in a wide range of forms, most are unstructured, like images, videos, audios, etc. In the whole data processing pipeline, starting from data integrating and storing, it's already a quite essential part. Since the consistency, accessibility, velocity of data inserting and retrieving might make a huge difference to the following data manipulation tasks, especially in fields like finance, social networks, etc. Thus making an optimal choice of the right database for different data format, different data volume, different business focus like whether velocity or consistency is more important for the client's perspective.

Widely used Database Management Systems are Relational DBMS such as SQL (MySQL, SQL Server), and NoSQL(Not Only SQL). Why the relational DBMS is limited is that due to its strong focus on consistency and reliability and ACID guaranteed (ACID: Atomicity, Consistency, Isolation and Durability), thus also making it difficult for scaling, which is one of the many reasons why NoSQL is considered better than SQL. To reach a balance between availability and consistency, NoSQL achieve more optimal level of consistency by slightly sacrificing its availability.

 NoSQL also can reduce the level of availability to achieve higher consistency. However there are still many remaining problems which NoSQL databases cannot solve. As a result, providers produced distinct data model CDBMSs(Cloud Database Management System) to satisfy different requirements, whereas each of them are present an imperfection.



By design, NoSQL databases and management systems are relation-less (or schema-less). They are not based on a single model (e.g. relational model of RDBMSs) and each database, depending on their target-functionality, adopt a different one.

There are almost a handful of different operational models and functioning systems for NoSQL databases.:

### Goal

The proposed goal of the paper is to give a precise introduction of the  NoSQL DBMS and provide a comparison of some of the most frequently used in the industries nowadays, Cassandra, HBase and MongoDB, mainly from 4 aspects: their mechanism, important features, advantages, existing problems.

### Proposing Outline

#### NoSQL Data Models



    
   
    Key / Value:

e.g. Redis, MemcacheDB, etc.

    Column:

e.g. Cassandra, HBase, etc.

    Document:

e.g. MongoDB, Couchbase, etc

    Graph:

e.g. OrientDB, Neo4J, etc. 

- Cassandra
- mongodb
- Hbase

on:
    
 - mechanism
 - features
 - advantage
 - disadvantage


- Comparison

This section is the comparison between two CDBMS having the same data model and discussion of their advantages and disadvantages. The comparison builds on several Features in NoSQL databases, such as Partitioning, Replication and so on.


Cassandra vs. HBase

- Data model itself

Even though both Cassandra and HBase inherit the same data model from BigTable, there are still some differences between them. In Cassandra, the keyspaces are the container of data, inside the keyspaces are many column families which contains columns. A set of columns is identified by row-key, and each row in a column family can have different kinds of columns. In HBase, data is stored into HBase tables which made of rows and columns. Columns belong to column family and store together. In summary, because the differences of data model between Cassandra and HBase, the database designs may be various. Cassandra queries on row key always achieve a better performance, in contrast, HBase prefer to query on column family. And Cassandra more rely on writes than reads, however HBase achieves both convenient writes and reads.
Architecture

Cassandra is an open source implementation of Dynamo, hence its architecture is p2p which every node are equality. Compare with Cassandra, HBase uses master-slave architecture, each HMaster need to manage many slaves distributed in the same region. As a consequence, Cassandra achieved a higher availability than HBase.

- Partitioning

Cassandra has two partitioning strategies: Random partitioning and Ordered partitioning. Random partitioning is the default and recommended strategy, hence scanning rows is very complicated and partial row-keys is not permitted in Cassandra. In contrast, HBase only support the Ordered Partitioning, so HBase queries can be formulated with partial start and end row-keys. In addition, due to the different partitioning strategy, Hbase can support some simple aggregate, however Cassandra can’t.

- Replication

Firstly, Cassandra replicates data in every transaction, a coordinate node captures changes and propagate it to other coordinators. In this case, Cassandra rely on a high speed and low latency network connection. On the other hand, HBase use a more practical replication architecture. HBase captures change log and put it into replication queue, then the replication message is propagated to other nodes. Secondly, in the case NoSQL databases, replication is provided to achieve high availability. Cassandra uses the no master-slave replication strategy or peer to peer replication strategy, which can provide a high level availability and durability. Whereas, HBase prefer Master-slave replication strategy, hence single point failure is acceptable in HBase databases.

- Versioning

Both Cassandra and HBase use timestamp to version the data, then opt the last-write-wins(LWW) approach. They also mark a tombstone when the deletion is requested. After the compactions, the LWW or tombstone will choose the newest data. Data Versioning is a significant concept to provide concurrency.
Achievable consistency levels

Cassandra satisfied the Availability and Partition Tolerance in the CAP(Consistency Availability Partition Tolerance) theorem, and HBase satisfied
the Consistency and Partition Tolerance properties. So, for achieving the availability, Cassandra need to trade-off between availability and consistency levels, it can achieve strong consistency and eventual consistency by configuration. However, Hbase only satisfy a strong consistency, hence it need to sacrifice the availability.
Support of traditional database constraints
Normally, traditional relation database used several constraints to limit the type of data in table. Includes: NOT NULL, DEFAULT, UNIQUE, PRIMARY Key, FOREIGN Key, CHECK, INDEX. In the case of Cassandra, it supports all the constraints except the FOREIGN Key constraint. Even though INDEX is provided in Cassandra, but INDEX is not recommended. As a consequence, Cassandra need to use many corresponding tables to achieve the read requests and ignore the data redundancy. In contrast, HBase support FOREIGN Key to support referencing, and INDEX is not offered in HBase.

- Level of satisfaction of ACID

ACID represents Atomicity, Consistency, Isolation and Durability. In Cassandra, Firstly, all individual writes are atomic at the row level. Secondly, even Though Cassandra could provide strict consistency, but it is a different scope with Consistency in ACID. Thirdly, Nothing is isolated in Cassandra. Finally, the commit log contribute to the updates durable. In Hbase, Firstly, the mutations also atomic within a row level. Secondly, Scan operations can provide a consistent view of data in HBase. Thirdly, writes in HBase is isolated by locking the rows, and reads is not isolated. Finally, all the visible or retrievable data are durable in HBase.

- Existence of any Single Points of Failure

When talks about single points failure, that usually to do more with the master-slave replication architecture. Cassandra do not exist any single points failure, because it does not operate under the master-slave replication strategy. These bring lots of advantages in availability, however this will bring more extra transaction between nodes. On the other hand, due to the utilization of Mater-slave architecture, HBase do exist single points failure.


