# NoSQL

## Go Beyond Relational Model
**Pros of Relational DBs**
1. Simple, can capture nearly any business use case
2. Can integrate multiple applications via shared data store
3. Standard interface language SQL
4. ad-hoc queries across and within data aggregates
5. Fast, reliable, concurrent, consistent

**Cons of Relational DBs**
1. Object Relational (OR) impedance mismatch
2. not good with big data
3. not good with clustered/replicated servers

Adoption of NoSQL driven by cons of Relational (e.g. a lot of work to dissemble and reassemble the aggregate)

But "Polyglot persistence" => Relational will not go away

## Big Data

### Definition & 3Vs
Data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed)

- **Volume**-Much larger quantity of data than typical for relational DBs
- **Variety**-Lots of different data types and formats
- **Velocity**-data comes at very fast rate (e.g. mobile sensors, web click stream)

### Schema on write vs. schema on read
**Schema on Write**-preexisting data model, how traditional databases are designed (relational databases)


**Schema on Read**-data model determined later, depends on how you want to use it (XML, JSON); capture and store the data, and worry about how you want to use it later.

<center>
 <img src="./figures/week11-1.png" width = "400" alt="图片名称" align=center />
</center>


### Data Lake
A large integrated repository for internal and external data that does not follow a predefined schema.

Capture everything, dive in anywhere, flexible access


## NoSQL database

### Properties
1. Doesn't use relational model or SQL language
2. Runs well on distributed servers
3. Most are open-source
4. Built for the modern web
5. Schema-less (though maybe some implicite schema); Supports schema on read
6. Not ACID compliant (can't guarantee instant transfer)
7. Eventually consistent

**Goal:** to improve programer productivity (OR mismatch); to handle larger dta volumes and throughput (big data)

<center>
 <img src="./figures/week11-4.png" width = "450" alt="图片名称" align=center />
</center>

Comparison Between NoSQL and Relational DBs

NoSQL DBs need to be available all the time. Facebook / Twitter needs to ensure that users can access and post to its site and app all the time, even if the data is not consistent. For example, users expect to always see the number of likes on a post (or retweet count on a tweet), they do not mind if that number is slightly outdated.

Relational DBs sacrifice availability for consistency. This means the DBs need to be consistent all the time for all the users. If there is potential for inconsistencies (such as two people booking the same seats to watch a movie), the database will sometimes be unavailable.

### Types of NoSQL databases

<center>
 <img src="./figures/week11-2.png" width = "400" alt="图片名称" align=center />
</center>

#### Key-value stores
**Key**=Primary key

**Value**=anything (number, array, image, JSON); The application is in charge of interpreting what it means (most flexible and least structured)

**Operations**-Put (store), Get, Update

**Examples:** Riak, Redis, Memcached, BerkeleyDB, HamsterDB, Amazon DynamoDB, Project Voldemort, Couchbase

#### Document databases
Similar to a key-value store except that the document is examinable by the databases, so its content can be queried, and parts of it updated

**Document**=JSON file

**Examples:** MongoDB, CouchDB, Terrastore, OrientDB, RavenDB

MongoDB example
<center>
 <img src="./figures/week11-3.png" width = "400" alt="图片名称" align=center />
</center>

**Aggregate-oriented**

Key-value, documnet store and column-family are aggregate-oriented-store business object in its entirety databases.

**Pros**
1. entire aggregate of data is stored together (no need for transactions)
2. efficient storage on clusters/distributed databases

**Cons**
1. hard to analyse across subfields of aggregates (e.g. sum over products instead of orders)


#### Column families
Columns rather than rows are stored together on disk; This is like automatic vertical partitioning

Makes analysis faster, as less data is fetched

Related columns grouped together into families

**Examples:**  Cassandra, BigTable, HBase (Facebook, Netflix, Twitter)

#### Graph databases
A graph is a node-and-arc network; and Graphs are difficult to program in relational DB

A graph DB stores entities and their relationships; Graph queries deduce knowledge from the graph

**Examples:** Neo4j (Airbnb, Microsoft), Infinite Graph, OrientDBv, FlockDB, TAO

#### Summary
1. Key-value stores<br/>
A simple pair of a key and an associated collection of values. Key is usually a string. The database has no knowledge of the structure or meaning of the values
2. Document stores<br/>
Like a key-value store, but document goes further than value. The document is structured, so specific elements can be manipulated separately.
3. Column-family stores<br/>
Data is grouped in column groups/families for efficiency reasons.
4. Graph-oriented databases<br/>
Maintain information regarding the relationships between data items. Nodes with properties


## CAP theorem restate
Fowler's version: If you have a distributed database, when a partition occurs, you must then choose consistency OR availability. (EVERY NoSQL is a distributed database; most NoSQL choose availability)

### ACID vs BASE
**ACID**-Atomic, Consistent, Isolate, Durable

**BASE**-Basically, Available, Soft State, Eventual Consistency

**Basically Available**: This constraint states that the system does guarantee the availability of the data; there will be a response to any request. But data may be in an inconsistent or changing state.

**Soft state**: The state of the system could change over time-even during times without input there may be changes going on due to eventual consistency

**Eventual Consistency**: The system will eventually become consistent once it stops receiving input. The data will propagate to everythwere it needs to, sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.

