# NoSQL for Dummies

<img src="http://geekandpoke.typepad.com/.a/6a00d8341d3df553ef0148c80ac6ef970c-pi" style="width: 400px;"/>

---
Review
---

<details><summary>
What are the "pros" of a SQL system?
</summary>
- Much of the world can be modeled relationally  
- Community knowledge and tools  
- SQL plays well with DataFrames
- Highly optimized becuase it has been around so long  
- Data structure and use guarantees
</details>
<br>
<br>
<details><summary>
What are the "cons" of a SQL systems?
</summary>
- Only 1 way to represent the world 
- Not all of the world can be modeled relationally    
- SQL doesn't play well with other data types: sets, key-value pairs  
- Does not easily scale to distributed environments  
- SQL doesn't handle schema change  
</details>

---
When is data most valuable? When do you know the most about your data?
----

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/data_value.png" style="width: 400px;"/>

----
When do you Schema?
-----

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/old_school_stack.png" style="width: 400px;"/>

----
Schema-before-write -> Schema-on-write -> Schema-on-read -> Schema-on-use 
----

![](https://media.giphy.com/media/piupi6AXoUgTe/giphy.gif)

All relational database management systems (RDBMS) model data the same way with relational schemas.

NoSQL databases can model and store data in other schemas.

---
What makes a NoSQL database?
---

1. Schema-flexible

2. Doesn’t use (approved) SQL as query language. NoSQL usually a more "primitive" query language.

For example, document databases allow hierarchical non-normalized objects to be retrieved directly.

<img src="http://blog.philipphauer.de/wp-content/uploads/2015/05/Match-OO-Document.png" style="width: 400px;"/>

---
What are the killer features of NoSQL?
---

- Schema-flexible (compared to normalized tabular data)
- Better scalability
    - The best systems sits on commoditity hardware (compared to Vertica, SQL Server, ...)
    - Highly distributable, typically horizontal scale out (compared to vertical scalling of SQL)

<!--


----
Where is the catch?
---

Remeber ACID
---

![](https://sqlserverhelpdotcom.files.wordpress.com/2014/05/image.png)


---
CAP theorem for distrubted system
---

<img src="http://gaboesquivel.com/images/2013/09/cap_venn.png" style="width: 400px;"/>

CAP theorem states that it is impossible for a distributed computer system (i.e., web service) to simultaneously provide all 3 of the following guarantees:

__Consistency__: All nodes see the same data at the same time

__Availability__: A guarantee that every request receives a response about whether it succeeded or failed

__Partition Tolerance__: The system continues to operate despite arbitrary partitioning due to network failures)

Eric Brewer 

<img src="https://upload.wikimedia.org/wikipedia/commons/a/af/TNW_Con_EU15_-_Eric_Brewer_%28scientist%29-2.jpg" style="width: 400px;"/>

CAP was a conjecture by Eric Brewer in 2000

Orginal Paper:  
[Towards robust distributed systems](http://dl.acm.org/citation.cfm?id=343502)

Nice Update:  
[CAP 12 years later](http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed)

---
Pick any 2
---

A distributed system can satisfy any 2 of these guarantees at the same time but not all 3

<img src="https://www.ibm.com/developerworks/community/blogs/IMSupport/resource/BLOGS_UPLOADED_IMAGES/CAP_Theorem.jpg" style="width: 400px;"/>

Consistency + Availability
---

Examples:

- Single-site databases   
- Cluster databases

Traits:

- All nodes are always in contact
- 2-phase commit
- Cache validation protocols
- No partition without blocking


Consistency + Partition Tolerance
---

Examples:

- Distributed databases
- Distributed Locking  
- Majority protocols

Traits:

- Some data may not be accesssible, but the rest is still consistent/accurate
- Pessimistic locking
- Make minority partitions unavailable
  

Availability + Partition Tolerance
----

Examples:

- Web caching
- DNS

Traits:

- System available under partitioaning, but some data may be inaccurate
- conflict resolution   
- optimistic
- expiration/leases (can ~5ms)

---
Example Systems
----

<img src="images/visual_cap.png" style="width: 400px;"/>

---
Challenge Question
---

<details><summary>
Where does Apache Zookeeper belong?
</summary>
AP <br>
(Consensus protocol)
</details>

----
Eventual consistency
---

<img src="http://blog.sqlauthority.com/i/b/Eventual-Consistency.png" style="width: 400px;"/>

Most businesses what this. But is it truly necessary.

<details><summary>
Money is important, so banks must use transactions to keep money safe and consistent, right?
</summary>
__NO__ ATMs are able to give money when not networked. Allowing people to with withdraw money not in their accounts.
</details>

[Source](http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-banks-are-base-not-acid-availability.html)

__Man Withdrawing Money From An Atm In Russia:__
<img src="http://static.boredpanda.com/blog/wp-content/uploads/2016/02/man-withdraws-cash-from-atm-in-thailand-the-internet-responds-15__605.gif" style="width: 400px;"/>

__Man Withdrawing Cash From An Atm In Detroit:__
<img src="http://static.boredpanda.com/blog/wp-content/uploads/2016/02/man-withdraws-cash-from-atm-in-thailand-the-internet-responds-21__605.jpg" style="width: 400px;"/>

__College Student Withdrawing Cash From Atm In US:__
<img src="http://static.boredpanda.com/blog/wp-content/uploads/2016/02/man-withdraws-cash-from-atm-in-thailand-the-internet-responds-51__605.jpg" style="width: 400px;"/>

http://www.boredpanda.com/atm-cash-withdrawal-thailand-pattaya/

Eventual consistency is a key property of non-ACID systems

Means if no further changes made, eventually all nodes will be consistent

In itself eventual consistency is a very weak guarantee

> When is “eventually”? 

It doesn’t say!

In practice it means the system can be inconsistent at any time

Stronger guarantees are sometimes made with prediction and measuring, actual behaviour can be quantified
in practice, systems often appear strongly consistent

-->

---
NoSQL Flavors
---

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/types.png" style="width: 150%;"/>

#### Key-Value


Key-value databases, as the name suggests, map keys to values. 

Every single item in the database is stored as an attribute name (or "key")

The “key” is simply an identifier, and the value tends to be implemented as an opaque binary object that’s decoded by the database application, similar to the way an RDBMS deals with large data objects like images, sound files, or big chunks of unstructured text. Key-value databases enable quick access to data but the storage schema doesn’t embody relationships among the data.

By far the most popular!

Similar to job interview - Throw a hash-map at it

Big Players: Riak , Voldemort, Redis

![](http://media.charlesleifer.com/blog/photos/p1432653421.74.png)

#### Document

Pair each key with a complex data structure known as a document.

Document databases (formerly known as document-oriented databases) store data in a format such as XML or JSON, which are referred to as self-describing formats because they include descriptive labels on what data is stored. Document databases are good for storing data that is hierarchical and nested, like books and other text-heavy content.

Big Player: MongoDB

![](http://s3.amazonaws.com/info-mongodb-com/_com_assets/media/mongodb-logo-rgb.jpeg)

#### Wide column

Store data in columns together, instead of row

Wide column stores are somewhat similar to the relational model in that they store data in tables—but they are much more flexible. Relational databases don’t allow adding new columns on the fly, which is an important constraint for the sake of data integrity, but there are environments where adding columns on the fly are useful. Wide column databases are far more efficient for storing records that have different sets of columns.

Big Player: Cassandra

![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Cassandra_logo.svg/1280px-Cassandra_logo.svg.png)

#### Graph

Store information about networks, such as social connections

Graph databases use a branch of mathematics known as graph theory. Graph theory represents entities as “vertexes” connected by “edges.” The edges show relationships between entities. Examples include airline route networks or “friend of a friend” relationships in social networks.

Big Player: Neo4j

![](http://neo4j.com/wp-content/themes/neo4jweb/assets/images/neo4j-logo-2015.png)


Source:
https://www.oreilly.com/ideas/nosql-technologies-are-built-to-solve-business-problems-not-just-wrangle-big-data

---
Practical Downsides to NoSQL
----

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/no_problem.jpg" style="width: 400px;"/>

1. Everyone knows SQL  
    few people know your specific NoSQL database  
    
2. Lack of validation  
    code will typically do anything the database lets it get away with (especially over time)  

3. No standards  
    you can’t easily switch databases  

4. Lack of maturity  
    lack of supporting tools, unpleasant surprises, ...  
    lack of answers on stackoverflow  
 
5. Weak query languages  
    means you have to do more in code  
    may hurt performance  

Playing a startup innovation "chip"

---
Summary
----
- There is a world beyond RDMS and SQL.
- NoSQL is a loose collection of ideas and techniques to handle modern workloads
- There are 4 Types of NoSQL DBs
    1. Key-Value
    2. Document
    3. Column
    4. Graph
- You can have all types. But __always__ have a RDMS hanging around.