<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 11: NoSQL and Performance Tradeoffs** 
_For when a centralized row store just won't work_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- 

### **Skills / Know how to ...**
- 

--------
## **LESSON 11 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/uRoW2sojmsE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

## **BIG PICTURE: Physical Design Alternatives**
In this lesson we will explore various alternatives to the traditional relational DBMS. Some of the alternatives will drop SQL itself, which would seem like heresy. However, there are actually some important use cases where one might not want to use a SQL-based solution. Even when SQL is the right decision there are plenty of options to choose from that fit some use cases better than others. 

In Lesson 5 we discussed four basic design tradeoffs:
- Minimizing Storage Space
- Maximizing Calculation Speed
- Maximizing Coherency
- Minimizing Data Corruption Risk

These tradeoffs exist regardless of the technology used. That's why they are *design* tradeoffs. However, there is more to database system than design. Sometimes we have to broaden our scope to consider other alternatives. 

Here are four technical considerations that go beyond logical design:
- **Flexibility / Developer Experience**  
  How easy is it for developers to learn and use the technology? It may make sense to use technology that is closer to what your programmers use everyday. 
- **Scalability / Performance Speed and Cost**  
  There is a natural tradeoff between speed and cost. Technology that works at Big Data Scale makes that tradeoff explicit. How much does each GB of storage cost? How about each GB/sec of query throughput? If we are willing to wait a little longer for each query can we save some money? 
- **Consistency / Timeliness**  
  Data is worthless if it is not available when you need it. If we don't need data to be instantaneously available, how long can we wait? If we need the data right now, then how tolerant are we of anomalies and other imperfections? If we use data replication to speed up access, do we need all copies to be 100% consistent? 
- **Technical Maturity / Technical Debt**  
  Database technology is always evolving, with new solutions coming out all the time. The newest technology may score well on the above considerations but also may come with bugs and other problems that need to be worked out. Meanwhile, older technology may be rock solid but also may limit your choices going forward. There is always the risk of being stuck with obsolete technology when your competition is beating you with something newer. 

We will start with the first two considerations, developer experience and performance, which are generally used as the rationale for NoSQL technologies that don't (necessarily) adhere to the relational database model. Then we will follow up with strategies that can further improve the speed and cost performance for any technology, providing you are able to make the right consistency and maturity tradeoffs.


---
## **NoSQL Databases**
The term "NoSQL" was first coined in the late 1990s and gained popularity among application programmers about a decade later. It refers to databases that do not rely on the relational model. That does not mean that SQL (or some close approximation) isn't used, but rather, that NoSQL systems extend beyond the traditional relational model. 

For this reason some interpret NoSQL as "Not only SQL" rather than exclusively no use of SQL at all. In fact, each of the NoSQL technologies surveyed here *could* in fact be implemented in SQL, and we will try to use relational models to explain how each related to SQL and how it extends beyond it. 

Why would we even need to go beyond SQL? 
- **Developer Experience (DX):** As we have discussed before, there is a natural impedance mismatch between a declarative language like SQL and a more imperative application development language like Python, Java, or JavaScript. NoSQL technologies remove much of the discomfort some programmers feel when using SQL.
- **Performance:** Selectively relaxing the rules of the relational model can sometimes bring speed and cost benefits that outweigh the integrity protections of the relational model. 

For each of the models below we will discuss how it differs from the standard relational model, potential DX or performance benefits, and the most common usage scenarios. 

### **Key-Value Stores**
Key-Value (KV) stores are most useful when a table is very sparse, like this section of the NBA PlayLog data set. 

![Sparse Table](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L11_Sparse_Table.png)

If most values in a table are blank, then why waste time and effort recording them in rows and columns. Instead, just store the data you actually have, tagged to suit however you may need to retrieve it. 

In relational terms, a KV store is just a single two column table 
`kv_store(`**`key`**`, value)`

where
- **`key`** is a unique index
- `value` is a data to be stored

Usually the data type of the `value` is either encoded data itself (e.g., like i2'15' or s5'Steve`), encoded in the key, or assumed to be text. 

> **heads Up:** We have already seen an example in Lesson 5. The Entity-Attribute-Value model can, with the right modeling conventions, be seen as a kind of KV store. The key would be a composite of the entity and attribute. 

In case you are wondering how we can replace a two-dimensional table with rows and columns with a one dimensional KV store. We do it exactly like we would with composite indexes in the relational model, with the row and column encoded in the key. It's all in the patterns we use when constructing the keys.

For example, the following could be used to record the `assist`,`away`, etc. columns as key-value pairs:

| **Key** | **Value**|
| --- | ---|
| 2:away | Derrick Favors |
| 2:home | Marc Gasol |
| 18:opponent | Kyle Lowry |
| 20:num | 1 |
| 21:num | 2 |

KV Stores first are commonly used for creating data caches for the web. Here, for example, is the data that Colab is keeping about *this page* while I have it open in my web browser: 
![](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L11_Colab_Local_Storage.png)

Similar caching technology is used by webservers to minimize reads from the disk storage. If the same CSS is used on every page then why read it from disk every time? For the server will instead use a highly optimized KV store. 

#### **Wide-Column KV Stores**
Wide-Column data stores extend the KV model to allow data storage as multiple columns:
- The keys are just like an other KV store.
- The columns are encoded like a SQL `STRUCT` (or JSON object), with each column having a name and a value. 

The columns for each key may vary, like this:

| **Key** | **Value**|
| --- | ---|
| 2 | {away:Derrick Favors, home: Marc Gasol} |
| 18 | {opponent:Kyle Lowry} |
| 20 | {num:1} |
| 21 | {num:2} |

The effect is to condense the aleady very space-efficient key-value model by eliminating overlapping keys. In this case we eliminated a row of data storage by sharing the key `2` for the `away` and the `home` columns. 

#### **Summary**

- Pros
  - very fast storage and retrieval (when applicable)
  - compact storage 
  - programmer-friendly
- Cons
  - schema by convention instead of rules
  - potential for schema drift as new key types proliferate
  - no FKs or analogous way to integrate data across keys
- When to use
  - With sparse or highly volatile data where the keys need to be flexible
  - As local data stores in application development
  - When caching data to speed up application performance
- Example Products
  - [Varnish](https://varnish-cache.org/)
  - Nginxs
  - Squid
  - Memcached
  - redis
  - AWS DynamoDB

### **Document Stores**
Document stores build on the KV model to construct arbitrarily complex (and data rich) databases of *semi-structured* data. Semi-structured data has a schema (so we can interpret and process it), just like any other data model. However in a document store each document (roughly equivalent to a table row) can have its own schema, structuring the data however it likes. We can then only know the schema *on read*, whereas for the relational model it is known *on write*. 

The classic example of a document storage format is that used by Microsoft Word, which acts a *container* for components (blocks of text, titles, images, tables, etc.). One can insert just about anything inside an MS Word file (even malware, unfortunately). MS Word will then compose and render the components as documents in real time so end users can read and edit the data contained inside. 

A more relevant example is JSON, which has become a ubquitious format for transmitting data over the web. Like with an MS Word document, a JSON object or list acts as a container, into which we can insert ... JSON objects and lists. The results is a tree structure, with objects and lists nested inside each other to some unspecified depth. 

Here, for example, is a Colab notebook (this one, in fact) looks like when you open it in a text editor:
![Colab as JSON](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L11_Colab_As_JSON.png)

Yes, the `.ipynb` file format is really just highly structured JSON. Here's the same JSON in a nicer, pretty printed format:
![Prettified JSON](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L11_Pretty_JSON.png)

A Colab Notebook is mostly a list of "cells" $-$ look for it in the screenshot $-$, where each cell has a `cell_type`, `metadata`, and `source`. The `source` is wherever we typed into the cell. 

We will come back to JSON in the **Pro Tips** section later in this lesson. 

 


#### **Summary**
- Pros
  - little or no impedance mismatch for programmers, especially when using JavaScript
  - very compact, especially if the keys are terse
- Cons
  - same as KV stores; complex queries are especially difficult
  - schema on read complicates app design and development; potentially buggy
- When to use
  - For local storage or web transmission of data
  - For complex hierarchically-structured data, where documents are composed of nested components
  - When storing "objects" in relational databases
- Example Products
  - CouchDB
  - mongoDB
  - AWS DynamoDB
  - Google Cloud Firestore

### **Inverse Indexes**
### **Graph Databases**


---
## **Technology for Capacity and Performance**
### **Distributed Databases**
CAP
### **Row-stores vs Column-stores**

---
## **PRO TIPS: How to Work with JSON Data in SQL**






---
## **SQL AND BEYOND: Git as a Distributed DBMS**

---
## **Congratulations! You've made it to the end of Lesson 11.**

There is no Lesson 12. So maybe celebrate with your beverage of choice.  



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.