In [1]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

## NoSQL

### Relational Database

#### Relational Model
+ Database is a collection of relations
+ Each relation has attributes - each attribute is a column
+ Each relation has a collection of tuples

#### SQL (Structured Query Language)
- Manage data in a relational database
- Select rows from a relation satisfying a given condition

#### Advantages
+ Concurrency Control - ACID (Atomic, Consistent, Isolated and Durable) Transaction Management
    - **Atomicity**: An operation either succeeds or fails	entirely. Many rows spanning tables are updated as a single operation
    - **Consistency**: Any given transaction must change the affected data 
    - **Isolated**: Define how/when the changes made by one operation become visible to other. Concurrent ioerations are isolated from each other so that they can't see a partial update.
    - **Durable**: Once a transaction has been commited, it will remain permanently. 
+ Standard Model
    - Different query languages are similar
    - Transaction operations work in the similar way as well

### Big Data & Relational Databases

- Data & Traffic increase, schema changes
- Solution:
    + Scale out or scale up? Cost and efficiency
    + Started to store data in the cluster, i.e. distributing data into smaller databases 
    + **Sharding**: A type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. 
    <img src="sharding.png" width="300pt">
    
    
==> **Reliability** issue: What if one of the database die? e.g. what if A dies?
- However, most relational databases are not designed to run on cluster
- What if a table schema changes all the time?

### JSON (JavaScript Object Notation)
- Open-standard file format that uses human-readable text to transmit data objects consisting of key-value pairs and array data types (or any other serializable value)
- Commonly used for brownser-server communication for REST API

### Example 1 - parse and load	“user_1.json” into	“users”	table.

In [4]:
import psycopg2
import json

def connectdb(db_name, user_name):
    try: db_conn = psycopg2.connect(dbname=db_name, user=user_name)
    except: print("Not able to connect to" + db_name)
    return db_conn

def db_cursor(db_conn): 
    return db_conn.cursor()

def execute(db_cursor, query): 
    return db_cursor.execute(query)

def create_table(db_cursor, table_name, column_type_list):
    create_tbl_qry = "CREATE TABLE "+table_name+"("+column_type_list+")"
    execute(db_cursor, create_tbl_qry)

def insert_into_table(db_cursor, table_name, column_names, values):
    insert_qry = "INSERT INTO "+table_name+\
                "("+column_names+") VALUES ("+values+");"
    execute(db_cursor, insert_qry)

def get_value(data, key): 
    if (data.get(key)): return data[key]

In [None]:
psycopg2.connect("dbname='msan691' user='thykhuely' host='localhost' password=''")

### Why NoSQL

- Find another option for the system we're using

- **Impedance Mismatch**
    + Relation model is different from In-memory Data Structure (object)
    + Difficulties when a RDBMS is being served by an application program written in an object-oriented programming language, particularly because objects/class definitions must be mapped to database tables defined by a relational schema.
    + Better mapping with in-memory data structures for the application


- **Large colume of data**
    + Run large	data on	clusters of	many smaller and cheaper machines.
    + Cheaper and reliable

### NoSQL Features
- Take schema-less data
- Non-relational data
- Open-source, 
- Trade off traditional consistency for other properties
- Run on Cluster

- Postgres supports JSON data type. However, 
    + Even though it can take schema-less or non-structured data, it is **not designed for data distribution**
    + It does **not offer** any native mechanisms for data availability or **scaling** the database beyond a single serve
    + Lacks mechanisms for automatic **failover and recovery** between database replicas
    + No native mechanisms to **partition (shard)** the database across a cluster of nodes.
    + Not as natural to work with JSON data in Postgres - non-standard extensions to SQL to query and manipulate JSON are not supported by most tools


### NoSQL Database Types

| Aggregate-oriented | Relationship-oriented |
| -------------------|-----------------------|
| Key-value db | Graph db |
| Document db | |
| Columnar db | |

<img src="nosql-types.jpg", width="60%", height="50%">

### 1. Aggregate: Collection of related objects, treated as a unit
- For analyzing data, you might want to place some data that are strongly related together as a unit
    - E.g. students id, name, address, phone number
    - On a cluster, an aggregate is stored together on a node
    - A single aggregate is a unit of atomic updates
    - They (All the attributes/columns) are always together in one machine
    - In Postgres, everything has to be in the same machine, if there isn't enough space, they chop the data down by **columns**, each has a key field that can be later join together
    - Different from relational database, data can be splitted into different chunks, splitted by the **rows**/**aggregate** into different machines. 
- Use aggregates indexed by **key** for data lookup
- Don't have ACID tnx that span multiple aggregates


- **Advantages**
    + Provides clearer semantics to consider by focusing on the aggregate unit used by applications
    + Better design choice for running on a cluster
    
    
- **Disadvantages**
    + Not to easy to draw boundaries
        + e.g. How can we draw -- 
        ```
        {name: 'Ann',
        address: {street:'123 Howard', city: 'San Francisco', state: 'CA'}}
        ```
    + When the goal of data mgmt/analysis is unclear => not the best choice
    + Does not support ACID tnx
    
    

### Example 2 - Draw boundaries	of	an	aggregate.

<img src="day5_ex2.png", width="80%">

1. Approach (1): 

    ```python
    # aggregate 1
    { customer_id, customer_name,
          billing_address:{ street, city, state, post_code } 
          }
    
    # aggregate2
    { order_id, customer_id,
        order_payment: { card_number, txn_id, billing_address },
        order_item: { name, price }
        }
    ``` 
    

2. Approach (2): 

```python
{ customer_id, customer_name, 
    order: { order_id, shipping_address, 
        item: { product_name, price },
        payment: { card_num, txn_id, billing_address } 
        }
}
``` 

- **Key-value Database**: 
    + Each aggregate has a key field (ID)
    + Values are opaque, the storage does not know anything about them
    + The key lookup to fetch the entire aggregate (i.e. all the column fields in the aggregate)
    + When querying for a key, it returns **everything**, every fields belong to that key. 
    + But there is more freedom to place whatever you want
    
    
- **Document Database**:
    + Each aggregate has a key field (ID)
    + It has allowable structure & type
    + Access by key and also by the fields in the aggregates (can retrieve fields in the value)
    + Very similar to Key-Value Database
    + But must follow the format: "key": "value" -- less freedom than key-value database
    + But can also submit a query based on the internal structure of the document
        + When querying for a key, document db can return a specific field 
        + e.g. `db.world_bank.find({'borrower': 'KENYA'})`
        + e.g. 
        ```
        db.world_bank.find(
            {'borrower': {$regex: 'KENYA'}},
            { borrower:1, url:1, _id: 0 })
        ```
        
- **Columnar Database**:
    + Optimize for cases when you rarely have to write, but columns are read together in many rows
    + Not good for writing because of the relationship? But reading is very fast
    + Data could be fetched by column
    + Organize columns into column families as a unit of access
    + Two-level map structure:
        1. Row Identifier (Row Key): Choose the aggregate of interest
        2. Columns: Choose a particular column
    + The beauty of it: If there is a new node coming in and you know about its relationship to one of the other nodes, we can traverse from the new node to the other
    + e.g. Columnar Database: 
    ```
    {'name': ['john', 'andy'],
     'address': ['100 Howard', '400 Spear'] }
     ```  

### 2. Relationship-Oriented Database

- Relational database with complex schema
- Hard to understand, query, generalize and integrate data
- Compared to Aggregate-oriented NoSQL: Atomicity is only supported within a single aggregate

- **Nodes** (Object) and **Edges** (Relationship) representation
- For data with complex relationship: 
    + Focus more graph "Traverse" (more than "Insert")
    + Comparing to RDBMS, many joins can cause poor performance
- Running on a **single** server, is **NOT** designed for distributing across clusters


### Examples

- **Key-value**: Redis, Riak, Berkeley DB
- **Document**: MongoDB, CouchDB, OrientDB, RavenDB
- **Column**: Cassandra, HBase, Amazon Simple DB
- **Graph**: Neo4J, OrientDB, FlockDB

### Choice of DBMS

- **Polyglot Persistence**: 
    + Using multiple data storage technologies, chosen based on the way data is being used by individual applications
    + NoSQL data do not replace relational databases

<img src="nosql-vs-sql1.png">

### Document Databases
- For storing and retrieving documents with self-contained schema -- including JSON and XML
- Schema of the data can differ across documents, but these documents can still belong to the same collection
    + Some attributes do not exist in another document
    + New attributes can be created without defining them in the existing documents

## MongoDB

#### Terminology 

| SQL (RDBMS) | MongoDB |
|-------------|---------|
| database | database |
| table | collection |
| row | document |
| column | field |

### Why MongoDB?
- Ease of use: No defined schema
- Easy scaling: MongoDB takes care of
    + Balancing data
    + Loading data across a cluster
    + Redistributing documents automatically
    + Routing user request to the correct machines
- Many features
    + Creating,	reading, updating, and	deleting (CRUD)	data.
    + Indexing:	Supports secondary indexes,	allowing fast queries.
    + Aggregation pipeline: Allow you to build complex aggregations	from	simple pieces.
    + Special collection types:	Time-to-live collections (session / data deleted after awhile),	fixed-size collections.
    + File storage: Stores large files and file metadata (GridFS).
- Supported drivers: 
C,	C++, C#, Java,	Node.js, Perl, PHP,	Python,	Ruby,	Scala,	Go	and	Erlang

### Mongo CRUD Operations

#### Create 

```
document1 = {"name": "David", "address": {"street": "100 Howard", "city": "SF", "state": "CA"}}
document2 = {"name": "Yannet", "address": {"street": "100 Howard", "city": "SF", "state": "CA"}}

db.collection_name.insert([document1, document2)
```

#### Retrieve 
```
db.friend.findOne({"name": {$regex: "David"}})
db.friend.find({"name": {$regex: "D"}})
```

#### Update

By default, it will only change the first row/document it found. If set **"multi"** to True, updates multiple documents that meet the query criteria

```
# Update
new_value = {"name": "David Guy Brizan", "address": {"city": "SF"}}
search_criteria = {"name": "David Guy Brizan"}
db.friend.update(search_criteria, new_value)

# Add new field
db.friend.update({"name": "Diane"}, {$set:{"noCats": 1}}) 

# Remove the field - value does not matter
db.friend.update({"name": "Diane"}, {$unset:{"noCats": 1}}) 

# Increment - if does not exist, start with zero
db.friend.update({"name": "Diane"}, {$inc:{"noCats": 1}}) 

# Change name of the field
db.friend.update({"name": "Diane"}, {$rename:{"noCats": "numCats"}}) 

# Set a minimum level 
db.friend.update({"name": "Diane"}, {$min:{"numCats":5}}) 

# Update for all rows
db.friend.update({"name": "Diane"}, {$rename:{"numCats":5}}, {multi:True}) 

```