# Lecture 7: Introduction to non-relational databases

## Announcements
- Quiz 1 has been fully graded, and overall, you’ve done a great job.
- There won’t be practice questions for Lectures 7 and 8 (apologies - got too busy).
- Quiz 1 results will be available on PrairieLearn under the Gradebook on Wednesday, so you can decide if you’d like to book review section at CBTF.
- We will have a worksheet this week, with time at the end of class to work on it. One TA (Daniel/Jeremy) and I will be there to assist you.
- You all know about the policies regarding the quiz and autograding. Here are the Quiz 1 policies for your reference:
```{note}
Exam Instructions:
You must click Save on every question before the timer runs out or your work will not be saved.

All quiz questions in 513 are autograded. If your code has a syntax error, you'll receive a zero. Only copy the code needed to produce the required output.
For coding questions, partial credit is not awarded. But you will receive 25% of the points only if your code runs and is VERY close to the expected output, meaning it covers almost all requirements but misses just one item; otherwise, no points will be awarded.
```
- Good news is that above category of students recieve 75% instead of 25% for the coding questions. We will keep this policy for quiz 2 as well.

## Quiz 2 Preparation
- For the second quiz, there will be overlaps with topics covered in Quiz 1. Be prepared.
- Time will be a challenge, so make sure to practice extensively and have resources ready - an effective cheat sheet is a good strategy.
- We won’t provide a workspace for MongoDB. Since you won’t be able to test MQL, I won’t ask you to write MQL queries. However, you should be able to understand them.
- Since we are now in a bit more complex SQL world, we’ve updated the pgAdmin workspace to include auto-completion. Hopefully, you’ll find it useful.
- A practice quiz for Quiz 2 has also been released.

## Today's Theme

- Where are we with RDBMS?
- Understand the Advantages & Disadvantages of RDBMS
- What is NoSQL, and what are the different types of NoSQL?
- Introduction to Document databases.
- Understanding the basics of MQL

```{note}
- Wow… worksheets! They are due with Lab 4. We don’t have a separate repository created for them like the labs, so please download it from the student repo.
- The worksheet is worth just 1%. More details can be found in the worksheet `.ipynb` file.
```
## Where are we with RDBMS? (RDBMS check)

- We learned about the reasons why we need a relational database.
- We learned about DDL and some important constraints (like PK, FK, and other constraints that we can add).
- We spent a lot of time on DML, and we learned about different types of joins.
  - SELECT
  - FROM 
  - WHERE (use LIKE, ILIKE, and other logical operators like AND, OR, NOT - remember the order of precedence)
  - DISTINCT
  - GROUP BY
  - HAVING
  - ORDER BY
  - LIMIT
- We learned about casting and where and when to use it.
- We learned about different types of functions (like aggregate functions, string functions, date functions, and other functions).
- We learned about different types of joins. (INNER, LEFT, RIGHT, FULL, CROSS)
- We learned about the use of aliases and how to use them. (AS)
- We learned about the use of IN, ANY, ALL, EXISTS, NOT EXISTS
- We learned about subqueries and correlated subqueries.
- views vs materialized views vs tables vs temp tables vs CTEs
- We learned about CTEs and window functions.
- We learned about ACID properties and how to use transactions.


## Story Time (Justin Beiber problem)

```{margin}
<img src="img/justin.png">

Check out [this article](https://www.wired.com/2015/11/how-instagram-solved-its-justin-bieber-problem/) to know why I am here!!
```

We want the right tool for the job. So far, we’ve used RDBMS appropriately by normalizing it, using indexes wherever needed & denormalizing it for analytical and dashboard purposes. Postgres or any other database ( like MySQL, Oracle, and SQL Server) can support very large operations. Instagram, IMDB, Skype, and others use Postgres for business-critical services. However, SQL is not suited to all analyses. Here is what happened with Instagram - how Justin Beiber's popularity could destroy the entire infrastructure. 

Even around 2013, Instagram already had a pretty big infrastructure, and they used to use Postgres as their database. Everything was going well until Justin Beiber got into Instagram. He was one of the first high-profile people to get on Instagram. When he would post something, people would comment and like him so much. But people around the globe are doing it simultaneously that it is hitting servers so fast that people can't see their likes or comments, or it took a long time for that to get updated as a result of [locking tables](https://www.postgresql.org/docs/9.1/explicit-locking.html). After so much debugging of this issue, Instagram decided to move to a hybrid system with Amazon Cassandra and Postgres.

The takeaway message from this story is that even though they had a big infrastructure on Postgres, no other techniques really worked out to solve this problem. And they want to rely on a hybrid solution ( Postgres and Cassandra) to this problem. Most data systems are hybrid in the era we are in; you will experience this when you start working in the industry.

## Advantages of RDBMS

- Uses a (semi) standard language (SQL) across all platforms.
- Uses ACID principles
    - Atomic Transactions
    - Consistent data representation
    - Independent Transactions
    - Durable Data
- Highly optimized query planner (EXPLAIN ANALYZE)

## Disadvantages of RDBMS

- Write times and table locks reduce efficiency
    - Particularly when the upsert level is high (UPDATE & INSERT)
- Some data has complex models
- Some data may be “sparse” in a normalized format
- Documents & keys may not be suited to an RDBMS
- Scaling difficulties

## What is NoSQL?

```{margin}
<img src="img/NoSQL.jpg">

[Here](http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page) check out the real NoSQL from 1998.
```

NoSQL (Not (Only) SQL) has been there for quite a while now. It was first introduced in 1998 by [Carlo Strozzi’s](http://www.strozzi.it/users/carlo/vitae.html) to call his lightweight, relational database that did not follow the standard SQL interface, and hence NoSQL. Later, in 2000, this became popular as the hashtag #NoSQL to discuss the advancement in the non-relational database world. It's a trendy buzzword these days.

### Main Goals of NoSQL

- Reduce the dependence on fixed data schema.
- Reduce the dependence on a single point of truth.
- Scaling!

### NoSQL Strengths

```{margin}
I came up with [the BASE] acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is "ACID" -- much more than people realize, so we figured it was good enough. ***- Eric Brewer***
```

- (semi) Schema-free
  - With no referential integrity and internal consistency, we can easily split data across servers
- Accepting Basic Availability:
  - You usually see all the tweets when you sort by “Latest”, but sometimes a refresh shows some messages you missed.
  - You are usually chatting in real-time, but sometimes messages pile up.
- NoSQL is BASE
  - Basically Available
  - Soft State
  - Eventual Consistency


### Different types of NoSQL databases.

#### Key–value store

In these kinds of databases, data is stored as key-value pairs. You can think of it as a giant dictionary object. In these cases, the database will contain a simple, unique string as the key, and that key will point to a large data value. The data value can be stored as an integer, a string, or a complex structure. This database is based on the Amazon Dynamo [paper](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf).

E.g., Amazon DynamoDB, Redis, Memcached

#### Document store

A type of key-value DB with an internal (searchable) structure. A “document” contains all relevant data and has a unique key.

Eg: ElasticSearch, MongoDB, AWS DocumentDB


#### Column based

Column-based databases are built based on google's big table [paper](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf). Our normal files and relational databases store data in the disk in a row-based fashion, but here data is stored in a column-based fashion. Efficiency comes in filtering when a few columns are needed.

E.g., Bigtable, Cassandra, Amazon Redshift.


#### Graph based

Objects are defined with properties focused on relationships between objects. More to follow...

E.g., Neo4j, OrientDB, AWS Neptune.

## MongoDB

MongoDB is based on **JSON-like documents** for data storage. It offers:

- Native replication and sharding
- Automatic scaling and load balancing
- Multi-language support
- Powerful query language

### JSON

- JSON is short for _Java Script Object Notation_.

- JSON documents are simple containers where a string key is mapped to a value (e.g., a number, string, function, another object).

```json
{
  "_id": 1,
  "name" : { "first" : "John", "last" : "Backus" },
  "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  "awards" : [
    {
      "award" : "W.W. McDowell Award",
      "year" : 1967,
      "by" : "IEEE Computer Society"
    }, {
      "award" : "Draper Prize",
      "year" : 1993,
      "by" : "National Academy of Engineering"
    }
  ]
}
```

### BSON

Although the JSON document may look great for storing data **_as is_**, it has a number of drawbacks:

- JSON is text, and text parsing is very slow
- JSON’s format is readable but not space-efficient (a database concern)
- JSON's support of various data types is not great

For the above reasons, MongoDB stores data in BSON (Binary JSON) files, which address all of the above issues but still look like JSON when we work with them in MongoDB.

For an overview, see [here](https://www.mongodb.com/json-and-bson).

### Collections

In MongoDB, a database consists of one or more **collections**, each containing multiple **documents**.

<img src="img/collection.png" width="600">

### Documents

<img src="img/document.png" width="600">

- Each document contains field-value pairs
- The field name `_id` acts as the primary key of each document and should, therefore, be unique in a collection
- MongoDB automatically assigns an `_id` value if not specified at the time of inserting a document
- MongoDB creates an index on the `_id` field by default
- The maximum size of a BSON document is about 16MB

## MongoDB Atlas

[MongoDB Atlas](https://docs.atlas.mongodb.com/) is a fully managed cloud database service that automates the whole process of configuring, administration, and maintaining a database server for you. Basically, you specify what kind of server (CPU, RAM, number of nodes, location, etc.) you need, and MongoDB Atlas sets it up for you. They've partnered with Amazon Web Services, Google Cloud Platform, Microsoft Azure to host their database instances.

The majority of these services are paid, however, they also offer a basic database service that is **free** and is best suited for learning and exploring. We'll use the free MongoDB Atlas clusters for our course. You can set up your own cluster [here](https://www.mongodb.com/cloud/atlas/register).

<<<<put a video on getting started and the commands that we can run using MongoDB shell and MongoDB Compass>>>>

### MongoDB interfaces
#### MongoDB shell (`mongosh`)
#### MongoDB Compass
#### MongoDB's Python driver (`pymongo`)

And finally, `pymongo` is the official Python driver for MongoDB. If you are using the course `conda` environment, this package is installed and ready to use in Jupyter Lab. You can take a look at `pymongo`s documentation [here](https://pymongo.readthedocs.io/en/stable/tutorial.html).

In [29]:
from pymongo import MongoClient
import json
import urllib.parse

with open('data/credentials_mongodb.json') as f:
    login = json.load(f)

username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)

In [30]:
client = MongoClient(url)

## MongoDB query language (MQL)

<img src="img/nosql.png" width="400">

([image source](https://dataedo.com/cartoon/it-is-nosql))

```{admonition} See also ...
SQL to MongoDB mapping chart: https://docs.mongodb.com/manual/reference/sql-comparison/
```

As mentioned earlier, there is no standard query language among NoSQL DBMSs. This is because each NoSQL DBMS supports a different data model and obviously no one language can suit all data models.

MongoDB has its own query language known as MongoDB Query Language or MQL (We already saw CQL for neo4j). I will walk you through the usage of MQL in the remainder of this lecture.

### Accessing databases

In lecture notes, you can find different ways to access databases through different interfaces. But here in themes, we will go just with pymongo.

<<<Maybe video on use compass and mongosh>>>

**`pymongo`**:

In [31]:
my_db = client['my_db']
my_db

Database(MongoClient(host=['ac-uvqmtkk-shard-00-02.rjjc4dz.mongodb.net:27017', 'ac-uvqmtkk-shard-00-00.rjjc4dz.mongodb.net:27017', 'ac-uvqmtkk-shard-00-01.rjjc4dz.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='atlas-5uzn7i-shard-0', tls=True), 'my_db')

Running the above cell just gives you some information about our connection to the server. We'll learn how to run queries on this connection in a bit. For now, let's see what databases we have:

In [32]:
client.list_database_names()

['exam',
 'mds',
 'sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_guides',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'admin',
 'local']

### Accessing collections

In [15]:
my_collection = my_db['sample_mflix']
my_collection

Collection(Database(MongoClient(host=['ac-uvqmtkk-shard-00-02.rjjc4dz.mongodb.net:27017', 'ac-uvqmtkk-shard-00-00.rjjc4dz.mongodb.net:27017', 'ac-uvqmtkk-shard-00-01.rjjc4dz.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='atlas-5uzn7i-shard-0', tls=True), 'my_db'), 'sample_mflix')

In [33]:
client['sample_mflix'].list_collection_names()

['users', 'comments', 'sessions', 'embedded_movies', 'theaters', 'movies']

```{important}
Everything in MongoDB is a JSON-like document even queries themselves!
```

### `find`

In [17]:
client['sample_mflix']['movies'].find( filter={'title': 'Titanic'} )

<pymongo.synchronous.cursor.Cursor at 0x11315ae10>

Well, the above code doesn't do anything because it returns a cursor object, which is basically a Python generator. Let's return the first element of this generator:

In [18]:
next(client['sample_mflix']['movies'].find( {'title': 'Titanic'} ))

{'_id': ObjectId('573a139af29313caabcefb1d'),
 'plot': 'The story of the 1912 sinking of the largest luxury liner ever built, the tragedy that befell over two thousand of the rich and famous as well as of the poor and unknown passengers aboard the doomed ship.',
 'genres': ['Action', 'Drama', 'History'],
 'runtime': 173,
 'cast': ['Peter Gallagher',
  'George C. Scott',
  'Catherine Zeta-Jones',
  'Eva Marie Saint'],
 'poster': 'https://m.media-amazon.com/images/M/MV5BYWM0MDE3OWMtMzlhZC00YzMyLThiNjItNzFhNGVhYzQ1YWM5XkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_SY1000_SX677_AL_.jpg',
 'title': 'Titanic',
 'fullplot': "The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl o

Or we can pass it to `list()` to materialize the generator entirely:

In [19]:
list(
    client['sample_mflix']['movies'].find( {'title': 'Titanic'} )
)

[{'_id': ObjectId('573a139af29313caabcefb1d'),
  'plot': 'The story of the 1912 sinking of the largest luxury liner ever built, the tragedy that befell over two thousand of the rich and famous as well as of the poor and unknown passengers aboard the doomed ship.',
  'genres': ['Action', 'Drama', 'History'],
  'runtime': 173,
  'cast': ['Peter Gallagher',
   'George C. Scott',
   'Catherine Zeta-Jones',
   'Eva Marie Saint'],
  'poster': 'https://m.media-amazon.com/images/M/MV5BYWM0MDE3OWMtMzlhZC00YzMyLThiNjItNzFhNGVhYzQ1YWM5XkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_SY1000_SX677_AL_.jpg',
  'title': 'Titanic',
  'fullplot': "The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent I

> **Note:** `.find( filter={} )` or `.find()` returns every document in the collection.
`.find_one()` in `pymongo`. This method returns only one document regardless of how many there are, according to the order in which documents are stored on the physical disk.

### `projection`

In [35]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'_id': 0, 'title': 1, 'year': 1}
    )
)

[{'title': 'Titanic', 'year': 1996}, {'year': 1997, 'title': 'Titanic'}]

> **Note:** In `pymongo`, you can use `True` instead of `1` and `False` instead of `0`.
> **Note:** In `pymongo`, we need to enclose all field names in single or double quotes (e.g. `'title'` not `title`), otherwise Python would complain because it doesn't recognize those names. In `mongosh`, this is not necessary.
>In the above returned documents, note that the primary key field, namely, the `_id` field is always returned by default unless you explicitly exclude it using `{'_id': 0}` or `{'_id': False}`. **This is the only scenario where we might mix up `1`s and `0`s (or `True`s and `False`s) in the projection field.**

### `sort`

In [21]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'_id': 0, 'title': 1, 'year': 1, 'runtime': 1},
        sort=[('runtime', 1), ('year', -1)]
    )
)

[{'runtime': 173, 'title': 'Titanic', 'year': 1996},
 {'year': 1997, 'title': 'Titanic', 'runtime': 194}]

### `limit`

In [22]:
list(
    client['sample_mflix']['movies'].find(
        projection={'title': 1, '_id': 0},
        limit=5
    )
)

[{'title': 'The Great Train Robbery'},
 {'title': 'A Corner in Wheat'},
 {'title': 'Civilization'},
 {'title': 'The Poor Little Rich Girl'},
 {'title': 'Wild and Woolly'}]

### `count_documents`

In [23]:
client['sample_mflix']['movies'].count_documents(filter={'year': 2000})

581

### `skip`

In [37]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'title': 1, 'year': 1}
    )
)

[{'_id': ObjectId('573a139af29313caabcefb1d'),
  'title': 'Titanic',
  'year': 1996},
 {'_id': ObjectId('573a139af29313caabcf0d74'),
  'year': 1997,
  'title': 'Titanic'}]

### `distinct`

In [39]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'title': 1, 'year': 1}
    )
    .distinct('title')
)

['Titanic']

## Can you?

- Use MQL to interact with MongoDB?

## Class activity
Worksheets time!!
```{note}
We don't have a separate repo created like the labs. So please grab it from the student repo.
```

- Practice MQL.

In [28]:
  client['sample_mflix']['movies'].count_documents(
      filter={
          'cast': {
              '$eq': 'Norman Kerry'
          },
          'languages': {'$exists': True},
          'tomatoes.viewer.meter': {'$gte': 5, '$ne': ''}
      }
  ) 

0