# <center>Big Data for Engineers &ndash; Exercises</center>
## <center>Spring 2020 &ndash; Week 11 &ndash; ETH Zurich</center>
## <center>MongoDB</center>

## Introduction

This exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practical exercises. Instructions are provided to install it on the Azure Portal.

## 1. Document stores

A record in document store is a *document*. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.  Documents are composed of field-value pairs and have the following structure:

![123](https://docs.mongodb.com/manual/_images/crud-annotated-mongodb-insertOne.bakedsvg.svg)

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.



### 1.1 General Questions
1. What are advantages of document stores over relational databases?
2. Can the data in document stores be normalized? 
3. How does denormalization affect performance? 

### 1.2 True/False Questions
Say if the following statements are *true* or *false*.

1. Document stores expose only a key-value interface.
2. Different relationships between data can be represented by references and embedded documents.
3. MongoDB does not support schema validation.
4. MongoDB encodes documents in the XML format.
5. In document stores, you must determine and declare a table's schema before inserting data. 
6. MongoDB performance degrades when the number of documents increases. 
7. Document stores are column stores with flexible schema.
8. There are no joins in MongoDB.

## 2. MongoDB

In this part of the exercise, you will setup a MongoDB image using **Azure Container Instances (ACI)**. By using ACI, apps can be deployed without explicitly managing virtual machines. You can learn more about ACI [here](https://azure.microsoft.com/en-us/services/container-instances/#overview).

<font color='red' size='5'>**Important: please delete your container after finishing the exercise.**</font>

### 2.1 Install MongoDB

1. Open the [Azure portal](https://portal.azure.com) and click **"Create a resource"**. After searching for `container instances`, click **"Container Instances Microsoft"** and **"Create"**.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/container_instances.png" width="500">
1. In the "Basics" tab, select your subscription for this exercise, and create a new resource group.
1. Fill in the container name and region. You can select any region you prefer.
1. Select **"Docker Hub or other registry"** for "Image source", and type in `mongo` in the "Image" field. By default, Azure will use [Docker Hub](https://hub.docker.com/) as the container registry. Leave other settings as default.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/basics.png" width="500">
1. In the "Networking" tab, choose a DNS name for your container. Open **port 27017** which is the default port that MongoDB listens to. Use TCP for the port.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/networking.png" width="500">
1. Change nothing in the "Advanced" and "Tags" tabs.
1. In the "Review" tab, review your resource settings and click "Create". The deployment should be finished in a couple of minutes. In fact, fast startup time is one of the benefits of using ACI!

### 2.2 Setup a test database

After the container is deployed, we need to connect to the container to create a database user.

1. Select the recently created container resource from Azure portal, click **"Settings - Containers"**, then choose the **"Connect"** tab. Use `/bin/bash` as start up command. Click **"Connect"**.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/containers_connect.png" width="700">
1. Start MongoDB shell by `mongo -shell`.
1. Select the `admin` database:
```
use admin
```
1. Then create a `root` user:
```
db.createUser(
    {
        user: "root", 
        pwd: "root", 
        roles:["root"]
    }
)
```
1. Log out from MongoDB shell:
```
exit
```
1. Now we are in the shell of the container. Download an example dataset:
```sh
apt update && apt install wget && wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
```
1. Import the dataset using `mongoimport`:
```
mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json
```
If the dataset is successfully imported, you should see something similar to this:
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/dataset_imported.png" width="100%">

### 2.3 Connect to the MongoDB server

We have finished setting up the database server. Next, we need to connect to the server using a `pymongo` client. First, install some packages:

In [None]:
!pip install pymongo==3.10.1
!pip install dnspython

Import some libraries:

In [None]:
from pymongo import MongoClient, errors
import dns
from pprint import pprint
import urllib
import json
import dateutil
from datetime import datetime, timezone, timedelta

In order to connect to MongoDB, we need to know the domain name of the host. In the resource console, click **"Overview"** to see the basic information of the container. Copy the host URL from the **"FQDN"** field and paste it in the following cell. Execute it to connect to the database.

In [None]:
# global variables for MongoDB host (default port is 27017)
DOMAIN = 'mymongo.westeurope.azurecontainer.io' # Note: this should be replaced by the URL of your own container!! 
PORT = 27017

# use a try-except indentation to catch MongoClient() errors
try:
    # try to instantiate a client instance
    client = MongoClient(
        host = [ str(DOMAIN) + ":" + str(PORT) ],
        serverSelectionTimeoutMS = 3000, # 3 second timeout
        username = "root",
        password = "root",
    )

    db = client.test
    
except errors.ServerSelectionTimeoutError as err:
    # set the client to 'None' if exception
    client = None

    # catch pymongo.errors.ServerSelectionTimeoutError
    print ("pymongo ERROR:", err)
    
db.restaurants

As a sanity check, we count the number of documents in the `restaurants` collection that we previously imported. It should match the number reported by `mongoimport`.

In [None]:
db.restaurants.count_documents({})

### 2.4  MongoDB CRUD operations

In this section, we will go through some commonly used CRUD (**C**reate, **R**ead, **U**pdate, **D**elete) operations in MongoDB.

In [None]:
# Create a new collection
scientists = db['scientists']

In [None]:
# Insert some documents.
# Note that documents can have nested structures, and the collection can be heterogeneous.
scientists.insert_one({
    "Name": {
        "First": "Albert",
        "Last": "Einstein"
    },
    "Theory": "Particle Physics"})
scientists.insert_one({
    "Name": {
        "First": "Kurt",
        "Last": "Gödel"
    },
    "Theory": "Incompleteness" })
scientists.insert_one({
    "Name": {
        "First": "Sheldon",
        "Last": "Cooper"
    }})

In [None]:
# Select all documents from the collection
scientists.find()

In [None]:
# As you can see, find() method returns a Cursor object. One must iterate the Cursor object to access individual documents
for doc in scientists.find():
    pprint(doc)

### Query Documents
For the ```db.collection.find()``` method, you can specify the following optional fields:
- a **query filter** to specify which documents to return,
- a **query projection** to specify which fields from the matching documents to return (the projection limits the amount of data that MongoDB returns to the client over the network),
- optionally, a **cursor modifier** to impose limits, skips, and sort orders.

![query](https://docs.mongodb.com/manual/_images/crud-annotated-mongodb-find.bakedsvg.svg)

In [None]:
# Using a query filter
for doc in db.scientists.find({"Theory": "Particle Physics"}):
    pprint(doc)

In [None]:
# Using a projection
for doc in db.scientists.find({"Theory": "Particle Physics"}, {"Name.Last": 1}):
    pprint(doc)

In [None]:
# Using a projection, with "_id" output disabled
for doc in db.scientists.find({"Theory": "Particle Physics"}, {"_id": 0, "Name.Last": 1}):
    pprint(doc)

In [None]:
# Insert more documents
doc_list = [
    {"Name":"Einstein", "Profession":"Physicist"},
    {"Name":"Gödel", "Profession":"Mathematician"},
    {"Name":"Ramanujan", "Profession":"Mathematician"},
    {"Name":"Pythagoras", "Profession":"Mathematician"},
    {"Name":"Turing", "Profession":"Computer Scientist"},
    {"Name":"Church", "Profession":"Computer Scientist"},
    {"Name":"Nash", "Profession":"Economist"},
    {"Name":"Euler", "Profession":"Mathematician"},
    {"Name":"Bohm", "Profession":"Physicist"},
    {"Name":"Galileo", "Profession":"Astrophysicist"},
    {"Name":"Lagrange", "Profession":"Mathematician"},
    {"Name":"Gauss", "Profession":"Mathematician"},
    {"Name":"Thales", "Profession":"Mathematician"}
]
scientists.insert_many(doc_list)

In [None]:
# Using cursor modifiers
print("Using sort:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1):
    pprint(doc)
    
print("Using skip:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1):
    pprint(doc)
    
print("Using limit:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1).limit(3):
    pprint(doc)

In [None]:
# Updating documents

# Adding a new field:
scientists.update_many({"Name": "Einstein"}, {"$set": {"Century" : "20"}})
pprint(scientists.find_one({"Name": "Einstein"}))

# Changing the type of a field:
scientists.update_many({"Name": "Nash"}, {"$set": {"Profession" : ["Mathematician", "Economist"]}})
pprint(scientists.find_one({"Name": "Nash"}))

In [None]:
# Matching array elements
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1, "Profession": 1}).sort("Name", 1):
    pprint(doc)

In [None]:
# Delete documents
scientists.delete_one({"Profession": "Astrophysicist"})
scientists.count_documents({"Name": "Galileo"})

### `pymongo` vs MongoDB shell

In the lecture, we learnt how to write queries in the syntax of the MongoDB shell. The syntax is a bit different from the syntax of `pymongo`. Here are a few examples:

|     | MongoDB shell | `pymongo` | Note |
| --- | ------------- | --------- | ---- |
| Insert | `insert()` | `insert_one()` or `insert_many()` | `insert()` is also valid for `pymongo` but deprecated. |
| Update | `update()` | `update_one()` or `update_many()` | `update()` is also valid for `pymongo` but deprecated. |
| Delete | `delete()` | `delete_one()` or `delete_many()` | `delete()` is also valid for `pymongo` but deprecated. |
| Sort criterion | JSON document | list of `(key, direction)` pairs | |
| Naming convention | camelCase (e.g. `createIndex`) | snake_case (e.g. `create_index`) | |
| Count | `db.collection.find(filter).count()` | `db.collection.count_documents(filter)` | `count()` is also valid for `pymongo` but deprecated. |

It is not necessary to remember these differences, but you should understand the semantics of a query written in either `pymongo` or MongoDB shell syntax.

### 2.5 A larger dataset

Now it's time to play with a dataset of more realistic size! Try to insert a document into the ```restaurants``` collection. In addition, you can see the structure of documents in the collection.

In [None]:
from dateutil.parser import isoparse
db.restaurants.insert_one(
   {
      "address" : {
         "street" : "2 Avenue",
         "zipcode" : "10075",
         "building" : "1480",
         "coord" : [ -73.9557413, 40.7720266 ]
      },
      "borough" : "Manhattan",
      "cuisine" : "Italian",
      "grades" : [
         {
            "date" : isoparse("2014-10-01T00:00:00Z"),
            "grade" : "A",
            "score" : 11
         },
         {
            "date" : isoparse("2014-01-16T00:00:00Z"),
            "grade" : "A",
            "score" : 17
         }
      ],
      "name" : "Vella",
      "restaurant_id" : "41704620"
   }
)

In [None]:
# Query one document in a collection:
pprint(db.restaurants.find_one())

###  2.6 Questions
For this part of the exercise, we will use the `restaurants` collection. Write queries in MongoDB that return the following:

**1)** All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**2)** The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".

In [None]:
# insert your query here:
db.restaurants.count_documents()

**3)** All restaurants with zipcode 11225.

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**4)** Names of restaurants with zipcode 11225 that have at least one grade "C".

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**5)** Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**6)** Names and streets of restaurants that don't have an "A" grade.

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**7)** All restaurants with a grade C and a score greater than 50 for that grade at the same time.

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**8)** All restaurants with a grade C or a score greater than 50.

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

**9)** All restaurants that have only A grades.

In [None]:
# insert your query here:
cursor = db.restaurants.find()
pprint(cursor[0]) # print the first returned document

## 3. Indexing in MongoDB

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to select those documents that match the query statement. Scan can be highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index type supports. 

By default,  MongoDB creates the ```_id``` index, which is an ascending unique index on the ```_id``` field, for all collections when the collection is created. You cannot remove the index on the ```_id``` field.

### Managing indexes in MongoDB

An ```explain()``` operator provides information on the query plan. It returns a document that describes the process and indexes used to return the query. This may provide useful insight when attempting to optimize a query. Example:

In [None]:
db.restaurants.find({"borough" : "Brooklyn"}).explain()

In `pymongo`, you can create an index by calling the `create_index()` method. For example, we can create an index for the `borough` field:

In [None]:
db.restaurants.create_index("borough")

Now, let's see how the query plan changes to use the newly created index:

In [None]:
db.restaurants.find({"borough" : "Brooklyn"}).explain()

The number of documents examined is indicated in the `docsExamined` field. The number drops significantly by using an index. In fact, in this example the number of documents examined is exactly the number of documents returned (`nReturned`).

The index specification describes the kind of index for that field. For example, a value of 1 specifies an index that orders items in ascending order. A value of -1 specifies an index that orders items in descending order. **Note that index direction only matters in a compound index.**

To remove all indexes, you can use ```db.collection.drop_indexes()```. Example:

In [None]:
print("Before drop_indexes():")
for index in db.restaurants.list_indexes():
    pprint(index)
print("Now we drop all indexes...")
db.restaurants.drop_indexes()
print("After drop_indexes():")
for index in db.restaurants.list_indexes():
    pprint(index)

To remove a specific index you can use ```db.collection.drop_index(index_name)```. Example:

In [None]:
print('Create some indexes first...')
db.restaurants.create_index([('cuisine', -1), ('borought', 1)]) 
index_name = db.restaurants.create_index('address.building')
print('\nNow we have these indexes:')
for index in db.restaurants.list_indexes():
    pprint(index)
    
print('\nThen drop_index()...')
db.restaurants.drop_index(index_name)
print('\nThe remaining indexes are:')
for index in db.restaurants.list_indexes():
    pprint(index)

### 3.1 Questions

**Please answer questions 1) and 2) in Moodle.**

**1)** Which queries will use the following index: 
```python
db.restaurants.create_index("cuisine")
```

A.  `db.restaurants.find({"address.street": "2 Avenue"})`  
B.  `db.restaurants.find({}, {"cuisine": 1})`  
C.  `db.restaurants.find({"borough": "Brooklyn"}, {"cuisine": 1})`  
D.  `db.restaurants.find({"cuisine": "Italian"})`

**2)** Which queries will use the following index: 
```python
db.restaurants.create_index([("borough", -1), ("address.street", -1)])
```

A.  `db.restaurants.find().sort([("borough", 1), ("address.street", -1)])`   
B.  `db.restaurants.find({"address.street": "2 Avenue"})`    
C.  `db.restaurants.find({"address.zipcode": "10075"}, {"address": 1})`    
D.  `db.restaurants.find({}, {"address": -1})`     

**3)** Write a command for creating an index on the `zipcode` field.

In [None]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([])

# print all indexes
for index in db.restaurants.list_indexes():
    pprint(index)

**4)** Write an index to speed up the following query:
```python
    db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1})
```

In [None]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([])

# verify the query plan
print(db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1})
      .explain()['executionStats']['executionStages'])

**5)** Write an index to speed up the following query:
```python
    db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})
```

In [None]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([])

# verify the query plan
print(db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})
      .explain()['executionStats']['executionStages'])

<font color='red' size='5'>**Important: please delete your container after finishing the exercise.**</font>