# <center>Big Data for Engineers &ndash; Exercises</center>
## <center>Spring 2020 &ndash; Week 11 &ndash; ETH Zurich</center>
## <center>MongoDB</center>

## Introduction

This exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practical exercises. Instructions are provided to install it on the Azure Portal.

## 1. Document stores

A record in document store is a *document*. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.  Documents are composed of field-value pairs and have the following structure:

![123](https://docs.mongodb.com/manual/_images/crud-annotated-mongodb-insertOne.bakedsvg.svg)

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.



### 1.1 General Questions
1. What are advantages of document stores over relational databases?
2. Can the data in document stores be normalized? 
3. How does denormalization affect performance? 

###  Solution


1) Flexibility. Not every record needs to store the same properties. New properties can be added on the fly (Flexible schema).

2) Yes. References can be used for data normalization. 

<img src="https://docs.mongodb.com/manual/_images/data-model-normalized.bakedsvg.svg" style="width: 500px;"/>
<img src="https://docs.mongodb.com/manual/_images/data-model-denormalized.bakedsvg.svg" style="width: 500px;"/>


3)  All data for an object is stored in a single record. In general, it provides better performance for read operations (since expensive joins can be omitted), as well as the ability to request and retrieve related data in a single database operation. In addition, embedded data models make it possible to update related data in a single atomic write operation.


### 1.2 True/False Questions
Say if the following statements are *true* or *false*.

1. Document stores expose only a key-value interface.
2. Different relationships between data can be represented by references and embedded documents.
3. MongoDB does not support schema validation.
4. MongoDB encodes documents in the XML format.
5. In document stores, you must determine and declare a table's schema before inserting data. 
6. MongoDB performance degrades when the number of documents increases. 
7. Document stores are column stores with flexible schema.
8. There are no joins in MongoDB.

###  Solution

1. (False) Document stores expose only a key-value interface. 
2. (True) Different relationships between data can be represented by references and embedded documents.
3. (False) MongoDB does not support schema validation.
4. (False) MongoDB encodes documents in the XML format. 
5. (False) In document stores, you must determine and declare a table's schema before inserting data. 
6. (True) MongoDB performance degrades when the number of documents increases. 
7. (False) Document stores are column stores with flexible schema.
8. (True) There are no joins in MongoDB. ** Nonetheless, starting in version 3.2, MongoDB supports aggregations with "lookup" operator, which can perform a ** ```LEFT OUTER JOIN```.

## 2. MongoDB

In this part of the exercise, you will setup a MongoDB image using **Azure Container Instances (ACI)**. By using ACI, apps can be deployed without explicitly managing virtual machines. You can learn more about ACI [here](https://azure.microsoft.com/en-us/services/container-instances/#overview).

<font color='red' size='5'>**Important: please delete your container after finishing the exercise.**</font>

### 2.1 Install MongoDB

1. Open the [Azure portal](https://portal.azure.com) and click **"Create a resourece"**. After searching for `container instances`, click **"Container Instances Microsoft"** and **"Create"**.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/container_instances.png" width="500">
1. In the "Basics" tab, select your subscription for this exercise, and create a new resource group.
1. Fill in the container name and region. You can select any region you prefer.
1. Select **"Docker Hub or other registry"** for "Image source", and type in `mongo` in the "Image" field. By default, Azure will use [Docker Hub](https://hub.docker.com/) as the container registry. Leave other settings as default.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/basics.png" width="500">
1. In the "Networking" tab, choose a DNS name for your container. Open **port 27017** which is the default port that MongoDB listens to. Use TCP for the port.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/networking.png" width="500">
1. Change nothing on the "Advanced" and "Tags" tabs.
1. In the "Review" tab, review your resource settings and click "Create". The deployment should be finished in a couple of minutes. In fact, fast startup time is one of the benefits of using ACI!

### 2.2 Setup a test database

After the container is deployed, we need to connect to the container to create a database user.

1. Select the recently created container resource from Azure portal, click **"Settings - Containers"**, then choose the **"Connect"** tab. Use `/bin/bash` as start up command. Click **"Connect"**.
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/containers_connect.png" width="700">
1. Start MongoDB shell by `mongo -shell`.
1. Select the `admin` database:
```
use admin
```
1. Then create a `root` user:
```
db.createUser(
    {
        user: "root", 
        pwd: "root", 
        roles:["root"]
    }
)
```
1. Log out from MongoDB shell:
```
exit
```
1. Now we are in the shell of the container. Download an example dataset:
```sh
apt update && apt install wget && wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
```
1. Import the dataset using `mongoimport`:
```
mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json
```
If the dataset is successfully imported, you should see something similar to this:
<img src="https://bigdata2020exassets.blob.core.windows.net/bdfeex11/dataset_imported.png" width="100%">

### 2.3 Connect to the MongoDB server

We have finished setting up the database server. Next, we need to connect to the server using a `pymongo` client. First, install some packages:

In [2]:
!pip install pymongo==3.10.1
!pip install dnspython



Import some libraries:

In [3]:
from pymongo import MongoClient, errors
import dns
from pprint import pprint
import urllib
import json
import dateutil
from datetime import datetime, timezone, timedelta

In order to connect to MongoDB, we need to know the domain name of the host. In the resource console, click **"Overview"** to see the basic information of the container. Copy the host URL from the **"FQDN"** field and paste it in the following cell. Execute it to connect to the database.

In [4]:
# global variables for MongoDB host (default port is 27017)
DOMAIN = 'mymongo.westus.azurecontainer.io' # Note: this should be replaced by the URL of your own container!! 
PORT = 27017

# use a try-except indentation to catch MongoClient() errors
try:
    # try to instantiate a client instance
    client = MongoClient(
        host = [ str(DOMAIN) + ":" + str(PORT) ],
        serverSelectionTimeoutMS = 3000, # 3 second timeout
        username = "root",
        password = "root",
    )

    db = client.test
    
except errors.ServerSelectionTimeoutError as err:
    # set the client to 'None' if exception
    client = None

    # catch pymongo.errors.ServerSelectionTimeoutError
    print ("pymongo ERROR:", err)
    
db.restaurants

Collection(Database(MongoClient(host=['mymongo.westus.azurecontainer.io:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=3000), 'test'), 'restaurants')

As a sanity check, we count the number of documents in the `restaurants` collection that we previously imported. It should match the number reported by `mongoimport`.

In [None]:
db.restaurants.count_documents({})

25359

### 2.4  MongoDB CRUD operations

In this section, we will go through some commonly used CRUD (**C**reate, **R**ead, **U**pdate, **D**elete) operations in MongoDB.

In [None]:
# Create a new collection
scientists = db['scientists']

In [None]:
# Insert some documents.
# Note that documents can have nested structures, and the collection can be heterogeneous.
scientists.insert_one({
    "Name": {
        "First": "Albert",
        "Last": "Einstein"
    },
    "Theory": "Particle Physics"})
scientists.insert_one({
    "Name": {
        "First": "Kurt",
        "Last": "Gödel"
    },
    "Theory": "Incompleteness" })
scientists.insert_one({
    "Name": {
        "First": "Sheldon",
        "Last": "Cooper"
    }})

<pymongo.results.InsertOneResult at 0x7f71f62b3e60>

In [None]:
# Select all documents from the collection
scientists.find()

<pymongo.cursor.Cursor at 0x7f71f62f6f50>

In [None]:
# As you can see, find() method returns a Cursor object. One must iterate the Cursor object to access individual documents
for doc in scientists.find():
    pprint(doc)

{'Name': {'First': 'Albert', 'Last': 'Einstein'},
 'Theory': 'Particle Physics',
 '_id': ObjectId('60a4315a48d3bda027c3d328')}
{'Name': {'First': 'Kurt', 'Last': 'Gödel'},
 'Theory': 'Incompleteness',
 '_id': ObjectId('60a4315b48d3bda027c3d329')}
{'Name': {'First': 'Sheldon', 'Last': 'Cooper'},
 '_id': ObjectId('60a4315b48d3bda027c3d32a')}


### Query Documents
For the ```db.collection.find()``` method, you can specify the following optional fields:
- a **query filter** to specify which documents to return,
- a **query projection** to specify which fields from the matching documents to return (the projection limits the amount of data that MongoDB returns to the client over the network),
- optionally, a **cursor modifier** to impose limits, skips, and sort orders.

![query](https://docs.mongodb.com/manual/_images/crud-annotated-mongodb-find.bakedsvg.svg)

In [None]:
# Using a query filter
for doc in db.scientists.find({"Theory": "Particle Physics"}):
    pprint(doc)

{'Name': {'First': 'Albert', 'Last': 'Einstein'},
 'Theory': 'Particle Physics',
 '_id': ObjectId('60a4315a48d3bda027c3d328')}


In [None]:
# Using a projection
for doc in db.scientists.find({"Theory": "Particle Physics"}, {"Name.Last": 1}):
    pprint(doc)

{'Name': {'Last': 'Einstein'}, '_id': ObjectId('60a4315a48d3bda027c3d328')}


In [None]:
# Using a projection, with "_id" output disabled
for doc in db.scientists.find({"Theory": "Particle Physics"}, {"_id": 0, "Name.Last": 1}):
    pprint(doc)

{'Name': {'Last': 'Einstein'}}


In [None]:
# Insert more documents
doc_list = [
    {"Name":"Einstein", "Profession":"Physicist"},
    {"Name":"Gödel", "Profession":"Mathematician"},
    {"Name":"Ramanujan", "Profession":"Mathematician"},
    {"Name":"Pythagoras", "Profession":"Mathematician"},
    {"Name":"Turing", "Profession":"Computer Scientist"},
    {"Name":"Church", "Profession":"Computer Scientist"},
    {"Name":"Nash", "Profession":"Economist"},
    {"Name":"Euler", "Profession":"Mathematician"},
    {"Name":"Bohm", "Profession":"Physicist"},
    {"Name":"Galileo", "Profession":"Astrophysicist"},
    {"Name":"Lagrange", "Profession":"Mathematician"},
    {"Name":"Gauss", "Profession":"Mathematician"},
    {"Name":"Thales", "Profession":"Mathematician"}
]
scientists.insert_many(doc_list)

<pymongo.results.InsertManyResult at 0x7f71f6301d20>

In [None]:
# Using cursor modifiers
print("Using sort:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1):
    pprint(doc)
    
print("Using skip:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1):
    pprint(doc)
    
print("Using limit:")
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1}).sort("Name", 1).skip(1).limit(3):
    pprint(doc)

Using sort:
{'Name': 'Euler'}
{'Name': 'Gauss'}
{'Name': 'Gödel'}
{'Name': 'Lagrange'}
{'Name': 'Pythagoras'}
{'Name': 'Ramanujan'}
{'Name': 'Thales'}
Using skip:
{'Name': 'Gauss'}
{'Name': 'Gödel'}
{'Name': 'Lagrange'}
{'Name': 'Pythagoras'}
{'Name': 'Ramanujan'}
{'Name': 'Thales'}
Using limit:
{'Name': 'Gauss'}
{'Name': 'Gödel'}
{'Name': 'Lagrange'}


In [None]:
# Updating documents

# Adding a new field:
scientists.update_many({"Name": "Einstein"}, {"$set": {"Century" : "20"}})
pprint(scientists.find_one({"Name": "Einstein"}))

# Changing the type of a field:
scientists.update_many({"Name": "Nash"}, {"$set": {"Profession" : ["Mathematician", "Economist"]}})
pprint(scientists.find_one({"Name": "Nash"}))

{'Century': '20',
 'Name': 'Einstein',
 'Profession': 'Physicist',
 '_id': ObjectId('60a4317248d3bda027c3d32b')}
{'Name': 'Nash',
 'Profession': ['Mathematician', 'Economist'],
 '_id': ObjectId('60a4317248d3bda027c3d331')}


In [None]:
# Matching array elements
for doc in scientists.find({"Profession": "Mathematician"}, {"_id": 0, "Name": 1, "Profession": 1}).sort("Name", 1):
    pprint(doc)

{'Name': 'Euler', 'Profession': 'Mathematician'}
{'Name': 'Gauss', 'Profession': 'Mathematician'}
{'Name': 'Gödel', 'Profession': 'Mathematician'}
{'Name': 'Lagrange', 'Profession': 'Mathematician'}
{'Name': 'Nash', 'Profession': ['Mathematician', 'Economist']}
{'Name': 'Pythagoras', 'Profession': 'Mathematician'}
{'Name': 'Ramanujan', 'Profession': 'Mathematician'}
{'Name': 'Thales', 'Profession': 'Mathematician'}


In [None]:
# Delete documents
scientists.delete_one({"Profession": "Astrophysicist"})
scientists.count_documents({"Name": "Galileo"})

0

### `pymongo` vs MongoDB shell

In the lecture, we learnt how to write queries in the syntax of the MongoDB shell. The syntax is a bit different from the syntax of `pymongo`. Here are a few examples:

|     | MongoDB shell | `pymongo` | Note |
| --- | ------------- | --------- | ---- |
| Insert | `insert()` | `insert_one()` or `insert_many()` | `insert()` is also valid for `pymongo` but deprecated. |
| Update | `update()` | `update_one()` or `update_many()` | `update()` is also valid for `pymongo` but deprecated. |
| Delete | `delete()` | `delete_one()` or `delete_many()` | `delete()` is also valid for `pymongo` but deprecated. |
| Sort criterion | JSON document | list of `(key, direction)` pairs | |
| Naming convention | camelCase (e.g. `createIndex`) | snake_case (e.g. `create_index`) | |
| Count | `db.collection.find(filter).count()` | `db.collection.count_documents(filter)` | `count()` is also valid for `pymongo` but deprecated. |

It is not necessary to remember these differences, but you should understand the semantics of a query written in either `pymongo` or MongoDB shell syntax.

### 2.5 A larger dataset

Now it's time to play with a dataset of more realistic size! Try to insert a document into the ```restaurants``` collection. In addition, you can see the structure of documents in the collection.

In [None]:
from dateutil.parser import isoparse
db.restaurants.insert_one(
   {
      "address" : {
         "street" : "2 Avenue",
         "zipcode" : "10075",
         "building" : "1480",
         "coord" : [ -73.9557413, 40.7720266 ]
      },
      "borough" : "Manhattan",
      "cuisine" : "Italian",
      "grades" : [
         {
            "date" : isoparse("2014-10-01T00:00:00Z"),
            "grade" : "A",
            "score" : 11
         },
         {
            "date" : isoparse("2014-01-16T00:00:00Z"),
            "grade" : "A",
            "score" : 17
         }
      ],
      "name" : "Vella",
      "restaurant_id" : "41704620"
   }
)

<pymongo.results.InsertOneResult at 0x7f71f5e45af0>

In [None]:
# Query one document in a collection:
pprint(db.restaurants.find_one())

{'_id': ObjectId('60a43126ce772f47e9a5f71a'),
 'address': {'building': '1007',
             'coord': [-73.856077, 40.848447],
             'street': 'Morris Park Ave',
             'zipcode': '10462'},
 'borough': 'Bronx',
 'cuisine': 'Bakery',
 'grades': [{'date': datetime.datetime(2014, 3, 3, 0, 0),
             'grade': 'A',
             'score': 2},
            {'date': datetime.datetime(2013, 9, 11, 0, 0),
             'grade': 'A',
             'score': 6},
            {'date': datetime.datetime(2013, 1, 24, 0, 0),
             'grade': 'A',
             'score': 10},
            {'date': datetime.datetime(2011, 11, 23, 0, 0),
             'grade': 'A',
             'score': 9},
            {'date': datetime.datetime(2011, 3, 10, 0, 0),
             'grade': 'B',
             'score': 14}],
 'name': 'Morris Park Bake Shop',
 'restaurant_id': '30075445'}


###  2.6 Questions
For this part of the exercise, we will use the `restaurants` collection. Write queries in MongoDB that return the following:

**1)** All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".

In [None]:
# insert your query here:
cursor = db.restaurants.find({"borough": "Brooklyn", "cuisine": "Hamburgers"})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f71b'),
 'address': {'building': '469',
             'coord': [-73.961704, 40.662942],
             'street': 'Flatbush Avenue',
             'zipcode': '11225'},
 'borough': 'Brooklyn',
 'cuisine': 'Hamburgers',
 'grades': [{'date': datetime.datetime(2014, 12, 30, 0, 0),
             'grade': 'A',
             'score': 8},
            {'date': datetime.datetime(2014, 7, 1, 0, 0),
             'grade': 'B',
             'score': 23},
            {'date': datetime.datetime(2013, 4, 30, 0, 0),
             'grade': 'A',
             'score': 12},
            {'date': datetime.datetime(2012, 5, 8, 0, 0),
             'grade': 'A',
             'score': 12}],
 'name': "Wendy'S",
 'restaurant_id': '30112340'}


**2)** The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".

In [None]:
# insert your query here:
db.restaurants.count_documents({"borough": "Brooklyn", "cuisine": "Hamburgers"})

102

**3)** All restaurants with zipcode 11225.

In [None]:
# insert your query here:
cursor = db.restaurants.find({"address.zipcode": "11225"})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f71b'),
 'address': {'building': '469',
             'coord': [-73.961704, 40.662942],
             'street': 'Flatbush Avenue',
             'zipcode': '11225'},
 'borough': 'Brooklyn',
 'cuisine': 'Hamburgers',
 'grades': [{'date': datetime.datetime(2014, 12, 30, 0, 0),
             'grade': 'A',
             'score': 8},
            {'date': datetime.datetime(2014, 7, 1, 0, 0),
             'grade': 'B',
             'score': 23},
            {'date': datetime.datetime(2013, 4, 30, 0, 0),
             'grade': 'A',
             'score': 12},
            {'date': datetime.datetime(2012, 5, 8, 0, 0),
             'grade': 'A',
             'score': 12}],
 'name': "Wendy'S",
 'restaurant_id': '30112340'}


**4)** Names of restaurants with zipcode 11225 that have at least one grade "C".

In [None]:
# insert your query here:
cursor = db.restaurants.find({"address.zipcode": "11225", "grades.grade": "C"}, {"name": 1})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5fd7e'), 'name': "Vee'S Restaurant"}


**5)** Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".

In [None]:
# insert your query here:
cursor = db.restaurants.find({"address.zipcode": "11225", "grades.0.grade": "C", "grades.1.grade": "A"}, {"name": 1})
pprint(cursor[0])

{'_id': ObjectId('60a43128ce772f47e9a648a8'), 'name': 'Careta Bar & Restaurant'}


**6)** Names and streets of restaurants that don't have an "A" grade.

In [None]:
# insert your query here:t
cursor = db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1, "address.street": 1})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f8cf'),
 'address': {'street': 'Thompson Street'},
 'name': 'Tomoe Sushi'}


**7)** All restaurants with a grade C and a score greater than 50 for that grade at the same time.

In [None]:
# insert your query here:
cursor = db.restaurants.find({"grades": {"$elemMatch": {"grade": "C", "score": {"$gt": 50}}}})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f726'),
 'address': {'building': '1269',
             'coord': [-73.871194, 40.6730975],
             'street': 'Sutter Avenue',
             'zipcode': '11208'},
 'borough': 'Brooklyn',
 'cuisine': 'Chinese',
 'grades': [{'date': datetime.datetime(2014, 9, 16, 0, 0),
             'grade': 'B',
             'score': 21},
            {'date': datetime.datetime(2013, 8, 28, 0, 0),
             'grade': 'A',
             'score': 7},
            {'date': datetime.datetime(2013, 4, 2, 0, 0),
             'grade': 'C',
             'score': 56},
            {'date': datetime.datetime(2012, 8, 15, 0, 0),
             'grade': 'B',
             'score': 27},
            {'date': datetime.datetime(2012, 3, 28, 0, 0),
             'grade': 'B',
             'score': 27}],
 'name': 'May May Kitchen',
 'restaurant_id': '40358429'}


**8)** All restaurants with a grade C or a score greater than 50.

In [None]:
# insert your query here:
cursor = db.restaurants.find({"$or": [{"grades.score": {"$gt": 50}}, {"grades.grade": "C"}]})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f726'),
 'address': {'building': '1269',
             'coord': [-73.871194, 40.6730975],
             'street': 'Sutter Avenue',
             'zipcode': '11208'},
 'borough': 'Brooklyn',
 'cuisine': 'Chinese',
 'grades': [{'date': datetime.datetime(2014, 9, 16, 0, 0),
             'grade': 'B',
             'score': 21},
            {'date': datetime.datetime(2013, 8, 28, 0, 0),
             'grade': 'A',
             'score': 7},
            {'date': datetime.datetime(2013, 4, 2, 0, 0),
             'grade': 'C',
             'score': 56},
            {'date': datetime.datetime(2012, 8, 15, 0, 0),
             'grade': 'B',
             'score': 27},
            {'date': datetime.datetime(2012, 3, 28, 0, 0),
             'grade': 'B',
             'score': 27}],
 'name': 'May May Kitchen',
 'restaurant_id': '40358429'}


**9)** All restaurants that have only A grades.

In [None]:
# insert your query here:
cursor = db.restaurants.find({"grades": {"$not": {"$elemMatch": {"grade": {"$ne": "A"}}}}})
pprint(cursor[0])

{'_id': ObjectId('60a43126ce772f47e9a5f71c'),
 'address': {'building': '351',
             'coord': [-73.98513559999999, 40.7676919],
             'street': 'West   57 Street',
             'zipcode': '10019'},
 'borough': 'Manhattan',
 'cuisine': 'Irish',
 'grades': [{'date': datetime.datetime(2014, 9, 6, 0, 0),
             'grade': 'A',
             'score': 2},
            {'date': datetime.datetime(2013, 7, 22, 0, 0),
             'grade': 'A',
             'score': 11},
            {'date': datetime.datetime(2012, 7, 31, 0, 0),
             'grade': 'A',
             'score': 12},
            {'date': datetime.datetime(2011, 12, 29, 0, 0),
             'grade': 'A',
             'score': 12}],
 'name': 'Dj Reynolds Pub And Restaurant',
 'restaurant_id': '30191841'}


## 3. Indexing in MongoDB

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to select those documents that match the query statement. Scan can be highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index type supports. 

By default,  MongoDB creates the ```_id``` index, which is an ascending unique index on the ```_id``` field, for all collections when the collection is created. You cannot remove the index on the ```_id``` field.

### Managing indexes in MongoDB

An ```explain()``` operator provides information on the query plan. It returns a document that describes the process and indexes used to return the query. This may provide useful insight when attempting to optimize a query. Example:

In [None]:
db.restaurants.find({"borough" : "Brooklyn"}).explain()

{'executionStats': {'allPlansExecution': [],
  'executionStages': {'advanced': 6086,
   'direction': 'forward',
   'docsExamined': 25360,
   'executionTimeMillisEstimate': 1,
   'filter': {'borough': {'$eq': 'Brooklyn'}},
   'isEOF': 1,
   'nReturned': 6086,
   'needTime': 19275,
   'needYield': 0,
   'restoreState': 25,
   'saveState': 25,
   'stage': 'COLLSCAN',
   'works': 25362},
  'executionSuccess': True,
  'executionTimeMillis': 13,
  'nReturned': 6086,
  'totalDocsExamined': 25360,
  'totalKeysExamined': 0},
 'ok': 1.0,
 'queryPlanner': {'indexFilterSet': False,
  'namespace': 'test.restaurants',
  'parsedQuery': {'borough': {'$eq': 'Brooklyn'}},
  'plannerVersion': 1,
  'rejectedPlans': [],
  'winningPlan': {'direction': 'forward',
   'filter': {'borough': {'$eq': 'Brooklyn'}},
   'stage': 'COLLSCAN'}},
 'serverInfo': {'gitVersion': '72e66213c2c3eab37d9358d5e78ad7f5c1d0d0d7',
  'host': 'wk-caas-86a3f02b03ca49938922fb529b1cfe0b-32e01c0b27eaabfd720d0d',
  'port': 27017,
  'versi

In `pymongo`, you can create an index by calling the `create_index()` method. For example, we can create an index for the `borough` field:

In [14]:
db.restaurants.create_index("borough")

'borough_1'

Now, let's see how the query plan changes to use the newly created index:

In [15]:
db.restaurants.find({"borough" : "Brooklyn"}).explain()

{'executionStats': {'allPlansExecution': [],
  'executionStages': {'advanced': 6086,
   'alreadyHasObj': 0,
   'docsExamined': 6086,
   'executionTimeMillisEstimate': 1,
   'inputStage': {'advanced': 6086,
    'direction': 'forward',
    'dupsDropped': 0,
    'dupsTested': 0,
    'executionTimeMillisEstimate': 0,
    'indexBounds': {'borough': ['["Brooklyn", "Brooklyn"]']},
    'indexName': 'borough_1',
    'indexVersion': 2,
    'isEOF': 1,
    'isMultiKey': False,
    'isPartial': False,
    'isSparse': False,
    'isUnique': False,
    'keyPattern': {'borough': 1},
    'keysExamined': 6086,
    'multiKeyPaths': {'borough': []},
    'nReturned': 6086,
    'needTime': 0,
    'needYield': 0,
    'restoreState': 6,
    'saveState': 6,
    'seeks': 1,
    'stage': 'IXSCAN',
    'works': 6087},
   'isEOF': 1,
   'nReturned': 6086,
   'needTime': 0,
   'needYield': 0,
   'restoreState': 6,
   'saveState': 6,
   'stage': 'FETCH',
   'works': 6087},
  'executionSuccess': True,
  'executionTi

The number of documents examined is indicated in the `docsExamined` field. The number drops significantly by using an index. In fact, in this example the number of documents examined is exactly the number of documents returned (`nReturned`).

The index specification describes the kind of index for that field. For example, a value of 1 specifies an index that orders items in ascending order. A value of -1 specifies an index that orders items in descending order. **Note that index direction only matters in a compound index.**

To remove all indexes, you can use ```db.collection.drop_indexes()```. Example:

In [None]:
print("Before drop_indexes():")
for index in db.restaurants.list_indexes():
    pprint(index)
print("Now we drop all indexes...")
db.restaurants.drop_indexes()
print("After drop_indexes():")
for index in db.restaurants.list_indexes():
    pprint(index)

Before drop_indexes():
SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
SON([('v', 2), ('key', SON([('borough', 1)])), ('name', 'borough_1')])
Now we drop all indexes...
After drop_indexes():
SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])


To remove a specific index you can use ```db.collection.drop_index(index_name)```. Example:

In [None]:
print('Create some indexes first...')
db.restaurants.create_index([('cuisine', -1), ('borough', 1)]) 
index_name = db.restaurants.create_index('address.building')
print('\nNow we have these indexes:')
for index in db.restaurants.list_indexes():
    pprint(index)
    
print('\nThen drop_index()...')
db.restaurants.drop_index(index_name)
print('\nThe remaining indexes are:')
for index in db.restaurants.list_indexes():
    pprint(index)

Create some indexes first...

Now we have these indexes:
SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
{'key': SON([('cuisine', -1), ('borough', 1)]),
 'name': 'cuisine_-1_borough_1',
 'v': 2}
{'key': SON([('address.building', 1)]),
 'name': 'address.building_1',
 'v': 2}

Then drop_index()...

The remaining indexes are:
SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
{'key': SON([('cuisine', -1), ('borough', 1)]),
 'name': 'cuisine_-1_borough_1',
 'v': 2}


### 3.1 Questions

**Please answer questions 1) and 2) in Moodle.**

**1)** Which queries will use the following index: 
```python
db.restaurants.create_index("cuisine")
```

A.  `db.restaurants.find({"address.street": "2 Avenue"})`  
B.  `db.restaurants.find({}, {"cuisine": 1})`  
C.  `db.restaurants.find({"borough": "Brooklyn"}, {"cuisine": 1})`  
D.  `db.restaurants.find({"cuisine": "Italian"})`

**Solution**: Only query **D** would benefit from the index.

In [16]:
print('Creating index on "cuisine"...')
db.restaurants.drop_indexes()
db.restaurants.create_index("cuisine")

print('\nQuery A:')
print(db.restaurants.find({"address.city": "Boston"}).explain()['executionStats']['executionStages'])

print('\nQuery B:')
print(db.restaurants.find({}, {"cuisine": 1}).explain()['executionStats']['executionStages'])

print('\nQuery C:')
print(db.restaurants.find({"borough": "Brooklyn"}, {"cuisine": 1}).explain()['executionStats']['executionStages'])

print('\nQuery D:')
print(db.restaurants.find({"cuisine": "Italian"}).explain()['executionStats']['executionStages'])

Creating index on "cuisine"...

Query A:
{'stage': 'COLLSCAN', 'filter': {'address.city': {'$eq': 'Boston'}}, 'nReturned': 0, 'executionTimeMillisEstimate': 1, 'works': 25361, 'advanced': 0, 'needTime': 25360, 'needYield': 0, 'saveState': 25, 'restoreState': 25, 'isEOF': 1, 'direction': 'forward', 'docsExamined': 25359}

Query B:
{'stage': 'PROJECTION_SIMPLE', 'nReturned': 25359, 'executionTimeMillisEstimate': 1, 'works': 25361, 'advanced': 25359, 'needTime': 1, 'needYield': 0, 'saveState': 25, 'restoreState': 25, 'isEOF': 1, 'transformBy': {'cuisine': 1}, 'inputStage': {'stage': 'COLLSCAN', 'nReturned': 25359, 'executionTimeMillisEstimate': 1, 'works': 25361, 'advanced': 25359, 'needTime': 1, 'needYield': 0, 'saveState': 25, 'restoreState': 25, 'isEOF': 1, 'direction': 'forward', 'docsExamined': 25359}}

Query C:
{'stage': 'PROJECTION_SIMPLE', 'nReturned': 6086, 'executionTimeMillisEstimate': 2, 'works': 25361, 'advanced': 6086, 'needTime': 19274, 'needYield': 0, 'saveState': 25, 'res

**2)** Which queries will use the following index: 
```python
db.restaurants.create_index([("borough", -1), ("address.street", -1)])
```

A.  `db.restaurants.find().sort([("borough", 1), ("address.street", -1)])`   
B.  `db.restaurants.find({"address.street": "2 Avenue"})`    
C.  `db.restaurants.find({"address.zipcode": "10075"}, {"address": 1})`    
D.  `db.restaurants.find({}, {"address": -1})` 

**Solution**: No queries would benefit from the index.

In [17]:
print('Creating index on "borough" and "address.street" ...')
db.restaurants.drop_indexes()
db.restaurants.create_index([("borough", -1), ("address.street", -1)])

print('\nQuery A:')
print(db.restaurants.find().sort([("borough", 1), ("address.street", -1)]).explain()['executionStats']['executionStages'])

print('\nQuery B:')
print(db.restaurants.find({"address.street": "2 Avenue"}).explain()['executionStats']['executionStages'])

print('\nQuery C:')
print(db.restaurants.find({"address.zipcode": "10075"}, {"address": 1}).explain()['executionStats']['executionStages'])

print('\nQuery D:')
print(db.restaurants.find({}, {"address": -1}).explain()['executionStats']['executionStages'])

Creating index on "borough" and "address.street" ...

Query A:
{'stage': 'SORT', 'nReturned': 25359, 'executionTimeMillisEstimate': 54, 'works': 50721, 'advanced': 25359, 'needTime': 25361, 'needYield': 0, 'saveState': 51, 'restoreState': 51, 'isEOF': 1, 'sortPattern': {'borough': 1, 'address.street': -1}, 'memLimit': 104857600, 'type': 'simple', 'totalDataSizeSorted': 11029959, 'usedDisk': False, 'inputStage': {'stage': 'COLLSCAN', 'nReturned': 25359, 'executionTimeMillisEstimate': 0, 'works': 25361, 'advanced': 25359, 'needTime': 1, 'needYield': 0, 'saveState': 51, 'restoreState': 51, 'isEOF': 1, 'direction': 'forward', 'docsExamined': 25359}}

Query B:
{'stage': 'COLLSCAN', 'filter': {'address.street': {'$eq': '2 Avenue'}}, 'nReturned': 391, 'executionTimeMillisEstimate': 1, 'works': 25361, 'advanced': 391, 'needTime': 24969, 'needYield': 0, 'saveState': 25, 'restoreState': 25, 'isEOF': 1, 'direction': 'forward', 'docsExamined': 25359}

Query C:
{'stage': 'PROJECTION_SIMPLE', 'nRetu

**3)** Write a command for creating an index on the `zipcode` field.

**Solution:**

In [18]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([("address.zipcode", 1)])

# print all indexes
for index in db.restaurants.list_indexes():
    pprint(index)

SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
{'key': SON([('address.zipcode', 1)]),
 'name': 'address.zipcode_1',
 'v': 2}


**4)** Write an index to speed up the following query:
```python
    db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1})
```

**Solution:**

In [19]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([("grades.grade", 1)])

# verify the query plan
print(db.restaurants.find({"grades.grade": {"$ne": "A"}}, {"name": 1 , "address.street": 1})
      .explain()['executionStats']['executionStages'])

{'stage': 'PROJECTION_DEFAULT', 'nReturned': 1919, 'executionTimeMillisEstimate': 5, 'works': 14744, 'advanced': 1919, 'needTime': 12824, 'needYield': 0, 'saveState': 14, 'restoreState': 14, 'isEOF': 1, 'transformBy': {'name': 1, 'address.street': 1}, 'inputStage': {'stage': 'FETCH', 'filter': {'grades.grade': {'$not': {'$eq': 'A'}}}, 'nReturned': 1919, 'executionTimeMillisEstimate': 5, 'works': 14744, 'advanced': 1919, 'needTime': 12824, 'needYield': 0, 'saveState': 14, 'restoreState': 14, 'isEOF': 1, 'docsExamined': 11616, 'alreadyHasObj': 0, 'inputStage': {'stage': 'IXSCAN', 'nReturned': 11616, 'executionTimeMillisEstimate': 1, 'works': 14744, 'advanced': 11616, 'needTime': 3127, 'needYield': 0, 'saveState': 14, 'restoreState': 14, 'isEOF': 1, 'keyPattern': {'grades.grade': 1}, 'indexName': 'grades.grade_1', 'isMultiKey': True, 'multiKeyPaths': {'grades.grade': ['grades']}, 'isUnique': False, 'isSparse': False, 'isPartial': False, 'indexVersion': 2, 'direction': 'forward', 'indexBou

5) Write an index to speed up the following query:
```python
    db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})
```

**Solution:**

In [20]:
db.restaurants.drop_indexes()

# write your code here:
db.restaurants.create_index([("grades.score", 1), ("grades.grade", 1)])

# verify the query plan
print(db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})
      .explain()['executionStats']['executionStages'])

{'stage': 'FETCH', 'filter': {'grades.grade': {'$eq': 'C'}}, 'nReturned': 315, 'executionTimeMillisEstimate': 0, 'works': 357, 'advanced': 315, 'needTime': 41, 'needYield': 0, 'saveState': 0, 'restoreState': 0, 'isEOF': 1, 'docsExamined': 349, 'alreadyHasObj': 0, 'inputStage': {'stage': 'IXSCAN', 'nReturned': 349, 'executionTimeMillisEstimate': 0, 'works': 357, 'advanced': 349, 'needTime': 7, 'needYield': 0, 'saveState': 0, 'restoreState': 0, 'isEOF': 1, 'keyPattern': {'grades.score': 1, 'grades.grade': 1}, 'indexName': 'grades.score_1_grades.grade_1', 'isMultiKey': True, 'multiKeyPaths': {'grades.score': ['grades'], 'grades.grade': ['grades']}, 'isUnique': False, 'isSparse': False, 'isPartial': False, 'indexVersion': 2, 'direction': 'forward', 'indexBounds': {'grades.score': ['(50, inf.0]'], 'grades.grade': ['[MinKey, MaxKey]']}, 'keysExamined': 356, 'seeks': 1, 'dupsTested': 356, 'dupsDropped': 7}}


**Comment:** The index would not work for this query:
```python
db.restaurants.find({"grades.grade" : "C"})
```
See the query plan below:

In [21]:
# verify the query plan
print(db.restaurants.find({"grades.grade" : "C"})
      .explain()['executionStats']['executionStages'])

{'stage': 'COLLSCAN', 'filter': {'grades.grade': {'$eq': 'C'}}, 'nReturned': 2708, 'executionTimeMillisEstimate': 5, 'works': 25361, 'advanced': 2708, 'needTime': 22652, 'needYield': 0, 'saveState': 25, 'restoreState': 25, 'isEOF': 1, 'direction': 'forward', 'docsExamined': 25359}


<font color='red' size='5'>**Important: please delete your container after finishing the exercise.**</font>