# Indexing in MongoDB

---
### Connecting to MongoDB using Pymongo
----

In [1]:
# Importing the required libraries
import pymongo

import pprint as pp
pp.sorted = lambda x, key=None: x

----
### New York Airbnb Data

Contains information about accomodations listed on Airbnb in the New York City area.

Website - http://insideairbnb.com/

----

### Restore database to local server

---

In [2]:
# # Restore database to local server
# !mongorestore /home/avadmin/Desktop/Mongo/Indexing/nyc_airbnb/

---
### Connect to local server

---

In [3]:
# Connect to local server
client = pymongo.MongoClient("mongodb://localhost:27017/")

In [4]:
# Connect to database
db = client['nyc']

In [5]:
# List collections
db.list_collection_names()

['airbnb']

In [6]:
# Sample document - airbnb collection
pp.pprint(
    db.airbnb.find_one())

{'_id': ObjectId('60c21bf5b653d40e79b4a7d0'),
 'accom_id': 2595,
 'description': 'Skylit Midtown Castle',
 'host': {'id': 2845,
          'name': 'Jennifer',
          'listings_count': 3,
          'neighbourhood_list': ['Midtown', "Hell's Kitchen"]},
 'neighbourhood': {'name': 'Midtown', 'group': 'Manhattan'},
 'location': {'type': 'Point', 'coordinates': [-73.98559, 40.75356]},
 'room_type': 'Entire home/apt',
 'price': 150,
 'minimum_nights': 30,
 'reviews': {'number_of_reviews': 48,
             'last_review': datetime.datetime(2019, 11, 4, 0, 0),
             'reviews_per_month': 0.35},
 'availability_365': 365}


---
### About the data

- `accom_id` - Accomodation ID.

- `description` - Description about the accomodation.

- `host` - Sub-document about the host of the accomodation. 
> - `id` - Host ID.
> - `name` - Host name.
> - `listings_count` - Number of listings.
> - `neighbourhood_list` - List of neighbourhoods where they have an accomodation.

- `neighbourhood` - Neighbourhood of current accomodation.

- `location` - GeoJSON object supporting geospatial data about the location of the accomodation.
> - `type` - Type of GeoJSON object.
> - `coordinates` - Object's coordinates as an array of longitude and lattitude.

- `room_type` - Type of room offered by accomodation.

- `price` - Cost of accomodation.

- `minimum_nights` - Minimum nights for booking.

- `reviews` - Sub-document on reviews about the accomodation.
> - `number_of_reviews` - Total reviews for the accomodation.
> - `last_review` - Date of last review.
> - `reviews_per_month` - Float number representing reviews per month.

- `availability_365` - Days of year the accomodation is available.


----

### Index information

[index_information](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.index_information) returns dictionary with information about collection's indexes.

-----

In [7]:
# Check index information
db.airbnb.index_information()

{'_id_': {'v': 2, 'key': [('_id', 1)]}}

----
By default there is always one index on `_id` field.


[explain()](https://docs.mongodb.com/manual/reference/explain-results/) gives statistics about a query implemented in MongoDB.

- [explain.queryPlanner](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.queryPlanner) - Information about the most efficient query plan selected by query optimiser given the available indexes.

- [explain.executionStats](https://docs.mongodb.com/manual/reference/explain-results/#executionstats) - Information about the execution of the winning query plan selected by query optimiser.

- [explain.serverInfo](https://docs.mongodb.com/manual/reference/explain-results/#serverinfo) - Information about MongoDB instance like host, port, etc.

Query plans are returned as tree of stages. Each stage passes result to parent stage. The root node of the outermost stage is the final stage from which MongoDB derives the result set. 

-----

In [8]:
# Explain query
pp.pprint(
            db.airbnb.find().explain()
)

{'queryPlanner': {'plannerVersion': 1,
                  'namespace': 'nyc.airbnb',
                  'indexFilterSet': False,
                  'parsedQuery': {},
                  'winningPlan': {'stage': 'COLLSCAN', 'direction': 'forward'},
                  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
                    'nReturned': 36905,
                    'executionTimeMillis': 12,
                    'totalKeysExamined': 0,
                    'totalDocsExamined': 36905,
                    'executionStages': {'stage': 'COLLSCAN',
                                        'nReturned': 36905,
                                        'executionTimeMillisEstimate': 1,
                                        'works': 36907,
                                        'advanced': 36905,
                                        'needTime': 1,
                                        'needYield': 0,
                                        'saveState': 36,
             

---
---

- [explain.executionStats.nReturned](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.nReturned) - Number of documents that match the query condition.

- [explain.executionStats.executionTimeMillis](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.executionTimeMillis) - Total time in milliseconds required for query plan selection and query execution.

- [explain.executionStats.totalKeysExamined](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.totalKeysExamined) - Number of index entries scanned.

- [explain.executionStats.totalDocsExamined](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.totalDocsExamined) - Number of documents examined during query execution. 

- [explain.executionStats.executionStages](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.executionStages) - Details the completed execution of the winning plan as a tree of stages.

- [explain.executionStats.allPlansExecution](https://docs.mongodb.com/manual/reference/explain-results/#mongodb-data-explain.executionStats.allPlansExecution) - Contains partial execution information captured during the plan selection phase for both the winning and rejected plans. 

---
Look at `executionPlan` specifically.

For example, looking at the `executionPlan` for query to find documents where accomodation id is `2595`.

---

In [9]:
# Explain query
pp.pprint(
            db.airbnb.find({'accom_id': 2595})\
                     .explain()['executionStats']
)

{'executionSuccess': True,
 'nReturned': 1,
 'executionTimeMillis': 38,
 'totalKeysExamined': 0,
 'totalDocsExamined': 36905,
 'executionStages': {'stage': 'COLLSCAN',
                     'filter': {'accom_id': {'$eq': 2595}},
                     'nReturned': 1,
                     'executionTimeMillisEstimate': 6,
                     'works': 36907,
                     'advanced': 1,
                     'needTime': 36905,
                     'needYield': 0,
                     'saveState': 36,
                     'restoreState': 36,
                     'isEOF': 1,
                     'direction': 'forward',
                     'docsExamined': 36905},
 'allPlansExecution': []}


----
### Index storage statistics

In Pymongo, we have the [command](https://pymongo.readthedocs.io/en/stable/api/pymongo/database.html#pymongo.database.Database.command) method that runs specified database commands. 

Using [collStats](https://docs.mongodb.com/manual/reference/command/collStats/#collstats) method in MongoDB, we can return a variety of storage statistics for a given collection.

----

In [10]:
# Storage statistics

db.command('collStats', 'airbnb')

{'ns': 'nyc.airbnb',
 'size': 18768524,
 'count': 36905,
 'avgObjSize': 508,
 'storageSize': 6164480,
 'freeStorageSize': 0,
 'capped': False,
 'wiredTiger': {'metadata': {'formatVersion': 1},
  'creationString': 'access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(enabled=false,file_metadata=,repair=false),internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldes

In [11]:
# Number of indexes on the collection

db.command('collStats', 'airbnb')['nindexes']

1

In [12]:
# Index name and size of existing indexes on the collection

db.command('collStats', 'airbnb')['indexSizes']

{'_id_': 344064}

In [13]:
# Total size of all indexes

db.command('collStats', 'airbnb')['totalIndexSize']

344064