MongoDB
===========



a ``MongoClient`` instance provides connection to MongoDB Server, each server can host multiple databases which can be retrieved with ``connection.database_name`` which can then contain multiple ``collections`` with different documents.

In [176]:
from dotenv import load_dotenv
import os
from pymongo import MongoClient, InsertOne, DeleteOne, DeleteMany, UpdateOne, UpdateMany
import pymongo
from decimal import Decimal
import json
from bson.objectid import ObjectId

In [177]:
try:
    load_dotenv()
    usr = os.getenv("mongo_usr")
    pwd = os.getenv("mongo_pw")
except Exception as e:
    exit()

In [178]:
client = MongoClient(f'mongodb://{usr}:{pwd}@localhost:27017/')
client

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

Create a database called ``phonebook`` and a collection called ``people``

Once the database is retrieved, collections can be accessed as attributes of the database itself.

A MongoDB document is actually just a Python Dictionary, inserting a document is as simple as telling pymongo to insert the dictionary into the collection. Each document can have its own structure, can contain different data and you are not required to declare and structure of the collection. Not existing collections will be automatically created on the insertion of the first document

Insert an object in the database: ``data = {'name': 'Alessandro', 'phone': '+39123456789'}``

In [179]:
db = client['phonebook']
collection = db['people']

In [180]:
data = {'name': 'Alessandro', 'phone': '+39123456789'}
collection.insert_one(data)

InsertOneResult(ObjectId('669fc65c2a221cfbf59b8fb4'), acknowledged=True)

Fetching back inserted document can be done using ``find`` and ``find_one`` methods of collections. Both methods accept a query expression that filters the returned documents. Omitting it means retrieving all the documents (or in case of find_one the first document).

In [181]:
collection.find_one()

{'_id': ObjectId('669f77c64a3e41501a7d9df4'),
 'name': 'John',
 'phone': '+39123456789'}

Filters in mongodb are described by Documents themselves, so in case of PyMongo they are dictionaries too.
A filter can be specified in the form ``{'field': value}``. 
By default filtering is performed by *equality* comparison, this can be changed by specifying a query operator in place of the value.

Query operators by convention start with a ``$`` sign and can be specified as ``{'field': {'operator': value}}``.
Full list of query operators is available at http://docs.mongodb.org/manual/reference/operator/query/

Find a person that has an object id greather than ``53b30ff57ab71c051823b031``we can achieve that with using ``find_one``:

In [182]:
collection.find_one({"_id": {"$gt": ObjectId("53b30ff57ab71c051823b031")}})

{'_id': ObjectId('669f77c64a3e41501a7d9df4'),
 'name': 'John',
 'phone': '+39123456789'}

Updating Documents
---------------------

Updating documents in MongoDB can be performed with the ``update_one`` or ``update_many`` method of the collection. Updating is actually one of the major sources of issues for new users as it doesn't change values in document like it does on SQL based databases, but instead it replaces the document with a new one.

What you usually want to do is actually using the ``$set`` operator which changes the existing document instead of replacing it with a new one. Read docs: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.update_one

In [183]:
# Find the document with name Alessandro and print it
print(collection.find_one({'name': {"$eq": "Alessandro"}}))

# Update the name to John and print the update 
collection.update_one({"name": "Alessandro"}, {"$set": {"name": "John"}})
collection.find_one()

{'_id': ObjectId('669fc65c2a221cfbf59b8fb4'), 'name': 'Alessandro', 'phone': '+39123456789'}


{'_id': ObjectId('669f77c64a3e41501a7d9df4'),
 'name': 'John',
 'phone': '+39123456789'}

SubDocuments
--------------

The real power of mongodb is released when you use subdocuments.

As each mongodb document is a JSON object (actually BSON, but that doesn't change much for the user), it can contain any data which is valid in JSON. Including other documents and arrays. This replaces "relations" between collections in multiple use cases and it's heavily more efficient as it returns all the data in a single query instead of having to perform multiple queries to retrieve related data.

For example if you want to store a blog post in mongodb you might actually store everything, including author data and tags inside the blogpost itself:

- Create a collection ``blog``
- Insert the following document : 
```{'title': 'MongoDB is great!',
                'author': {'name': 'Alessandro',
                           'surname': 'Molina',
                           'avatar': 'weblink'},
                'tags': ['mongodb', 'web', 'scaling']}

In [184]:
db['blog']

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'phonebook'), 'blog')

In [185]:
db.blog.insert_one({'title': 'MongoDB is great!',
                    'author': {'name': 'Alessandro',
                               'surname': 'Molina',
                               'avatar': 'weblink'},
                    'tags': ['mongodb', 'web', 'scaling']})

InsertOneResult(ObjectId('669fc65c2a221cfbf59b8fb5'), acknowledged=True)

In [186]:
db.blog.find_one({'title': 'MongoDB is great!'})

{'_id': ObjectId('669f7dc24a3e41501a7d9df5'),
 'title': 'MongoDB is great!',
 'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
 'tags': ['mongodb', 'web', 'scaling']}

In [187]:
list(db.blog.find({'tags': 'mongodb'}))

[{'_id': ObjectId('669f7dc24a3e41501a7d9df5'),
  'title': 'MongoDB is great!',
  'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
  'tags': ['mongodb', 'web', 'scaling']},
 {'_id': ObjectId('669f80564a3e41501a7d9df8'),
  'title': 'MongoDB is great!',
  'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
  'tags': ['mongodb', 'web', 'scaling']},
 {'_id': ObjectId('669f86884a3e41501a7d9dfb'),
  'title': 'MongoDB is great!',
  'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
  'tags': ['mongodb', 'web', 'scaling']},
 {'_id': ObjectId('669f8b03af2b71acd37184f0'),
  'title': 'MongoDB is great!',
  'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
  'tags': ['mongodb', 'web', 'scaling']},
 {'_id': ObjectId('669f9ae4efacb22c6ddb31c1'),
  'title': 'MongoDB is great!',
  'author': {'name': 'Alessandro', 'surname': 'Molina', 'avatar': 'weblink'},
  'tags': ['mongodb', 'web', 'scaling']},
 {'_i

Aggregation Pipeline
----------------------

The aggreation pipeline provided by the aggreation framework is a powerful feature in MongoDB that permits to perform complex data analysis by passing the documents through a pipeline of operations.

MongoDB was created with the cover philosophy that you are going to store your documents depending on the way you are going to read them. So to properly design your schema you need to know how you are going to use the documents. While this approach provides great performance benefits and is more concrete in case of web application, it might not always be feasible.

In case you need to perform some kind of analysis your documents are not optimized for, you can rely on the aggreation framework to create a pipeline that transforms them in a way more practical for the kind of analysis you need.

### How it works

The aggregation pipeline is a list of operations that gets executed one after the other on the documents of the collections. The first operation will be performed on all the documents, while successive operations are performed on the result of the previous steps.

If steps are able to take advantage of **indexes** they will, that is the case for a **match** or **sort** operator, if it appears at the begin of the pipeline. All operators start with a <span><strong>$</strong></span> sign

### Stage Operators


* **project**	Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.
* **match**	Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. **match** uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).
* **limit**	Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).
* **skip**	Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).
* **unwind**	Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.
* **group**	Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.
* **sort**	Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.
* **geoNear**	Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of **match**, **sort**, and **limit** for geospatial data. The output documents include an additional distance field and can include a location identifier field.
* **out**	Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.

#### Expression Operators

Each stage operator can work with one or more **expression operator** which allow to perform actions during that stage, for a list of expression operators see http://docs.mongodb.org/manual/reference/operator/aggregation/#expression-operators

### Pipeline Examples

use the full listingsAndReviews json (Google Drive)

In [188]:
if "pipe_example" in client.test.list_collection_names():
    client.test.pipe_example.drop()

qa = client.test.pipe_example

In [189]:
result = []
with open('../data/listingsAndReviews.json', 'r', encoding='utf8') as f:
    for jsonObj in f:
        try:
            my_json = json.loads(jsonObj)
            my_json['price'] = my_json['price']['$numberDecimal']
            result.append(InsertOne(my_json))
        except Exception as e:
            print(f"Error processing JSON: {e},[{my_json['price']}")

qa.bulk_write(result)

BulkWriteResult({'writeErrors': [], 'writeConcernErrors': [], 'nInserted': 5555, 'nUpserted': 0, 'nMatched': 0, 'nModified': 0, 'nRemoved': 0, 'upserted': []}, acknowledged=True)

In [190]:
#Q1 Find the total number of listings in Sydney
len(list(qa.find({'address.market': 'Sydney'}))), len(list(qa.find({'address.government_area': 'Sydney'})))

(609, 183)

In [191]:
#Q2 Show the most 5 popular market  with the largest number of properties
pipeline = [
    {"$group": {
        "_id": "$address.market",
        "property_count": {"$sum": 1}
    }
    },
    {"$sort": {
        "property_count": -1  # -1 for descending order
    }
    },
    {"$limit": 5}
]
results = list(qa.aggregate(pipeline))
results

[{'_id': 'Istanbul', 'property_count': 660},
 {'_id': 'Montreal', 'property_count': 648},
 {'_id': 'Barcelona', 'property_count': 632},
 {'_id': 'Hong Kong', 'property_count': 619},
 {'_id': 'Sydney', 'property_count': 609}]

In [192]:
#Q3 count of properties and average price per night by most populate market
pipeline = [
    {"$group": {
        "_id": "$address.market",
        "property_count": {"$sum": 1},
        "average_price": {"$avg": {"$round": {"$toDouble": "$price"}}}
    }
    },
    {"$project": {
        'avg_price': {"$round": ["$average_price", 2]}}
     },
    {"$sort": {
        "property_count": -1
    }
    },
    {"$limit": 5}
]

# results = list(qa.aggregate(pipeline))
results = list(qa.aggregate(pipeline))
results

[{'_id': 'Rio De Janeiro', 'avg_price': 525.81},
 {'_id': 'Other (International)', 'avg_price': 445.75},
 {'_id': 'Oahu', 'avg_price': 212.3},
 {'_id': 'Maui', 'avg_price': 286.59},
 {'_id': '', 'avg_price': 115.5}]

In [218]:
#Q4 Choose 2 amenities of your choice and check the review scores and the price of the listings with those amenities 
chosen_amenities = ["Wifi", "Cooking basics"]


query = {
    "amenities": {
        "$all": chosen_amenities
    }
}

projection = {
    "name": 1,
    "amenities": 1,
    "review_scores": 1,
    "price": 1
}

listings = list(collection.find(query, projection).sort("price", -1))



In [220]:
len(listings)

1590