<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 3.1.4
# *Python with MongoDB*

## Introduction to PyMongo

For this lab you will firstly need to install two programs (both for Windows and Mac users). If using Windows download then run the `msi` package in both cases.

1) MongoDB Community Server from https://www.mongodb.com/try/download/community

2) MongoDB Command Line Database Tools from https://www.mongodb.com/try/download/database-tools

**Installation instructions for Windows users:**

[Install MongoDB Community Edition on Windows](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-windows/)

**The following resources may assist Mac users:**

[Install MongoDB Community Edition on macOS](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-os-x/)

[How to Install Latest MongoDB on macOS](https://www.youtube.com/watch?v=NLw7Tln6IeM)

[How to install HomeBrew (often this helps, if you are having issues with your setup)](https://www.youtube.com/watch?v=IWJKRmFLn-g)

In [7]:
!pip install pymongo



In [8]:
from IPython.display import display, HTML
import pymongo
import pandas as pd
from pymongo import MongoClient
print ('Mongo version ' + pymongo.__version__)

Mongo version 4.8.0


**Start the mongod server (if it isn't already running):**

Windows:
1. Using Command Prompt navigate to the folder containing `mongod.exe` (e.g. by typing cd "C:\Program Files\MongoDB\Server\7.0\bin").
2.  Execute `mongod` at the prompt.

Mac:
1. Run `brew services start mongodb-community@7.0`

In [10]:
# Creating a client object in our local machine
client = MongoClient('localhost', 27017)

In [11]:
print(client.list_database_names())

['admin', 'config', 'local']


In MongoDB, a **database** stores and manages collections of related data, similar to how you might organise files into folders on your computer.

Create a new database:

In [14]:
db = client.test

A **collection** in MongoDB is similar to a table in a relational database.
Collections store documents (records) in a structured format (usually in BSON, which is a binary representation of JSON).
The people collection likely contains documents related to people, such as user profiles, contact information, or other relevant data.
Each document within the people collection represents an individual person or entity.

In [16]:
print(client.list_database_names())

['admin', 'config', 'local']


It is important to note that MongoDB is lazy: the db won't get created until data has been written to it!

Create a collection called "shoppers" (with object name `mycol`):

In [19]:
#ANSWER
mycol = db.shoppers

In [20]:
mycol

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'test'), 'shoppers')

Create a document (i.e. a dictionary) with two name:value items
("name" = "Paul", and "address" = "Mansfield Ave") and insert
it into the "shoppers" collection:

In [22]:
#ANSWER:
mydict = {"name":"Paul", "address":"Mansfield Ave"}
mydict

{'name': 'Paul', 'address': 'Mansfield Ave'}

In [23]:
x = mycol.insert_one(mydict)

In [24]:
x

InsertOneResult(ObjectId('66a82b7dc14a4900a5d11288'), acknowledged=True)

Now test for the existence of the database:

In [26]:
#ANSWER:
print(client.list_database_names())

['admin', 'config', 'local', 'test']


List all collections in the database:

In [28]:
#ANSWER
print(db.list_collection_names())

['shoppers']


Insert another record in the "shoppers" collection
("name" = "Rafa", "address" = "Holder Drive")
and return the value of the _id field:

In [30]:
mydict = {"name":"Rafa", "address":"Holder Drive"}
mydict

{'name': 'Rafa', 'address': 'Holder Drive'}

In [31]:
x = mycol.insert_one(mydict)

In [32]:
x

InsertOneResult(ObjectId('66a82b7dc14a4900a5d11289'), acknowledged=True)

Given the list of dicts below, insert multiple documents into
the collection using the insert_many() method:

In [34]:
mylist = [
  { "name": "Ashton", "address": "Axle St"},
  { "name": "Benjamin", "address": "Green Dr"},
  { "name": "Sally", "address": "Holly Blvd"},
  { "name": "Helen", "address": "Castor Prom"},
  { "name": "Craig", "address": "Parsons Way"},
  { "name": "Betty", "address": "Watters St"},
  { "name": "Aparna", "address": "Yonder Dr"},
  { "name": "Kent", "address": "Garrison St"},
  { "name": "Violet", "address": "Station St"},
  { "name": "Svetlana", "address": "Wayman Ave"}
]

In [35]:
x = mycol.insert_many(mylist)

Print a list of the _id values of the inserted documents:

In [37]:
print(x.inserted_ids)

[ObjectId('66a82b7dc14a4900a5d1128a'), ObjectId('66a82b7dc14a4900a5d1128b'), ObjectId('66a82b7dc14a4900a5d1128c'), ObjectId('66a82b7dc14a4900a5d1128d'), ObjectId('66a82b7dc14a4900a5d1128e'), ObjectId('66a82b7dc14a4900a5d1128f'), ObjectId('66a82b7dc14a4900a5d11290'), ObjectId('66a82b7dc14a4900a5d11291'), ObjectId('66a82b7dc14a4900a5d11292'), ObjectId('66a82b7dc14a4900a5d11293')]


Execute the next cell to insert a list of dicts with specified `_id`s:

In [39]:
mylist = [
  { "_id": 1, "name": "Paul", "address": "Mansfield Ave"},
  { "_id": 2, "name": "Rafa", "address": "Holder Drive"},
  { "_id": 3, "name": "Ashton", "address": "Axle St"},
  { "_id": 4, "name": "Benjamin", "address": "Green Dr"},
  { "_id": 5, "name": "Sally", "address": "Holly Blvd"},
  { "_id": 6, "name": "Helen", "address": "Castor Prom"},
  { "_id": 7, "name": "Craig", "address": "Parsons Way"},
  { "_id": 8, "name": "Betty", "address": "Watters St"},
  { "_id": 9, "name": "Aparna", "address": "Yonder Dr"},
  { "_id": 10, "name": "Kent", "address": "Garrison St"},
  { "_id": 11, "name": "Violet", "address": "Station St"},
  { "_id": 12, "name": "Svetlana", "address": "Wayman Ave"}
]
x = mycol.insert_many(mylist)
print(x.inserted_ids)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


Now try inserting a new dict with an existing `_id`:

In [41]:
# Warning -- This code will return an error as id no: 12 is already exists
x = mycol.insert_one({ "_id": 12, "name": "Lola", "address": "Prospect Dr"})

DuplicateKeyError: E11000 duplicate key error collection: test.shoppers index: _id_ dup key: { _id: 12 }, full error: {'index': 0, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: test.shoppers index: _id_ dup key: { _id: 12 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 12}}

So, if we want to manage `_id`s in code, we need to be careful!

This returns the first document in the collection:

In [49]:
x = mycol.find_one()
print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}


Do the same for the document containing "name" = "Ashton":

In [51]:
x = mycol.find_one({"name":"Ashton"})
print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}


This returns (and prints) all documents in the collection:

In [53]:
for x in mycol.find():
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': ObjectId('66a82b7dc

This returns only the name and address fields:

In [55]:
for x in mycol.find({},{ "_id": 0, "name": 1, "address": 1 }):
    print(x)

{'name': 'Paul', 'address': 'Mansfield Ave'}
{'name': 'Rafa', 'address': 'Holder Drive'}
{'name': 'Ashton', 'address': 'Axle St'}
{'name': 'Benjamin', 'address': 'Green Dr'}
{'name': 'Sally', 'address': 'Holly Blvd'}
{'name': 'Helen', 'address': 'Castor Prom'}
{'name': 'Craig', 'address': 'Parsons Way'}
{'name': 'Betty', 'address': 'Watters St'}
{'name': 'Aparna', 'address': 'Yonder Dr'}
{'name': 'Kent', 'address': 'Garrison St'}
{'name': 'Violet', 'address': 'Station St'}
{'name': 'Svetlana', 'address': 'Wayman Ave'}
{'name': 'Paul', 'address': 'Mansfield Ave'}
{'name': 'Rafa', 'address': 'Holder Drive'}
{'name': 'Ashton', 'address': 'Axle St'}
{'name': 'Benjamin', 'address': 'Green Dr'}
{'name': 'Sally', 'address': 'Holly Blvd'}
{'name': 'Helen', 'address': 'Castor Prom'}
{'name': 'Craig', 'address': 'Parsons Way'}
{'name': 'Betty', 'address': 'Watters St'}
{'name': 'Aparna', 'address': 'Yonder Dr'}
{'name': 'Kent', 'address': 'Garrison St'}
{'name': 'Violet', 'address': 'Station St'

Print only the `_id` and name fields:

In [57]:
#ANSWER
for x in mycol.find({},{ "_id": 1, "name": 1}):
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty'}
{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana'}
{'_id': 1, 'name': 'Paul'}
{'_id': 2, 'name': 'Rafa'}
{'_id': 3, 'name': 'Ashton'}
{'_id': 4, 'name': 'Benjamin'}
{'_id': 5, 'name': 'Sally'}
{'_id': 6, 'name': 'Helen'}
{'_id': 7, 'name': 'Craig'}
{'_id': 8, 'name': 'Betty'}
{'_id': 9, '

So, we must explicitly use `"_id": 0` to exclude it, but for other fields we simply omit them from the dict argument.

To include field conditionals in a query, we use `$` operators. This finds addresses starting with "S" or greater:

In [59]:
myquery = { "address": { "$gt": "S" } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 8, 'name': 'Betty', 'address': 'Watters St'}
{'_id': 9, 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': 11, 'name': 'Violet', 'address': 'Station St'}
{'_id': 12, 'name': 'Svetlana', 'address': 'Wayman Ave'}


Here are some more comparison operators:

            $gt, $gte, $eq, $in, $nin, $exists, $and, $or, $not
            
Experiment with these until you understand how to use them.

In [61]:
# $gte  Greater than or equal

myquery = { "name": { "$gte": "H" } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 1, 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': 5, 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': 6, 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': 10, 'name': 'Kent', 'address': 'Garrison St'}
{'_id': 11, 'name': 'Violet', 'address': 'Station St'}
{'_id': 12, 'name': 'Svetlana', 'address': 'Wayman 

In [63]:
# $eq  Equals

myquery = { "name": { "$eq": "Helen" } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': 6, 'name': 'Helen', 'address': 'Castor Prom'}


In [65]:
# $in  Value of a field equals any value in a specified array

myquery = { "address": { "$in": ["Station St", "Holly Blvd"] } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': 5, 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': 11, 'name': 'Violet', 'address': 'Station St'}


In [67]:
# $nin  Opposie of 'in' i.e. not in

myquery = { "address": { "$nin": ["Station St", "Holly Blvd"] } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 1, 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}
{'_i

In [69]:
# $and  Combines multiple conditions in a query

myquery = { "$and": [{ "name": "Rafa"}, { "address": "Holder Drive" }]}
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}


In [71]:
# $or  One value or another

myquery = { "$or": [{ "name": "Rafa"}, { "address": "Axle St" }]}
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': 3, 'name': 'Ashton', 'address': 'Axle St'}


In [73]:
# $not  Select docs that do not match a condition.  Can also use $ne

myquery = { "name": { "$ne": "Helen"} }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 1, 'name': 'Paul'

Now find all docs with an address that begins with "W":  
(HINT: The value for "address" in the argument should be the regex-based dict { "$regex": "^W" }.)

In [75]:
#ANSWER:
myquery3 = { "address": { "$regex": "^W" } }
mydoc = mycol.find(myquery3)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 8, 'name': 'Betty', 'address': 'Watters St'}
{'_id': 12, 'name': 'Svetlana', 'address': 'Wayman Ave'}


Sorting can be applied by invoking the sort() method after the find() method. Sort the collection by the name field:

In [77]:
#ANSWER:
mydoc = mycol.find().sort("name")
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': 9, 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': 3, 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': 4, 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': 8, 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': 7, 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': 6, 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': 10, 'name': 'Kent', 'address': 'Garrison St'}
{'_id': Ob

Now sort in reverse order (HINT: The sort() method takes an optional second parameter.)

In [79]:
#ANSWER
mydoc = mycol.find().sort("name", -1)
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}
{'_id': 11, 'name': 'Violet', 'address': 'Station St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': 12, 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': 5, 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': 1, 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': 10, 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128d'), 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': 6, 'name': 'Helen', 'address': 'Castor P

A single record can be deleted by specifying some criterion:

In [81]:
mycol.delete_one({ "address": "Castor Prom" })

DeleteResult({'n': 1, 'ok': 1.0}, acknowledged=True)

Now delete all docs with the 2-digit `Id` values:

In [83]:
#ANSWER:
# Define query to find IDs less than length 10

myquery2 = {"_id": {"$type": "number","$lte":9999999999}}
results = mycol.find(myquery2)
for result in results:
    print(result)

{'_id': 1, 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': 2, 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': 3, 'name': 'Ashton', 'address': 'Axle St'}
{'_id': 4, 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': 5, 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': 6, 'name': 'Helen', 'address': 'Castor Prom'}
{'_id': 7, 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': 8, 'name': 'Betty', 'address': 'Watters St'}
{'_id': 9, 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': 10, 'name': 'Kent', 'address': 'Garrison St'}
{'_id': 11, 'name': 'Violet', 'address': 'Station St'}
{'_id': 12, 'name': 'Svetlana', 'address': 'Wayman Ave'}


In [85]:
result = mycol.delete_many(myquery2)

In [87]:
# Check result

mydoc = mycol.find().sort("name")
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Garrison St'}
{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}


This would delete all docs:
`x = mycol.delete_many({})`

This would remove the collection:
`mycol.drop()`

This would drop the database:
`client.drop_database('test')`

Change the first instance of "address" == "Garrison St" to "Somers Ave" using update_one().  
(HINT: The 1st parameter of update_one() is the criterion (query); the 2nd is dict specifying the field to change and its new value.)

In [89]:
#ANSWER:
# Define the query to find the first document with address "Garrison St"

myquery3 = {"address": "Garrison St"}

# Define the update operation to set address to "Somers Ave"
update_operation = {"$set": {"address": "Somers Ave"}}

# Use update_one to perform the update
result = mycol.update_one(myquery3, update_operation)

The limit() method can be applied after the find() method to limit the number of docs returned. Show the first 5 docs:

In [91]:
#ANSWER:
# Check result

mydoc = mycol.find().sort("name")
for x in mydoc:
    print(x)

{'_id': ObjectId('66a82b7dc14a4900a5d11290'), 'name': 'Aparna', 'address': 'Yonder Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128a'), 'name': 'Ashton', 'address': 'Axle St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128b'), 'name': 'Benjamin', 'address': 'Green Dr'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128f'), 'name': 'Betty', 'address': 'Watters St'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128e'), 'name': 'Craig', 'address': 'Parsons Way'}
{'_id': ObjectId('66a82b7dc14a4900a5d11291'), 'name': 'Kent', 'address': 'Somers Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11288'), 'name': 'Paul', 'address': 'Mansfield Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11289'), 'name': 'Rafa', 'address': 'Holder Drive'}
{'_id': ObjectId('66a82b7dc14a4900a5d1128c'), 'name': 'Sally', 'address': 'Holly Blvd'}
{'_id': ObjectId('66a82b7dc14a4900a5d11293'), 'name': 'Svetlana', 'address': 'Wayman Ave'}
{'_id': ObjectId('66a82b7dc14a4900a5d11292'), 'name': 'Violet', 'address': 'Station St'}


## PyMongo for Data Science

MongoDB has many more features of interest to developers, but the main focus of a data scientist will be wrangling and munging the data. It may or may not be desirable to do all the data munging in Pandas; for a large, distributed database, it may be imperative to perform aggregation in MongoDB.

In [93]:
# Ref:  https://rsandstroem.github.io/MongoDBDemo.html

import pandas as pd
import numpy as np

The following code in the next few cells, will create a database named "command_test" which populates its data from a local JSON file called dummyData.json (you need to download this from the Google Classroom DATA folder) using the mongoimport program.

**Step 1**. Using the command prompt (Windows) or Terminal (Mac) change to the directory containing the file mongoimport.exe (example: cd "C:\Program Files\MongoDB\Tools\100\bin")

**Step 2**. After you have changed your directory, copy/paste the line below in the command prompt/ terminal after modifying "dummydata.json" to include the path on your system where the file is located:
<br>

      mongoimport --db command_test --collection people --drop --file "dummyData.json"       

- db: stands for database name that you use --> in our case, it is "command_test"
- collection: stands for the collection you created within the database --> in our case it is "people"
- file: stands for the file path to your "dummyData.json" file, which you downloaded from our Google Classroom DATA folder - you may need to modify "dummyData.json"

**Step 3**. Copy/paste the line below into the command prompt or terminal:
<br>

    mongoimport -d command_test -c people --file "dummyData.json"


In [95]:
client = MongoClient('localhost', 27017) #connects to your local mongoDB

If the above is successful, running the next cell should produce three records corresponding to the youngest people.

In [97]:
db = client.command_test
collection = db.people
cursor = collection.find().sort('Age',pymongo.ASCENDING).limit(3)
for doc in cursor:
    print (doc)

{'_id': ObjectId('66a82d3c15d21f0f0ef7d1d9'), 'Name': 'Sawyer, Neve M.', 'Age': 18, 'Country': 'Serbia', 'Location': '-34.37446, 174.0838'}
{'_id': ObjectId('66a82d0e1d80b0e3cba335fc'), 'Name': 'Sawyer, Neve M.', 'Age': 18, 'Country': 'Serbia', 'Location': '-34.37446, 174.0838'}
{'_id': ObjectId('66a82d3c15d21f0f0ef7d1a9'), 'Name': 'Townsend, Cadman I.', 'Age': 19, 'Country': 'Somalia', 'Location': '-87.69188, -144.16138'}


Here is a small demonstration of the MongoDB aggregation framework. We want to create a table of the number of persons in each country and their average age. To do it we group by country. We extract the results from MongoDB aggregation into a pandas dataframe, and use the country as index.

In [99]:
pipeline = [
        {"$group": {"_id":"$Country",
             "AvgAge":{"$avg":"$Age"},
             "Count":{"$sum":1},
        }},
        {"$sort":{"Count":-1,"AvgAge":1}}
]
aggResult = collection.aggregate(pipeline) # returns a cursor

df1 = pd.DataFrame(list(aggResult)) # use list to turn the cursor to an array of documents
df1 = df1.set_index("_id")
df1.head()

Unnamed: 0_level_0,AvgAge,Count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1
China,46.25,8
Antarctica,46.333333,6
Guernsey,48.333333,6
Puerto Rico,26.5,4
Heard Island and Mcdonald Islands,29.0,4


For simple cases one can either use a cursor through find("search term") or use the "$match" operator in the aggregation framework, like this:

In [101]:
pipeline = [
        {"$match": {"Country":"China"}},
]
aggResult = collection.aggregate(pipeline)
df2 = pd.DataFrame(list(aggResult))
df2.head()

Unnamed: 0,_id,Name,Age,Country,Location
0,66a82d0e1d80b0e3cba335f5,"Byrd, Dante A.",43,China,"31.2, 121.5"
1,66a82d0e1d80b0e3cba335fe,"Carney, Tamekah I.",57,China,"45.75, 126.6333"
2,66a82d0e1d80b0e3cba3360b,"Mayer, Violet U.",53,China,"40, 95"
3,66a82d0e1d80b0e3cba3361e,"Holman, Hasad O.",32,China,"39.9127, 116.3833"
4,66a82d3c15d21f0f0ef7d1a7,"Holman, Hasad O.",32,China,"39.9127, 116.3833"


Now we can apply all the power of Python libraries to analyse and visualise the data. Here, we will use the folium package to plot markers for the locations of the people we just found in China (click on a marker to see their data):

In [103]:
# Un-comment and execute to install folium pkg (1st time only):
import sys
!{sys.executable} -m pip install folium

Collecting folium
  Downloading folium-0.17.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting branca>=0.6.0 (from folium)
  Using cached branca-0.7.2-py3-none-any.whl.metadata (1.5 kB)
Downloading folium-0.17.0-py2.py3-none-any.whl (108 kB)
   ---------------------------------------- 0.0/108.4 kB ? eta -:--:--
   ---------------------- ----------------- 61.4/108.4 kB 3.4 MB/s eta 0:00:01
   ---------------------------------------- 108.4/108.4 kB 2.1 MB/s eta 0:00:00
Using cached branca-0.7.2-py3-none-any.whl (25 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.7.2 folium-0.17.0


In [105]:
import folium
print ('Folium version ' + folium.__version__)

world_map = folium.Map(location = [35, 100], zoom_start = 4)
for i in range(len(df2)):
    location = [float(loc) for loc in df2.Location[i].split(',')]
    folium.Marker(location = location, popup = df2.Name[i] + ', age:' + str(df2.Age[i])).add_to(world_map)

world_map

Folium version 0.17.0


In [107]:
#Finally drop the databases created in the lab:
print(client.list_database_names())
client.drop_database('test')
client.drop_database('command_test')
print(client.list_database_names())

['admin', 'command_test', 'config', 'local', 'test']
['admin', 'config', 'local']


## HOMEWORK:


1. Read up on how to perform aggregation in mongoDB. Insert a duplicate record into the collection:
        mydict = {"name": "Benjamin", "address": "Green Dr"}
   Now write a command to find docs with a duplicate "name" field (using aggregation) and remove them.  
   Print the collection.

In [109]:
# Insert the duplicate record
mydict = {"name": "Benjamin", "address": "Green Dr"}
collection.insert_many([mydict, mydict])

BulkWriteError: batch op errors occurred, full error: {'writeErrors': [{'index': 1, 'code': 11000, 'errmsg': "E11000 duplicate key error collection: command_test.people index: _id_ dup key: { _id: ObjectId('66a82de0c14a4900a5d11295') }", 'keyPattern': {'_id': 1}, 'keyValue': {'_id': ObjectId('66a82de0c14a4900a5d11295')}, 'op': {'name': 'Benjamin', 'address': 'Green Dr', '_id': ObjectId('66a82de0c14a4900a5d11295')}}], 'writeConcernErrors': [], 'nInserted': 1, 'nUpserted': 0, 'nMatched': 0, 'nModified': 0, 'nRemoved': 0, 'upserted': []}

In [117]:
# Find duplicates
pipeline = [
    {"$group": {
            "_id": "$name",
            "count": {"$sum": 1},
            "docs": {"$push": "$_id"}}},
    {"$match": {"count": {"$gt": 1}}}]

duplicates = list(collection.aggregate(pipeline))
print("Duplicate documents:")
for doc in duplicates:
    print(doc)

Duplicate documents:


In [119]:
# Remove duplicates
keep_ids = set()
for group in duplicates:
    ids = group["docs"]
    keep_ids.add(ids[0])
    collection.delete_many({"_id": {"$in": ids[1:]}})

print("Duplicates removed.")

Duplicates removed.


In [121]:
# Print remaining documents
remaining_docs = list(collection.find())
print("Remaining documents:")
for doc in remaining_docs:
    print(doc)

Remaining documents:
{'_id': ObjectId('66a82de0c14a4900a5d11295'), 'name': 'Benjamin', 'address': 'Green Dr'}


2. Read up on how to apply indexes in mongoDB. Create an index on the "name" and "address" fields in this collection.
   Print the indexes for the collection.

In [123]:
# Create index on the 'name' and 'address' fields
collection.create_index([("name", 1), ("address", 1)])

print("Index created on 'name' and 'address' fields.")

Index created on 'name' and 'address' fields.




>
>
>




In [125]:
# Print the indexes for the collection
indexes = collection.list_indexes()
print("Indexes for the collection:")
for index in indexes:
    print(index)

Indexes for the collection:
SON([('v', 2), ('key', SON([('_id', 1)])), ('name', '_id_')])
SON([('v', 2), ('key', SON([('name', 1), ('address', 1)])), ('name', 'name_1_address_1')])




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



