_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

# PyMongo
The first step when working with PyMongo is to create a MongoClient to the running mongod instance.

Make sure you have a MongoDB instance running - see [https://www.mongodb.com/docs/manual/administration/install-community/](https://www.mongodb.com/docs/manual/administration/install-community/)

In [None]:
try:
    from pymongo import MongoClient
except:
    !pip install pymongo
    from pymongo import MongoClient

try:
    import psutil
except:
    !pip install psutil
    import psutil

Being a local server, you can create a client in several ways.

In [21]:
client = MongoClient()
# same as 
#  client = MongoClient('localhost', 27017)
# or 
#  client = MongoClient('mongodb://localhost:27017/')

client

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

## Databases 
A single client instance of MongoDB can support multiple independent databases. When working with PyMongo you access databases using **attribute style access** on MongoClient instances.

So, the next line will "connect" to (or create if it does not exist) `sensorsDB` database.

This also means that you have to be very careful with the naming.

In [24]:
db = client.sensorsDB

## Collections 
A collection is a group of documents stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database.

In [27]:
sensors_location = db.sensors_locations

An important note about collections (and databases) in MongoDB is that they are created lazily - none of the above commands have actually performed any operations on the MongoDB server. Collections and databases are created when the first document is inserted into them.

# Insert documents

To **insert a document** into a collection we can use the `insert_one()` method.

One way is to create a document and pass it to the insert_one() method.

In [29]:
data = {
        'location_name': 'Prometheus Server', 
        'description' : 'Prometheus Server @ lab. 163 / ISE /UAlg',
        'sensor': [ 
                    {
                        'sensor_name' : 'cpu_sensor', 
                        'unit' : 'percent'
                    },
                    {
                        'sensor_name' : 'mem_sensor', 
                        'unit' : 'percent'
                    }
             ]
       }

The `insert_one()` method takes a document as its argument and returns an instance of the inserted document.

In [31]:
x = sensors_location.insert_one(data)
x

InsertOneResult(ObjectId('67b4fdc7e934e3bfa66c79e7'), acknowledged=True)

Further, we can get the `_id` of the inserted document. This is relevant when we want to use it later to update or delete the document or to relate it to other documents.

The `_id` is a unique identifier for the document and is generated by the MongoDB server. The value is a 12-byte `ObjectId` which is generated based on the following components:

- Timestamp: The first 4 bytes are a timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch. This provides the `ObjectId` with a natural order by time of creation.

- Machine Identifier: The next 3 bytes are a unique identifier for the machine or process that generated the `ObjectId`. In older versions of MongoDB, this was the machine's hostname; in newer versions, it's a random value generated once per process. This ensures that ObjectIds generated on different machines or processes are likely to be unique.

- Process ID: The next 2 bytes are the process ID that generated the `ObjectId`. This further disambiguates ObjectIds created simultaneously on the same machine or by processes with the same machine identifier.

- Counter: The last 3 bytes are a counter, starting with a random value. This counter increments with each new ObjectId generated. It helps ensure uniqueness for ObjectIds created in the same second, on the same machine, by the same process.

The combination of these components ensures that each ObjectId is unique across different machines, processes, and moments in time. This system avoids the need for a more costly centralized ID generation scheme and makes it easy to generate IDs in a distributed environment, which is crucial for the scalability of MongoDB.

The 12-byte ObjectId format is compact and efficient, both in terms of storage space and performance. It also provides some level of timestamp-based sorting, which can be useful in certain applications.

In [33]:
location_id = x.inserted_id
location_id

ObjectId('67b4fdc7e934e3bfa66c79e7')

Let us see what is on the `sensors_location` collection. To list all documents in the collection we can use the `find()` method which returns a cursor that can be used to iterate over the documents.

In [35]:
from pprint import *

for doc in sensors_location.find():
    pprint(doc)

{'_id': ObjectId('67b4fdc7e934e3bfa66c79e7'),
 'description': 'Prometheus Server @ lab. 163 / ISE /UAlg',
 'location_name': 'Prometheus Server',
 'sensor': [{'sensor_name': 'cpu_sensor', 'unit': 'percent'},
            {'sensor_name': 'mem_sensor', 'unit': 'percent'}]}


And now, on the `sensors_readings`collection, we can insert on document for each reading of the sensor.

Note: we'll use the `datetime` module to generate the timestamp and the `psutil` module to generate the value. The latest, psutil, is a package that provides access to many different system utilities, e.g., CPU usage, memory usage, disk usage, network usage etc. 

In [37]:
import datetime
import psutil

for _ in range(200):
    # creat the document
    data = {
           'sensor' : {'location_id': location_id, 
                       'sensor_name' : 'cpu_sensor' 
                      },
            'value' : psutil.cpu_percent(interval=0.1),
            'units' : 'percent',
            'timestamp' : datetime.datetime.utcnow()
           }
    # send the document to the database
    res = db.sensors_readings.insert_one(data)
    print('.', end='')   

........................................................................................................................................................................................................

let us store the last `_id` for latter

In [None]:
_id = res.inserted_id
_id

To list all inserted readings, we can use again the `find()` method and force the cursor to return all documents by calling the `list()` function

In [None]:
list(db.sensors_readings.find())

We can also list the inserted readings, sorted by value and timestamp. The `sort()` method allows us to sort the results by one or more fields, in this case `value` and `timestamp`. It receives a list of tuples with the field name and the sort order.

In [None]:
list(
    db.sensors_readings.find().sort([
        ('value',-1),
        ('timestamp', -1)]
    )
)

Given the `ObjectId` (we stored ir earlier), it is possible to get one specific document

In [None]:
pprint(list(db.sensors_readings.find({'_id': _id})))

Or find all documents with a value greater than 50%

In [None]:
query = {'value': {'$gt': 50}}
pprint(list(db.sensors_readings.find(query)))

In the last 5 minutes...

In [None]:
query = {'timestamp': {'$gt': datetime.datetime.utcnow() - datetime.timedelta(minutes=5)}}

pprint(list(db.sensors_readings.find(query)))

## Embending of information I
In this approach, a single document contains **multiple sensors with a single read**. Also, embedded location info.

In [None]:
import datetime
import psutil

for _ in range(200):
    data = {
        'location_name': 'Prometheus Server', 
        'description' : 'Prometheus Server @ lab. 163 / ISE /UAlg',
        'sensors' : [ 
               {
                   'sensor_name' : 'mem_sensor', 
                   'value' : psutil.virtual_memory().percent,
                   'units' : 'percent'
               },
               {
                   'sensor_name' : 'cpu_sensor', 
                   'value' : psutil.cpu_percent(interval=0.1),
                   'units' : 'percent'
               }
           ],
        'timestamp' : datetime.datetime.utcnow()
    }
    db.sensors_readings.insert_one(data)
    print('.', end='')

Get the last insert. In this case we use the `limit(1)` method to limit the number of documents returned. Since the list is sorted by timestamp, the first one will be the last inserted

In [None]:
pprint(
    list(
        db.sensors_readings.find()\
            .sort([('timestamp', -1)])\
            .limit(1)
    )
)

## Embending of information II
A single document contains multiple sensors - and multiple reads.

In [None]:
data = {
    'location_name': 'Prometheus Server', 
    'description' : 'Prometheus Server @ lab. 163 / ISE /UAlg',
    'sensors' : [ 
           {
               'sensor_name' : 'mem_sensor', 
               'values' :[] ,
               'units' : 'percent'
           },
           {
               'sensor_name' : 'cpu_sensor', 
               'values' : [],
               'units' : 'percent'
           }
       ],
}

# get the readingd id to latter add values to the readings
readings_id = db.sensors_readings.insert_one(data).inserted_id
readings_id

However, in this implementation the **full document is upload each time a new read is made**. Which means that we need to update the document in the database each time a new read is made. To do this, we use the `update_one()` method which takes 2 arguments: the query and the update. A third argument are options which you can find in the documentation ([https://www.mongodb.com/docs/manual/reference/method/db.collection.update/](https://www.mongodb.com/docs/manual/reference/method/db.collection.update/)).

In [None]:
for _ in range(100):
    mem = psutil.virtual_memory().percent
    cpu = psutil.cpu_percent(interval=0.01)

    # update the data
    data['sensors'][0]['values'].append({'value': mem, 'timestamp' : datetime.datetime.utcnow()})
    data['sensors'][1]['values'].append({'value': cpu, 'timestamp' : datetime.datetime.utcnow()})
    # update the database, sending the full document again!!
    db.sensors_readings.update_one(
        {'_id': readings_id}, 
        {'$set': data}
    )
    
    print('.', end='')

The last reading is 

In [None]:
x = list(
    db.sensors_readings.find() \
        .sort([('_id', -1)]) \
        .limit(1)
)
x

to get a value from it we can "navigate" the array/dictionary

In [None]:
x[0]['sensors'][0]['values'][0]['value']

## Embending of information III
As previously, a single document contains multiple sensors - and multiple reads. But now, only the fields we need are updated. 

In [None]:
import datetime
import psutil

data = {
    'location_name': 'Prometheus Server', 
    'description' : 'Prometheus Server @ lab. 163 / ISE /UAlg', 
    'sensors' : [ 
           {
               'sensor_name' : 'mem_sensor', 
               'values' : [],
               'units' : 'percent'
           },
           {
               'sensor_name' : 'cpu_sensor', 
               'values' : [],
               'units' : 'percent'
           }
       ]
}

readings_id = db.sensors_readings.insert_one(data).inserted_id

Now, a first document was inserted with no sensors values. The document `_id` was stored and in the following data will be appended/pushed to the corresponding document

In [None]:
for _ in range(200):
    mem = psutil.virtual_memory().percent
    cpu = psutil.cpu_percent(interval=0.1)
    
    # update the database, sending only the update
    db.sensors_readings.update_one(
        {'_id': readings_id}, 
        {
            '$push': {
                'sensors.0.values': {'value': mem, 'timestamp' : datetime.datetime.utcnow()},
                'sensors.1.values': {'value': cpu, 'timestamp' : datetime.datetime.utcnow()}        
            }
        }
    )    
    
    print('.', end='')

The last reading is 

In [None]:
pprint(
    list(
        db.sensors_readings\
            .find()\
            .sort([('_id', -1)])\
            .limit(1)
    )
)

# Getting Documents
Getting a single document with find_one() can be done using the `find_one()` method which returns the first document in the collection which matches the query. The syntax is the same as the `find()` method, i.e., `find_one({query}, {projection})` 

In [None]:
db.sensors_readings.find_one()

Find one reading from "Prometheus Server"

In [None]:
db.sensors_readings.find_one({'location_name':'Prometheus Server'})

Get the Object id for one reading on the sensor's reading collection

In [None]:
obj_id = db.sensors_readings.find_one({'location_name':'Prometheus Server'})["_id"]
obj_id

Querying By ObjectId

In [None]:
from bson.objectid import ObjectId

db.sensors_readings.find_one({'_id': obj_id})  # update the _id

Do projections, i.e., select which fields to present

In [None]:
db.sensors_readings.find_one(
    {'_id': obj_id},
    {'sensors':1}
)

# Bulk Insert
In addition to inserting a single document, we can also perform bulk insert operations, by passing a list as the first argument to insert_many(). This will insert each document in the list, sending only a single command to the server.

The result from insert_many() now returns multiple ObjectId instances, one for each inserted document.

In [None]:
new_posts = [{
                'sensor': {'location_id': ObjectId('5a95821bdc936e0cfc7c7d96'),
                'sensor_name': 'cpu_sensor'},
                'timestamp': datetime.datetime.utcnow(),
                'units': 'percent',
                'value': 4.5
            },
             {
                'sensor': {'location_id': ObjectId('5a95821bdc936e0cfc7c7d96'),
                'sensor_name': 'cpu_sensor'},
                'timestamp': datetime.datetime.utcnow(),
                'units': 'percent',
                'value': 4.5
             }
            ]
result = db.sensors_readings.insert_many(new_posts)
result

and get the id's of the inserted objects

In [None]:
result.inserted_ids

# Querying for More Than One Document

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows us to iterate over all matching documents. 

Note that do to our experiments, the documents no not follow any type of schema. This is not recommended but is possible. So, fields like timestamp are found in every document, but in different "positions" in the document. 

For example, we can iterate over every document in the posts collection:

In [None]:
for doc in db.sensors_readings.find():
    pprint(doc)

you can also limit the output and order it...

In [None]:
for doc in db.sensors_readings.find().sort([('_id',1)]).limit(2):
    pprint(doc)

## Counting
If we just want to know how many documents match a query we can perform a count_documents() operation instead of a full query.

In [None]:
db.sensors_readings.count_documents({})

It is also possible to count the number of documents in a collection satisfying a query

In [None]:
query = {'location_name':'Prometheus Server'}
db.sensors_readings.count_documents(query)

## Range Queries
MongoDB supports many different types of advanced queries.


In [None]:
last_inserted_doc_timestamp = db.sensors_readings.find().sort([('_id',-1)]).limit(1)[0]['timestamp']
last_inserted_doc_timestamp

In [None]:
for doc in db.sensors_readings.find({'timestamp': last_inserted_doc_timestamp}):
    print(doc)

As an example, lets perform a query where we limit results to readings insertd in the last 5 minutes:

In [None]:
date = datetime.datetime.utcnow() - datetime.timedelta(minutes=5)
query = {'timestamp': {'$gt': date}}
for doc in db.sensors_readings.find(query):
    print(doc)

All readings witha `value` lower than 10

In [None]:
for doc in db.sensors_readings.find({'value': {'$lt': 10}}):
    print(doc)

All CPU readings with `value` lower than 10 in the next type of documents.
[be aware that, since we were experimenting, the documents may not have the same structure]

![./images/doc_example.png](./images/doc_example.png)

In [None]:
for doc in db.sensors_readings.find({'sensors.1.values.value': {'$gt': 10}}):
    print(doc)