<center>
    <h2>Online learning platform database - MongoDB</h2>
    <h3>Methodologies applied for loading the data and performing the queries</h3>
</center>

<h3>Preliminary operations: import csv files into MongoDB (<code>mongoimport</code> tool)</h3>

There are basically two ways to import a csv file into a MongoDB instance. One is to use <a href = 'https://www.mongodb.com/docs/compass/master/import-export/'><i>MongoDB Compass</i></a>, an intuitive and comprehensive Graphical User Interface for MongoDB. The other is to use a MongoDB Database tool, i.e. <a href = 'https://www.mongodb.com/docs/database-tools/mongoimport/#mongodb-binary-bin.mongoimport'><i>mongoimport</i></a>. The use of a GUI makes the import extremely user-friendly, although I prefer to implement a method that uses a standard command line API to import the data into the database. The <code>mongoimport</code> tool must be run from the system command line, not from the mongo shell, hence its execution must include host and authentication information to connect to the DBMS and interact with it.
<br>
<h4>
Syntax
</h4><br>
The <code>mongoimport</code> tool must be run from the system command line, not from within the MongoDB shell. Its syntax requires that we provide information for connecting to the desired MongoDB server (although by default the host name and port are the standard ones: <i>localhost:27017</i>), together with authentication details, database name and file details.
Connection and authentication details can be provided through a <i>connection string</i> or via options. I opt for the latter method because it is more explicit and readable, in my opinion.
<br>
<code>
    mongoimport [options] [connection-string] [file]
</code>
<h4>
Options
</h4><br>
- <b>host name</b><br>
The <code>--host</code> option allows us to indicate the hostname and port of the MongoDB instance we want to connect to. As previously specified, these need not be declared if the default ones (<i>localhost:27017</i>) are ok:
<br>
<code>
    mongoimport --host=localhost:27017
</code>
<br>
- <b>authentication</b><br>
If we specify a <code>-password</code> option (together with <code>--username</code>), we must also set the <code>--authenticationDatabase</code> option. These can all be specified explicitly in the command line, although it is recommended that the password is stored into a configuration file or inputed at the prompt. For the latter method, it is sufficient to let an empty string ('') follow the <code>--password</code> option. Username and password need not be enclosed in quotes.<br>The <code>--authenticationDatabase</code> option must be provided when using the <code>password</code> option: it specifies the <a href = 'https://www.mongodb.com/docs/manual/core/security-users/#std-label-user-authentication-database'>authentication database</a> where the specified username has been created (<i>admin</i>, in my case).
<br>
<code>
    mongoimport --host=localhost:27017 --username=root --password='' --authenticationDatabase=admin
</code>
<br>
- <b>database</b><br>
The <code>--db</code> option is used to declare the name of the database where we want to import the csv file. We can also use a <code>--collection</code> option to indicate the name of the collection within the previously declared database. If the <code>--collection</code> option is not used, <i>mongoimport</i> creates a collection within the declared database with the filename (without extension) as collection name.
<br>
<code>
    mongoimport --host=localhost:27017 --username=root --password='' --authenticationDatabase=admin 
        --db=dbB_MONGODB_test
</code>
<br>
- <b>file</b><br>
The file that must be imported must be specified together with a couple more options: <code>-type</code> needs to be specified if the file is not a <i>JSON</i> (default type), <code>--headerline</code> indicates that the first row of the csv contains the header, <code>--file</code> is for indicating the path and filename.
<br>
<code>
    mongoimport --host=localhost:27017 --username=root --password='' --authenticationDatabase=admin 
        --db=dbB_MONGODB_test --type=csv --headerline --file=path/filename.csv
</code>
<br>
When the file is imported, <code>mongoimport</code> translates each row into a document, assigning to each value a key equal to the value of the corresponding column of the first row of the csv file. Since a csv file is flat, the resulting schema of the documents in the collection will be homogeneous.
<br>
<h4>
Notes when running MongoDB from a Docker container
</h4><br>
In loading the csv file from the local machine, we must take into account the usual feature of Docker virtual environments. They come with a file system of their own, so a DBMS run from within a container has access to this file system, not to the local machine file system. Hence, the csv file must be imported into the container via the usual <code>docker cp</code> command.
<br>
<code>
    docker cp path/mycsvfile.csv container:/path
</code>

As previously pointed out, I just copy the file into the container's root, without specifying a subfolder.<br>Another important occurrence of running MongoDB within a Docker container regards the fact that we must execute the <code>mongoimport</code> tool from the system command line, not from the Mongo shell (<code>mongosh</code>). This means that we must run the <code>mongoimport</code> command as a Docker execution, without previously accessing the container:
<br>
<code>
    docker exec <i>container</i> mongoimport --host=localhost:27017 --username=root --password='' 
        --authenticationDatabase=admin --db=dbB_MONGODB_test --type=csv --headerline --file='filename.csv'
</code>
<br>
If importing is successful, before returning to the system folder from which the <code>docker exec</code> has been run, we are returned a message notifying the number of documents successfully imported (this should be equal to the number of rows of the csv file) and the number of those failed to import. By running <i>mongosh</i> from the container's bash shell, or checking <i>MongoDB Compass</i> we can query the newly created collection.

<h3>Python - MongoDB interaction</h3>

Interaction between a Python API and a MongoDB DBMS requires the installation of a specific driver. It is convenient to use the driver officially developed by the MongoDB team, i.e. <i>PyMongo</i>. A list of MongoDB driver for various programming languages is provided in the <a href = 'https://www.mongodb.com/docs/drivers/'>MongoDB Drivers</a> web page, together with the ones dedicated to Python (<a href = 'https://www.mongodb.com/docs/drivers/pymongo/'><i>PyMongo</i></a> and <a href = 'https://www.mongodb.com/docs/drivers/motor/'><i>Motor</i></a>)<br>After having installed the driver, it can be imported into a Python environment the usual way.

In [1]:
import pymongo

<h4>
Establishing a connection to MongoDB
</h4><br>
A <a href = 'https://pymongo.readthedocs.io/en/stable/api/pymongo/mongo_client.html#pymongo.mongo_client.MongoClient'>client object</a> can be created via the <code>MongoClient()</code> method of the driver. The default host is the standard <i>localhost</i> mapped to port <i>27017</i>, but these can also be explicitly specified together with authentication.

In [2]:
client = pymongo.MongoClient(host = 'localhost', port = 27017, username = 'root', password = 'myPassword')

<h4>
Choosing a database
</h4><br>
The database of interest can be accessed (and assigned to a Python variable as a database object) as an attribute of the client object or explicitly by passing the database name as a string to the client's <code>get_database()</code> method. The full list of databases available in the MongoDb server can be shown by means of the client's <code>list_database_names()</code> method, which requires no arguments (recall that the methods and attributes of an object can be displayed with the <code>dir()</code> function).

In [3]:
print('List of databases in the MongoDB server:\n')
dbCounter = 0
for dbs in client.list_database_names():
    dbCounter += 1
    print(dbCounter, dbs)

List of databases in the MongoDB server:

1 admin
2 bilancio_demografico
3 config
4 congress
5 dati_comuni
6 dbB_MONGODB_test
7 local
8 university
9 weather


In [4]:
#db = client.get_database('dbB_MONGODB_test')
db = client.dbB_MONGODB_test

<h4>
Choosing a collection
</h4><br>
We can equivalently access a collection within the database (and assign it to a variable as a collection object) as an attribute of a database object or via explicitly passing its name as a string to the database object's <code>get_collection()</code> method. A list of all the collections available within a database can be displayed with the <code>list_collection_names()</code> method of a database object. A collection object is provided with methods replicating the functions available within the MongoDB DML for performing <b>CRUD</b> operations.

In [5]:
print('List of collections within the database %s:\n' % db.name)
collCounter = 0
for coll in db.list_collection_names():
    collCounter +=1
    print(collCounter, coll)

List of collections within the database dbB_MONGODB_test:

1 largeDB
2 humongousDB
3 smallDB
4 mediumDB


In [6]:
#smallColl = db.get_collection('dataset250k')
smallColl = db.smallDB
mediumColl = db.mediumDB
largeColl = db.largeDB
humongousColl = db.humongousDB

<h4>
Executing a query
</h4><br>
A list of utilities for interacting with a collection is shown by the <code>dir()</code> applied to the collection object. Among the others, we find <i>insert</i>, <i>delete</i>, <i>update</i>, <i>find</i> and <i>aggregate</i> utilities, that work thanks to a syntax equivalent to that of the MongoDB DML. In particular, the <code>find()</code> method accepts filtering conditions and projection options within curly brackets separated by a comma). The method returns a <a href = 'https://pymongo.readthedocs.io/en/stable/api/pymongo/cursor.html#pymongo.cursor.Cursor'>cursor object</a> that can be iterated to display the query results. Query results within a cursor represent single documents stored in dictionaries. This means that, to display the query results, we must first iterate over the cursor and subsequently access the values by their key.

In [7]:
cursor = smallColl.find({'courseID' : 192}, {'_id' : 0, 'studentID' : 1, 'firstName' : 1, 'lastName' : 1})

In [8]:
for res in cursor:
    print(res['studentID'], res['firstName'], res['lastName'])

2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia Hidalgo
2 Custodia H

Find one correspondence to the query.

In [9]:
smallColl.find_one({"courseID" : 192},{})

{'_id': ObjectId('6503860bdf6d53be1766bfce'),
 'courseID': 192,
 'discipline': 'statistics',
 'courseName': 'Econometrics: Methods and Applications',
 'courseYear': 2022,
 'syllabus': 'http://learning_platform.com/econometricsmethodsandapplications/syllabus',
 'studentID': 2,
 'firstName': 'Custodia',
 'lastName': 'Hidalgo',
 'dateOfBirth': '1981-4-23',
 'genre': 'female',
 'country': 'Eritrea',
 'town': 'Alicante',
 'email': 'custodia.hidalgo@yahoo.com',
 'materialID': 13367,
 'unit': 'Unit 1',
 'materialType': 'lecture slides',
 'name': '[SLIDES] Exploring core analytical skills',
 'dimension': 3,
 'accessDate': '2023-01-05',
 'recordID': 2}

Create a cursor object storing an aggregate query.

In [10]:
cursor2 = smallColl.aggregate([{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'sID' : '$studentID', 'name' : '$firstName', 'surname' : '$lastName'}}}, {'$project' : {'studentID' : 1, 'firstName' : 1, 'lastName' : 1}}])

In [11]:
for doc in cursor2:
    print(doc)

{'_id': {'sID': 1948, 'name': 'Ingeborg', 'surname': 'Amundsen'}}
{'_id': {'sID': 664, 'name': 'Ledün', 'surname': 'Soylu'}}
{'_id': {'sID': 415, 'name': 'Vigilija', 'surname': 'Gaižauskas'}}
{'_id': {'sID': 177, 'name': 'Narciso', 'surname': 'Ferrán'}}
{'_id': {'sID': 38, 'name': 'Sarah', 'surname': 'Lara'}}
{'_id': {'sID': 320, 'name': 'Patrícia', 'surname': 'Leite'}}
{'_id': {'sID': 1237, 'name': 'Ana', 'surname': 'Narušis'}}
{'_id': {'sID': 447, 'name': 'Casandra', 'surname': 'Arenas'}}
{'_id': {'sID': 1650, 'name': 'Nath', 'surname': 'Nicolas'}}
{'_id': {'sID': 1653, 'name': 'Émile', 'surname': 'Nicolas'}}
{'_id': {'sID': 1825, 'name': 'Cathrine', 'surname': 'Lie'}}
{'_id': {'sID': 2, 'name': 'Custodia', 'surname': 'Hidalgo'}}
{'_id': {'sID': 1208, 'name': 'Arthur', 'surname': 'Laroche'}}


<h4>
Measuring and displaying the query execution time
</h4>
<h4>
- method 1: <code>time()</code>
</h4><br>
To display the query execution time we can use the Python <a href = 'https://docs.python.org/3/library/time.html'><i>time</i></a> module and its <code>time()</code> function. The function returns the system time at a floating point precision, so the query execution time can be measured as a large number of fractions of a second. It is sufficient to assign the time before the query execution to a variable and the time after the query execution to another variable. The difference between the two variables will measure the query execution time. Obviously, the time for the Python API to connect to the MongoDB server and the time to return to the Python API after the query execution will be summed up to the query execution time at the DBMS level.

In [12]:
import time
start = time.time()
cursor1 = smallColl.find({'courseID' : 192}, {'_id' : 0, 'studentID' : 1, 'firstName' : 1, 'lastName' : 1})
end = time.time()
print((end - start) * 1000)

0.11706352233886719


<h4>
- method 2: <code>explain()</code>
</h4><br>
However, it is to be noted that the cursor object obtained in query execution stores a series of statistics that can be accessed via the <code>explain</code> method. This method returns a dictionary of dictionaries. One of the inner dictionaries is associated to the key '<i>executionStats</i>' and one of its keys is '<i>executionTimeMillis</i>', which returns the query execution time in milliseconds.

In [13]:
cursor1.explain()['executionStats']['executionTimeMillis']

167

Notice, anyway, that this method performs a new query with the same statement, so the <i>cursor1</i> object of the above cell, while storing the same results as the <i>cursor1</i> object of the previous cell, is a different object, so the execution time will be different every time we call it.

Moreover, the <code>explain</code> method is not available for aggregate queries. In this case the <code>db.command()</code> method can be used: the <i>explain</i> command can be passed as a string, followed by a dictionary of key-value pairs providing the collection name as a string, the aggregation statement (better if stored in a variable) and a (even empty) cursor object.

In [14]:
aggregation_example = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'sID' : '$studentID', 'name' : '$firstName', 'surname' : '$lastName'}}}, {'$project' : {'studentID' : 1, 'firstName' : 1, 'lastName' : 1}}]

In [30]:
cursor3 = db.command('explain', {'aggregate' : 'smallDB', 'pipeline' : aggregation_example, 'cursor' : {}})

The object so constructed has a complex structure and stores a lot of information on the aggregation query.

In [65]:
cursor3

{'explainVersion': '2',
 'stages': [{'$cursor': {'queryPlanner': {'namespace': 'dbB_MONGODB_test.smallDB',
     'indexFilterSet': False,
     'parsedQuery': {'courseID': {'$eq': 192}},
     'queryHash': '4959F3A1',
     'planCacheKey': '4959F3A1',
     'maxIndexedOrSolutionsReached': False,
     'maxIndexedAndSolutionsReached': False,
     'maxScansToExplodeReached': False,
     'winningPlan': {'queryPlan': {'stage': 'GROUP',
       'planNodeId': 2,
       'inputStage': {'stage': 'COLLSCAN',
        'planNodeId': 1,
        'filter': {'courseID': {'$eq': 192}},
        'direction': 'forward'}},
      'slotBasedPlan': {'slots': '$$RESULT=s10 env: { s1 = TimeZoneDatabase(Etc/GMT...Asia/Baku) (timeZoneDB), s2 = Nothing (SEARCH_META), s3 = 1699176894679 (NOW) }',
       'stages': '[2] mkobj s10 [_id = s9] true false \n[2] project [s9 = newObj ("sID", s6, "name", s7, "surname", s8)] \n[2] group [s6, s7, s8] [] \n[2] project [s8 = getField (s4, "lastName")] \n[2] project [s7 = getField (s4, 

The information on the execution time can be accessed as follows:

In [68]:
cursor3['stages'][0]['$cursor']['executionStats']['executionTimeMillis']

247

<h4>
- chosen method: <code>time()</code>
</h4><br>
Although the <i>explain</i> utility provides a dedicated way to retrieve the query execution time within the PyMongo driver, I find that consistency across all of the DBMSs used in the project requires application of a uniform method to retrieve query execution times. Hence, the first methodology will be used here.