<center>
    <h2>Online learning platform database - MongoDB</h2>
    <h3>Performing the queries and storing the queries execution time</h3>
</center>

<h3>Python - MongoDB interaction</h3>

Prior to performing the queries, we import the required modules (the <i>PyMongo</i> driver and the <i>time</i> and <i>csv</i> modules) and establish a connection to the MongoDB instance running in Docker. It is also convenient to store the database and the collections of interest into Python variables.

In [1]:
# import modules
import pymongo            # MongoDB driver
import time               # time-related functions to register query execution times
import csv                # read and write csv files

# create client object
client = pymongo.MongoClient(host = 'localhost', port = 27017, username = 'root', password = 'root')

# assign database to Python variable (db)
db = client.get_database('dbB_MONGODB_test')

# assign collections to Python variables
smallColl = db.smallDB
mediumColl = db.mediumDB
largeColl = db.largeDB
humongousColl = db.humongousDB

<h4>
Measuring and displaying the query execution time
</h4><br>
To display the query execution time we can use the Python <a href = 'https://docs.python.org/3/library/time.html'><i>time</i></a> module and its <code>time</code> function. The function returns the system time at a floating point precision, so the query execution time can be measured as a large number of fractions of a second. It is sufficient to assign the time before the query execution to a variable and the time after the query execution to another variable. The difference between the two variables will measure the query execution. Obviously, the time for the Python API to connect to the MongoDB server and the time to return to the Python API after the query execution will be summed up to the query execution time at the DBMS level.

In [2]:
import time
start = time.time()
cursor1 = smallColl.find({'courseID' : 192}, {'_id' : 0, 'studentID' : 1, 'firstName' : 1, 'lastName' : 1})
end = time.time()
print((end - start) * 1000)

0.17881393432617188


<h3>Query the collections</h3>

I create a dictionary of lists for each of the four collections. In these dictionaries the keys are the query names and the values are the 31 query execution times: in fact I attach the value of the query execution time of the most recent query to the list. Since query execution times are required in milliseconds, prior to attaching them, I multiply them by 1000 and round them to the fifth decimal precision.
The above summarized actions (for each of the four queries on each of the four collections) are performed by following a standard succession of steps. Each step is encapsulated within a notebook cell (so each query is performed 31 times by using three notebook cells), as follows:
 - step 1: define the query, perform it for the first time, contextually create timestamps prior and after query execution, print query result;
 - step 2: compute execution time of the first query execution and store it within the corresponding dictionary list;
 - step 3: [thirty times] perform query execution while creating prior and following timestamps, compute execution time and store it within the corresponding dictionary list. With Mongo I don't need to reset the cursor to allow repeating the query.

For each dataset, after having performed the four queries, I will finally compute the mean of the query executions from step 3. Together with the first query execution, this mean value will be stored into a new dictionary, specific to a dataset. Originally, I would use these four new dictionaries to save the query execution times into a csv file for constructing histograms. I later resolved to save all the 31 recorded query execution times and pass them all to Microsoft© Excel to process them.

In [3]:
smallDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
mediumDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
largeDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
humongousDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}

In [4]:
# mean function
def mean(aList):
    n = len(aList)
    sum = 0
    for value in aList:
        sum += value
    return sum / n

<h3>Collection with 250k documents</h3><br>
I start with the smallest collection.

<h4>Query 1</h4>

In [5]:
# step 1
small_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
small_query1 = smallColl.aggregate(small_mongo1)
after = time.time()

for result in small_query1:
    print(result['_id']['name'], result['_id']['surname'])

Arthur Laroche
Émile Nicolas
Patrícia Leite
Vigilija Gaižauskas
Ingeborg Amundsen
Sarah Lara
Casandra Arenas
Narciso Ferrán
Cathrine Lie
Custodia Hidalgo
Ana Narušis
Nath Nicolas
Ledün Soylu


In [6]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query1'].append(round(msec_duration, 5))

In [7]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [8]:
# step 1
small_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
small_query2 = smallColl.aggregate(small_mongo2)
after = time.time()

for result in small_query2:
    print(result['_id']['course'])

Econometrics: Methods and Applications
Exploratory Data Analysis
Introduction to Probability and Data with R
Python and Statistics for Financial Analysis
Basic Statistics
Introduction to Statistics
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Bayesian Statistics: From Concept to Data Analysis


In [9]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query2'].append(round(msec_duration, 5))

In [10]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [11]:
# step 1
small_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
small_query3 = smallColl.aggregate(small_mongo3)
after = time.time()

for result in small_query3:
    print(result['IDcount'])

838


In [12]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query3'].append(round(msec_duration, 5))

In [13]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [14]:
# step 1
small_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
small_query4 = smallColl.aggregate(small_mongo4)
after = time.time()

for result in small_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Raghav Sura from North Korea
Lynda Reynolds from Korea
Cathrine Lie from South Korea


In [15]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query4'].append(round(msec_duration, 5))

In [16]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query4'].append(round(msec_duration, 5))

In [17]:
smallDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in smallDict:
    smallDataset[key].append(smallDict[key][0])
    mean30 = mean(smallDict[key][1 : 31])
    smallDataset[key].append(round(mean30, 5))
smallDataset

{'query1': [1235.01515, 135.87454],
 'query2': [192.60073, 169.13517],
 'query3': [178.5481, 165.27492],
 'query4': [172.63603, 157.43516]}

<h3>Collection with 500k documents</h3><br>

<h4>Query 1</h4>

In [18]:
# step 1
medium_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
medium_query1 = mediumColl.aggregate(medium_mongo1)
after = time.time()

for result in medium_query1:
    print(result['_id']['name'], result['_id']['surname'])

Karl Christensen
Christl Henschel
Patrícia Leite
Débora Vaz
Yuvaan Dara
Cathrine Lie
Narciso Ferrán
Ana Narušis
Nath Nicolas
Émile Nicolas
Arthur Laroche
Nedas Naujokas
Vigilija Gaižauskas
Ingeborg Amundsen
Sarah Lara
Joris Kavaliauskas
Casandra Arenas
Custodia Hidalgo
Miguel Real
Ledün Soylu


In [19]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query1'].append(round(msec_duration, 5))

In [20]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [21]:
# step 1
medium_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
medium_query2 = mediumColl.aggregate(medium_mongo2)
after = time.time()

for result in medium_query2:
    print(result['_id']['course'])

Basic Statistics
Introduction to Statistics
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Bayesian Statistics: From Concept to Data Analysis
Econometrics: Methods and Applications
Exploratory Data Analysis
Introduction to Probability and Data with R
Python and Statistics for Financial Analysis


In [22]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query2'].append(round(msec_duration, 5))

In [23]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [24]:
# step 1
medium_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
medium_query3 = mediumColl.aggregate(medium_mongo3)
after = time.time()

for result in medium_query3:
    print(result['IDcount'])

1698


In [25]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query3'].append(round(msec_duration, 5))

In [26]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [27]:
# step 1
medium_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
medium_query4 = mediumColl.aggregate(medium_mongo4)
after = time.time()

for result in medium_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Lynda Reynolds from Korea
Ninthe Horrocks from Noord-Korea
Cathrine Lie from South Korea
Miguel Real from República de Corea
Raghav Sura from North Korea


In [28]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query4'].append(round(msec_duration, 5))

In [29]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query4'].append(round(msec_duration, 5))

In [30]:
mediumDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in mediumDict:
    mediumDataset[key].append(mediumDict[key][0])
    mean30 = mean(mediumDict[key][1 : 31])
    mediumDataset[key].append(round(mean30, 5))
mediumDataset

{'query1': [2975.96407, 270.70076],
 'query2': [338.65976, 330.85795],
 'query3': [307.88112, 306.66293],
 'query4': [367.66911, 294.13149]}

<h3>Collection with 750k documents</h3><br>

<h4>Query 1</h4>

In [31]:
# step 1
large_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
large_query1 = largeColl.aggregate(large_mongo1)
after = time.time()

for result in large_query1:
    print(result['_id']['name'], result['_id']['surname'])

Arthur Laroche
Nath Nicolas
Émile Nicolas
Ana Narušis
Narciso Ferrán
Cathrine Lie
Brian Thompson
Yuvaan Dara
Débora Vaz
Patrícia Leite
Dorita Abella
Christl Henschel
Collin Heerkens
Liliana Flaiano
Ledün Soylu
Karl Christensen
Custodia Hidalgo
Miguel Real
Urvi Dani
Casandra Arenas
Özkutlu Gül
Sarah Lara
Ingeborg Amundsen
Joris Kavaliauskas
Vigilija Gaižauskas
Nedas Naujokas


In [32]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query1'].append(round(msec_duration, 5))

In [33]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [34]:
# step 1
large_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
large_query2 = largeColl.aggregate(large_mongo2)
after = time.time()

for result in large_query2:
    print(result['_id']['course'])

Basic Statistics
Introduction to Statistics
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Bayesian Statistics: From Concept to Data Analysis
Econometrics: Methods and Applications
Exploratory Data Analysis
Introduction to Probability and Data with R
Python and Statistics for Financial Analysis


In [35]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query2'].append(round(msec_duration, 5))

In [36]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [37]:
# step 1
large_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
large_query3 = largeColl.aggregate(large_mongo3)
after = time.time()

for result in large_query3:
    print(result['IDcount'])

2628


In [38]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query3'].append(round(msec_duration, 5))

In [39]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [40]:
# step 1
large_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
large_query4 = largeColl.aggregate(large_mongo4)
after = time.time()

for result in large_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Tere Castells from República Popular Democrática de Corea
Lynda Reynolds from Korea
Ninthe Horrocks from Noord-Korea
Cathrine Lie from South Korea
Miguel Real from República de Corea
Raghav Sura from North Korea


In [41]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query4'].append(round(msec_duration, 5))

In [42]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query4'].append(round(msec_duration, 5))

In [43]:
largeDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in largeDict:
    largeDataset[key].append(largeDict[key][0])
    mean30 = mean(largeDict[key][1 : 31])
    largeDataset[key].append(round(mean30, 5))
largeDataset

{'query1': [5357.55205, 412.26209],
 'query2': [542.06991, 495.13317],
 'query3': [650.81215, 442.50081],
 'query4': [473.5539, 472.96095]}

<h3>Collection with 1m documents</h3><br>

<h4>Query 1</h4>

In [44]:
# step 1
humongous_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
humongous_query1 = humongousColl.aggregate(humongous_mongo1)
after = time.time()

for result in humongous_query1:
    print(result['_id']['name'], result['_id']['surname'])

Patrícia Leite
Dorita Abella
Christl Henschel
Eduardo Rezende
Giuseppina Scarfoglio
Débora Vaz
Torsten Schulz
Arthur Laroche
Émile Nicolas
Vigilija Gaižauskas
Kristen Webb
Nedas Naujokas
Mamen Teruel
Ingeborg Amundsen
David Miranda
Casandra Arenas
Özkutlu Gül
Ledün Soylu
Liliana Flaiano
Melania Savorgnan
Collin Heerkens
Brian Thompson
Yuvaan Dara
Ana Narušis
Cathrine Lie
Narciso Ferrán
Nath Nicolas
Sarah Lara
Joris Kavaliauskas
Shaan Raju
Custodia Hidalgo
Miguel Real
Urvi Dani
Karl Christensen
Finn Karlsen


In [45]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query1'].append(round(msec_duration, 5))

In [46]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [47]:
# step 1
humongous_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
humongous_query2 = humongousColl.aggregate(humongous_mongo2)
after = time.time()

for result in humongous_query2:
    print(result['_id']['course'])

Basic Statistics
Introduction to Statistics
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Bayesian Statistics: From Concept to Data Analysis
Econometrics: Methods and Applications
Exploratory Data Analysis
Introduction to Probability and Data with R
Python and Statistics for Financial Analysis


In [48]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query2'].append(round(msec_duration, 5))

In [49]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [50]:
# step 1
humongous_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
humongous_query3 = humongousColl.aggregate(humongous_mongo3)
after = time.time()

for result in humongous_query3:
    print(result['IDcount'])

3498


In [51]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query3'].append(round(msec_duration, 5))

In [52]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [53]:
# step 1
humongous_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
humongous_query4 = humongousColl.aggregate(humongous_mongo4)
after = time.time()

for result in humongous_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Raghav Sura from North Korea
Tere Castells from República Popular Democrática de Corea
Ninthe Horrocks from Noord-Korea
Miguel Real from República de Corea
Leila Gailys from Korea
Cathrine Lie from South Korea
Debra Shaw from Korea
Lynda Reynolds from Korea


In [54]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query4'].append(round(msec_duration, 5))

In [55]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query4'].append(round(msec_duration, 5))

In [56]:
humongousDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in humongousDict:
    humongousDataset[key].append(humongousDict[key][0])
    mean30 = mean(humongousDict[key][1 : 31])
    humongousDataset[key].append(round(mean30, 5))
humongousDataset

{'query1': [5710.1388, 527.2241],
 'query2': [693.3949, 679.81941],
 'query3': [621.69409, 599.94771],
 'query4': [664.572, 573.48735]}

In [57]:
with open('mongo_tests.csv', 'w', newline = '') as mongo_tests:
    writer = csv.writer(mongo_tests, delimiter = ',')
    keys = smallDict.keys()
    limit = len(smallDict['query1'])
    
    writer.writerow(keys)
    writer.writerow('s') # s stands for small dataset
    for i in range(0, limit):
        writer.writerow(smallDict[k][i] for k in keys)
    writer.writerow('m')  # m stands for medium dataset
    for i in range(0, limit):
        writer.writerow(mediumDict[k][i] for k in keys)
    writer.writerow('l') # l stands for large dataset
    for i in range(0, limit):
        writer.writerow(largeDict[k][i] for k in keys)
    writer.writerow('h') # h stands for humongous dataset
    for i in range(0, limit):
        writer.writerow(humongousDict[k][i] for k in keys)