<center>
    <h2>Online learning platform database - MongoDB</h2>
    <h3>Performing the queries and storing the queries execution time</h3>
</center>

<h3>Python - MongoDB interaction</h3>

Prior to performing the queries, we import the required modules (the <i>PyMongo</i> driver and the <i>time</i> and <i>csv</i> modules) and establish a connection to the MongoDB instance running in Docker. It is also convenient to store the database and the collections of interest into Python variables.

In [3]:
# import modules
import pymongo            # MongoDB driver
import time               # time-related functions to register query execution times
import csv                # read and write csv files

# create client object
client = pymongo.MongoClient(host = 'localhost', port = 27017, username = 'root', password = 'root')

# assign database to Python variable (db)
db = client.get_database('dbB_MONGODB_test')

# assign collections to Python variables
smallColl = db.smallDB
mediumColl = db.mediumDB
largeColl = db.largeDB
humongousColl = db.humongousDB

<h4>
Measuring and displaying the query execution time
</h4><br>
To display the query execution time we can use the Python <a href = 'https://docs.python.org/3/library/time.html'><i>time</i></a> module and its <code>time</code> function. The function returns the system time at a floating point precision, so the query execution time can be measured as a large number of fractions of a second. It is sufficient to assign the time before the query execution to a variable and the time after the query execution to another variable. The difference between the two variables will measure the query execution. Obviously, the time for the Python API to connect to the MongoDB server and the time to return to the Python API after the query execution will be summed up to the query execution time at the DBMS level.

In [60]:
import time
start = time.time()
cursor1 = smallColl.find({'courseID' : 192}, {'_id' : 0, 'studentID' : 1, 'firstName' : 1, 'lastName' : 1})
end = time.time()
print((end - start) * 1000)

0.1888275146484375


<h3>Query the collections</h3>

I create a dictionary of lists for each of the four collections. In these dictionaries the keys are the query names and the values are the 31 query execution times: in fact I attach the value of the query execution time of the most recent query to the list. Since query execution times are required in milliseconds, prior to attaching them, I multiply them by 1000 and round them to the fifth decimal precision.
The above summarized actions (for each of the four queries on each of the four collections) are performed by following a standard succession of steps. Each step is encapsulated within a notebook cell (so each query is performed 31 times by using three notebook cells), as follows:
 - step 1: define the query, perform it for the first time, contextually create timestamps prior and after query execution, print query result;
 - step 2: compute execution time of the first query execution and store it within the corresponding dictionary list;
 - step 3: [thirty times] perform query execution while creating prior and following timestamps, compute execution time and store it within the corresponding dictionary list. With Mongo I don't need to reset the cursor to allow repeating the query.

For each dataset, after having performed the four queries, I will finally compute the mean of the query executions from step 3. Together with the first query execution, this mean value will be stored into a new dictionary, specific to a dataset. The four new dictionaries will then be saved as a csv file for constructing histograms.

In [15]:
smallDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
mediumDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
largeDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
humongousDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}

In [10]:
# mean function
def mean(aList):
    n = len(aList)
    sum = 0
    for value in aList:
        sum += value
    return sum / n

<h3>Collection with 250k documents</h3><br>
I start with the smallest collection.

<h4>Query 1</h4>

In [110]:
# step 1
small_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
small_query1 = smallColl.aggregate(small_mongo1)
after = time.time()

for result in small_query1:
    print(result['_id']['name'], result['_id']['surname'])

Custodia Hidalgo
Arthur Laroche
Sarah Lara
Narciso Ferrán
Patrícia Leite
Casandra Arenas
Cathrine Lie
Vigilija Gaižauskas
Ana Narušis
Ledün Soylu
Nath Nicolas
Émile Nicolas
Ingeborg Amundsen


In [17]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query1'].append(round(msec_duration, 5))

In [18]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [30]:
# step 1
small_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
small_query2 = smallColl.aggregate(small_mongo2)
after = time.time()

for result in small_query2:
    print(result['_id']['course'])

Foundations: Data, Data, Everywhere
Basic Statistics
Econometrics: Methods and Applications
Exploratory Data Analysis
Bayesian Statistics: From Concept to Data Analysis
Understanding Clinical Research: Behind the Statistics
Introduction to Statistics
Python and Statistics for Financial Analysis
Introduction to Probability and Data with R


In [31]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query2'].append(round(msec_duration, 5))

In [32]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [50]:
# step 1
small_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
small_query3 = smallColl.aggregate(small_mongo3)
after = time.time()

for result in small_query3:
    print(result['IDcount'])

838


In [51]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query3'].append(round(msec_duration, 5))

In [52]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [45]:
# step 1
small_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
small_query4 = smallColl.aggregate(small_mongo4)
after = time.time()

for result in small_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Lynda Reynolds from Korea
Cathrine Lie from South Korea
Raghav Sura from North Korea


In [46]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query4'].append(round(msec_duration, 5))

In [47]:
# step 3
for i in range(0, 30):
    before = time.time()
    smallColl.aggregate(small_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query4'].append(round(msec_duration, 5))

In [105]:
smallDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in smallDict:
    smallDataset[key].append(smallDict[key][0])
    mean30 = mean(smallDict[key][1 : 31])
    smallDataset[key].append(round(mean30, 5))
smallDataset

{'query1': [192.01708, 140.77709],
 'query2': [312.48307, 194.26864],
 'query3': [266.86788, 195.87686],
 'query4': [208.28009, 155.97881]}

<h3>Collection with 500k documents</h3><br>

<h4>Query 1</h4>

In [109]:
# step 1
medium_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
medium_query1 = mediumColl.aggregate(medium_mongo1)
after = time.time()

for result in medium_query1:
    print(result['_id']['name'], result['_id']['surname'])

Christl Henschel
Karl Christensen
Nath Nicolas
Joris Kavaliauskas
Yuvaan Dara
Débora Vaz
Casandra Arenas
Ingeborg Amundsen
Ana Narušis
Émile Nicolas
Ledün Soylu
Miguel Real
Arthur Laroche
Sarah Lara
Narciso Ferrán
Custodia Hidalgo
Patrícia Leite
Nedas Naujokas
Vigilija Gaižauskas
Cathrine Lie


In [55]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query1'].append(round(msec_duration, 5))

In [56]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [58]:
# step 1
medium_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
medium_query2 = mediumColl.aggregate(medium_mongo2)
after = time.time()

for result in medium_query2:
    print(result['_id']['course'])

Bayesian Statistics: From Concept to Data Analysis
Understanding Clinical Research: Behind the Statistics
Python and Statistics for Financial Analysis
Exploratory Data Analysis
Introduction to Statistics
Foundations: Data, Data, Everywhere
Introduction to Probability and Data with R
Basic Statistics
Econometrics: Methods and Applications


In [59]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query2'].append(round(msec_duration, 5))

In [60]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [62]:
# step 1
medium_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
medium_query3 = mediumColl.aggregate(medium_mongo3)
after = time.time()

for result in medium_query3:
    print(result['IDcount'])

1698


In [63]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query3'].append(round(msec_duration, 5))

In [64]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [66]:
# step 1
medium_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
medium_query4 = mediumColl.aggregate(medium_mongo4)
after = time.time()

for result in medium_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Raghav Sura from North Korea
Cathrine Lie from South Korea
Lynda Reynolds from Korea
Miguel Real from República de Corea
Ninthe Horrocks from Noord-Korea


In [67]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query4'].append(round(msec_duration, 5))

In [68]:
# step 3
for i in range(0, 30):
    before = time.time()
    mediumColl.aggregate(medium_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query4'].append(round(msec_duration, 5))

In [104]:
mediumDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in mediumDict:
    mediumDataset[key].append(mediumDict[key][0])
    mean30 = mean(mediumDict[key][1 : 31])
    mediumDataset[key].append(round(mean30, 5))
mediumDataset

{'query1': [3829.1719, 290.41341],
 'query2': [459.23471, 389.29562],
 'query3': [444.53502, 436.27919],
 'query4': [355.16381, 298.19805]}

<h3>Collection with 750k documents</h3><br>

<h4>Query 1</h4>

In [70]:
# step 1
large_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
large_query1 = largeColl.aggregate(large_mongo1)
after = time.time()

for result in large_query1:
    print(result['_id']['name'], result['_id']['surname'])

Cathrine Lie
Karl Christensen
Christl Henschel
Vigilija Gaižauskas
Custodia Hidalgo
Nedas Naujokas
Patrícia Leite
Narciso Ferrán
Sarah Lara
Joris Kavaliauskas
Miguel Real
Arthur Laroche
Yuvaan Dara
Ledün Soylu
Urvi Dani
Özkutlu Gül
Dorita Abella
Liliana Flaiano
Ana Narušis
Émile Nicolas
Ingeborg Amundsen
Collin Heerkens
Casandra Arenas
Brian Thompson
Nath Nicolas
Débora Vaz


In [71]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query1'].append(round(msec_duration, 5))

In [72]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [74]:
# step 1
large_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
large_query2 = largeColl.aggregate(large_mongo2)
after = time.time()

for result in large_query2:
    print(result['_id']['course'])

Bayesian Statistics: From Concept to Data Analysis
Understanding Clinical Research: Behind the Statistics
Python and Statistics for Financial Analysis
Exploratory Data Analysis
Introduction to Statistics
Foundations: Data, Data, Everywhere
Introduction to Probability and Data with R
Basic Statistics
Econometrics: Methods and Applications


In [75]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query2'].append(round(msec_duration, 5))

In [76]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [78]:
# step 1
large_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
large_query3 = largeColl.aggregate(large_mongo3)
after = time.time()

for result in large_query3:
    print(result['IDcount'])

2628


In [79]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query3'].append(round(msec_duration, 5))

In [80]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [82]:
# step 1
large_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
large_query4 = largeColl.aggregate(large_mongo4)
after = time.time()

for result in large_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Cathrine Lie from South Korea
Ninthe Horrocks from Noord-Korea
Miguel Real from República de Corea
Lynda Reynolds from Korea
Tere Castells from República Popular Democrática de Corea
Raghav Sura from North Korea


In [83]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query4'].append(round(msec_duration, 5))

In [84]:
# step 3
for i in range(0, 30):
    before = time.time()
    largeColl.aggregate(large_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query4'].append(round(msec_duration, 5))

In [103]:
largeDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in largeDict:
    largeDataset[key].append(largeDict[key][0])
    mean30 = mean(largeDict[key][1 : 31])
    largeDataset[key].append(round(mean30, 5))
largeDataset

{'query1': [9782.52077, 633.73331],
 'query2': [653.68295, 672.64531],
 'query3': [514.96506, 592.96764],
 'query4': [502.33507, 616.69949]}

<h3>Collection with 1m documents</h3><br>

<h4>Query 1</h4>

In [108]:
# step 1
humongous_mongo1 = [{'$match' : {'courseID' : 192}}, {'$group' : {'_id' : {'name' : '$firstName', 'surname' : '$lastName'}}}]

before = time.time()
humongous_query1 = humongousColl.aggregate(humongous_mongo1)
after = time.time()

for result in humongous_query1:
    print(result['_id']['name'], result['_id']['surname'])

Kristen Webb
Arthur Laroche
Eduardo Rezende
Nedas Naujokas
Custodia Hidalgo
David Miranda
Patrícia Leite
Narciso Ferrán
Cathrine Lie
Vigilija Gaižauskas
Sarah Lara
Yuvaan Dara
Melania Savorgnan
Brian Thompson
Collin Heerkens
Mamen Teruel
Ana Narušis
Shaan Raju
Ledün Soylu
Urvi Dani
Giuseppina Scarfoglio
Miguel Real
Finn Karlsen
Torsten Schulz
Özkutlu Gül
Dorita Abella
Christl Henschel
Joris Kavaliauskas
Nath Nicolas
Karl Christensen
Débora Vaz
Ingeborg Amundsen
Casandra Arenas
Liliana Flaiano
Émile Nicolas


In [88]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query1'].append(round(msec_duration, 5))

In [89]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo1)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [91]:
# step 1
humongous_mongo2 = [{'$match' : {'discipline' : 'statistics', 'courseYear' : 2022}}, {'$group' : {'_id' : {'ID' : '$courseID', 'course' : '$courseName'}}}, {'$project' : {'_id.course' : 1}}]

before = time.time()
humongous_query2 = humongousColl.aggregate(humongous_mongo2)
after = time.time()

for result in humongous_query2:
    print(result['_id']['course'])

Bayesian Statistics: From Concept to Data Analysis
Understanding Clinical Research: Behind the Statistics
Python and Statistics for Financial Analysis
Exploratory Data Analysis
Introduction to Statistics
Foundations: Data, Data, Everywhere
Introduction to Probability and Data with R
Basic Statistics
Econometrics: Methods and Applications


In [92]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query2'].append(round(msec_duration, 5))

In [93]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo2)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [95]:
# step 1
humongous_mongo3 = [{'$match' : {'discipline': 'maths', 'materialType' : 'lecture slides', 'email': {'$regex': '@gmail.com'}}}, {'$group' : {'_id': '_id', 'IDcount' : {'$count': {}}}}]

before = time.time()
humongous_query3 = humongousColl.aggregate(humongous_mongo3)
after = time.time()

for result in humongous_query3:
    print(result['IDcount'])

3498


In [96]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query3'].append(round(msec_duration, 5))

In [97]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo3)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [99]:
# step 1
humongous_mongo4 = [{'$match' : {'discipline': 'psychology', 'country' : {'$regex': 'orea'}, 'dateOfBirth': {'$regex': '^1'}}}, {'$project' : {'_id': 0, 'firstName': 1, 'lastName': 1, 'country': 1}}, {'$group' : {'_id': {'name': '$firstName', 'surname': '$lastName', 'country': '$country'}}}, {'$sort' : {'lastName' : 1}}]

before = time.time()
humongous_query4 = humongousColl.aggregate(humongous_mongo4)
after = time.time()

for result in humongous_query4:
    print(result['_id']['name'], result['_id']['surname'], 'from', result['_id']['country'])

Raghav Sura from North Korea
Tere Castells from República Popular Democrática de Corea
Cathrine Lie from South Korea
Leila Gailys from Korea
Miguel Real from República de Corea
Ninthe Horrocks from Noord-Korea
Debra Shaw from Korea
Lynda Reynolds from Korea


In [100]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query4'].append(round(msec_duration, 5))

In [101]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongousColl.aggregate(humongous_mongo4)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query4'].append(round(msec_duration, 5))

In [106]:
humongousDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in humongousDict:
    humongousDataset[key].append(humongousDict[key][0])
    mean30 = mean(humongousDict[key][1 : 31])
    humongousDataset[key].append(round(mean30, 5))
humongousDataset

{'query1': [5834.81503, 731.05908],
 'query2': [747.70832, 981.6229],
 'query3': [826.83492, 806.35171],
 'query4': [737.1347, 830.114]}

In [107]:
with open('mongo_tests.csv', 'w', newline = '') as mongo_tests:
    writer = csv.writer(mongo_tests, delimiter = ',')
    keys = smallDataset.keys()
    limit = len(smallDataset['query1'])
    
    writer.writerow(keys)
    writer.writerow('s') # s stands for small dataset
    for i in range(0, limit):
        writer.writerow(smallDataset[k][i] for k in keys)
    writer.writerow('m')  # m stands for medium dataset
    for i in range(0, limit):
        writer.writerow(mediumDataset[k][i] for k in keys)
    writer.writerow('l') # l stands for large dataset
    for i in range(0, limit):
        writer.writerow(largeDataset[k][i] for k in keys)
    writer.writerow('h') # h stands for humongous dataset
    for i in range(0, limit):
        writer.writerow(humongousDataset[k][i] for k in keys)