<center>
    <h2>Online learning platform database - MySQL</h2>
    <h3>Performing the queries and storing the queries execution time</h3>
</center>

<br/>
<h3>Python - MySQL interaction</h3>

Prior to performing the queries we import the required modules (the MySQL <i>connector</i> and the <i>time</i> module) and establish a connection with the running MySQL instance.

In [1]:
# import modules
import mysql.connector as connector   # MySQL driver
import time                           # time-related functions to register query execution times
import csv                            # read and write csv files

# create connection object
conn = connector.connect(host = '127.0.0.1', port = '3306', user = 'root', password = 'X4mPpd3V', database = 'dbB_MYSQL_test')

# create cursor object
cursor = conn.cursor()

<h3>Query the datasets</h3>

I create a dictionary of lists for each of the four datasets. In these dictionaries the keys are the query names and the values are the 31 query execution times: in fact I attach the value of the query execution time of the most recent query to the list. Since query execution times are required in milliseconds, prior to attaching them, I multiply them by 1000 and round them to the fifth decimal precision.
The above sumarized actions (for each of the four queries on each of the four datasets) are performed by following a standard succession of steps. Each step is incapsulated within a notebook cell (so each query is performed 31 times by means of three notebook cells):
 - step 1: define the query, perform it for the first time, contextually create timestamps prior and after query execution, print query result;
 - step 2: compute executione time of the first query execution and store it within the corresponding dictionary list;
 - step 3: [thirty times] perform query execution while creating prior and following timestamps, compute execution time and store it within the corresponding dictionary list, reset the cursor to allow repeating the query.

For each dataset, after having performed the four queries, I will finally compute the mean of the query executions from step 3. Together with the first query execution, this mean value will be stored into a new dictionary, specific to a dataset. The four new dictionaries will then be saved as a csv file for constructing histograms.

In [2]:
smallDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
mediumDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
largeDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
humongousDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}

In [3]:
# mean function (used to compute mean execution time of the 30 grouped queries)
def mean(aList):
    n = len(aList)
    sum = 0
    for value in aList:
        sum += value
    return sum / n

<h3>Dataset with 250k records</h3>

I start with the smallest dataset.

<h4>Query1</h4>

In [4]:
# step 1
small_sql1 = 'SELECT DISTINCT firstName AS name, lastName AS surname FROM smallDB AS S WHERE courseID = 192'

before = time.time()
cursor.execute(small_sql1)
after = time.time()

small_query1 = cursor.fetchall()
for name, surname in small_query1:
    print(name, surname)

Custodia Hidalgo
Sarah Lara
Narciso Ferrán
Patrícia Leite
Vigilija Gaižauskas
Casandra Arenas
Ledün Soylu
Arthur Laroche
Ana Narušis
Nath Nicolas
Émile Nicolas
Cathrine Lie
Ingeborg Amundsen


In [5]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query1'].append(round(msec_duration, 5))

In [6]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(small_sql1)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query1'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query2</h4>

In [7]:
# step 1
small_sql2 = 'SELECT DISTINCT courseName AS name FROM smallDB WHERE discipline = \'statistics\' AND courseYear = 2022'

before = time.time()
cursor.execute(small_sql2)
after = time.time()

small_query2 = cursor.fetchall()
for course in small_query2:
    print(course[0])

Econometrics: Methods and Applications
Exploratory Data Analysis
Understanding Clinical Research: Behind the Statistics
Introduction to Probability and Data with R
Bayesian Statistics: From Concept to Data Analysis
Introduction to Statistics
Python and Statistics for Financial Analysis
Basic Statistics
Foundations: Data, Data, Everywhere


In [8]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query2'].append(round(msec_duration, 5))

In [9]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(small_sql2)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query2'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query3</h4>

In [10]:
# step 1
small_sql3 = 'SELECT COUNT(materialID) FROM smallDB WHERE materialType = \'lecture slides\' AND discipline = \'maths\' AND email LIKE \'%gmail.com\''

before = time.time()
cursor.execute(small_sql3)
after = time.time()

small_query3 = cursor.fetchall()
for count in small_query3[0]:
    print(count)

838


In [11]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query3'].append(round(msec_duration, 5))

In [12]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(small_sql3)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query3'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query4</h4>

In [13]:
# step 1
small_sql4 = 'SELECT DISTINCT firstName AS name, lastName as surname, country FROM smallDB WHERE discipline = \'psychology\' AND country LIKE \'%orea\' AND dateOfBirth LIKE \'1%\' AND courseYear = 2023 ORDER BY surname ASC;'

before = time.time()
cursor.execute(small_sql4)
after = time.time()

small_query4 = cursor.fetchall()
for name, surname, country in small_query4:
    print(name, surname, 'from', country)

Cathrine Lie from South Korea
Lynda Reynolds from Korea
Raghav Sura from North Korea


In [14]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query4'].append(round(msec_duration, 5))

In [15]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(small_sql4)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query4'].append(round(msec_duration, 5))
    cursor.reset()

I store the execution time of the first query execution and of the mean of the following 30 query executions into a new dictionary.

In [16]:
smallDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in smallDict:
    smallDataset[key].append(smallDict[key][0])
    mean30 = mean(smallDict[key][1 : 31])
    smallDataset[key].append(round(mean30, 5))
smallDataset

{'query1': [3223.03796, 111.13884],
 'query2': [192.51108, 156.48786],
 'query3': [195.03403, 147.04111],
 'query4': [215.14106, 179.53944]}

<h3>Dataset with 500k records</h3>

<h4>Query1</h4>

In [26]:
# step 1
medium_sql1 = 'SELECT DISTINCT firstName AS name, lastName AS surname FROM mediumDB AS S WHERE courseID = 192'

before = time.time()
cursor.execute(medium_sql1)
after = time.time()

medium_query1 = cursor.fetchall()
for name, surname in medium_query1:
    print(name, surname)

Custodia Hidalgo
Sarah Lara
Narciso Ferrán
Patrícia Leite
Vigilija Gaižauskas
Casandra Arenas
Ledün Soylu
Arthur Laroche
Ana Narušis
Nath Nicolas
Émile Nicolas
Cathrine Lie
Ingeborg Amundsen
Nedas Naujokas
Christl Henschel
Miguel Real
Karl Christensen
Joris Kavaliauskas
Yuvaan Dara
Débora Vaz


In [27]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query1'].append(round(msec_duration, 5))

In [28]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(medium_sql1)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query1'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query2</h4>

In [29]:
# step 1
medium_sql2 = 'SELECT DISTINCT courseName AS name FROM mediumDB WHERE discipline = \'statistics\' AND courseYear = 2022'

before = time.time()
cursor.execute(medium_sql2)
after = time.time()

medium_query2 = cursor.fetchall()
for course in medium_query2:
    print(course[0])

Econometrics: Methods and Applications
Exploratory Data Analysis
Understanding Clinical Research: Behind the Statistics
Introduction to Probability and Data with R
Bayesian Statistics: From Concept to Data Analysis
Introduction to Statistics
Python and Statistics for Financial Analysis
Basic Statistics
Foundations: Data, Data, Everywhere


In [30]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query2'].append(round(msec_duration, 5))

In [31]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(medium_sql2)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query2'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query3</h4>

In [32]:
# step 1
medium_sql3 = 'SELECT COUNT(materialID) FROM mediumDB WHERE materialType = \'lecture slides\' AND discipline = \'maths\' AND email LIKE \'%gmail.com\''

before = time.time()
cursor.execute(medium_sql3)
after = time.time()

medium_query3 = cursor.fetchall()
for count in medium_query3[0]:
    print(count)

1698


In [33]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query3'].append(round(msec_duration, 5))

In [34]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(medium_sql3)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query3'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query4</h4>

In [35]:
# step 1
medium_sql4 = 'SELECT DISTINCT firstName AS name, lastName as surname, country FROM mediumDB WHERE discipline = \'psychology\' AND country LIKE \'%orea\' AND dateOfBirth LIKE \'1%\' AND courseYear = 2023 ORDER BY surname ASC;'

before = time.time()
cursor.execute(medium_sql4)
after = time.time()

medium_query4 = cursor.fetchall()
for name, surname, country in medium_query4:
    print(name, surname, 'from', country)

Ninthe Horrocks from Noord-Korea
Cathrine Lie from South Korea
Miguel Real from República de Corea
Lynda Reynolds from Korea
Raghav Sura from North Korea


In [36]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query4'].append(round(msec_duration, 5))

In [37]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(medium_sql4)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query4'].append(round(msec_duration, 5))
    cursor.reset()

In [38]:
mediumDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in mediumDict:
    mediumDataset[key].append(mediumDict[key][0])
    mean30 = mean(mediumDict[key][1 : 31])
    mediumDataset[key].append(round(mean30, 5))
mediumDataset

{'query1': [442.6651, 377.45869],
 'query2': [446.44189, 387.30586],
 'query3': [461.92908, 419.34764],
 'query4': [590.17587, 489.78967]}

<h3>Dataset with 750k records</h3>

<h4>Query1</h4>

In [39]:
# step 1
large_sql1 = 'SELECT DISTINCT firstName AS name, lastName AS surname FROM largeDB AS S WHERE courseID = 192'

before = time.time()
cursor.execute(large_sql1)
after = time.time()

large_query1 = cursor.fetchall()
for name, surname in large_query1:
    print(name, surname)

Custodia Hidalgo
Sarah Lara
Narciso Ferrán
Patrícia Leite
Vigilija Gaižauskas
Casandra Arenas
Ledün Soylu
Arthur Laroche
Ana Narušis
Nath Nicolas
Émile Nicolas
Cathrine Lie
Ingeborg Amundsen
Nedas Naujokas
Christl Henschel
Miguel Real
Karl Christensen
Joris Kavaliauskas
Yuvaan Dara
Débora Vaz
Urvi Dani
Collin Heerkens
Brian Thompson
Özkutlu Gül
Dorita Abella
Liliana Flaiano


In [40]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query1'].append(round(msec_duration, 5))

In [41]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(large_sql1)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query1'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query2</h4>

In [42]:
# step 1
large_sql2 = 'SELECT DISTINCT courseName AS name FROM largeDB WHERE discipline = \'statistics\' AND courseYear = 2022'

before = time.time()
cursor.execute(large_sql2)
after = time.time()

large_query2 = cursor.fetchall()
for course in large_query2:
    print(course[0])

Econometrics: Methods and Applications
Exploratory Data Analysis
Understanding Clinical Research: Behind the Statistics
Introduction to Probability and Data with R
Bayesian Statistics: From Concept to Data Analysis
Introduction to Statistics
Python and Statistics for Financial Analysis
Basic Statistics
Foundations: Data, Data, Everywhere


In [43]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query2'].append(round(msec_duration, 5))

In [44]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(large_sql2)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query2'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query3</h4>

In [45]:
# step 1
large_sql3 = 'SELECT COUNT(materialID) FROM largeDB WHERE materialType = \'lecture slides\' AND discipline = \'maths\' AND email LIKE \'%gmail.com\''

before = time.time()
cursor.execute(large_sql3)
after = time.time()

large_query3 = cursor.fetchall()
for count in large_query3[0]:
    print(count)

2628


In [46]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query3'].append(round(msec_duration, 5))

In [47]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(large_sql3)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query3'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query4</h4>

In [48]:
# step 1
large_sql4 = 'SELECT DISTINCT firstName AS name, lastName as surname, country FROM largeDB WHERE discipline = \'psychology\' AND country LIKE \'%orea\' AND dateOfBirth LIKE \'1%\' AND courseYear = 2023 ORDER BY surname ASC;'

before = time.time()
cursor.execute(large_sql4)
after = time.time()

large_query4 = cursor.fetchall()
for name, surname, country in large_query4:
    print(name, surname, 'from', country)

Tere Castells from República Popular Democrática de Corea
Ninthe Horrocks from Noord-Korea
Cathrine Lie from South Korea
Miguel Real from República de Corea
Lynda Reynolds from Korea
Raghav Sura from North Korea


In [49]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query4'].append(round(msec_duration, 5))

In [50]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(large_sql4)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query4'].append(round(msec_duration, 5))
    cursor.reset()

In [51]:
largeDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in largeDict:
    largeDataset[key].append(largeDict[key][0])
    mean30 = mean(largeDict[key][1 : 31])
    largeDataset[key].append(round(mean30, 5))
largeDataset

{'query1': [20249.22609, 549.37373],
 'query2': [665.6878, 995.14083],
 'query3': [684.79514, 618.8828],
 'query4': [799.21412, 779.94096]}

<h3>Dataset with 1m records</h3>

<h4>Query1</h4>

In [52]:
# step 1
humongous_sql1 = 'SELECT DISTINCT firstName AS name, lastName AS surname FROM humongousDB AS S WHERE courseID = 192'

before = time.time()
cursor.execute(humongous_sql1)
after = time.time()

humongous_query1 = cursor.fetchall()
for name, surname in humongous_query1:
    print(name, surname)

Custodia Hidalgo
Sarah Lara
Narciso Ferrán
Patrícia Leite
Vigilija Gaižauskas
Casandra Arenas
Ledün Soylu
Arthur Laroche
Ana Narušis
Nath Nicolas
Émile Nicolas
Cathrine Lie
Ingeborg Amundsen
Nedas Naujokas
Christl Henschel
Miguel Real
Karl Christensen
Joris Kavaliauskas
Yuvaan Dara
Débora Vaz
Urvi Dani
Collin Heerkens
Brian Thompson
Özkutlu Gül
Dorita Abella
Liliana Flaiano
Finn Karlsen
David Miranda
Torsten Schulz
Kristen Webb
Shaan Raju
Giuseppina Scarfoglio
Mamen Teruel
Eduardo Rezende
Melania Savorgnan


In [53]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query1'].append(round(msec_duration, 5))

In [54]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(humongous_sql1)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query1'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query2</h4>

In [55]:
# step 1
humongous_sql2 = 'SELECT DISTINCT courseName AS name FROM humongousDB WHERE discipline = \'statistics\' AND courseYear = 2022'

before = time.time()
cursor.execute(humongous_sql2)
after = time.time()

humongous_query2 = cursor.fetchall()
for course in large_query2:
    print(course[0])

Econometrics: Methods and Applications
Exploratory Data Analysis
Understanding Clinical Research: Behind the Statistics
Introduction to Probability and Data with R
Bayesian Statistics: From Concept to Data Analysis
Introduction to Statistics
Python and Statistics for Financial Analysis
Basic Statistics
Foundations: Data, Data, Everywhere


In [56]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query2'].append(round(msec_duration, 5))

In [57]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(humongous_sql2)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query2'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query3</h4>

In [58]:
# step 1
humongous_sql3 = 'SELECT COUNT(materialID) FROM humongousDB WHERE materialType = \'lecture slides\' AND discipline = \'maths\' AND email LIKE \'%gmail.com\''

before = time.time()
cursor.execute(humongous_sql3)
after = time.time()

humongous_query3 = cursor.fetchall()
for count in humongous_query3[0]:
    print(count)

3498


In [59]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query3'].append(round(msec_duration, 5))

In [60]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(humongous_sql3)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query3'].append(round(msec_duration, 5))
    cursor.reset()

<h4>Query4</h4>

In [61]:
# step 1
humongous_sql4 = 'SELECT DISTINCT firstName AS name, lastName as surname, country FROM humongousDB WHERE discipline = \'psychology\' AND country LIKE \'%orea\' AND dateOfBirth LIKE \'1%\' AND courseYear = 2023 ORDER BY surname ASC;'

before = time.time()
cursor.execute(humongous_sql4)
after = time.time()

humongous_query4 = cursor.fetchall()
for name, surname, country in humongous_query4:
    print(name, surname, 'from', country)

Tere Castells from República Popular Democrática de Corea
Leila Gailys from Korea
Ninthe Horrocks from Noord-Korea
Cathrine Lie from South Korea
Miguel Real from República de Corea
Lynda Reynolds from Korea
Raghav Sura from North Korea


In [62]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query4'].append(round(msec_duration, 5))

In [63]:
# step 3
for i in range(0, 30):
    before = time.time()
    cursor.execute(humongous_sql4)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query4'].append(round(msec_duration, 5))
    cursor.reset()

In [64]:
humongousDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in humongousDict:
    humongousDataset[key].append(humongousDict[key][0])
    mean30 = mean(humongousDict[key][1 : 31])
    humongousDataset[key].append(round(mean30, 5))
humongousDataset

{'query1': [13020.43223, 896.23044],
 'query2': [758.47006, 791.1298],
 'query3': [865.78679, 844.03125],
 'query4': [1314.27693, 1359.20997]}

In [65]:
with open('mysql_tests.csv', 'w', newline = '') as mysql_tests:
    writer = csv.writer(mysql_tests, delimiter = ',')
    keys = smallDataset.keys()
    limit = len(smallDataset['query1'])
    
    writer.writerow(keys)
    writer.writerow('s') # s stands for small dataset
    for i in range(0, limit):
        writer.writerow(smallDataset[k][i] for k in keys)
    writer.writerow('m')  # m stands for medium dataset
    for i in range(0, limit):
        writer.writerow(mediumDataset[k][i] for k in keys)
    writer.writerow('l') # l stands for large dataset
    for i in range(0, limit):
        writer.writerow(largeDataset[k][i] for k in keys)
    writer.writerow('h') # h stands for humongous dataset
    for i in range(0, limit):
        writer.writerow(humongousDataset[k][i] for k in keys)
