<center>
    <h2>Online learning platform database - Cassandra</h2>
    <h3>Performing the queries and storing the queries execution time</h3>
</center>

<h3>Python - Cassandra interaction</h3>

Prior to performing the queries we import the required modules (the Cassandra Python driver and the <i>time</i> and <i>csv</i> modules), establish a connection with the Cassandra instance running in Docker and choose the keyspace on which we will perform the queries.

In [1]:
from cassandra.cluster import Cluster   # MySQL driver
import time                             # time-related functions to register query execution times
import csv                              # read and write csv files

# instantiate a cluster
cluster = Cluster(['127.0.0.1'])

# create a session by connecting to the cluster
session = cluster.connect()

# associate a keyspace to the session
session.set_keyspace('dbb_cassandra_test')

<h3>Query the datasets</h3>

I create a dictionary of lists for each of the four keyspaces. In these dictionaries the keys are the query names and the values are the 31 query execution times: in fact I attach the value of the query execution time of the most recent query to the list. Since query execution times are required in milliseconds, prior to attaching them, I multiply them by 1000 and round them to the fifth decimal precision.
The above summarized actions (for each of the four queries on each of the four keyspaces) are performed by following a standard succession of steps. Each step is encapsulated within a notebook cell (so each query is performed 31 times by using three notebook cells), as follows:
 - step 1: create index if needed, define the query, perform it for the first time, contextually create timestamps prior and after query execution, print query result;
 - step 2: compute execution time of the first query execution and store it within the corresponding dictionary list;
 - step 3: [thirty times] perform query execution while creating prior and following timestamps, compute execution time and store it within the corresponding dictionary list <u>, reset the cursor to allow repeating the query</u>.

For each dataset, after having performed the four queries, I will finally compute the mean of the query executions from step 3. Together with the first query execution, this mean value will be stored into a new dictionary, specific to a dataset. Originally, I would use these four new dictionaries to save the query execution times into a csv file for constructing histograms. I later resolved to save all the 31 recorded query execution times and pass them all to Microsoft© Excel to process them.

In [2]:
smallDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
mediumDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
largeDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
humongousDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}

In [3]:
# mean function (used to compute mean execution time of the 30 grouped queries)
def mean(aList):
    n = len(aList)
    sum = 0
    for value in aList:
        sum += value
    return sum / n

<h3>Table with 250k records</h3><br>
I start with the smallest table.

<h4>Query1</h4>
It is to be noted that, in relation to its focus on performance, Cassandra does not allow the use of the <code>DISTINCT</code> keyword on columns that are not partition keys (primary keys). In <i>Query 1</i>, however, we are only interested in the names of the students enrolled to the course having ID = 192, there is no point in having a student's name repeated more than once. Each student enrolled to course 192, however, has accessed to various learning materials of course 192, hence the query result, without the use of <code>DISTINCT</code> will be a <i>bag</i>, rather than a <i>set</i>. Duplicate values must be handled in some way, in order to display each student only once. I have considered two possibilities to reach the desired result:

- the first one uses CQLSH and an external csv file. It requires, within the selected keyspace, creating a new table with three fields (student first name, student last name and course ID) and setting the first two as primary key. Then the same three fields are to be copied from the entire original table to a csv file and copied back from the csv file to the table having firstname and lastName as primary key. Then the query on the new table can be run by using the <code>DISTINCT</code> option:

    <code>
        'CREATE TABLE query1_temp (firstName VARCHAR, lastName VARCHAR, courseID VARCHAR, PRIMARY KEY(firstName, lastName));'
        'COPY smallDB(firstName, lastName, courseID) TO 'path/query1_temp.csv' WITH HEADER = TRUE AND DELIMITER = ',';'
        'COPY query1_temp(firstname, lastName, courseID) FROM 'path/query1_temp.csv' WITH HEADER = TRUE AND DELIMITER = ',';'
        'SELECT DISTINCT firstName, lastName FROM query1_temp WHERE courseID = 192;'
    </code>

- the second one just considers performing the query on the original table without the use of the <code>DISTINCT</code> keyword. The query result is then processed via programming language to obtain the unique values of the students enrolled to course 192. Python is suitable for this purpose, having in store set objects than do not allow element replicas.

Both methods will affect the query execution time, if all the steps are to be taken into account. The second method seems to me the cleanest one and I will apply it for <i>Query 1</i>. In particular, I will add to <i>step 1</i> a new substep: prior to recording timestamps I create a set (<i>resSet</i>) where I want to store unique values from the query result. After having defined thid object and created the index required by the query: I record the first timestamp, I run the query, I manipulate the query result by storing the rows into the <i>resSet</i> object, I record the last timestamp. Then the query result can be displayed.

In [4]:
# step 1
resSet1 = set()
session.execute('CREATE INDEX IF NOT EXISTS query1Index ON smalldb(courseid);')
small_cassandra1 = 'SELECT firstName, lastName FROM smalldb WHERE courseid = \'192\';'

before = time.time()
small_query1 = session.execute(small_cassandra1)
for row in small_query1:
    resSet1.add(row)
after = time.time()

for element in resSet1:
    print(element[0], element[1])

Casandra Arenas
Ledün Soylu
Cathrine Lie
Custodia Hidalgo
Patrícia Leite
Arthur Laroche
Narciso Ferrán
Ingeborg Amundsen
Sarah Lara
Vigilija Gaižauskas
Nath Nicolas
Ana Narušis
Émile Nicolas


In [5]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query1'].append(round(msec_duration, 5))

In [6]:
# step 3
for i in range(0, 30):
    resSet1 = set()
    before = time.time()
    small_query1 = session.execute(small_cassandra1)
    for row in small_query1:
        resSet1.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>
<i>Query 2</i> requires selection on more than one field value (discipline must be 'statistics' and the year of the course must be '2022'. In this case two indices must be created, but Cassandra requires the <code>ALLOW FILTERING</code> clause because double indexing may negatively impact query performance. Even in this case we need unique values, hence I add results to a Python set after query completion.

In [8]:
# step 1
resSet2 = set()
session.execute('CREATE INDEX IF NOT EXISTS query2Index1 ON smalldb(discipline);')
session.execute('CREATE INDEX IF NOT EXISTS query2Index2 ON smalldb(courseyear);')
small_cassandra2 = 'SELECT coursename FROM smalldb WHERE discipline = \'statistics\' AND courseyear = \'2022\' ALLOW FILTERING;'

before = time.time()
small_query2 = session.execute(small_cassandra2)
for row in small_query2:
    resSet2.add(row)
after = time.time()

for element in resSet2:
    print(element[0])

Basic Statistics
Exploratory Data Analysis
Bayesian Statistics: From Concept to Data Analysis
Python and Statistics for Financial Analysis
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Econometrics: Methods and Applications
Introduction to Probability and Data with R
Introduction to Statistics


In [9]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query2'].append(round(msec_duration, 5))

In [10]:
# step 3
for i in range(0, 30):
    before = time.time()
    small_query2 = session.execute(small_cassandra2)
    for row in small_query2:
        resSet2.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>
In this case the hardest difficulty was in trying to implement an index that could behave similarly to the MySQL <code>LIKE</code>, by matching patterns in string. In Cassandra custom indices can be created, in particular the so-called <a href = 'https://cassandra.apache.org/doc/stable/cassandra/cql/SASI.html'>SASI indexes</a> which can be set on three different modes: <code>PREFIX</code> (default), <code>CONTAINS</code> or <code>SPARSE</code>.<br>
<br>
The first one allows use of a syntax such as the following:<br>
<code>SELECT <i>fieldNames</i> WHERE <i>fieldName</i> LIKE '<i>prefix</i>%;'</code><br>
<br>
The second one would allow either suffixes or strings contained in another string:
<br>
<code>SELECT <i>fieldNames</i> WHERE <i>fieldName</i> LIKE '%<i>contained</i>%;'</code><br>
or
<br>
<code>SELECT <i>fieldNames</i> WHERE <i>fieldName</i> LIKE '%<i>suffix</i>;'</code><br>

As for the <code>SPARSE </code> mode, I just found <a href = 'https://www.doanduyhai.com/blog/?p=2058'>here</a> some details. This mode is mainly designed for cases when very few occurrences match the query.<br>
Based on this, I approached the definition of <i>Query 3</i> as per the following cell. I create two indices working on the <i>discipline</i> and on the <i>material type</i> and another custom index for the <i>email</i> field. Then I build the <code>WHERE</code> clause on the three indices.

In [12]:
# set the indices
session.execute('CREATE INDEX IF NOT EXISTS query3Index1 ON smalldb(discipline);')
session.execute('CREATE INDEX IF NOT EXISTS query3Index2 ON smalldb(materialtype);')
session.execute('CREATE CUSTOM INDEX IF NOT EXISTS SASIquery3Index3 ON smalldb(email) USING \'org.apache.cassandra.index.sasi.SASIIndex\' WITH OPTIONS = {\'mode\': \'CONTAINS\', \'analyzer_class\': \'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer\', \'case_sensitive\': \'false\'};')

# define the CQL query
test = 'SELECT materialid FROM smalldb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' AND email LIKE \'%gmail.com\' ALLOW FILTERING;'

# run the query
smallqueryTest_query3 = session.execute(test)

ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures: UNKNOWN from /172.25.0.2:7000" info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1, 'error_code_map': {'172.25.0.2': '0x0000'}}

A <i>ReadFailure error</i> is thrown and I believe it is associated to the use of a custom index together with two regular indices. In fact, by executing two separate queries with the regular indices or the custom one, results are obtained.

In [13]:
test1 = 'SELECT materialid FROM smalldb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' ALLOW FILTERING;'
queryTest1 = session.execute(test1)
for i in range(0, 10):
    print(queryTest1[i])

Row(materialid='21516')
Row(materialid='4163')
Row(materialid='21576')
Row(materialid='21308')
Row(materialid='4328')
Row(materialid='21329')
Row(materialid='21370')
Row(materialid='21490')
Row(materialid='21367')
Row(materialid='21727')


  print(queryTest1[i])


In [14]:
test2 = 'SELECT materialid FROM smalldb WHERE email LIKE \'%gmail.com\' ALLOW FILTERING;'
queryTest2 = session.execute(test2)
for i in range(0, 10):
    print(queryTest2[i])

  print(queryTest2[i])


Row(materialid='17036')
Row(materialid='9497')
Row(materialid='11515')
Row(materialid='9388')
Row(materialid='19901')
Row(materialid='23612')
Row(materialid='8111')
Row(materialid='13673')
Row(materialid='11260')
Row(materialid='18429')


I then resolved to run a more simplified query and manipulate the result via programming language to obtain the desired result. I get the rows where the discipline is <i>maths</i> and the learning material type is <i>lecture slides</i>, then I run a custom function on the query result. The function allows to obtain the domain of the email in the query result, this allows me to count only those matching the string '<i>gmail.com</i>'. In this case, all the query results are of interest, since we want to know how many learning materials have been accessed by students, irrespective of possible repetitions in accessed learning materials.

In [15]:
# define function taking an email string and returning a substring with the email domain
def findDomain(email):
    delimiter = '@'
    emailList = email.split(delimiter)
    return emailList[1]

In [16]:
# step 1
small_cassandra3 = 'SELECT email FROM smalldb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' ALLOW FILTERING;'

before = time.time()
small_query3 = session.execute(small_cassandra3)
counter = 0
for row in small_query3:
    if findDomain(row.email) == 'gmail.com':
        counter += 1
    else:
        counter += 0
after = time.time()

print(counter)

837


In [17]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query3'].append(round(msec_duration, 5))

In [18]:
# step 3
for i in range(0, 30):
    before = time.time()
    small_query3 = session.execute(small_cassandra3)
    counter = 0
    for row in small_query3:
        if findDomain(row.email) == 'gmail.com':
            counter += 1
        else:
            counter += 0
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>
<i>Query 4</i> presents analogous problems to those of <i>Query 3</i> since multiple occurrences of Korean countries are present in the table (South Korea, North Korea, Républica de Corea, etc.). In this case, to achieve the desired task, instead of trying to using a custom index, I preferred to exploit the <code>IN</code> set operator, by using it on the complete list of occurrences of Korean countries. In this way, only students from a country present in the limited set of Korean countries can be selected (together with those enrolled to a course of the discipline '<i>psychology</i>'). Considering that birthdate is simply another string value and given that string manipulation or pattern searches would require using a custom index together with other selecting approaches, which would raise the already experienced problems, choosing students born before year 2000 seems a  hard task, which, given my configuration, would probably be better achieved via programming language. So I define a function to extract the year from the <i>dateofbirth</i> field and check if the year precedes year 2000. I use the function on the elements of the Python set to which I have already added the query results, which seems a more efficient approach than using the function on the query result and later adding surviving results in a Python set.

In [20]:
# define function taking a birthdate string in the format yyyy-mm-dd and returning a substring with the year
def findYear(dateofbirth):
    delimiter = '-'
    dateList = dateofbirth.split(delimiter)
    return int(dateList[0])

In [21]:
# step 1
resSet4 = set()
session.execute('CREATE INDEX IF NOT EXISTS query4Index1 ON smalldb(discipline);')
small_cassandra4 = 'SELECT firstname, lastname, country, dateofbirth FROM smalldb WHERE discipline = \'psychology\' AND country IN (\'Korea\', \'República de Corea\', \'South Korea\', \'North Korea\', \'República Popular Democrática de Corea\', \'Sydkorea\', \'Noord-Korea\') ALLOW FILTERING;'

before = time.time()
small_query4 = session.execute(small_cassandra4)
for row in small_query4:
    if findYear(row.dateofbirth) < 2000:
        resSet4.add(row)
after = time.time()

for element in resSet4:
    print(element.firstname, element.lastname, element.country)

Raghav Sura North Korea
Cathrine Lie South Korea
Lynda Reynolds Korea


In [22]:
# step 2
msec_duration = (after - before) * 1000
smallDict['query4'].append(round(msec_duration, 5))

In [23]:
# step 3
for i in range(0, 30):
    before = time.time()
    small_query4 = session.execute(small_cassandra4)
    for row in small_query4:
        resSet4.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    smallDict['query4'].append(round(msec_duration, 5))

In [25]:
smallDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in smallDict:
    smallDataset[key].append(smallDict[key][0])
    mean30 = mean(smallDict[key][1 : 31])
    smallDataset[key].append(round(mean30, 5))
smallDataset

{'query1': [4394.18602, 121.6824],
 'query2': [1478.45817, 334.82787],
 'query3': [317.37709, 270.71492],
 'query4': [915.18497, 610.58322]}

<h3>Table with 500k records</h3>

<h4>Query1</h4>

In [28]:
# step 1
resSet1 = set()
session.execute('CREATE INDEX IF NOT EXISTS med_query1Index ON mediumdb(courseid);')
medium_cassandra1 = 'SELECT firstName, lastName FROM mediumdb WHERE courseid = \'192\';'

before = time.time()
medium_query1 = session.execute(medium_cassandra1)
for row in medium_query1:
    resSet1.add(row)
after = time.time()

for element in resSet1:
    print(element[0], element[1])

Ledün Soylu
Nedas Naujokas
Arthur Laroche
Narciso Ferrán
Miguel Real
Custodia Hidalgo
Patrícia Leite
Sarah Lara
Débora Vaz
Vigilija Gaižauskas
Ingeborg Amundsen
Christl Henschel
Nath Nicolas
Ana Narušis
Émile Nicolas
Yuvaan Dara
Joris Kavaliauskas
Casandra Arenas
Cathrine Lie
Karl Christensen


In [29]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query1'].append(round(msec_duration, 5))

In [30]:
# step 3
for i in range(0, 30):
    resSet1 = set()
    before = time.time()
    medium_query1 = session.execute(medium_cassandra1)
    for row in medium_query1:
        resSet1.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [33]:
# step 1
resSet2 = set()
session.execute('CREATE INDEX IF NOT EXISTS med_query2Index1 ON mediumdb(discipline);')
session.execute('CREATE INDEX IF NOT EXISTS med_query2Index2 ON mediumdb(courseyear);')
medium_cassandra2 = 'SELECT coursename FROM mediumdb WHERE discipline = \'statistics\' AND courseyear = \'2022\' ALLOW FILTERING;'

before = time.time()
medium_query2 = session.execute(medium_cassandra2)
for row in medium_query2:
    resSet2.add(row)
after = time.time()

for element in resSet2:
    print(element[0])

Basic Statistics
Exploratory Data Analysis
Bayesian Statistics: From Concept to Data Analysis
Python and Statistics for Financial Analysis
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Introduction to Probability and Data with R
Econometrics: Methods and Applications
Introduction to Statistics


In [34]:
msec_duration = (after - before) * 1000
mediumDict['query2'].append(round(msec_duration, 5))

In [35]:
# step 3
for i in range(0, 30):
    before = time.time()
    medium_query2 = session.execute(medium_cassandra2)
    for row in medium_query2:
        resSet2.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [37]:
# step 1
medium_cassandra3 = 'SELECT email FROM mediumdb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' ALLOW FILTERING;'

before = time.time()
medium_query3 = session.execute(medium_cassandra3)
counter = 0
for row in medium_query3:
    if findDomain(row.email) == 'gmail.com':
        counter += 1
    else:
        counter += 0
after = time.time()

print(counter)

1686


In [38]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query3'].append(round(msec_duration, 5))

In [39]:
# step 3
for i in range(0, 30):
    before = time.time()
    medium_query3 = session.execute(medium_cassandra3)
    counter = 0
    for row in medium_query3:
        if findDomain(row.email) == 'gmail.com':
            counter += 1
        else:
            counter += 0
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [42]:
# step 1
resSet4 = set()
session.execute('CREATE INDEX IF NOT EXISTS med_query4Index1 ON smalldb(discipline);')
medium_cassandra4 = 'SELECT firstname, lastname, country, dateofbirth FROM mediumdb WHERE discipline = \'psychology\' AND country IN (\'Korea\', \'República de Corea\', \'South Korea\', \'North Korea\', \'República Popular Democrática de Corea\', \'Sydkorea\', \'Noord-Korea\') ALLOW FILTERING;'

before = time.time()
medium_query4 = session.execute(medium_cassandra4)
for row in medium_query4:
    if findYear(row.dateofbirth) < 2000:
        resSet4.add(row)
after = time.time()

for element in resSet4:
    print(element.firstname, element.lastname, element.country)

Raghav Sura North Korea
Cathrine Lie South Korea
Ninthe Horrocks Noord-Korea
Miguel Real República de Corea
Lynda Reynolds Korea


In [43]:
# step 2
msec_duration = (after - before) * 1000
mediumDict['query4'].append(round(msec_duration, 5))

In [44]:
# step 3
for i in range(0, 30):
    before = time.time()
    medium_query4 = session.execute(medium_cassandra4)
    for row in medium_query4:
        resSet4.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    mediumDict['query4'].append(round(msec_duration, 5))

In [46]:
mediumDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in mediumDict:
    mediumDataset[key].append(mediumDict[key][0])
    mean30 = mean(mediumDict[key][1 : 31])
    mediumDataset[key].append(round(mean30, 5))
mediumDataset

{'query1': [782.57918, 86.56153],
 'query2': [1269.67812, 732.44093],
 'query3': [619.02809, 512.1908],
 'query4': [1792.56797, 1621.39731]}

<h3>Table with 750k records</h3>

<h4>Query 1</h4>

In [49]:
# step 1
resSet1 = set()
session.execute('CREATE INDEX IF NOT EXISTS lg_query1Index ON largedb(courseid);')
large_cassandra1 = 'SELECT firstname, lastname FROM largedb WHERE courseid = \'192\';'

before = time.time()
large_query1 = session.execute(large_cassandra1)
for row in large_query1:
    resSet1.add(row)
after = time.time()

for element in resSet1:
    print(element[0], element[1])

Nedas Naujokas
Ledün Soylu
Brian Thompson
Arthur Laroche
Narciso Ferrán
Miguel Real
Custodia Hidalgo
Patrícia Leite
Urvi Dani
Sarah Lara
Débora Vaz
Vigilija Gaižauskas
Özkutlu Gül
Liliana Flaiano
Ingeborg Amundsen
Collin Heerkens
Christl Henschel
Nath Nicolas
Ana Narušis
Émile Nicolas
Dorita Abella
Yuvaan Dara
Joris Kavaliauskas
Casandra Arenas
Cathrine Lie
Karl Christensen


In [50]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query1'].append(round(msec_duration, 5))

In [51]:
# step 3
for i in range(0, 30):
    resSet1 = set()
    before = time.time()
    large_query1 = session.execute(large_cassandra1)
    for row in large_query1:
        resSet1.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [55]:
# step 1
resSet2 = set()
session.execute('CREATE INDEX IF NOT EXISTS lg_query2Index1 ON largedb(discipline);')
session.execute('CREATE INDEX IF NOT EXISTS lg_query2Index2 ON largedb(courseyear);')
large_cassandra2 = 'SELECT coursename FROM largedb WHERE discipline = \'statistics\' AND courseyear = \'2022\' ALLOW FILTERING;'

before = time.time()
large_query2 = session.execute(large_cassandra2)
for row in large_query2:
    resSet2.add(row)
after = time.time()

for element in resSet2:
    print(element[0])

Basic Statistics
Exploratory Data Analysis
Bayesian Statistics: From Concept to Data Analysis
Python and Statistics for Financial Analysis
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Econometrics: Methods and Applications
Introduction to Probability and Data with R
Introduction to Statistics


In [56]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query2'].append(round(msec_duration, 5))

In [57]:
# step 3
for i in range(0, 30):
    before = time.time()
    large_query2 = session.execute(large_cassandra2)
    for row in large_query2:
        resSet2.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [60]:
# step 1
large_cassandra3 = 'SELECT email FROM largedb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' ALLOW FILTERING;'

before = time.time()
large_query3 = session.execute(large_cassandra3)
counter = 0
for row in large_query3:
    if findDomain(row.email) == 'gmail.com':
        counter += 1
    else:
        counter += 0
after = time.time()

print(counter)

2615


In [61]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query3'].append(round(msec_duration, 5))

In [62]:
# step 3
for i in range(0, 30):
    before = time.time()
    large_query3 = session.execute(large_cassandra3)
    counter = 0
    for row in large_query3:
        if findDomain(row.email) == 'gmail.com':
            counter += 1
        else:
            counter += 0
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [64]:
# step 1
resSet4 = set()
session.execute('CREATE INDEX IF NOT EXISTS lg_query4Index1 ON largedb(discipline);')
large_cassandra4 = 'SELECT firstname, lastname, country, dateofbirth FROM largedb WHERE discipline = \'psychology\' AND country IN (\'Korea\', \'República de Corea\', \'South Korea\', \'North Korea\', \'República Popular Democrática de Corea\', \'Sydkorea\', \'Noord-Korea\') ALLOW FILTERING;'

before = time.time()
large_query4 = session.execute(large_cassandra4)
for row in large_query4:
    if findYear(row.dateofbirth) < 2000:
        resSet4.add(row)
after = time.time()

for element in resSet4:
    print(element.firstname, element.lastname, element.country)

Raghav Sura North Korea
Cathrine Lie South Korea
Ninthe Horrocks Noord-Korea
Tere Castells República Popular Democrática de Corea
Miguel Real República de Corea
Lynda Reynolds Korea


In [65]:
# step 2
msec_duration = (after - before) * 1000
largeDict['query4'].append(round(msec_duration, 5))

In [66]:
# step 3
for i in range(0, 30):
    before = time.time()
    large_query4 = session.execute(large_cassandra4)
    for row in large_query4:
        resSet4.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    largeDict['query4'].append(round(msec_duration, 5))

In [68]:
largeDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in largeDict:
    largeDataset[key].append(largeDict[key][0])
    mean30 = mean(largeDict[key][1 : 31])
    largeDataset[key].append(round(mean30, 5))
largeDataset

{'query1': [140.74492, 106.26348],
 'query2': [1108.06799, 1003.05887],
 'query3': [1558.99978, 751.42794],
 'query4': [3008.16417, 2378.96704]}

<h3>Table with 1m records</h3>

<h4>Query 1</h4>

In [72]:
# step 1
resSet1 = set()
session.execute('CREATE INDEX IF NOT EXISTS hmg_query1Index ON humongousdb(courseid);')
humongous_cassandra1 = 'SELECT firstName, lastName FROM humongousdb WHERE courseid = \'192\';'

before = time.time()
humongous_query1 = session.execute(humongous_cassandra1)
for row in humongous_query1:
    resSet1.add(row)
after = time.time()

for element in resSet1:
    print(element[0], element[1])

Ledün Soylu
Nedas Naujokas
Brian Thompson
Torsten Schulz
Arthur Laroche
Narciso Ferrán
Giuseppina Scarfoglio
Miguel Real
Mamen Teruel
Kristen Webb
Custodia Hidalgo
Patrícia Leite
Urvi Dani
Sarah Lara
Débora Vaz
Vigilija Gaižauskas
David Miranda
Melania Savorgnan
Özkutlu Gül
Liliana Flaiano
Ingeborg Amundsen
Collin Heerkens
Christl Henschel
Nath Nicolas
Ana Narušis
Shaan Raju
Eduardo Rezende
Émile Nicolas
Dorita Abella
Yuvaan Dara
Joris Kavaliauskas
Casandra Arenas
Finn Karlsen
Cathrine Lie
Karl Christensen


In [73]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query1'].append(round(msec_duration, 5))

In [74]:
# step 3
for i in range(0, 30):
    resSet1 = set()
    before = time.time()
    humongous_query1 = session.execute(humongous_cassandra1)
    for row in humongous_query1:
        resSet1.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query1'].append(round(msec_duration, 5))

<h4>Query 2</h4>

In [79]:
# step 1
resSet2 = set()
session.execute('CREATE INDEX IF NOT EXISTS hmg_query2Index1 ON humongousdb(discipline);')
session.execute('CREATE INDEX IF NOT EXISTS hmg_query2Index2 ON humongousdb(courseyear);')
humongous_cassandra2 = 'SELECT coursename FROM humongousdb WHERE discipline = \'statistics\' AND courseyear = \'2022\' ALLOW FILTERING;'

before = time.time()
humongous_query2 = session.execute(humongous_cassandra2)
for row in humongous_query2:
    resSet2.add(row)
after = time.time()

for element in resSet2:
    print(element[0])

Basic Statistics
Exploratory Data Analysis
Bayesian Statistics: From Concept to Data Analysis
Python and Statistics for Financial Analysis
Foundations: Data, Data, Everywhere
Understanding Clinical Research: Behind the Statistics
Econometrics: Methods and Applications
Introduction to Probability and Data with R
Introduction to Statistics


In [80]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query2'].append(round(msec_duration, 5))

In [81]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongous_query2 = session.execute(humongous_cassandra2)
    for row in humongous_query2:
        resSet2.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query2'].append(round(msec_duration, 5))

<h4>Query 3</h4>

In [83]:
# step 1
humongous_cassandra3 = 'SELECT email FROM humongousdb WHERE discipline = \'maths\' AND materialtype = \'lecture slides\' ALLOW FILTERING;'

before = time.time()
humongous_query3 = session.execute(humongous_cassandra3)
counter = 0
for row in humongous_query3:
    if findDomain(row.email) == 'gmail.com':
        counter += 1
    else:
        counter += 0
after = time.time()

print(counter)

3510


In [84]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query3'].append(round(msec_duration, 5))

In [85]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongous_query3 = session.execute(humongous_cassandra3)
    counter = 0
    for row in humongous_query3:
        if findDomain(row.email) == 'gmail.com':
            counter += 1
        else:
            counter += 0
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query3'].append(round(msec_duration, 5))

<h4>Query 4</h4>

In [88]:
# step 1
resSet4 = set()
session.execute('CREATE INDEX IF NOT EXISTS hmg_query4Index1 ON humongousdb(discipline);')
humongous_cassandra4 = 'SELECT firstname, lastname, country, dateofbirth FROM humongousdb WHERE discipline = \'psychology\' AND country IN (\'Korea\', \'República de Corea\', \'South Korea\', \'North Korea\', \'República Popular Democrática de Corea\', \'Sydkorea\', \'Noord-Korea\') ALLOW FILTERING;'

before = time.time()
humongous_query4 = session.execute(humongous_cassandra4)
for row in humongous_query4:
    if findYear(row.dateofbirth) < 2000:
        resSet4.add(row)
after = time.time()

for element in resSet4:
    print(element.firstname, element.lastname, element.country)

Raghav Sura North Korea
Cathrine Lie South Korea
Ninthe Horrocks Noord-Korea
Tere Castells República Popular Democrática de Corea
Miguel Real República de Corea
Debra Shaw Korea
Leila Gailys Korea
Lynda Reynolds Korea


In [89]:
# step 2
msec_duration = (after - before) * 1000
humongousDict['query4'].append(round(msec_duration, 5))

In [90]:
# step 3
for i in range(0, 30):
    before = time.time()
    humongous_query4 = session.execute(humongous_cassandra4)
    for row in humongous_query4:
        resSet4.add(row)
    after = time.time()
    msec_duration = (after - before) * 1000
    humongousDict['query4'].append(round(msec_duration, 5))

In [92]:
humongousDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in humongousDict:
    humongousDataset[key].append(humongousDict[key][0])
    mean30 = mean(humongousDict[key][1 : 31])
    humongousDataset[key].append(round(mean30, 5))
humongousDataset

{'query1': [234.81083, 151.13089],
 'query2': [1519.24586, 1418.84372],
 'query3': [1896.25597, 1120.20448],
 'query4': [4797.63794, 3076.35696]}

In [93]:
with open('cassandra_tests.csv', 'w', newline = '') as cassandra_tests:
    writer = csv.writer(cassandra_tests, delimiter = ',')
    keys = smallDict.keys()
    limit = len(smallDict['query1'])
    
    writer.writerow(keys)
    writer.writerow('s') # s stands for small dataset
    for i in range(0, limit):
        writer.writerow(smallDict[k][i] for k in keys)
    writer.writerow('m')  # m stands for medium dataset
    for i in range(0, limit):
        writer.writerow(mediumDict[k][i] for k in keys)
    writer.writerow('l') # l stands for large dataset
    for i in range(0, limit):
        writer.writerow(largeDict[k][i] for k in keys)
    writer.writerow('h') # h stands for humongous dataset
    for i in range(0, limit):
        writer.writerow(humongousDict[k][i] for k in keys)