<center>
<h2>Online learning platform database - Redis</h2>
</center>

<h3>Preliminary operations: import csv files into Redis</h3>

Given Redis' nature of a <i>key-value store</i> rather than a DBMS in a classical form, importing is a task better performed via a programming language API. This requires to load the csv file and store the data into a suitable data structure, then use the programming language API to connect to a Redis instance and store the data into a Redis data type. Hence, this section on Redis will have a slightly different format from the previous ones, it will start directly with introducing the Python driver for Redis and ways to connect to a Redis instance from Python.

<h3>Importing csv files into Python</h3>

Importing a csv file into Python is best performed via the <code>csv</code> module, contained in the Python standard library. It contains methods to read <code>csv.reader</code> or write <code>csv.writer</code> csv files. By starting a connection to a file we can read it line by line and store the fields within each line into lists. The file lines can be stored into a dictionary of lists where each line number corresponds to the dictionary key and the associated list contains the fields included in the csv lines. The above described procedure is stored into a function that can be called for each of the four datasets.

In [None]:
# PURPOSE: stores a csv file into a Python dictionary (dictionary keys are row numbers, values are rows as lists)
# arguments: a path (string), a csv filename including extension (string), a dictionary name
# RETURNS: a dictionary
def importCSVfile(pathName, csvFileName, dictName):
    import csv
    key = 0
    dictName = dict()
    with open(pathName + csvFileName, newline = '') as csvFile:
        reader = csv.reader(csvFile, delimiter = ',')
        for line in reader:
            dictName[key] = line
            key += 1
    return dictName

In [1]:
# call the importCSVfile function for the four differently sized datasets
path = '/Users/mau/OneDrive - unime.it/Learning/CdL Informatica/Anno II - Database/Module B/project/tables/'

# 250k rows dataset to dict
importCSVfile(path, 'dataset250k.csv', smallDB)

# 500k rows dataset to dict
importCSVfile(path, 'dataset500k.csv', mediumDB)

# 750k rows dataset to dict
importCSVfile(path, 'dataset750k.csv', largeDB)

# 1m rows dataset to dict
importCSVfile(path, 'dataset1m.csv', humongousDB)

In [2]:
print('Lengths of the four dictionaries, from the smallest to the largest:\n')
print('smallDB:', len(datasetSmall), 'mediumDB:', len(datasetMedium), 'largeDB:', len(datasetLarge), 'humongousDB:', len(datasetHumongous))

Lengths of the four dictionaries, from the smallest to the largest:
250001 500001 750001 1000001


<h3>Python - Redis interaction</h3>

Interaction between a Python API and a Redis key-value store requires the installation of a specific driver. The usual list of drivers for various programming languages is provided in the <a href = 'https://redis.io/resources/clients/'>Clients</a> web page of the Redis website: <a href = 'https://redis-py.readthedocs.io/en/stable/index.html'>redis-py</a> is the driver developed by <i>Redis Inc.</i> for a Python programming environment.<br>After having installed the driver, it can be imported into a Python environment the usual way.

In [1]:
import redis

<h4>
Establishing a connection to a Redis database
</h4><br>
We can connect to a Redis instance by simply assigning a <code>Redis()</code> object to a Python variable. By default, the driver sets a connection to a local Redis instance on port 6379. Host name and port can also be specified as arguments. By default, Redis returns responses as bytes in Python. We can be returned responses decoded as strings by specifying the <code>decode_responses</code> argument.

In [2]:
myRedis = redis.Redis(host = 'localhost', port = 6379, decode_responses = True)

A Redis connection implements a <code>CoreCommands</code> class which contains functions that can replicate all the <a href = 'https://redis-py.readthedocs.io/en/stable/commands.html'>commands</a> provided within the <i>redis-cli</i> API. Since Python is case sensitive, however, they must be typed in the correct letter case (they usually use lowercase letters). The list of all available methods is accessible via the usual <code>dir(<i>redisObject</i>)</code> function.

<h4>
Store data into Redis hashes [ ! LONG PROCESSES FOLLOW ! ]
</h4><br>
<i>Hashes</i> are a Redis data type that allows the association of keys and values. A hash object has a name and a list of key-value stores. In our case, the keys may represent the field (column) names contained in the first row of the csv file (header) while the values are the field values contained in the other csv rows. We can create one hash per row by creating hash names of the form <i>small:rownumber</i>, where the text before the colon represents the dataset and the text after the colon is the row number. Thus, each row becomes a hash where the hash keys are common across all hashes in the dataset. This is helpful because hash keys may work as a schema for implementing queries. This procedure is stored into a function that takes a dictionary as argument, so that it is sufficient to feed it the desired dataset (one of the four differently sized dataset stored into the four dictionaries above) to have data sent to Redis (the process is quite long even for the 250k rows dataset, anyway).

In [None]:
# PURPOSE: stores Python dictionary key-value pairs to Redis hashes with a prefix
# arguments: a Redis instance, a dictionary(dict), a string we want to use as hash prefix
# (usually the hash string ends with a colon to separate prefix and row number)
# RETURNS: nothing
def sendToRedis(redisInstance, datasetDict, hashPrefix):
    for i in range(1, len(datasetDict)):
        for j in range(0, len(datasetDict[0])):
            redisInstance.hset(hashPrefix + str(i), datasetDict[0][j], datasetDict[i][j])

If we desire to remove hashes prefixed with the dataset name, sent to Redis as above explained, we can also reverse the process by looping over the length of the dataset dictionary (the dictionary keys range from 0 to 250k or 500k, etc.), assigning the dataset name + colon + the row number to the hashes we want to remove and applying the <code>delete</code> method on them. This will remove them one by one (a long process as well as the one of loading them). Again, we store the procedure into a function.

In [None]:
# PURPOSE: removes Redis hashes sent with a prefix from a Python dictionary
# arguments:a Redis instance, a dictionary(dict), a string we want to use as hash prefix
# (usually the hash string ends with a colon to separate prefix and row number)
# RETURNS: nothing
def removeFromRedis(redisInstance, datasetDict, hashPrefix):
    for i in range(1, len(datasetDict)):
        redisInstance.delete(hashPrefix)

In [None]:
# THESE LINES START LONG PROCESSES, SO I COMMENT THEM OUT TO AVOID UNCAUTIOUS USE
'''
# 250k keys dict to Redis hashes
sendToRedis(myRedis, smallDB, 'smallDB:')

# 500k keys dict to Redis hashes
sendToRedis(myRedis, smallDB, 'mediumDB:')

# 750k keys dict to Redis hashes
sendToRedis(myRedis, smallDB, 'largeDB:')

# 1m keys dict to Redis hashes
sendToRedis(myRedis, smallDB, 'humongousDB:')
'''

In [None]:
# optional delete hashes process
# removeFromRedis(myRedis, smallDB, 'smallDB:')

We can consider each hash as a single document in a collection of documents where keys are common across them.

<h4>
Executing a query (<i>RediSearch</i>)
</h4><br>
Queries can be performed by using the <a href = 'https://docs.redis.com/latest/stack/search/'>RediSearch module</a>, which builds indices based on the provided schema.

- Creating an index <code>FT.CREATE</code><br>
  This is a very important step to take before performing a query. Creating an index allows to define the <code>SCHEMA</code> of the data for the purpose of performing a query. Creating the index is strongly query-oriented. The schema is in fact a list of secondary indices that we base our queries on.<br>
<code>
    FT.CREATE <i>indexName</i> ON hash PREFIX 1 <i>prefixPattern</i> SCHEMA [<i>fieldName</i> [TYPE] [OPTIONS] ... ]
</code>
<br>
  As the example syntax above shows, we specify:
  
  - the name of the index we are creating (<i>indexName</i>);
  - the data type on which we are creating it (HASH or JSON supported);
  - the data prefix (we have a pattern that allows us to put together many data types as a collection);
  - the schema, i.e. the fields (hash keys, to be more precise) we want to use as indices followed by the value type (TEXT, NUMERIC, TAG, ...) and its options (SORTABLE, ...).<br><br>

- Performing a query <code>FT.SEARCH</code><br>
  After having created an index we can use the secondary indices in the schema to select elements based on specific values. 
<br>
<code>
    FT.SEARCH <i>indexName</i> '@fieldName:fieldValue' RETURN [nr of projected fields] [<i>fieldNames</i>]
</code>
<br>
  As the example syntax shows, we specify:
  
  - the name of the index we want to use (<i>indexName</i>);
  - the selection criteria (<i>fieldName</i> introduced by a <i><b>at sign (@)</b></i> and <i>fieldValue</i> introduced by a <i><b>colon sign (:)</b></i>);
  - the projected fields (introduced by the <code>RETURN</code> keyword and their number).

<h4>
Executing a query in redis-py
</h4><br>
Within a Redis connection object, the <code>ft</code> method creates a new object providing methods that replicate the <a href = 'https://redis-py.readthedocs.io/en/stable/redismodules.html#redisearch-commands'>RediSearch commands</a> that we have introduced in the previous paragraph. Like Redis commands for creating data types, they also are only lowercase. We can create as many <i>RediSearch</i> objects as the indices we want to use to perform queries.<br> It is also advisable to import needed dependencies prior to index creation. <code>TextField</code>, <code>NumericField</code>, <code>TagField</code>, specify the value type of the fields included in the schema. The <code>IndexDefinition</code> dependency is needed to specify the common prefix of the Redis data types that must be indexed. The other imported dependency, <code>Query</code>, is useful for the execution of complex queries, allowing to specify parameters that can be chained to one another. Applying a parameter to a <code>Query()</code> object returns a query object. Applying a chained parameter results in applaying it to the query object returned by the preceding attached parameter and so on. The <code>aggregation</code> dependency is needed to perform aggregate queries by passing an <i>aggregation request</i> to the <code>aggregate</code> method of the index object. Finally, the <code>reducers</code> dependency stores methods to reduce aggregation results into a single record by applying appropriate functions such as <i>count</i>, <i>sum</i>, <i>min</i>, <i>max</i>, <i>average</i> ...

In [60]:
# import dependencies
from redis.commands.search.field import TextField, NumericField, TagField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import redis.commands.search.aggregation as aggregations
import redis.commands.search.reducers as reducers

<h4>
Create an index
</h4><br>
An index can be created with the <code>create_index</code> method of any <i>RediSearch</i> object. We can optionally pass the index name as an argument (otherwise a default '<i>idx</i>' name will be used) and assign the <i>RediSearch</i> object to a Python variable. Then we can pass the schema (each field name of the schema must be enclosed within the function that specifies its value type (<code>TextField</code>, <code>NumericField</code>, <code>TagField</code>) and index definition to the index object. The <code>info()</code> method of the index object is useful to retrieve index information.<br>Notice that if we need to redefine a previously created index, we must first drop it with the <code>dropindex</code> method.

In [5]:
# create RediSearch object
exampleRS = myRedis.ft('idx:exampleIdx')

# (if it exists drop and) create index object
#exampleRS.dropindex()
schema = (TextField('studentID'), TextField('courseID'), TextField('materialID'))
index_definition = IndexDefinition(prefix = 'smallDataset:')
exampleRS.create_index(schema, index_definition)
#redisIdx.info()

'OK'

<h4>
Run the related query
</h4><br>
A query depends on the previously created index and is performed via the <code>search</code> method of the index object. It is sufficient to pass a query string to the <code>search</code> method, where a query string is simply a string value that can be found in either of the secondary indices in the schema or a string of the form '<i>@fieldName:fieldValue</i>' if we seek to find the value in a specific index.

In [6]:
exampleRS.search('450')

Result{508 total, docs: [Document {'id': 'smallDataset:180027', 'payload': None, 'courseID': '450', 'discipline': 'miscellaneous', 'courseName': 'Game Theory', 'courseYear': '2023', 'syllabus': 'http://learning_platform.com/gametheory/syllabus', 'studentID': '1439', 'firstName': 'Danilo', 'lastName': 'Barbosa', 'dateOfBirth': '1975-9-5', 'genre': 'male', 'country': 'Granada', 'town': 'Santos do Norte', 'email': 'danilo.barbosa@hotmail.com', 'materialID': '32031', 'unit': 'Unit 4', 'materialType': 'lecture slides', 'name': '[SLIDES] Cellular Respiration, Part 1 ', 'dimension': '9', 'accessDate': '2023-05-19'}, Document {'id': 'smallDataset:102996', 'payload': None, 'courseID': '450', 'discipline': 'miscellaneous', 'courseName': 'Game Theory', 'courseYear': '2023', 'syllabus': 'http://learning_platform.com/gametheory/syllabus', 'studentID': '834', 'firstName': 'Geir', 'lastName': 'Brekke', 'dateOfBirth': '1993-10-22', 'genre': 'male', 'country': 'Republic of the Congo', 'town': 'Trondber

A query object stores the integer showing the <b>total number of found documents into the <code>total</code> attribute</b>, the <b>query execution time (in milliseconds) into the <code>duration</code> attribute</b> and all <b>the docs (hashes) matching the selection criteria into the <code>docs</code> attribute.</b> Assigning a query result to a Python variable allows us to retrieve these information at any time.

In [7]:
exampleQuery = exampleRS.search('450')
print('The query execution time was: %f milliseconds, and the number of documents matching it is: %i\n' % (exampleQuery.duration, exampleQuery.total))
print('A sample of the courseID and student name for the query results is the following:\n')
counter = 0
for doc in exampleQuery.docs:
    counter += 1
    print(counter, doc['courseID'], doc['firstName'], doc['lastName'])

The query execution time was: 7.215023 milliseconds, and the number of documents matching it is: 508

A sample of the courseID and student name for the query results is the following:

1 450 Danilo Barbosa
2 450 Geir Brekke
3 450 Jeffrey Wheeler
4 450 Jeffrey Wheeler
5 450 Jordan Cole
6 450 Sebastião Sousa
7 450 Danilo Barbosa
8 450 Candelaria Alba
9 450 Jordan Cole
10 450 Jeffrey Wheeler


The above example shows that the value '<i>450</i>' is searched throughout all the secondary indices in the schema of the <i>exampleRS</i> index, so we may  have student IDs, course IDs and material IDs matching the requested value. In the printed results, however, the only field that can match the searched value is the course ID. In the other cases, it may be the student ID or the material ID that we have found matching the searched value, but we cannot tell because the only projected secondary index field is the course ID. The above result also shows a limitation in the query results displayed by Redis. This is controlled by the optional argument <code>LIMIT</code>, which sets the offset and the number of results displayed. The default is 0 10, which returns 10 items starting from the first (0) result. The redis-cli syntax is simply:<br>
<code>
    FT.SEARCH '<i>@fieldName:fieldValue</i>' LIMIT [first num] RETURN [nr of projected fields] [<i>fieldNames</i>]
</code><br>
To control this parameter in our Python environment we must implement the query differently. It is not sufficient to pass a string to the <code>search</code> method, but we need to use a <code>Query</code> object as illustrated below. A <code>Query</code> object is used for complex queries allowing to specify parameters on the object itself. Query object parameters can be chained to adapt the query results to our needs. One of the parameters is <code>paging(<i>first</i>, <i>num</i>)</code> which replicates the effects of the <code>LIMIT</code> argument in redis-cli.

In [8]:
exampleQuery2 = exampleRS.search(Query('450').paging(0, 15))
counter2 = 0
for doc in exampleQuery2.docs:
    counter2 += 1
    print(counter2, doc['courseID'], doc['firstName'], doc['lastName'])

1 450 Danilo Barbosa
2 450 Geir Brekke
3 450 Jeffrey Wheeler
4 450 Jeffrey Wheeler
5 450 Jordan Cole
6 450 Sebastião Sousa
7 450 Danilo Barbosa
8 450 Candelaria Alba
9 450 Jordan Cole
10 450 Jeffrey Wheeler
11 313 Walentina Bohnbach
12 313 Walentina Bohnbach
13 450 Candelaria Alba
14 450 Sebastião Sousa
15 450 Sebastião Sousa


Instead of retrieving the entire documents, if we are perfectly aware of the information we need from a query, we can project the required fields and save memory.

In [9]:
exampleQuery3 = exampleRS.search(Query('@studentID:450').return_fields('firstName', 'lastName', 'courseID').paging(0, 15))
print('A sample of the courseID and student name for the query results is the following:\n')
counter3 = 0
for doc in exampleQuery3.docs:
    counter3 += 1
    print(counter3, doc['courseID'], doc['firstName'], doc['lastName'])

A sample of the courseID and student name for the query results is the following:

1 313 Walentina Bohnbach
2 313 Walentina Bohnbach
3 313 Walentina Bohnbach
4 313 Walentina Bohnbach
5 161 Walentina Bohnbach
6 161 Walentina Bohnbach
7 313 Walentina Bohnbach
8 161 Walentina Bohnbach
9 313 Walentina Bohnbach
10 313 Walentina Bohnbach
11 161 Walentina Bohnbach
12 161 Walentina Bohnbach
13 313 Walentina Bohnbach
14 313 Walentina Bohnbach
15 313 Walentina Bohnbach


To avoid replicated results we must use a different type of query, an aggregate query. We use the <code>aggregation</code> dependency to build an aggregation request and the <code>aggregate</code> method to which we pass the aggregation request. Also, we need to set a new index because the schema must support the fields we want to use to perform aggregation functions.

In [11]:
#exampleRS2.dropindex()

# new RediSearch object
exampleRS2 = myRedis.ft('idx:exampleIdx2')

# new index (with new schema)
schema2 = (TextField('studentID'), TextField('firstName'), TextField('lastName'))
exampleRS2.create_index(schema2, index_definition)

'OK'

In [13]:
# aggregate request and query
aggRequest = aggregations.AggregateRequest('@studentID:450').group_by({'@firstName', '@lastName'})
exampleQuery4 = exampleRS2.aggregate(aggRequest)
for res in exampleQuery4.rows:
    print(res[3], res[1])

Walentina Bohnbach


<h4>
Measuring and displaying the query execution time
</h4><br>
To display the query execution time we can use the <code>duration</code> method of a query object.

In [14]:
print('The query execution time was: %f milliseconds.' % exampleQuery3.duration)
print('The query execution time was: %f milliseconds.' % exampleQuery4.duration)

The query execution time was: 6.802797 milliseconds.


AttributeError: 'AggregateResult' object has no attribute 'duration'

However, this method is not available for aggragation objects, so it is better to rely on the <code>time()</code> function of the Python <code>time</code> module to mark the time before and after operations are executed and compute their difference. The unit measure here is seconds, so we get the time execution in milliseconds by multiplying the difference by 1000.

In [4]:
from time import time

In [16]:
# performing exampleQuery3 again:
startEx3 = time()
exampleRS.search(Query('@studentID:450').return_fields('firstName', 'lastName', 'courseID').paging(0, 15))
endEx3 = time()
timeEx3 = (endEx3 - startEx3) * 1000

# performing exampleQuery4 again:
startEx4 = time()
exampleRS2.aggregate(aggRequest)
endEx4 = time()
timeEx4 = (endEx4 - startEx4) * 1000

print('The query execution time for exampleQuery3 was: %f milliseconds.' % timeEx3)
print('The query execution time for exampleQuery4 was: %f milliseconds.' % timeEx4)

The query execution time for exampleQuery3 was: 4.856825 milliseconds.
The query execution time for exampleQuery4 was: 3.792048 milliseconds.


<h3>Query the datasets</h3>

In [5]:
smallDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
mediumDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
largeDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
humongousDict = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}

In [6]:
# mean function
def mean(aList):
    n = len(aList)
    sum = 0
    for value in aList:
        sum += value
    return sum / n

<h3>Dataset with 250k records</h3>

I create a dictionary of lists where the keys are the query names and the values are 31 query executions: I will attach the value of the query execution time of the most recent query (<code>SHOW PROFILE</code> statement) to the list. In particular, I will attach 31 query executions. I will consider the first execution and the mean value of the following 30 executions. Since the values are required in milliseconds, while attaching them, I divide them by 1000 and round them to the fifth decimal precision.

<h4>Query1</h4>

In [8]:
# create RediSearch object
small_redis1 = myRedis.ft('small_index1')

# (drop and) create index
#small_redis1.dropindex()
schema1 = (TextField('courseID'), TextField('firstName'), TextField('lastName'))
index_definition = IndexDefinition(prefix = ['smallDataset:'], index_type = IndexType.HASH)
small_redis1.create_index(schema1, definition = index_definition)

'OK'

In [9]:
# perform and show query (and measure time)
aggRequest1 = aggregations.AggregateRequest('@courseID:192').group_by({'@firstName', '@lastName'})
start = time()
small_query1 = small_redis1.aggregate(aggRequest1)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in small_query1.rows:
    print(res[1], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
smallDict['query1'].append(round(msec_duration, 5))

Query result:

Leite Patrícia
Nicolas Émile
Lie Cathrine
Gaižauskas Vigilija
Amundsen Ingeborg
Arenas Casandra
Soylu Ledün
Laroche Arthur
Nicolas Nath
Narušis Ana

Query execution time in milliseconds:
16.279220581054688


In [10]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    small_redis1.aggregate(aggRequest1)
    end = time()
    msec_duration = (end - start) * 1000
    smallDict['query1'].append(round(msec_duration, 5))

In [11]:
smallDict

{'query1': [16.27922,
  6.15597,
  6.5372,
  5.46217,
  4.84991,
  5.18417,
  6.09493,
  5.25498,
  5.41401,
  6.77299,
  5.94616,
  5.58805,
  5.88107,
  6.3262,
  4.92072,
  5.63312,
  6.00696,
  6.35505,
  5.70583,
  4.34709,
  3.95799,
  3.59416,
  3.90315,
  3.81494,
  4.03619,
  3.80397,
  3.97706,
  4.01115,
  4.143,
  3.65019,
  3.19791],
 'query2': [],
 'query3': [],
 'query4': []}

<h4>Query2</h4>

In [12]:
# create RediSearch object
small_redis2 = myRedis.ft('small_index2')

# (drop and) create index
#small_redis2.dropindex()
schema2 = (TextField('discipline'), TextField('courseYear'), TextField('courseName'))
index_definition = IndexDefinition(prefix = ['smallDataset:'], index_type = IndexType.HASH)
small_redis2.create_index(schema2, definition = index_definition)

'OK'

In [13]:
# perform and show query (and measure time)
aggRequest2 = aggregations.AggregateRequest('@discipline:statistics @courseYear:2022').group_by('@courseName')
start = time()
small_query2 = small_redis2.aggregate(aggRequest2)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in small_query2.rows:
    print(res[1])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
smallDict['query2'].append(round(msec_duration, 5))

Query result:

Introduction to Probability and Data with R
Basic Statistics
Bayesian Statistics: From Concept to Data Analysis
Python and Statistics for Financial Analysis
Understanding Clinical Research: Behind the Statistics
Econometrics: Methods and Applications
Exploratory Data Analysis
Foundations: Data, Data, Everywhere
Introduction to Statistics

Query execution time in milliseconds:
22.401094436645508


In [14]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    small_redis2.aggregate(aggRequest2)
    end = time()
    msec_duration = (end - start) * 1000
    smallDict['query2'].append(round(msec_duration, 5))

In [15]:
smallDict

{'query1': [16.27922,
  6.15597,
  6.5372,
  5.46217,
  4.84991,
  5.18417,
  6.09493,
  5.25498,
  5.41401,
  6.77299,
  5.94616,
  5.58805,
  5.88107,
  6.3262,
  4.92072,
  5.63312,
  6.00696,
  6.35505,
  5.70583,
  4.34709,
  3.95799,
  3.59416,
  3.90315,
  3.81494,
  4.03619,
  3.80397,
  3.97706,
  4.01115,
  4.143,
  3.65019,
  3.19791],
 'query2': [22.40109,
  19.95373,
  17.82703,
  18.59808,
  20.56193,
  23.2172,
  24.76096,
  18.04614,
  24.69683,
  28.87392,
  30.65395,
  21.71397,
  15.95473,
  14.38427,
  11.82222,
  11.84297,
  12.01987,
  12.20703,
  11.15012,
  11.43909,
  13.12399,
  12.89296,
  13.03506,
  14.3919,
  13.36718,
  13.376,
  15.75804,
  12.60495,
  12.17389,
  13.32998,
  13.23199],
 'query3': [],
 'query4': []}

<h4>Query3</h4>

In [16]:
# create RediSearch object
small_redis3 = myRedis.ft('small_index3')

# (drop and) create index
#small_redis3.dropindex()
schema3 = (TextField('materialType'), TagField('discipline'), TextField('email'), TextField('firstName'))
index_definition = IndexDefinition(prefix = ['smallDataset:'], index_type = IndexType.HASH)
small_redis3.create_index(schema3, definition = index_definition)

'OK'

In [17]:
# perform and show query (and measure time)
aggRequest3 = aggregations.AggregateRequest('@discipline:{maths} @materialType:\'lecture slides\' @email:*gmail.com').group_by('@discipline', reducers.count().alias('count'))
start = time()
small_query3 = small_redis3.aggregate(aggRequest3)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
print(small_query3.rows[0][3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
smallDict['query3'].append(round(msec_duration, 5))

Query result:

632

Query execution time in milliseconds:
14.388084411621094


In [18]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    small_query3 = small_redis3.aggregate(aggRequest3)
    end = time()
    msec_duration = (end - start) * 1000
    smallDict['query3'].append(round(msec_duration, 5))

In [19]:
smallDict

{'query1': [16.27922,
  6.15597,
  6.5372,
  5.46217,
  4.84991,
  5.18417,
  6.09493,
  5.25498,
  5.41401,
  6.77299,
  5.94616,
  5.58805,
  5.88107,
  6.3262,
  4.92072,
  5.63312,
  6.00696,
  6.35505,
  5.70583,
  4.34709,
  3.95799,
  3.59416,
  3.90315,
  3.81494,
  4.03619,
  3.80397,
  3.97706,
  4.01115,
  4.143,
  3.65019,
  3.19791],
 'query2': [22.40109,
  19.95373,
  17.82703,
  18.59808,
  20.56193,
  23.2172,
  24.76096,
  18.04614,
  24.69683,
  28.87392,
  30.65395,
  21.71397,
  15.95473,
  14.38427,
  11.82222,
  11.84297,
  12.01987,
  12.20703,
  11.15012,
  11.43909,
  13.12399,
  12.89296,
  13.03506,
  14.3919,
  13.36718,
  13.376,
  15.75804,
  12.60495,
  12.17389,
  13.32998,
  13.23199],
 'query3': [14.38808,
  14.73188,
  14.78505,
  14.94598,
  13.30996,
  13.36408,
  15.13815,
  12.89868,
  13.82518,
  18.49914,
  13.098,
  40.06886,
  19.76323,
  10.71787,
  10.88619,
  9.35912,
  7.98392,
  9.18984,
  8.02279,
  7.91383,
  9.0909,
  8.37517,
  8.1980

<h4>Query4</h4>

In [20]:
# create RediSearch object
small_redis4 = myRedis.ft('small_index4')

# (drop and) create index
#small_redis4.dropindex()
schema4 = (TagField('discipline'), TagField('courseYear'), TextField('country'), TextField('dateOfBirth'), TextField('firstName'), TextField('lastName', sortable = True))
index_definition = IndexDefinition(prefix = ['smallDataset:'], index_type = IndexType.HASH)
small_redis4.create_index(schema4, definition = index_definition)

'OK'

In [21]:
# perform and show query (and measure time)
aggRequest4 = aggregations.AggregateRequest('@discipline:{psychology} AND @courseYear:{2023} AND @country:*orea AND -@dateOfBirth:200*').group_by({'@firstName', '@lastName', '@country', '@dateOfBirth'}).sort_by('@lastName')
start = time()
small_query4 = small_redis4.aggregate(aggRequest4)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in small_query4.rows:
    print(res[1], res[5], res[7], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
smallDict['query4'].append(round(msec_duration, 5))

Query result:

lie Cathrine South Korea 1986-7-12
reynolds Lynda Korea 1989-7-21
sura Raghav North Korea 1973-11-27

Query execution time in milliseconds:
8.751869201660156


In [22]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    small_redis4.aggregate(aggRequest4)
    end = time()
    msec_duration = (end - start) * 1000
    smallDict['query4'].append(round(msec_duration, 5))

In [23]:
smallDict

{'query1': [16.27922,
  6.15597,
  6.5372,
  5.46217,
  4.84991,
  5.18417,
  6.09493,
  5.25498,
  5.41401,
  6.77299,
  5.94616,
  5.58805,
  5.88107,
  6.3262,
  4.92072,
  5.63312,
  6.00696,
  6.35505,
  5.70583,
  4.34709,
  3.95799,
  3.59416,
  3.90315,
  3.81494,
  4.03619,
  3.80397,
  3.97706,
  4.01115,
  4.143,
  3.65019,
  3.19791],
 'query2': [22.40109,
  19.95373,
  17.82703,
  18.59808,
  20.56193,
  23.2172,
  24.76096,
  18.04614,
  24.69683,
  28.87392,
  30.65395,
  21.71397,
  15.95473,
  14.38427,
  11.82222,
  11.84297,
  12.01987,
  12.20703,
  11.15012,
  11.43909,
  13.12399,
  12.89296,
  13.03506,
  14.3919,
  13.36718,
  13.376,
  15.75804,
  12.60495,
  12.17389,
  13.32998,
  13.23199],
 'query3': [14.38808,
  14.73188,
  14.78505,
  14.94598,
  13.30996,
  13.36408,
  15.13815,
  12.89868,
  13.82518,
  18.49914,
  13.098,
  40.06886,
  19.76323,
  10.71787,
  10.88619,
  9.35912,
  7.98392,
  9.18984,
  8.02279,
  7.91383,
  9.0909,
  8.37517,
  8.1980

In [24]:
smallDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in smallDict:
    smallDataset[key].append(smallDict[key][0])
    mean30 = mean(smallDict[key][1 : 31])
    smallDataset[key].append(round(mean30, 5))
smallDataset

{'query1': [16.27922, 5.01754],
 'query2': [22.40109, 16.567],
 'query3': [14.38808, 12.08061],
 'query4': [8.75187, 7.24271]}

<h3>Dataset with 500k records</h3>

<h4>Query1</h4>

In [82]:
# create RediSearch object
medium_redis1 = myRedis.ft('medium_index1')

# (drop and) create index
medium_redis1.dropindex()
schema1 = (TextField('courseID'), TextField('firstName'), TextField('lastName'))
index_definition = IndexDefinition(prefix = ['mediumDataset:'], index_type = IndexType.HASH)
medium_redis1.create_index(schema1, definition = index_definition)

'OK'

In [83]:
# perform and show query (and measure time)
aggRequest1 = aggregations.AggregateRequest('@courseID:192').group_by({'@firstName', '@lastName'})
start = time()
medium_query1 = medium_redis1.aggregate(aggRequest1)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in medium_query1.rows:
    print(res[1], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
mediumDict['query1'].append(round(msec_duration, 5))

Query result:


Query execution time in milliseconds:
3.821134567260742


In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    medium_redis1.aggregate(aggRequest1)
    end = time()
    msec_duration = (end - start) * 1000
    mediumDict['query1'].append(round(msec_duration, 5))

In [38]:
mediumDict

{'query1': [7.02071, 10.00214, 7.14111],
 'query2': [],
 'query3': [],
 'query4': []}

<h4>Query2</h4>

In [None]:
# create RediSearch object
medium_redis2 = myRedis.ft('medium_index2')

# (drop and) create index
#medium_redis2.dropindex()
schema2 = (TextField('discipline'), TextField('courseYear'), TextField('courseName'))
index_definition = IndexDefinition(prefix = ['mediumDataset:'], index_type = IndexType.HASH)
medium_redis2.create_index(schema2, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest2 = aggregations.AggregateRequest('@discipline:statistics @courseYear:2022').group_by('@courseName')
start = time()
medium_query2 = medium_redis2.aggregate(aggRequest2)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in medium_query2.rows:
    print(res[1])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
mediumDict['query2'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    medium_redis2.aggregate(aggRequest2)
    end = time()
    msec_duration = (end - start) * 1000
    mediumDict['query2'].append(round(msec_duration, 5))

In [None]:
mediumDict

<h4>Query3</h4>

In [None]:
# create RediSearch object
medium_redis3 = myRedis.ft('medium_index3')

# (drop and) create index
#medium_redis3.dropindex()
schema3 = (TextField('materialType'), TagField('discipline'), TextField('email'), TextField('firstName'))
index_definition = IndexDefinition(prefix = ['mediumDataset:'], index_type = IndexType.HASH)
medium_redis3.create_index(schema3, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest3 = aggregations.AggregateRequest('@discipline:{maths} @materialType:\'lecture slides\' @email:*gmail.com').group_by('@discipline', reducers.count().alias('count'))
start = time()
medium_query3 = medium_redis3.aggregate(aggRequest3)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
print(medium_query3.rows[0][3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
mediumDict['query3'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    medium_query3 = small_redis3.aggregate(aggRequest3)
    end = time()
    msec_duration = (end - start) * 1000
    mediumDict['query3'].append(round(msec_duration, 5))

In [None]:
mediumDict

<h4>Query4</h4>

In [None]:
# create RediSearch object
medium_redis4 = myRedis.ft('medium_index4')

# (drop and) create index
#medium_redis4.dropindex()
schema4 = (TagField('discipline'), TagField('courseYear'), TextField('country'), TextField('dateOfBirth'), TextField('firstName'), TextField('lastName', sortable = True))
index_definition = IndexDefinition(prefix = ['mediumDataset:'], index_type = IndexType.HASH)
medium_redis4.create_index(schema4, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest4 = aggregations.AggregateRequest('@discipline:{psychology} AND @courseYear:{2023} AND @country:*orea AND -@dateOfBirth:200*').group_by({'@firstName', '@lastName', '@country', '@dateOfBirth'}).sort_by('@lastName')
start = time()
medium_query4 = medium_redis4.aggregate(aggRequest4)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in medium_query4.rows:
    print(res[1], res[5], res[7], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
mediumDict['query4'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    medium_redis4.aggregate(aggRequest4)
    end = time()
    msec_duration = (end - start) * 1000
    mediumDict['query4'].append(round(msec_duration, 5))

In [None]:
mediumDict

In [None]:
mediumDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in mediumDict:
    mediumDataset[key].append(mediumDict[key][0])
    mean30 = mean(mediumDict[key][1 : 31])
    mediumDataset[key].append(round(mean30, 5))
mediumDataset

<h3>Dataset with 750k records</h3>

<h4>Query1</h4>

In [None]:
# create RediSearch object
large_redis1 = myRedis.ft('large_index1')

# (drop and) create index
#large_redis1.dropindex()
schema1 = (TextField('courseID'), TextField('firstName'), TextField('lastName'))
index_definition = IndexDefinition(prefix = ['largeDataset:'], index_type = IndexType.HASH)
large_redis1.create_index(schema1, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest1 = aggregations.AggregateRequest('@courseID:192').group_by({'@firstName', '@lastName'})
start = time()
large_query1 = large_redis1.aggregate(aggRequest1)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in large_query1.rows:
    print(res[1], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
largeDict['query1'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    large_redis1.aggregate(aggRequest1)
    end = time()
    msec_duration = (end - start) * 1000
    largeDict['query1'].append(round(msec_duration, 5))

In [None]:
largeDict

<h4>Query2</h4>

In [None]:
# create RediSearch object
large_redis2 = myRedis.ft('large_index2')

# (drop and) create index
#large_redis2.dropindex()
schema2 = (TextField('discipline'), TextField('courseYear'), TextField('courseName'))
index_definition = IndexDefinition(prefix = ['largeDataset:'], index_type = IndexType.HASH)
large_redis2.create_index(schema2, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest2 = aggregations.AggregateRequest('@discipline:statistics @courseYear:2022').group_by('@courseName')
start = time()
large_query2 = large_redis2.aggregate(aggRequest2)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in large_query2.rows:
    print(res[1])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
largeDict['query2'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    large_redis2.aggregate(aggRequest2)
    end = time()
    msec_duration = (end - start) * 1000
    largeDict['query2'].append(round(msec_duration, 5))

In [None]:
largeDict

<h4>Query3</h4>

In [None]:
# create RediSearch object
large_redis3 = myRedis.ft('large_index3')

# (drop and) create index
#large_redis3.dropindex()
schema3 = (TextField('materialType'), TagField('discipline'), TextField('email'), TextField('firstName'))
index_definition = IndexDefinition(prefix = ['largeDataset:'], index_type = IndexType.HASH)
large_redis3.create_index(schema3, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest3 = aggregations.AggregateRequest('@discipline:{maths} @materialType:\'lecture slides\' @email:*gmail.com').group_by('@discipline', reducers.count().alias('count'))
start = time()
large_query3 = large_redis3.aggregate(aggRequest3)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
print(large_query3.rows[0][3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
largeDict['query3'].append(round(msec_duration, 5))


In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    large_query3 = large_redis3.aggregate(aggRequest3)
    end = time()
    msec_duration = (end - start) * 1000
    largeDict['query3'].append(round(msec_duration, 5))

In [None]:
largeDict

<h4>Query4</h4>

In [None]:
# create RediSearch object
large_redis4 = myRedis.ft('large_index4')

# (drop and) create index
#large_redis4.dropindex()
schema4 = (TagField('discipline'), TagField('courseYear'), TextField('country'), TextField('dateOfBirth'), TextField('firstName'), TextField('lastName', sortable = True))
index_definition = IndexDefinition(prefix = ['largeDataset:'], index_type = IndexType.HASH)
large_redis4.create_index(schema4, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest4 = aggregations.AggregateRequest('@discipline:{psychology} AND @courseYear:{2023} AND @country:*orea AND -@dateOfBirth:200*').group_by({'@firstName', '@lastName', '@country', '@dateOfBirth'}).sort_by('@lastName')
start = time()
large_query4 = large_redis4.aggregate(aggRequest4)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in large_query4.rows:
    print(res[1], res[5], res[7], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
largeDict['query4'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    large_redis4.aggregate(aggRequest4)
    end = time()
    msec_duration = (end - start) * 1000
    largeDict['query4'].append(round(msec_duration, 5))

In [None]:
largeDict

In [None]:
largeDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in largeDict:
    largeDataset[key].append(largeDict[key][0])
    mean30 = mean(largeDict[key][1 : 31])
    largeDataset[key].append(round(mean30, 5))
largeDataset

<h3>Dataset with 1m records</h3>

<h4>Query1</h4>

In [None]:
# create RediSearch object
humongous_redis1 = myRedis.ft('humongous_index1')

# (drop and) create index
#humongous_redis1.dropindex()
schema1 = (TextField('courseID'), TextField('firstName'), TextField('lastName'))
index_definition = IndexDefinition(prefix = ['humongousDataset:'], index_type = IndexType.HASH)
humongous_redis1.create_index(schema1, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest1 = aggregations.AggregateRequest('@courseID:192').group_by({'@firstName', '@lastName'})
start = time()
humongous_query1 = humongous_redis1.aggregate(aggRequest1)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in humongous_query1.rows:
    print(res[1], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
humongousDict['query1'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    humongous_redis1.aggregate(aggRequest1)
    end = time()
    msec_duration = (end - start) * 1000
    humongousDict['query1'].append(round(msec_duration, 5))

In [None]:
humongousDict

<h4>Query2</h4>

In [None]:
# create RediSearch object
humongous_redis2 = myRedis.ft('humongous_index2')

# (drop and) create index
#humongous_redis2.dropindex()
schema2 = (TextField('discipline'), TextField('courseYear'), TextField('courseName'))
index_definition = IndexDefinition(prefix = ['humongousDataset:'], index_type = IndexType.HASH)
humongous_redis2.create_index(schema2, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest2 = aggregations.AggregateRequest('@discipline:statistics @courseYear:2022').group_by('@courseName')
start = time()
humongous_query2 = humongous_redis2.aggregate(aggRequest2)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in humongous_query2.rows:
    print(res[1])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
humongousDict['query2'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    humongous_redis2.aggregate(aggRequest2)
    end = time()
    msec_duration = (end - start) * 1000
    humongousDict['query2'].append(round(msec_duration, 5))

In [None]:
humongousDict

<h4>Query3</h4>

In [None]:
# create RediSearch object
humongous_redis3 = myRedis.ft('humongous_index3')

# (drop and) create index
#humongous_redis3.dropindex()
schema3 = (TextField('materialType'), TagField('discipline'), TextField('email'), TextField('firstName'))
index_definition = IndexDefinition(prefix = ['humongousDataset:'], index_type = IndexType.HASH)
humongous_redis3.create_index(schema3, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest3 = aggregations.AggregateRequest('@discipline:{maths} @materialType:\'lecture slides\' @email:*gmail.com').group_by('@discipline', reducers.count().alias('count'))
start = time()
humongous_query3 = humongous_redis3.aggregate(aggRequest3)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
print(humongous_query3.rows[0][3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
humongousDict['query3'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    humongous_query3 = small_redis3.aggregate(aggRequest3)
    end = time()
    msec_duration = (end - start) * 1000
    humongousDict['query3'].append(round(msec_duration, 5))

In [None]:
humongousDict

<h4>Query4</h4>

In [None]:
# create RediSearch object
humongous_redis4 = myRedis.ft('humongous_index4')

# (drop and) create index
#humongous_redis4.dropindex()
schema4 = (TagField('discipline'), TagField('courseYear'), TextField('country'), TextField('dateOfBirth'), TextField('firstName'), TextField('lastName', sortable = True))
index_definition = IndexDefinition(prefix = ['humongousDataset:'], index_type = IndexType.HASH)
humongous_redis4.create_index(schema4, definition = index_definition)

In [None]:
# perform and show query (and measure time)
aggRequest4 = aggregations.AggregateRequest('@discipline:{psychology} AND @courseYear:{2023} AND @country:*orea AND -@dateOfBirth:200*').group_by({'@firstName', '@lastName', '@country', '@dateOfBirth'}).sort_by('@lastName')
start = time()
humongous_query4 = humongous_redis4.aggregate(aggRequest4)
end = time()
msec_duration = (end - start) * 1000

print('Query result:\n')
for res in humongous_query4.rows:
    print(res[1], res[5], res[7], res[3])
print('\nQuery execution time in milliseconds:\n' + str(msec_duration))
humongousDict['query4'].append(round(msec_duration, 5))

In [None]:
# perform query 30 more times
for i in range(0, 30):
    start = time()
    humongous_redis4.aggregate(aggRequest4)
    end = time()
    msec_duration = (end - start) * 1000
    humongousDict['query4'].append(round(msec_duration, 5))

In [None]:
humongousDict

In [None]:
humongousDataset = {'query1' : list(), 'query2' : list(), 'query3' : list(), 'query4' : list()}
for key in humongousDict:
    humongousDataset[key].append(humongousDict[key][0])
    mean30 = mean(humongousDict[key][1 : 31])
    humongousDataset[key].append(round(mean30, 5))
humongousDataset

In [None]:
with open('redis_tests.csv', 'w', newline = '') as redis_tests:
    writer = csv.writer(redis_tests, delimiter = ',')
    keys = smallDataset.keys()
    limit = len(smallDataset['query1'])
    
    writer.writerow(keys)
    writer.writerow('s') # s stands for small dataset
    for i in range(0, limit):
        writer.writerow(smallDataset[k][i] for k in keys)
    writer.writerow('m')  # m stands for medium dataset
    for i in range(0, limit):
        writer.writerow(mediumDataset[k][i] for k in keys)
    writer.writerow('l') # l stands for large dataset
    for i in range(0, limit):
        writer.writerow(largeDataset[k][i] for k in keys)
    writer.writerow('h') # h stands for humongous dataset
    for i in range(0, limit):
        writer.writerow(humongousDataset[k][i] for k in keys)
