<center>
    <h2>Online learning platform database - Redis</h2>
    <h3>Methodologies applied for loading the data and performing the queries</h3>
</center>

<h3>Preliminary operations: import csv files into Redis</h3>

Given Redis' nature of a <i>key-value store</i> rather than a classical DBMS, importing is a task better performed via a programming language. This requires to load the csv file and store the data into a suitable data structure, then use the programming language to connect to a Redis instance and store the data into a Redis data type. Hence, this section on Redis will have a slightly different format from the previous ones, starting with describing how to manage csv files from Python and then introducing the Redis' Python driver and the mothods used to connect to a Redis instance from Python.

<h3>Importing csv files into Python</h3>

Importing a csv file into Python is best performed via the <code>csv</code> module, contained in the Python standard library. It contains methods for reading (<code>csv.reader</code>) or writing (<code>csv.writer</code>) csv files. By starting a connection to a file we can read it line by line and store the fields within each line into lists. The file lines can be stored into a dictionary of lists where each line number corresponds to the dictionary key and the associated list contains the fields included in the csv lines. The above described procedure is stored into a function that can be called for each of the four datasets.

In [1]:
#   PURPOSE:  stores a csv file into a Python dictionary (dictionary keys are row numbers, values are rows as lists)
# ARGUMENTS:  a path (string), a csv filename including extension (string), a dictionary name
#   RETURNS:  a dictionary
def importCSVfile(pathName, csvFileName):
    import csv
    key = 0
    dictName = dict()
    with open(pathName + csvFileName, newline = '') as csvFile:
        reader = csv.reader(csvFile, delimiter = ',')
        for line in reader:
            dictName[key] = line
            key += 1
    return dictName

In [2]:
# call the importCSVfile function for the four differently sized datasets
path = '/Users/mau/OneDrive - unime.it/Learning/CdL Informatica/Anno II - Database/Module B/project/tables/'

# 250k rows dataset to dict
smallDB = importCSVfile(path, 'dataset250k.csv')

# 500k rows dataset to dict
mediumDB = importCSVfile(path, 'dataset500k.csv')

# 750k rows dataset to dict
largeDB = importCSVfile(path, 'dataset750k.csv')

# 1m rows dataset to dict
humongousDB = importCSVfile(path, 'dataset1m.csv')

In [3]:
print('Lengths of the four dictionaries, from the smallest to the largest:\n')
print('smallDB:', len(smallDB), 'mediumDB:', len(mediumDB), 'largeDB:', len(largeDB), 'humongousDB:', len(humongousDB))

Lengths of the four dictionaries, from the smallest to the largest:

smallDB: 250001 mediumDB: 500001 largeDB: 750001 humongousDB: 1000001


<h3>Python - Redis interaction</h3>

Interaction between a Python API and a Redis key-value store requires the installation of a specific driver. The usual list of drivers for various programming languages is provided in the <a href = 'https://redis.io/resources/clients/'>Clients</a> web page of the Redis website: <a href = 'https://redis-py.readthedocs.io/en/stable/index.html'>redis-py</a> is the driver developed by <i>Redis Inc.</i> for a Python programming environment.<br>After having installed the driver, it can be imported into a Python environment the usual way.

In [3]:
import redis

<h4>
Establishing a connection to a Redis database
</h4><br>
We can connect to a Redis instance by simply assigning a <code>Redis()</code> object to a Python variable. By default, the driver sets a connection to a local Redis instance on port 6379. Host name and port can also be specified as arguments. By default, Redis returns responses as bytes in Python. We can be returned responses decoded as strings by specifying the <code>decode_responses</code> argument.

In [4]:
myRedis = redis.Redis(host = 'localhost', port = 6379, decode_responses = True)

I found it more convenient to access different Redis instances to perform the querying tests. This allows me to consider each instance as dedicated to a unique hash type (prefix). So each Redis instance could be conceptually treated as if it were an independent collection of documents. I could also store 2.5 million keys in a unique Redis instance and each hash prefix would allow identifying to which '<i>collection</i>' each document belonged. To allow for the chosen implementation, I create three more Redis instances: in the first one (<i>myRedis</i>) I will store the smallest dataset, in the second one (<i>myRedis2</i>) I will store the dataset with 500k records, in the third one (<i>myRedis3</i>) I will store the dataset with 750k records and in the fourth one (<i>myRedis4</i>) I will store the dataset with 1m records.

In [5]:
myRedis2 = redis.Redis(host = 'localhost', port = 6382, decode_responses = True)
myRedis3 = redis.Redis(host = 'localhost', port = 6383, decode_responses = True)
myRedis4 = redis.Redis(host = 'localhost', port = 6384, decode_responses = True)

A Redis connection implements a <code>CoreCommands</code> class which contains functions that can replicate all the <a href = 'https://redis-py.readthedocs.io/en/stable/commands.html'>commands</a> provided within the <i>redis-cli</i> API. Since Python is case sensitive, however, they must be typed in the correct letter case (they usually use lowercase letters). The list of all available methods is accessible via the usual <code>dir(<i>redisObject</i>)</code> function.

<h4>
Store data into Redis hashes [ ! LONG PROCESSES FOLLOW ! ]
</h4><br>
<i>Hashes</i> are a Redis data type that allows the association of keys and values. A hash object has a name and a list of key-value stores. In our case, the keys may represent the field (column) names contained in the first row of the csv file (header) while the values are the field values contained in the other csv rows. We can create one hash per row by creating hash names of the form <i>small:rownumber</i>, where the text before the colon represents the dataset and the text after the colon is the row number. Thus, each row becomes a hash where the hash keys are common across all hashes in the dataset. This is helpful because hash keys may work as a schema for implementing queries. This procedure is stored into a function that takes a dictionary as argument, so that it is sufficient to feed it the desired dataset (one of the four differently sized dataset stored into the four dictionaries above) to have data sent to Redis (the process is quite long even for the 250k rows dataset, anyway).

In [21]:
#   PURPOSE: stores Python dictionary key-value pairs to Redis hashes with a prefix
# ARGUMENTS: a Redis instance, a dictionary(dict), a string we want to use as hash prefix
#            (usually the hash string ends with a colon to separate prefix and row number)
#   RETURNS: nothing
def sendToRedis(redisInstance, datasetDict, hashPrefix):
    for i in range(1, len(datasetDict)):
        for j in range(0, len(datasetDict[0])):
            redisInstance.hset(hashPrefix + str(i), datasetDict[0][j], datasetDict[i][j])

If we desire to remove hashes prefixed with the dataset name, sent to Redis as above explained, we can also reverse the process by looping over the length of the dataset dictionary (the dictionary keys range from 0 to 250k or 500k, etc.), assigning the dataset name + colon + the row number to the hashes we want to remove and applying the <code>delete</code> method on them. This will remove them one by one (a long process as well as the one of loading them). Again, we store the procedure into a function.

In [22]:
#   PURPOSE: removes Redis hashes sent with a prefix from a Python dictionary
# ARGUMENTS: a Redis instance, a dictionary(dict), a string we want to use as hash prefix
#            (usually the hash string ends with a colon to separate prefix and row number)
#   RETURNS: nothing
def removeFromRedis(redisInstance, datasetDict, hashPrefix):
    for i in range(1, len(datasetDict)):
        redisInstance.delete(hashPrefix + str(i))

In [None]:
# THESE LINES START LONG PROCESSES, SO I COMMENT THEM OUT TO AVOID UNCAUTIOUS USE
'''
# 250k keys dict to Redis hashes
sendToRedis(myRedis, smallDB, 'smallDB:')

# 500k keys dict to Redis hashes
sendToRedis(myRedis2, mediumDB, 'mediumDB:')

# 750k keys dict to Redis hashes
sendToRedis(myRedis3, largeDB, 'largeDB:')

# 1m keys dict to Redis hashes
sendToRedis(myRedis4, humongousDB, 'humongousDB:')
'''

In [19]:
# optional delete hashes process
# removeFromRedis(myRedis, mediumDB, 'mediumDB:')

We can consider each hash as a single document in a collection of documents where keys are common across them.

<h4>
Executing a query (<i>RediSearch</i>)
</h4><br>
Queries can be performed by using the <a href = 'https://docs.redis.com/latest/stack/search/'>RediSearch module</a>, which builds indices based on the provided schema.

- Creating an index <code>FT.CREATE</code><br>
  This is a very important step to take before performing a query. Creating an index allows to define the <code>SCHEMA</code> of the data for the purpose of performing a query. Creating the index is strongly query-oriented. The schema is in fact a list of secondary indices that we base our queries on.<br>
<code>
    FT.CREATE <i>indexName</i> ON hash PREFIX 1 <i>prefixPattern</i> SCHEMA [<i>fieldName</i> [TYPE] [OPTIONS] ... ]
</code>
<br>
  As the example syntax above shows, we specify:
  
  - the name of the index we are creating (<i>indexName</i>);
  - the data type on which we are creating it (HASH or JSON supported);
  - the data prefix (we have a pattern that allows us to put together many data types as a collection);
  - the schema, i.e. the fields (hash keys, to be more precise) we want to use as indices followed by the value type (TEXT, NUMERIC, TAG, ...) and its options (SORTABLE, ...).<br><br>

- Performing a query <code>FT.SEARCH</code><br>
  After having created an index we can use the secondary indices in the schema to select elements based on specific values. 
<br>
<code>
    FT.SEARCH <i>indexName</i> '@fieldName:fieldValue' RETURN [nr of projected fields] [<i>fieldNames</i>]
</code>
<br>
  As the example syntax shows, we specify:
  
  - the name of the index we want to use (<i>indexName</i>);
  - the selection criteria (<i>fieldName</i> introduced by a <i><b>at sign (@)</b></i> and <i>fieldValue</i> introduced by a <i><b>colon sign (:)</b></i>);
  - the projected fields (introduced by the <code>RETURN</code> keyword and their number).

<h4>
Executing a query in redis-py
</h4><br>
Within a Redis connection object, the <code>ft</code> method creates a new object providing methods that replicate the <a href = 'https://redis-py.readthedocs.io/en/stable/redismodules.html#redisearch-commands'>RediSearch commands</a> that we have introduced in the previous paragraph. Like Redis commands for creating data types, they also are only lowercase. We can create as many <i>RediSearch</i> objects as the indices we want to use to perform queries.<br> It is also advisable to import needed dependencies prior to index creation. <code>TextField</code>, <code>NumericField</code>, <code>TagField</code>, specify the value type of the fields included in the schema. The <code>IndexDefinition</code> dependency is needed to specify the common prefix of the Redis data types that must be indexed. The other imported dependency, <code>Query</code>, is useful for the execution of complex queries, allowing to specify parameters that can be chained to one another. Applying a parameter to a <code>Query()</code> object returns a query object. Applying a chained parameter results in applaying it to the query object returned by the preceding attached parameter and so on. The <code>aggregation</code> dependency is needed to perform aggregate queries by passing an <i>aggregation request</i> to the <code>aggregate</code> method of the index object. Finally, the <code>reducers</code> dependency stores methods to reduce aggregation results into a single record by applying appropriate functions such as <i>count</i>, <i>sum</i>, <i>min</i>, <i>max</i>, <i>average</i> ...

In [20]:
# import dependencies
from redis.commands.search.field import TextField, NumericField, TagField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import redis.commands.search.aggregation as aggregations
import redis.commands.search.reducers as reducers

<h4>
Create an index
</h4><br>
An index can be created with the <code>create_index</code> method of any <i>RediSearch</i> object. We can optionally pass the index name as an argument (otherwise a default '<i>idx</i>' name will be used) and assign the <i>RediSearch</i> object to a Python variable. Then we can pass the schema (each field name of the schema must be enclosed within the function that specifies its value type (<code>TextField</code>, <code>NumericField</code>, <code>TagField</code>) and index definition to the index object. The <code>info()</code> method of the index object is useful to retrieve index information.<br>Notice that if we need to redefine a previously created index, we must first drop it with the <code>dropindex</code> method.

In [5]:
# create RediSearch object
exampleRS = myRedis.ft('idx:exampleIdx')

# (if it exists drop and) create index object
#exampleRS.dropindex()
schema = (TextField('studentID'), TextField('courseID'), TextField('materialID'))
index_definition = IndexDefinition(prefix = 'smallDataset:')
exampleRS.create_index(schema, index_definition)
#redisIdx.info()

'OK'

<h4>
Run the related query
</h4><br>
A query depends on the previously created index and is performed via the <code>search</code> method of the index object. It is sufficient to pass a query string to the <code>search</code> method, where a query string is simply a string value that can be found in either of the secondary indices in the schema or a string of the form '<i>@fieldName:fieldValue</i>' if we seek to find the value in a specific index.

In [6]:
exampleRS.search('450')

Result{508 total, docs: [Document {'id': 'smallDataset:180027', 'payload': None, 'courseID': '450', 'discipline': 'miscellaneous', 'courseName': 'Game Theory', 'courseYear': '2023', 'syllabus': 'http://learning_platform.com/gametheory/syllabus', 'studentID': '1439', 'firstName': 'Danilo', 'lastName': 'Barbosa', 'dateOfBirth': '1975-9-5', 'genre': 'male', 'country': 'Granada', 'town': 'Santos do Norte', 'email': 'danilo.barbosa@hotmail.com', 'materialID': '32031', 'unit': 'Unit 4', 'materialType': 'lecture slides', 'name': '[SLIDES] Cellular Respiration, Part 1 ', 'dimension': '9', 'accessDate': '2023-05-19'}, Document {'id': 'smallDataset:102996', 'payload': None, 'courseID': '450', 'discipline': 'miscellaneous', 'courseName': 'Game Theory', 'courseYear': '2023', 'syllabus': 'http://learning_platform.com/gametheory/syllabus', 'studentID': '834', 'firstName': 'Geir', 'lastName': 'Brekke', 'dateOfBirth': '1993-10-22', 'genre': 'male', 'country': 'Republic of the Congo', 'town': 'Trondber

A query object stores the integer showing the <b>total number of found documents into the <code>total</code> attribute</b>, the <b>query execution time (in milliseconds) into the <code>duration</code> attribute</b> and all <b>the docs (hashes) matching the selection criteria into the <code>docs</code> attribute.</b> Assigning a query result to a Python variable allows us to retrieve these information at any time.

In [7]:
exampleQuery = exampleRS.search('450')
print('The query execution time was: %f milliseconds, and the number of documents matching it is: %i\n' % (exampleQuery.duration, exampleQuery.total))
print('A sample of the courseID and student name for the query results is the following:\n')
counter = 0
for doc in exampleQuery.docs:
    counter += 1
    print(counter, doc['courseID'], doc['firstName'], doc['lastName'])

The query execution time was: 7.215023 milliseconds, and the number of documents matching it is: 508

A sample of the courseID and student name for the query results is the following:

1 450 Danilo Barbosa
2 450 Geir Brekke
3 450 Jeffrey Wheeler
4 450 Jeffrey Wheeler
5 450 Jordan Cole
6 450 Sebastião Sousa
7 450 Danilo Barbosa
8 450 Candelaria Alba
9 450 Jordan Cole
10 450 Jeffrey Wheeler


The above example shows that the value '<i>450</i>' is searched throughout all the secondary indices in the schema of the <i>exampleRS</i> index, so we may  have student IDs, course IDs and material IDs matching the requested value. In the printed results, however, the only field that can match the searched value is the course ID. In the other cases, it may be the student ID or the material ID that we have found matching the searched value, but we cannot tell because the only projected secondary index field is the course ID. The above result also shows a limitation in the query results displayed by Redis. This is controlled by the optional argument <code>LIMIT</code>, which sets the offset and the number of results displayed. The default is 0 10, which returns 10 items starting from the first (0) result. The redis-cli syntax is simply:<br>
<code>
    FT.SEARCH '<i>@fieldName:fieldValue</i>' LIMIT [first num] RETURN [nr of projected fields] [<i>fieldNames</i>]
</code><br>
To control this parameter in our Python environment we must implement the query differently. It is not sufficient to pass a string to the <code>search</code> method, but we need to use a <code>Query</code> object as illustrated below. A <code>Query</code> object is used for complex queries allowing to specify parameters on the object itself. Query object parameters can be chained to adapt the query results to our needs. One of the parameters is <code>paging(<i>first</i>, <i>num</i>)</code> which replicates the effects of the <code>LIMIT</code> argument in redis-cli.

In [8]:
exampleQuery2 = exampleRS.search(Query('450').paging(0, 15))
counter2 = 0
for doc in exampleQuery2.docs:
    counter2 += 1
    print(counter2, doc['courseID'], doc['firstName'], doc['lastName'])

1 450 Danilo Barbosa
2 450 Geir Brekke
3 450 Jeffrey Wheeler
4 450 Jeffrey Wheeler
5 450 Jordan Cole
6 450 Sebastião Sousa
7 450 Danilo Barbosa
8 450 Candelaria Alba
9 450 Jordan Cole
10 450 Jeffrey Wheeler
11 313 Walentina Bohnbach
12 313 Walentina Bohnbach
13 450 Candelaria Alba
14 450 Sebastião Sousa
15 450 Sebastião Sousa


Instead of retrieving the entire documents, if we are perfectly aware of the information we need from a query, we can project the required fields and save memory.

In [9]:
exampleQuery3 = exampleRS.search(Query('@studentID:450').return_fields('firstName', 'lastName', 'courseID').paging(0, 15))
print('A sample of the courseID and student name for the query results is the following:\n')
counter3 = 0
for doc in exampleQuery3.docs:
    counter3 += 1
    print(counter3, doc['courseID'], doc['firstName'], doc['lastName'])

A sample of the courseID and student name for the query results is the following:

1 313 Walentina Bohnbach
2 313 Walentina Bohnbach
3 313 Walentina Bohnbach
4 313 Walentina Bohnbach
5 161 Walentina Bohnbach
6 161 Walentina Bohnbach
7 313 Walentina Bohnbach
8 161 Walentina Bohnbach
9 313 Walentina Bohnbach
10 313 Walentina Bohnbach
11 161 Walentina Bohnbach
12 161 Walentina Bohnbach
13 313 Walentina Bohnbach
14 313 Walentina Bohnbach
15 313 Walentina Bohnbach


To avoid replicated results we must use a different type of query, an aggregate query. We use the <code>aggregation</code> dependency to build an aggregation request and the <code>aggregate</code> method to which we pass the aggregation request. Also, we need to set a new index because the schema must support the fields we want to use to perform aggregation functions.

In [11]:
#exampleRS2.dropindex()

# new RediSearch object
exampleRS2 = myRedis.ft('idx:exampleIdx2')

# new index (with new schema)
schema2 = (TextField('studentID'), TextField('firstName'), TextField('lastName'))
exampleRS2.create_index(schema2, index_definition)

'OK'

In [13]:
# aggregate request and query
aggRequest = aggregations.AggregateRequest('@studentID:450').group_by({'@firstName', '@lastName'})
exampleQuery4 = exampleRS2.aggregate(aggRequest)
for res in exampleQuery4.rows:
    print(res[3], res[1])

Walentina Bohnbach


<h4>
Measuring and displaying the query execution time
</h4>
<h4>
- method 1: <code>time()</code>
</h4><br>
To display the query execution time we can use the Python <a href = 'https://docs.python.org/3/library/time.html'><i>time</i></a> module and its <code>time()</code> function. The function returns the system time at a floating point precision, so the query execution time can be measured as a large number of fractions of a second. It is sufficient to assign the time before the query execution to a variable and the time after the query execution to another variable. The difference between the two variables will measure the query execution time. Obviously, the time for the Python API to connect to the Redis server and the time to return to the Python API after the query execution will be summed up to the query execution time at the DBMS level. The unit measure here is seconds, so we get the time execution in milliseconds by multiplying the difference by 1000.

In [None]:
from time import time

# performing exampleQuery3 again:
startEx3 = time()
exampleRS.search(Query('@studentID:450').return_fields('firstName', 'lastName', 'courseID').paging(0, 15))
endEx3 = time()
timeEx3 = (endEx3 - startEx3) * 1000

# performing exampleQuery4 again:
startEx4 = time()
exampleRS2.aggregate(aggRequest)
endEx4 = time()
timeEx4 = (endEx4 - startEx4) * 1000

print('The query execution time for exampleQuery3 was: %f milliseconds.' % timeEx3)
print('The query execution time for exampleQuery4 was: %f milliseconds.' % timeEx4)

<h4>
- method 2: <code>duration</code>
</h4><br>
To display the query execution time we can also use the <code>duration</code> method of a query object. However, as the following execution shows, this method is not available for aggregation objects.

In [14]:
print('The query execution time was: %f milliseconds.' % exampleQuery3.duration)
print('The query execution time was: %f milliseconds.' % exampleQuery4.duration)

The query execution time was: 6.802797 milliseconds.


AttributeError: 'AggregateResult' object has no attribute 'duration'

<h4>
- chosen method: <code>time()</code>
</h4><br>
For the above examined problem and for project consistency reasons, it is better to rely on the <code>time()</code> function of the Python <code>time</code> module to mark the time before and after operations are executed and compute their difference.