<center>
<h2>Online learning platform database - Redis</h2>
</center>

<h3>Preliminary operations: import csv files into Redis</h3>

Given Redis' nature of a <i>key-value store</i> rather than a DBMS in a classical form, importing is a task better performed via a programming language API. This requires to load the csv file and store the data into a suitable data structure, then use the programming language API to connect to a Redis instance and store the data into a Redis data type. Hence, this section on Redis will have a slightly different format from the previous ones, it will start directly with introducing the Python driver for Redis and ways to connect to a Redis instance from Python.

<h3>Importing csv files into Python</h3>

Importing a csv file into Python is best performed via the <code>csv</code> module, contained in the Python standard library. It contains methods to read <code>csv.reader</code> or write <code>csv.writer</code> csv files. By starting a connection to a file we can read it line by line and store the fields within each line into lists. In the following code fragment, I store the lines into a dictionary of lists where each line number corresponds to the dictionary key and the associated list contains the fields included in the csv lines.

In [3]:
import csv
path = '/Users/mau/OneDrive - unime.it/Learning/CdL Informatica/Anno II - Database/Module B/project/tables/'
smallDict = dict()
key = 0
with open(path + 'dataset250k.csv', newline = '') as csvSmall:
    reader = csv.reader(csvSmall, delimiter = ',')
    for line in reader:
        smallDict[key] = line
        key += 1

In [9]:
len(smallDict)

250001

<h3>Python - Redis interaction</h3>

Interaction between a Python API and a Redis key-value store requires the installation of a specific driver. The usual list of drivers for various programming languages is provided in the <a href = 'https://redis.io/resources/clients/'>Clients</a> web page of the Redis website: <a href = 'https://redis-py.readthedocs.io/en/stable/index.html'>redis-py</a> is the driver developed by <i>Redis Inc.</i> for a Python programming environment.<br>After having installed the driver, it can be imported into a Python environment the usual way.

In [1]:
import redis

<h4>
Establishing a connection to a Redis database
</h4><br>
We can connect to a Redis instance by simply assigning a <code>Redis()</code> object to a Python variable. By default, the driver sets a connection to a local Redis instance on port 6379. Host name and port can also be specified as arguments. By default, Redis returns responses as bytes in Python. We can be returned responses decoded as strings by specifying the <code>decode_responses</code> argument.

In [2]:
redis_dec = redis.Redis(host = 'localhost', port = 6379, decode_responses = True)

A Redis connection implements a <code>CoreCommands</code> class which contains functions that can replicate all the commands provided within the <i>redis-cli</i> API. Since Python is case sensitive, however, they must be typed in the correct letter case (they usually use lowercase letters). The list of all available methods is accessible via the usual <code>dir(<i>redisObject</i>)</code> function.

<h4>
Store data into Redis hashes
</h4><br>
<i>Hashes</i> are a Redis data type that allows the association of keys and values. A hash object has a name and a list of key-value stores. In our case, the keys may represent the field (column) names contained in the first row of the csv file (header) while the values are the field values contained in the other csv rows. We can create one hash per row by creating hash names of the form <i>small:rownumber</i>, where the text before the colon represents the dataset and the text after the colon is the row number. Thus, each row becomes a hash where the hash keys are common across all hashes in the dataset. This is helpful because hash keys may work as a schema for implementing queries.

In [28]:
for i in range(1, len(smallDict)):
    for j in range(0, len(smallDict[0])):
        redis_dec.hset('smallDataset:' + str(i), smallDict[0][j], smallDict[i][j])

We can consider each hash as a single document in a collection of documents where keys are common across them.

<h4>
Executing a query (<i>RediSearch</i>)
</h4><br>
Queries can be performed by using the <a href = 'https://docs.redis.com/latest/stack/search/'>RediSearch module</a>, which builds indices based on the provided schema.

- Creating an index <code>FT.CREATE</code><br>
  This is a very important step to take before performing a query. Creating an index allows to define the schema of the data for the purpose of performing a query. Creating the index is strongly query-oriented. The schema is in fact a list of secondary indices that we base our queries on.<br>
<code>
    FT.CREATE <i>indexName</i> ON hash PREFIX 1 <i>prefixPattern</i> SCHEMA [<i>fieldName</i> [TYPE] [OPTIONS] ... ]
</code>
<br>
  As the example syntax above shows, we specify:
  
  - the name of the index we are creating (<i>indexName</i>);
  - the data type on which we are creating it (HASH or JSON supported);
  - the data prefix (we have a pattern that allows us to put together many data types as a collection);
  - the schema, i.e. the fields (hash keys, to be more precise) we want to use as indices followed by the value type (TEXT, NUMERIC, TAG, ...) and its options (SORTABLE, ...).<br><br>

- Performing a query <code>FT.SEARCH</code><br>
  After having created an index we can use the secondary indices in the schema to select elements based on specific values. 
<br>
<code>
    FT.SEARCH <i>indexName</i> '@fieldName:fieldValue' RETURN [nr of projected fields] [<i>fieldNames</i>]
</code>
<br>
  As the example syntax shows, we specify:
  
  - the name of the index we want to use (<i>indexName</i>);
  - the selection criteria (<i>fieldName</i> introduced by a <i><b>at sign (@)</b></i> and <i>fieldValue</i> introduced by a <i><b>colon sign (:)</b></i>);
  - the projected fields (introduced by the <code>RETURN</code> keyword and their number).

<h4>
Executing a query in redis-py
</h4><br>
Within a Redis connection object, the <code>ft</code> object provides methods that replicate the Redisearch commands that we have introduced in the previous paragraph. Like Redis commands for creating data types, they also are only lowercase.

<h4>
Create an index
</h4><br>
An index can be created with the <code>create_index</code> method. We can pass the index name as an argument and assign the index object to a Python variable. Then we can pass the schema to this object (each field name of the schema must be enclosed within the function that specifies its value type (<code>TextField</code>, <code>NumericField</code>, <code>TagField</code>, <code>...</code>). These functions are dependencies that it is advisable to import prior to index creation. The <code>IndexDefinition</code> dependency is needed to specify the common prefix of the Redis data types that must be indexed.<br>If we need to redefine a previously created index, we must first drop it with the <code>dropindex</code> method. The other imported dependency, <code>Query</code>, is useful for the execution of complex queries, allowing to specify parameters that can be chained to one another. Applying a parameter to a <code>Query()</code> object returns a query object. Applying a chained parameter results in applaying it to the query object returned by the preceding attached parameter and so on.

In [50]:
from redis.commands.search.field import TextField, NumericField, TagField
from redis.commands.search.indexDefinition import IndexDefinition
from redis.commands.search.query import Query

#rs.dropindex()
rs = redis_dec.ft('query1')

After the index object has been created we can pass it the desired schema and index definition. The <code>info</code> method of the index object is useful to retrieve index information.

In [51]:
schema = (TextField('studentID'), TextField('courseID'), TextField('materialID'))
index_definition = IndexDefinition(prefix = 'smallDataset:')
rs.create_index(schema, index_definition)
#rs.info()

'OK'

<h4>
Run the related query
</h4><br>
A query depends on the previously created index and is performed via the <code>search</code> method of the index object. It is sufficient to pass a query string to the <code>search</code> method, where a query string is simply a string value that can be found in one of the secondary indices or a string of the form '<i>@fieldName:fieldValue</i>' if we seek to find the value in a specific <i>fieldName</i>.

In [7]:
rs.search('192')

Result{1109 total, docs: [Document {'id': 'smallDataset:22930', 'payload': None, 'courseID': '14', 'name': '[SLIDES] Anatomy of a CSS Rule', 'discipline': 'IT', 'studentID': '192', 'accessDate': '2022-12-20', 'dimension': '6', 'courseYear': '2022', 'firstName': 'Zaina', 'dateOfBirth': '1967-3-15', 'unit': 'Unit 2', 'courseName': 'Programming for Everybody (Getting Started with Python)', 'materialType': 'lecture slides', 'lastName': 'Madan', 'country': 'Argentina', 'syllabus': 'http://learning_platform.com/programmingforeverybodygettingstartedwithpython/syllabus', 'genre': 'female', 'materialID': '1019', 'town': 'Ambala', 'email': 'zaina.madan@gmail.com'}, Document {'id': 'smallDataset:23066', 'payload': None, 'courseID': '309', 'name': '[VIDEO] Extended Exercise: I Like Apples.', 'discipline': 'languages', 'studentID': '192', 'accessDate': '2023-08-18', 'dimension': '2', 'courseYear': '2023', 'firstName': 'Zaina', 'dateOfBirth': '1967-3-15', 'unit': 'Unit 5', 'courseName': 'English for

A query object stores the integer showing the total number of found documents into the <code>total</code> attribute, the query execution time into the <code>duration</code> attribute and all the docs (hashes) matching the selection criteria into the <code>docs</code> attribute. Assigning a query result to a Python variable allows us to retrieve these information at any time.

In [8]:
exampleQuery = rs.search('192')

In [9]:
print('The query execution time was: %f milliseconds, and the number of documents matching it is: %i' % (exampleQuery.duration, exampleQuery.total))

The query execution time was: 7.227898 milliseconds, and the number of documents matching it is: 1109


In [10]:
print('A sample of the courseID and student name for the query results is the following:\n')
counter = 0
for doc in exampleQuery.docs:
    counter += 1
    print(counter, doc['courseID'], doc['firstName'], doc['lastName'])

A sample of the courseID and student name for the query results is the following:

1 14 Zaina Madan
2 309 Zaina Madan
3 192 Sarah Lara
4 379 Zaina Madan
5 192 Patrícia Leite
6 192 Ana Narušis
7 192 Casandra Arenas
8 14 Zaina Madan
9 192 Custodia Hidalgo
10 192 Arthur Laroche


The above example shows that the value '<i>192</i>' is searched throughout all the secondary indices of our '<i>query1</i>' index, so we may  have student IDs, course IDs and material IDs matching the requested value. In the printed results, however, the only field that can match the searched value is the course ID. In the other cases, it may be the student ID or the material ID that we have found matching the searched value, but we cannot tell because the only projected secondary index field is the course ID. The above result also shows a limitation in the query results displayed by Redis. This is controlled by the optional argument <code>LIMIT</code>, which sets the offset and the number of results displayed. The default is 0 10, which returns 10 items starting from the first (0) result. The redis-cli syntax is simply:<br>
<code>
    FT.SEARCH '<i>@fieldName:fieldValue</i>' LIMIT [first num] RETURN [nr of projected fields] [<i>fieldNames</i>]
</code><br>
To control this parameter in our Python environment we must use a <code>Query</code> object. A <code>Query</code> object is used for complex queries allowing to specify parameters on the object itself. Query object parameters can be chained to adapt the query results to our needs. One of the parameters is <code>paging(<i>first</i>, <i>num</i>)</code> which replicates the effects of the <code>LIMIT</code> argument in redis-cli.

In [11]:
exampleQuery2 = rs.search(Query('192').paging(0, 25))

In [12]:
print('The query execution time was: %f milliseconds, and the number of documents matching it is still: %i' % (exampleQuery2.duration, exampleQuery2.total))

The query execution time was: 10.906219 milliseconds, and the number of documents matching it is still: 1109


In [14]:
print('A larger sample of the courseID and student name for the query results is the following:\n')
counter2 = 0
for doc in exampleQuery2.docs:
    counter2 += 1
    print(counter2, doc['courseID'], doc['firstName'], doc['lastName'])

A larger sample of the courseID and student name for the query results is the following:

1 14 Zaina Madan
2 309 Zaina Madan
3 192 Sarah Lara
4 379 Zaina Madan
5 192 Patrícia Leite
6 192 Ana Narušis
7 192 Casandra Arenas
8 14 Zaina Madan
9 192 Custodia Hidalgo
10 192 Arthur Laroche
11 192 Narciso Ferrán
12 192 Casandra Arenas
13 192 Ledün Soylu
14 192 Émile Nicolas
15 192 Cathrine Lie
16 192 Casandra Arenas
17 379 Zaina Madan
18 192 Ledün Soylu
19 192 Vigilija Gaižauskas
20 379 Zaina Madan
21 192 Narciso Ferrán
22 192 Narciso Ferrán
23 379 Zaina Madan
24 192 Narciso Ferrán
25 192 Custodia Hidalgo


Instead of retrieving the entire documents, if we are perfectly aware of the information we need from a query, we can project the required fields and save memory.

In [43]:
exampleQuery3 = rs.search(Query('@studentID:192').return_fields('firstName', 'lastName', 'courseID').paging(0, 25))

In [44]:
print('The query execution time was: %f milliseconds, and the number of documents matching it is: %i' % (exampleQuery3.duration, exampleQuery3.total))

The query execution time was: 4.333019 milliseconds, and the number of documents matching it is: 222


In [45]:
print('A sample of the courseID and student name for the query results is the following:\n')
counter3 = 0
for doc in exampleQuery3.docs:
    counter3 += 1
    print(counter3, doc['courseID'], doc['firstName'], doc['lastName'])

A sample of the courseID and student name for the query results is the following:

1 14 Zaina Madan
2 309 Zaina Madan
3 379 Zaina Madan
4 14 Zaina Madan
5 379 Zaina Madan
6 379 Zaina Madan
7 379 Zaina Madan
8 234 Zaina Madan
9 379 Zaina Madan
10 379 Zaina Madan
11 309 Zaina Madan
12 379 Zaina Madan
13 379 Zaina Madan
14 379 Zaina Madan
15 14 Zaina Madan
16 309 Zaina Madan
17 14 Zaina Madan
18 309 Zaina Madan
19 309 Zaina Madan
20 234 Zaina Madan
21 234 Zaina Madan
22 379 Zaina Madan
23 14 Zaina Madan
24 309 Zaina Madan
25 14 Zaina Madan
