<center>
    <h2>Online learning platform database - Neo4j</h2>
    <h3>Methodologies applied for loading the data and performing the queries</h3>
</center>

<h3>Preliminary operations: import csv files into Neo4j (<code>LOAD CSV</code>)</h3>

A csv file can be imported into Neo4j via the <code>LOAD CSV</code> command, described in the <a href = 'https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/'>Cypher manual</a> (<i>Cypher</i> is Neo4j’s declarative query language), although <a href = 'https://neo4j.com/docs/getting-started/data-import/csv-import/'>other methods</a> are better suitable for large files.
<br>
<h4>
Syntax
</h4><br>
The file to be imported via the <code>LOAD CSV</code> command is preceded by a <code>file:///</code> prefix. This prefix points to the <i>import</i> directory in the <i>neo4j</i> home folder (for security reasons this directory is the <a href = 'https://neo4j.com/docs/operations-manual/5/configuration/file-locations/'>default root</a> for files that are imported via <code>LOAD CSV</code>). The csv file must be copied/moved to the <i>import</i> folder prior to importing.
<br>
<code>
    LOAD CSV FROM 'file:///mycsvfile.csv' ... ;
</code>
<br>
The <code>LOAD CSV</code> command loads data line by line: each line is treated as an array of strings (it is to be considered that the imported data are always read as strings, hence if we want to have them as a different data type within our Neo4j instance, we should use conversion tools available within the <code>LOAD CSV</code> command) and each field is an element of the array. Indexing the line array gives us access to a specific field value. For example, if each line of our csv file contains the attributes of various entities, entities can be abstracted from their attributes and created as labeled nodes, and their attributes can be considered as node properties. If the schema is known and rigid, it is possible to know the attributes of each entity within the data: indexing the line array, thus, allows to distribute the properties to the nodes we create from abstracting the entities.
<br>
<code>
    LOAD CSV FROM 'file:///mycsvfile.csv' AS line
        CREATE(:<i>node1</i> {<i>property1: line[0]</i>, <i>property2: line[1]</i>, ..., <i>property[i]: line[j]</i>}),
              (:<i>node2</i> {<i>property1: line[m]</i>, <i>property2: line[n]</i>, ..., <i>property[p]: line[q]</i>}),
              ... ;
</code>

<h4>
Options
</h4><br>
Options permit to take into account the specific structure of different csv files. They help in adapting to csv files with a header row or to specify a custom field delimiter (the default is a comma).
<br>
<br>
- <code><b>WITH HEADERS</b></code><br>
The presence of a header row in the csv file can improve the process of node and property creation. If this is the case, we can include the <code>WITH HEADERS</code> option in the <code>LOAD CSV</code> statement and field value access can be performed via field name rather than by indexing:
<br>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///mycsvfile.csv' AS line
        CREATE(:<i>node1</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[i]: line.fieldName</i>}),
              (:<i>node2</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[p]: line.fieldName</i>}),
              ... ;
</code>
<br>
- <code><b>FIELDTERMINATOR</b></code><br>
Fields in the line array are separated by a comma by default, but this setting can be overridden by the <code>FIELDTERMINATOR</code> option:
<br>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///mycsvfile.csv' AS line FIELDTERMINATOR ','
        CREATE(:<i>node1</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[i]: line.fieldName</i>}),
              (:<i>node2</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[p]: line.fieldName</i>}),
              ... ;
</code>
<h4>
Actual <code>LOAD CSV</code> statement implementation
</h4><br>
According to the above presentation, a further consideration needs to be done on the actual implementaion that needs to be performed.<br>It is, in fact, advised to separate node and relationship creation into separate processing: this way the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.<br>
The actual statement passed to Neo4j for importing the dataset with 250k rows is then the following:
<br>
<h5>
Create student nodes:
</h5>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line 
	MERGE(:student {studentID: line.studentID, firstName: line.firstName, lastName: line.lastName,
                    dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town,
                    email: line.email});
</code>
<br>
<h5>
Create course nodes:
</h5>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MERGE(:course {courseID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                   courseYear: line.courseYear, syllabus: line.syllabus});
</code>
<br>
<h5>
Create learning material nodes:
</h5>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MERGE(:material {materialID: line.materialID, unit: line.unit, mType: line.materialType, mName: line.name,
                     dimension: line.dimension, accessDate: line.accessDate});
</code>
<br>
<h5>
Create relationship between student nodes and course nodes:
</h5>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(s:student {studentID: line.studentID, firstName: line.firstName, lastName: line.lastName, 
                     dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town, 
                     email: line.email})
	MATCH(c:course {courseID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                    courseYear: line.courseYear, syllabus: line.syllabus})
	MERGE (s) -[:is_enrolled]-> (c);
</code>
<br>
<h5>
Create relationship between student nodes and learning material nodes:
</h5>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(s:student {studentID: line.studentID, firstName: line.firstName, lastName: line.lastName, 
                     dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town, 
                     email: line.email})
	MATCH(m:material {materialID: line.materialID, unit: line.unit, mType: line.materialType, mName: line.name, 
                      dimension: line.dimension, accessDate: line.accessDate})
	MERGE (s) -[:studies] -> (m);
</code>
<br>
<h5>
Create relationship between course nodes and learning material nodes:
</h5>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(c:course {courseID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                    courseYear: line.courseYear, syllabus: line.syllabus})
	MATCH(m:material {materialID: line.materialID, unit: line.unit, mType: line.materialType, mName: line.name, 
                      dimension: line.dimension, accessDate: line.accessDate})
	MERGE (c) -[:uses] -> (m);
</code>
<br>
<h4>
A note on using the APOC plugin<br>
</h4>
When using the <code>LOAD CSV</code> method, if the csv file size is large and a large amount of nodes or relationships are to be created, it is possible to run into memory allocation problems. A convenient way to address such problems is to call the <code>apoc.periodic.iterate</code> procedure from the APOC Neo4j plugin (some references <a href = 'https://neo4j.com/labs/apoc/5/installation/#docker'>here</a>). The plugin can be installed by conveniently setting the required environment variables in the Docker compose file (see Neo4j Docker compose file used in the project). The procedure can be called by using the <code>CALL</code> command and it requires three parameters:

- a first Cypher statement to retrieve the data (in my case, for example I need a first statement with two <code>MATCH</code> commands to retrieve the required nodes and a <code>RETURN</code> command with the nodes variable names);
- a second Cypher statement to tell the system what to do with the data (create the relationship between the retrieved nodes, in my case, by using the <code>MERGE</code> command); 
- a final parameter (in the form of a dictionary of key-value pairs) to set various options such as the batch size, which will determine how many steps will be used to perform the procedure.

One example is the following:<br>
<code>
    CALL apoc.periodic.iterate(
    "LOAD CSV WITH HEADERS FROM 'file:///dataset500k.csv' AS line 
        MATCH(s:student {studentID: line.studentID, firstName: line.firstName, lastName: line.lastName, 
            dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town, email: line.email}) 
        MATCH(m:material {materialID: line.materialID, unit: line.unit, mType: line.materialType, mName: line.name, 
            dimension: line.dimension}) 
        RETURN s, m", 
    "MERGE (s) -[:studies {accessDate: line.accessDate}] -> (m)",
    {batchSize: 5000, parallel: false});
</code>
<br>
<h4>
Notes when running Neo4j from a Docker container
</h4><br>
In loading the csv file from the local machine, we must take into account the fact that Docker virtual environments come with a file system of their own, so a DBMS run from within a container has access to this file system, not to the file system of the local machine. Hence, the csv file must be imported into the container via the usual <code>docker cp</code> command.
<br>
<code>
    docker cp path/mycsvfile.csv container:/path
</code>

For security reasons, there is a default directory from which it is allowed to import external files (the <i>import</i> directory). So it is advisable to set this directory as <code>docker cp</code> destination, instead of the container root (to change this setting, the config file must be edited, although it is not recommended). The command syntax would then be:
<br>
<code>
    docker cp path/mycsvfile.csv container:/var/lib/neo4j/import
</code>

<h3>Python - Neo4j interaction</h3>

Interaction between a Python API and a Neo4j DBMS requires the installation of a specific driver. The usual list of drivers for various programming languages is provided in the <a href = 'https://neo4j.com/docs/getting-started/languages-guides/'>Connect to Neo4j</a> web page of the Neo4j website: the <a href = 'https://neo4j.com/docs/api/python-driver/current/'>Neo4j Python driver</a> is the official library for a Python programming environment.<br>After having installed the driver, it can be imported into a Python environment the usual way. In particular, we need to import the <code>GraphDatabase</code> class and setup a driver instance.

In [13]:
from neo4j import GraphDatabase

<h4>
Establishing a connection to Neo4j
</h4><br>
To instantiate a driver object we pass the DBMS location ('neo4j://localhost:7687' or 'neo4j://127.0.0.1:7687', for a local machine) to the <code>uri</code> parameter of the driver method and authentication values ('neo4j', '<i>mypassword</i>') to the <code>auth</code> argument. This does not actually establish a connection with the DBMS, but merely provides the information to connect to the driver object. Connection is actually established with the query execution. Nonetheless, we can verify if the given parameters are ok for a connection to be successfully established, by using the <code>verify_connectivity()</code> method of the driver object:
<br>
<code>
with GraphDatabase.driver(uri = 'neo4j://localhost:7687', auth = ('neo4j', '<i>mypassword</i>')) as driver:
    driver.verify_connectivity()
</code>

In [16]:
driver = GraphDatabase.driver(uri = 'neo4j://127.0.0.1:7687', auth = ('neo4j', 'myPassword'))

Analogously to what was done for implementing the interaction between Python and Redis, I consider it is better to instantiate four different Neo4j servers. In importing the student, course and learning material nodes, in fact, I use the <code>MERGE</code> command, which does not create a node if it already exists. If I subsequently imported the four different csv files, I would end up with storing only the largest dataset, the most comprehensive one.
Instead, I create three more Neo4j instances and import one of the csv files in each of the four Neo4j instances.
Then I will perform the four queries on the four different Neo4j instances.
Using the <code>CREATE</code> command in place of <code>MERGE</code>, would be prerhaps worse because I would replicate nodes without knowing exactly which node would be related to another.

In [None]:
driver2 = GraphDatabase.driver(uri = 'neo4j://127.0.0.1:7692', auth = ('neo4j', 'myPassword'))
driver3 = GraphDatabase.driver(uri = 'neo4j://127.0.0.1:7693', auth = ('neo4j', 'myPassword'))
driver4 = GraphDatabase.driver(uri = 'neo4j://127.0.0.1:7694', auth = ('neo4j', 'myPassword'))

<h4>
Executing a query
</h4><br>
Queries can be performed by passing Cypher statements to the <code>execute_query()</code> method of the driver object. The parameter <code>database</code> can be optionally passed to <code>execute_query()</code>, but our Neo4j version is single threaded, so there is no possible ambiguity on where the query is executed. The result is a query object that can be conveniently assigned to a Python variable for subsequent accesses. Moreover, upon query execution various attributes are available within the query object. We are particularly interested in the <code>records</code> returned, in the query <code>summary</code> and sometimes in the returned <code>keys</code>. It is convenient to assign them to Python variables for selected subsequent accesses. It is important to assign them in the shown order (the records must be assigned first, the summary second and the keys third).

In [29]:
query1_records, query1_summary, query1_keys = driver.execute_query('MATCH(s:student) WHERE s.studentID = \'192\' RETURN s.firstName, s.lastName, s.genre', database = 'neo4j')

<h4>
Displaying a query result
</h4><br>
The query result is stored in the <code>records</code> object. We can access each record returned by the query by looping over the <code>records</code> object and using the <code>data()</code> method available within each record in the loop. This displays each record as a dictionary.

In [30]:
for record in query1_records:
    print(record.data())

{'s.firstName': 'Zaina', 's.lastName': 'Madan', 's.genre': 'female'}


We can manipulate each record dictionary for cleaner printing purposes.

In [32]:
print('firstName', ' ' * (15 - len('firstName')), 'lastName', ' ' * (15 - len('lastName')) , 'genre')
for record in query1_records:
    print(record.data()['s.firstName'], ' ' * (15 - len(record.data()['s.firstName'])),  record.data()['s.lastName'], ' ' * (15 - len(record.data()['s.lastName'])), record.data()['s.genre'])

firstName        lastName         genre
Zaina            Madan            female


<h4>
Measuring and displaying the query execution time
</h4>
<h4>
- method 1: <code>time()</code>
</h4><br>
To display the query execution time we can use the <code>time()</code> function of the already introduced <i>time</i> Python module. We store the time into two different variables, the first one defined immediately before query execution, the second one defined immediately after query execution. Their difference is the query execution time in seconds, although we must recall that it also takes into account the time required to connect to the DBMS and return to the programming environment.

In [33]:
import time
before = time.time()
query1_records, query1_summary, query1_keys = driver.execute_query('MATCH(s:student) WHERE s.sID = \'192\' RETURN s.firstName, s.lastName, s.genre', database = 'neo4j')
after = time.time()
mytime = (after - before) * 1000

In [36]:
print('Query execution time is:', mytime)

Query execution time is: 18.45407485961914


<h4>
- method 2: <code>summary</code>
</h4><br>
To display the query execution time we can also exploit the <i>summary</i> object previously stored into the <i>query1_summary</i> variable. One of the methods of the <code>summary</code> object is, in fact <code>result_available_after</code>, which displays the query execution time as milliseconds.

In [35]:
print('Query execution time is:', query1_summary.result_available_after)

Query execution time is: 1


<h4>
- chosen method: <code>time()</code>
</h4><br>
To be consistent across the project, I deem it better to rely on the <code>time()</code> function of the Python <code>time</code> module to mark the time before and after operations are executed and compute their difference.

In [37]:
mytime

18.45407485961914

Query execution time is: 4
