<center>
<h2>Online learning platform database - Neo4j</h2>
</center>

<h3>Preliminary operations: import csv files into Neo4j (<code>LOAD CSV</code>)</h3>

A csv file can be imported into Neo4j via the <code>LOAD CSV</code> command, described in the <a href = 'https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/'>Cypher manual</a> (<i>Cypher</i> is Neo4j’s declarative query language), although <a href = 'https://neo4j.com/docs/getting-started/data-import/csv-import/'>other methods</a> are better suitable for large files.
<br>
<h4>
Syntax
</h4><br>
The file to be imported via the <code>LOAD CSV</code> command is preceded by a <code>file:///</code> prefix. This prefix points to the <i>import</i> directory in the <i>neo4j</i> home folder (for security reasons this directory is the <a href = 'https://neo4j.com/docs/operations-manual/5/configuration/file-locations/'>default root</a> for files that are imported via <code>LOAD CSV</code>). The csv file must be copied/moved to the <i>import</i> folder prior to importing.
<br>
<code>
    LOAD CSV FROM 'file:///mycsvfile.csv' ... ;
</code>
<br>
The <code>LOAD CSV</code> command loads data line by line: each line is treated as an array of strings (it is to be considered that the imported data are always read as strings, hence if we want to have them as a different data type within our Neo4j instance, we should use conversion tools available within the <code>LOAD CSV</code> command) and each field is an element of the array. Indexing the line array gives us access to a specific field value. For example, if each line of our csv file contains the attributes of various entities, entities can be abstracted from their attributes and created as labeled nodes, and their attributes can be considered as node properties. If the schema is known and rigid, it is possible to know the attributes of each entity within the data: indexing the line array, thus, allows to distribute the properties to the nodes we create from abstracting the entities.
<br>
<code>
    LOAD CSV FROM 'file:///mycsvfile.csv' AS line
        CREATE(:<i>node1</i> {<i>property1: line[0]</i>, <i>property2: line[1]</i>, ..., <i>property[i]: line[j]</i>}),
              (:<i>node2</i> {<i>property1: line[m]</i>, <i>property2: line[n]</i>, ..., <i>property[p]: line[q]</i>}),
              ... ;
</code>

<h4>
Options
</h4><br>
Options permit to take into account the specific structure of different csv files. They help in adapting to csv files with a header row or to specify a custom field delimiter (the default is a comma).
<br>
<br>
- <code><b>WITH HEADERS</b></code><br>
The presence of a header row in the csv file can improve the process of node and property creation. If this is the case, we can include the <code>WITH HEADERS</code> option in the <code>LOAD CSV</code> statement and field value access can be performed via field name rather than by indexing:
<br>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///mycsvfile.csv' AS line
        CREATE(:<i>node1</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[i]: line.fieldName</i>}),
              (:<i>node2</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[p]: line.fieldName</i>}),
              ... ;
</code>
<br>
- <code><b>FIELDTERMINATOR</b></code><br>
Fields in the line array are separated by a comma by default, but this setting can be overridden by the <code>FIELDTERMINATOR</code> option:
<br>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///mycsvfile.csv' AS line FIELDTERMINATOR ','
        CREATE(:<i>node1</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[i]: line.fieldName</i>}),
              (:<i>node2</i> {<i>property1: line.fieldName</i>, <i>property2: line.fieldName</i>, ..., <i>property[p]: line.fieldName</i>}),
              ... ;
</code>
<h4>
Actual <code>LOAD CSV</code> statement implementation
</h4><br>
According to the above presentation, a few more considerations need to be done on the actual implementaion we need to perform.<br>First of all, it is advised to separate node and relationship creation into separate processing: this way the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.<br>Moreover, we have to take into account the presence of null values in some of the learning material names. Null values are not stored by Neo4j (a property either exists or it doesn't and in the latter case it doesn't need to be created. When importing a flat csv file, the presence of null values causes import failure, so it must be handled some way. The option I chose is not perhaps the most adherent to the logic of a graph database, but nonetheless it prevents dropping a property which is mandatory in my schema, and it works for this case: it consists in the use of the <code>coalesce()</code> function on the <i>materialName</i> property. <code>coalesce()</code> allows to display a fallback default value when a property doesn’t exist on a node or relationship: I set this default value to '<i>empty</i>'.
The actual statement passed to Neo4j for importing the dataset with 250k rows is then the following:
<br>
<code>
    LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line 
	MERGE(:student {sID: line.studentID, firstName: line.firstName, lastName: line.lastName, dob: line.dateOfBirth, 
                    genre: line.genre, country: line.country, town: line.town, email: line.email});
</code>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MERGE(:course {cID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                   courseYear: line.courseYear, syllabus: line.syllabus});
</code>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MERGE(:material {mID: line.materialID, unit: line.unit, mType: line.materialType, 
                     mName: coalesce(line.materialName, 'empty'), dimension: line.dimension, 
                     accessDate: line.accessDate});
</code>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(s:student {sID: line.studentID, firstName: line.firstName, lastName: line.lastName, 
                     dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town, 
                     email: line.email})
	MATCH(c:course {cID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                    courseYear: line.courseYear, syllabus: line.syllabus})
	MERGE (s) -[:is_enrolled]-> (c);
<code>
</code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(s:student {sID: line.studentID, firstName: line.firstName, lastName: line.lastName, 
                     dob: line.dateOfBirth, genre: line.genre, country: line.country, town: line.town, 
                     email: line.email})
	MATCH(m:material {mID: line.materialID, unit: line.unit, mType: line.materialType, 
                      mName: coalesce(line.materialName, 'empty'), dimension: line.dimension, 
                      accessDate: line.accessDate})
	MERGE (s) -[:studies] -> (m);
</code>
<code>
LOAD CSV WITH HEADERS FROM 'file:///dataset250k.csv' AS line
	MATCH(c:course {cID: line.courseID, discipline: line.discipline, courseName: line.courseName, 
                    courseYear: line.courseYear, syllabus: line.syllabus})
	MATCH(m:material {mID: line.materialID, unit: line.unit, mType: line.materialType, 
                      mName: coalesce(line.materialName, 'empty'), dimension: line.dimension, 
                      accessDate: line.accessDate})
	MERGE (c) -[:uses] -> (m);
</code>
<br>
<h4>
Notes when running Neo4j from a Docker container
</h4><br>
In loading the csv file from the local machine, we must take into account the fact that Docker virtual environments come with a file system of their own, so a DBMS run from within a container has access to this file system, not to the file system of the local machine. Hence, the csv file must be imported into the container via the usual <code>docker cp</code> command.
<br>
<code>
    docker cp path/mycsvfile.csv container:/path
</code>

For security reasons, there is a default directory from which it is allowed to import external files (the <i>import</i> directory). So it is advisable to set this directory as <code>docker cp</code> destination, instead of the container root (to change this setting, the config file must be edited, although it is not recommended). The command syntax would then be:
<br>
<code>
    docker cp path/mycsvfile.csv container:/var/lib/neo4j/import
</code>

<h3>Python - Neo4j interaction</h3>

Interaction between a Python API and a Neo4j DBMS requires the installation of a specific driver. The usual list of drivers for various programming languages is provided in the <a href = 'https://neo4j.com/docs/getting-started/languages-guides/'>Connect to Neo4j</a> web page of the Neo4j website: the <a href = 'https://neo4j.com/docs/api/python-driver/current/'>Neo4j Python driver</a> is the official library for a Python programming environment.<br>After having installed the driver, it can be imported into a Python environment the usual way. In particular, we need to import the <code>GraphDatabase</code> class and setup a driver instance.

In [11]:
from neo4j import GraphDatabase

<h4>
Establishing a connection to Neo4j
</h4><br>
To instantiate a driver object we pass the DBMS location ('neo4j://localhost' or 'neo4j://127.0.0.1', for a local machine) to the <code>uri</code> parameter of the driver method and authentication values ('neo4j', '<i>mypassword</i>') to the <code>auth</code> argument. This does not actually establish a connection with the DBMS, but merely provides to the driver object the information to connect. Connection is established with query execution. Nonetheless, we can verify if a connection can be established with the given parameters, by using the <code>verify_connectivity()</code> method of the driver object:
<br>
<code>
with GraphDatabase.driver(uri = 'neo4j://localhost', auth = ('neo4j', '<i>mypassword</i>')) as driver:
    driver.verify_connectivity()
</code>

In [23]:
driver = GraphDatabase.driver(uri = 'neo4j://127.0.0.1', auth = ('neo4j', 'X4mPpd3V'))

<h4>
Executing a query
</h4><br>
Queries can be performed by passing Cypher statements to the <code>execute_query()</code> method of the driver object. The parameter <code>database</code> can be optionally passed to <code>execute_query()</code>, but our Neo4j version is single threaded, so there is no possible ambiguity on where the query is executed. The result is a query object that can be conveniently assigned to a Python variable for subsequent accesses. Moreover, upon query execution various attributes are available within the query object. We are particularly interested in the <code>records</code> returned, in the query <code>summary</code> and in the returned <code>keys</code>. It is convenient to assign them to Python variables for selected subsequent accesses. It is important to assign them in the shown order (the records must be assigned first, the summary second and the keys third).

In [92]:
query1_records, query1_summary, query1_keys = driver.execute_query('MATCH(s:student) WHERE s.sID = \'192\' RETURN s.firstName, s.lastName, s.genre', database = 'neo4j')

<h4>
Displaying a query result
</h4><br>
The query result is stored in the <code>records</code> object. We can access each record returned by the query by looping over the <code>records</code> object and using the <code>data()</code> method available within each record in the loop. This displays each record as a dictionary.

In [106]:
for record in query1_records:
    print(record.data())

{'s.firstName': 'Zaina', 's.lastName': 'Madan', 's.genre': 'female'}


We can manipulate each record dictionary for cleaner printing purposes.

In [112]:
print('firstName', ' ' * (15 - len('firstName')), 'lastName', ' ' * (15 - len('lastName')) , 'genre')
for record in query1_records:
    print(record.data()['s.firstName'], ' ' * (15 - len(record.data()['s.firstName'])),  record.data()['s.lastName'], ' ' * (15 - len(record.data()['s.lastName'])), record.data()['s.genre'])

firstName        lastName         genre
Zaina            Madan            female


<h4>
Measuring and displaying the query execution time
</h4><br>
To display the query execution time we can use the summary object previously stored into the <i>query1_summary</i> variable. One of the methods of the <code>summary</code> object is, in fact <code>result_available_after</code>, which displays the query execution time as milliseconds.

In [113]:
print('Query execution time is:', query1_summary.result_available_after)

Query execution time is: 1
