### i0u19a - Data Processing - KU Leuven

# Python Neo4j exercises

###### _Jan Aerts_

![license](https://licensebuttons.net/l/by/3.0/88x31.png)

Hello and welcome to the tutorial on data processing with **Neo4j**!
Before proceeding, make sure to have the neo4j server running as well, e.g with `docker run -d -p 7474:7474 jandot/neo4j-i0u19a`

We'll be using Jupyter notebook again (you're looking at it) as a tool to walk you through a few examples. At the VDA-Lab, we like notebooks as a teaching tool because they allow you to experiment with code and data as you work your way through the document.

A few guidelines on the notebook itself:
* A notebook consists of *cells*, which are snippets of either text (markdown) or code (Python in this case).
* Cells can be executed by clicking the `[>]` "play" button, or by hitting shift-enter on the keyboard.
* You can navigate between cells either by clicking or by using the arrow buttons.

### Neo4j documentation

Check the general Neo4j documentation [here](http://neo4j.com/docs/stable/).

py2neo API documentation: [http://py2neo.org/2.0/essentials.html](http://py2neo.org/2.0/essentials.html)

# The data
The dataset consists of gene-gene and gene-disease interactions. Gene nodes have the type `:Gene`; disease nodes are of type `:Disease`. Gene-gene relationships are of type `:INTERACTS_WITH`; gene-disease relationships of type `:AFFECTS`.

Node and relationship properties:
* gene nodes: geneId, name
* disease nodes: diseaseId, name
* gene-gene relationships: nr_proofs, proofs
* gene-disease relationships: score, associationType

# 1. Querying the data using the browser
The neo4j database has a good web interface that you can use to look at the data. It can be reached on port 7474: [http://192.168.99.100:7474](http://192.168.99.100:7474) (as always: change the IP if necessary). It looks like this:
![screenshot_browser](images/screenshot_browser.png)

Data can be queried using the CYPHER language. See [here](http://neo4j.com/developer/cypher-query-language/) for a tutorial. Some quick example queries:

* Get 5 nodes: `MATCH (n) RETURN n LIMIT 5;`
* Get number of nodes with 2 incoming links: `MATCH ()-->(m)<--() RETURN COUNT(m);`

To get things working a bit faster, let's create some indexes on the name fields. Run the following commands in the browser query field:
* `CREATE INDEX ON :Gene(name);`
* `CREATE INDEX ON :Disease(name);`

### Exercises
* Fetch 3 diseases from the database.
* What is the number of diseases in the database?
* Find the number of paths between genes of length 2.
* Which diseases are directly affected by the gene CRISP3?
* What is the shortest path between the genes CRISP3 and ADAM22?
* What are the 3 gene nodes with the highest degree?
* Are there any gene-gene connections that might be due to indirect connections? In other words: which 2 genes are connected in 2 steps that are also connected in 1 step? That might mean that the direct connection is actually indirect.
* Are there any cliques (= fully-connected subgraphs) of exactly 4 nodes? Return 1.

# 2. Using the python API
The CYPHER language as used above is very useful for extracting specific patterns from a graph database. However, if we want to investigate bigger structures, it is often necessary to go for a more generic approach. For example, if we want to find out more about the structure in the data.

The python API for neo4j we use here is `py2neo`.

## 2.1 Connecting to the database

To connect to the database from python, we will first need to:
* Load the necessary modules
* Create a graph_db connection

The default user/password for neo4j is `neo4j/neo4j`, but the `jandot/neo4j-i0u19a` docker image is configured so that you don't need a username/password to access the data. This has been done by changing the neo4j configuration file:

```
sed -i.bak 's/dbms.security.auth_enabled=true/dbms.security.auth_enabled=false/' /var/lib/neo4j/conf/neo4j-server.properties
```

For a list of commands available for the py2neo API, see [here](http://py2neo.org/2.0/essentials.html).

So let's connect.

In [None]:
from py2neo import Graph, Node, Relationship
graph_db = Graph("http://192.168.99.100:7474/db/data/") # Change IP if necessary

## 2.2 Simple queries
We can still use CYPHER as a language when working in py2neo:

In [None]:
graph_db.cypher.execute("MATCH (n:Gene)-[r]->(m:Gene)-[s]->(o:Disease) RETURN n.name,m.name,o.name LIMIT 5;")

In [None]:
results = graph_db.cypher.execute("MATCH (n:Gene)-[r]->(m:Gene)-[s]->(o:Disease) RETURN n.name,m.name,o.name LIMIT 20;")
for result in results:
    print("Gene 1: ", result[0], "; Gene 2: ", result[2])

In [None]:
abcd1 = graph_db.find_one("Gene", "name", "ABCD1")
print(abcd1.properties['name'])

What are the 10 disease nodes with the highest degree? Return the link-count and the node name.

In [None]:
results = graph_db.cypher.execute(# COMPLETE THIS)
for result in results:
    print(#COMPLETE THIS)

## 2.3 Does this look like a random network?
Let's see how degree frequencies are distributed, and compare to a random network afterwards.

To calculate degree frequencies:

In [None]:
results = graph_db.cypher.execute("MATCH (n:Gene)-[r]->(m:Gene) RETURN n, count(r);")
counts = {}
for result in results:
    # COMPLETE THIS
counts

Let's plot these using bokeh:

In [None]:
from bokeh.plotting import figure, show
from bokeh.charts import Bar
from bokeh.io import output_notebook
output_notebook()

In [None]:
a = list(counts.keys())
b = list(counts.values())
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
p.line(a, b, legend="Temp.", line_width=2)
show(p)

Let's generate a new graph with the same fraction of nodes to relationships, connected at random. Ideally, we'd use the same number as in the real dataset, but that will take too long to generate... We'll take 1/10 of the size.

Steps:
1. find out the number of gene nodes and the number of relationships
2. create new nodes and generate random relationships between them
3. recreate the same plot as above

#### 1. Get the number of gene nodes, and the number of relationships.

In [None]:
nr_nodes = # COMPLETE THIS
print(nr_nodes)
nr_relationships = # COMPLETE THIS
print(nr_relationships)

#### 2. Create new nodes and random relationships
Tips:
* Give the nodes the label "RandomNode" so that you can easily filter them.
* Creating these nodes and relationships might take a while. To know how far in the process you are, open a neo4j browser (i.e. http://192.168.99.100:7474), and execute either `MATCH (n:RandomNode) RETURN COUNT(n);` or `MATCH (n:RandomNode)-[r]->(m) RETURN COUNT(r);`.
* Start with a **small** network first, to see if everything works as it should. For example, 1/100th of the real network. You can easily remove any tryouts with `MATCH (n:RandomNode) DETACH DELETE n;`.

In [None]:
from random import randint
n = graph_db.cypher.execute("MATCH (n:RandomNode) RETURN n LIMIT 1;")
if not n: # Check if we already have RandomNodes, otherwise we have to delete those first (see tips above)
    nr_random_nodes = # COMPLETE THIS
    for i in range(1,nr_random_nodes):
        # COMPLETE THIS: create nodes of type RandomNode; see py2neo for documentation on `create`
        # Give each node a number because you'll have to be able to refer to it in the next step

    for i in range(1,nr_random_relationships-1):
        # COMPLETE THIS: create relationship between random nodes of type RandomNode

#### 3. Now calculate everything and plot for these random data => are they the same?
Calculate the same counts as above, and create the same plot. You do not have to load the bokeh library anymore, because we already did that.

In [None]:
random_results = # COMPLETE THIS
random_counts = {}
for result in random_results:
    # COMPLETE THIS
random_counts

In [None]:
a = # COMPLETE THIS
b = # COMPLETE THIS
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
p.line(a, b, legend="Temp.", line_width=2)
show(p)

## 2.4 Centralities
For many purposes, it is important to know how "central" a node is in a network. Of course, different definitions exist for centrality...

### A. Degree centrality
Same as node degree: how many relationships does each node have?

What are the 5 nodes with the highest degree centrality?

In [None]:
connected_nodes = # COMPLETE THIS
connected_nodes

### B. Betweenness centrality
How many critical paths go through this node? Calculate all shortest paths between all nodes, and check how many of these go through your node of interest. Unfortunately, this is a very compute intensive operation, so we'll create a small network of 15 nodes and 17 edges that looks like this:
![small network](images/small_network.png)

In [None]:
n = graph_db.cypher.execute("MATCH (n:NewNode) RETURN n LIMIT 1;")
links = [[0,1],[1,2],[1,3],[2,3],[2,4],[3,4],[4,5],[4,6],[6,7],[6,8],[8,9],[8,10],[8,11],[10,11],[4,12],[12,13],[12,14]]
if not n:
    newNodes = {}
    for i in range(0,15):
        newNodes[i] = Node("NewNode", number=i)
        graph_db.create(newNodes[i])
    for link in links:
        graph_db.create(Relationship(newNodes[link[0]], "CONNECTS_TO", newNodes[link[1]]))

Given this smaller dataset, let's calculate the betweenness centrality for each node.

Tip: to calculate the shortest path between 2 nodes, use something like:
```
MATCH p = shortestPath((a:NewNode {number: 0})-[*..500]-(b:NewNode {number: 14})) RETURN p
```

If we're going to run `shortestPath` queries, we need to find out what the output looks like exactly. So let's just calculate one, and dig in.

In [None]:
query = "MATCH p = shortestPath((a:NewNode {number: 3})-[*..500]-(b:NewNode {number: 11})) RETURN p"
result = graph_db.cypher.execute(query)
print("#### Result as a whole:")
print(result)
print("#### Getting the actual result without ID column:")
print(result[0])
print("#### Getting the path p:")
print(result[0].p)
print("#### What methods does this path object have?")
print(dir(result[0].p))
print("#### Getting the length of the path:")
print(result[0].p.size)
print("#### Getting a list of the nodes:")
print(result[0].p.nodes)
print("#### Getting the node IDs:")
print(list(map(lambda x: x.properties['number'], result[0].p.nodes)))

Now let's count the actual number of times that each node is in the shortest path between any other two nodes.

In [None]:
template_query = "MATCH p = shortestPath((a:NewNode {number: X})-[*..500]-(b:NewNode {number: Y})) RETURN p"
node_counts = {}

## Initialize the counts to 0
for i in range(0,15):
    node_counts[i] = 0

## Fetch all shortest paths and count the number of times each node is mentioned
for i in range(0,15):
    for j in range(0,15):
        if i < j: # We don't want to fetch each shortest path twice (i.e. in both directions)
            query = template_query.replace('X',str(i)).replace('Y',str(j))
            # COMPLETE THIS: run the query, and count the total number of times a node is mentioned in a path
            # by adding to node_counts

node_counts

### C. Closeness centrality
Is a bit less stringent than betweenness centrality: how much is a node in the "middle" of the network, not too far from the center? Calculate this by checking the average shortest path between this node and all other nodes.

In [None]:
template_query = "MATCH p = shortestPath((a:NewNode {number: X})-[*..500]-(b:NewNode {number: Y})) RETURN p"
path_lengths = {}
for i in range(0,15):
    path_lengths[i] = 0

for i in range(0,15):
    for j in range(0,15):
        if i < j:
            query = template_query.replace('X',str(i)).replace('Y',str(j))
            # COMPLETE THIS: how is this different from the betweenness centrality?

path_lengths

## 2.4 Cliques
Above, we identified cliques with 4 nodes. It'd be more useful to search for those containing 7 or more, as suggested in the paper by Pradhan et al [Cliques for the identification of gene signatures for colorectal cancer across population](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524317/). Unfortunately, this will give us a `java heapspace` error with the current setup of the server.

So can we do this using the python API instead? One possible option might work like this:
* Find all nodes that have a degree of 7.
* For each of these, fetch the related nodes, and store these in a dictionary:
```
{key: node_a, linked_nodes: [node_b, node_c, node_d, ...]}
{key: node_b: linked_nodes: [node_a, node_c, node_d, ...]}
...
```
* If we then take those arrays of linked nodes, add the key, and sort these, we will get something like this:
```
[node_a, node_b, node_c, node_d, ...] # based on neighbours of node_a
[node_a, node_b, node_c, node_d, ...] # based on neighbours of node_b
...
```
* If a certain combination appears 7 times, we have a clique of 7.

Of course, this is for nodes with **exactly** 7 relations. So for us find a clique of 7 like that, that clique would have to be unconnected from the rest of the graph. To correct for that, we actually need to search for those with *7 or more* relations, and instead of just checking for identity between the resulting arrays we have to check for subarrays.