# Extracting general information about the dataset

This notebook uses the graph algorithms provided by Neo4j's Graph data science library (https://neo4j.com/docs/graph-data-science/1.3/algorithms/)

It includes the following algorithms:
* Clustering: Label propagation
* Clustering: Louvain algorithm
* Centrality: PageRank algorithm
* Centrality: Betweeness centrality
* Similarity: Node similarity

First, let us connect to the database:

In [1]:
from neo4j import GraphDatabase, basic_auth
from dotenv import load_dotenv
import os

load_dotenv()

neo4jUser = os.getenv("NEO4J_USER")
neo4jPwd = os.getenv("NEO4J_PASSWORD_DS")
neo4jUrl = os.getenv("NEO4j_BOLT_DS")

driver = GraphDatabase.driver(neo4jUrl,auth=basic_auth(neo4jUser, neo4jPwd))

KeyboardInterrupt: 

## Inital preparation

As discovered by the graph visualization, we do not need to the ExclusionReasons and Considered papers of the database. These will now be deleted from the database.

**NB!** These nodes are deleted permanently so remember to run the script that generated the database again after if you need the original database, or run these algorithms in a Sandbox for this purpose

In [17]:
with driver.session() as session:
    session.run("""
        Match (n) WHERE n:ConsideredPaper OR n:ExclusionReason OR n:Calculation
        DETACH DELETE n
    """)


Many of the algorithms are run in the entire graph, so we create that projection now:

In [18]:
with driver.session() as session:
    session.run("CALL gds.graph.create('all-nodes', '*', '*')")

ClientError: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure `gds.graph.create`: Caused by: java.lang.IllegalArgumentException: A graph with name 'all-nodes' already exists.}

## Community detection Algorithms

Many graph representations, such as social networks, divide naturally into communities. Algorithms that discover such communities are called community detection algorithms. These algorithms can help uncover the structure of the graph and group tendencies. Communities are such that node members of a community have more relationships within the community than with nodes outside of that community. Examples of such algorithms are label propagation and the Louvain algorithm.

### Label Propagation

The label propagation finds clusters based on labels. The algorithm is iterative and propagates labels through the graph.

We run this algortihm here on the entire graph projection

In [26]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.labelPropagation.stream(
            'all-nodes',
            {
              maxIterations: 5
            }
        ) 
        YIELD nodeId, communityId
        RETURN communityId, count(nodeId) AS size
        ORDER BY size DESC
        LIMIT 15
    """)
    
    for rec in res:
        label_prop_table.append([rec["communityId"], rec["size"]])

pd.DataFrame(label_prop_table, columns=["Community Id", "Size"])

Unnamed: 0,Community Id,Size
0,9651,1265
1,9652,729
2,10043,562
3,3827,356
4,3694,195
5,4392,189
6,3842,167
7,3835,117
8,4399,109
9,4378,69


Next, we store the community Id to the databse

In [23]:
with driver.session() as session:
    session.run("""
        CALL gds.labelPropagation.write(
            'all-nodes',
            {
              maxIterations: 5,
              writeProperty:'community'
            }
        )
    """)

Now, we can use this community Id to investigate the largest communities.
One thing to do is visualization, but for that we need to use other tools, so for now, we can looke at the labels that are within the differnt groups and their count.

In [None]:
top_5 = label_prop_table[:5]
label_counts = []
with driver.session() as session:
    for row in top_5:
        res = session.run("""
            MATCH (n {community: %d})
            RETURN DISTINCT labels(n) as label, count(*) as number
        """ % row[0])
        
        for rec in res:
            label_counts.append([rec["label"], rec["number"]])
            
pd.DataFrame(label_prop_table, columns=["label", "number"])          
        

### Louvain algortihm

The Louvain algorithm is a more advanced method for finding communities in large graphs. The algorithm detects the communities by the concept of maximum modularity.

In [None]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.louvain.stream('all-nodes')
        YIELD nodeId, communityId
        RETURN communityId AS louvainId, COUNT(DISTINCT nodeId) AS members
        ORDER BY members DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["louvainId"], rec["size"]])

pd.DataFrame(label_prop_table, columns=["Louvain Id", "Size"])

Are there any difference with intermediate communities?

In [None]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.louvain.stream(
          'all-nodes',
          {
            includeIntermediateCommunities: true
          }
        )
        YIELD nodeId, communityId, intermediateCommunityIds
        RETURN communityId, COUNT(DISTINCT nodeId) AS members, intermediateCommunityIds ORDER BY members DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["louvainId"], rec["size"]])

pd.DataFrame(label_prop_table, columns=["Louvain Id", "Size"])

Result is the same.

Now, let's write the property to the graph for further investigation. The nodes recieve the property `louvain` that tells which community they are in.

In [None]:
with driver.session() as session:
    session.run("""
        CALL gds.louvain.write(
          'all-nodes',
          {
            writeProperty: 'louvain'
          }
        )
    """)

As with Label propagation, we'll find the labels that are included in the largest communitites

## Centrality algorithms

Centrality algorithms measure which nodes are the most influential and have an extensive impact on the graph.
There are multiple ways to measure the centrality of nodes. There are more simplistic approaches, like only counting the in- or out-degree of the nodes, and more advanced methods that take the dynamics of the connected nodes into account.

Here, we run the betweenness centrality algorithm and the PageRank algorithm

### PageRank

PageRank is a centrality algorithm that takes the influence of all the nodes into account. The algorithm works by transferring the nodes' ranking scores to the neighbor nodes proportional to the number of neighboring nodes.


In [None]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('all-nodes')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId)) as label, score
        ORDER BY score DESC
        LIMIT 100
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])

Now, we do the same procedure for cell types, brain regions and analyses specifically

#### Cell types:

In [None]:
with driver.session() as session:
    session.run("CALL gds.graph.create('all-nodes', '*', '*')")

import pandas as pd
label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('all-nodes')
        YIELD nodeId, score
        WHERE labels(gds.util.asNode(nodeId)) = "CellType"
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId)) as label, score
        ORDER BY score DESC
        LIMIT 10
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])

#### Brain regions

In [None]:
with driver.session() as session:
    session.run("CALL gds.graph.create('all-nodes', '*', '*')")

import pandas as pd
label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('all-nodes')
        YIELD nodeId, score
        WHERE labels(gds.util.asNode(nodeId)) = "BrainRegion"
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId)) as label, score
        ORDER BY score DESC
        LIMIT 10
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])

#### Analyses

In [None]:
with driver.session() as session:
    session.run("CALL gds.graph.create('all-nodes', '*', '*')")

import pandas as pd
label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('all-nodes')
        YIELD nodeId, score
        WHERE labels(gds.util.asNode(nodeId)) = "Analysis"
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId)) as label, score
        ORDER BY score DESC
        LIMIT 10
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])

### Betweenness Centrality

Another centrality measure is the betweenness centrality. Instead of measuring the nodes' direct influence, it measures the nodes' influence in the graph's information flow. Betweenness centrality measures to what extent a node lies on the path between other nodes.

Neo4j's implementation uses Brandes' approximation of the betweenness centrality

In [None]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.betweenness.stream('all-nodes')
        YIELD nodeId, score
        RETURN nodeId, gds.util.asNode(nodeId).name AS name, labelsgds.util.asNode(nodeId)).name AS label, score
        ORDER BY name ASC
    """)
    
    for rec in res:
        label_prop_table.append([rec["nodeId"], rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["NodeId", "Name", "Label", "Page Rank Score"])

## Similarity algorithms

The final group of graph algorithms presented in this thesis is similarity algorithms. These algorithms measure the similarity of nodes by comparing node pairs

### Node similarity 

An intuitive similarity algorithm is the node similarity algorithm. This algorithm compares two nodes based on their neighboring nodes. In this algorithm, nodes receive a high similarity score if they connect to many of the same nodes. Node similarity defines the similarity of node _i_ and _j_ as the number of neighbor nodes common for _i_ and _j_, divided by the number of distinct neighbor nodes of _i_ and _j_. This measure is called the **Jaccard coefficient**.

In [None]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.nodeSimilarity.stream(
          'all-nodes',
          {
            degreeCutoff: 3,
            similarityCutoff: 0.5,
            topN: 50,
            topK: 2
          }
        )
        YIELD node1, node2, similarity
        RETURN gds.util.asNode(node1).name as node1, labels(gds.util.asNode(node1)) as node1Label, gds.util.asNode(node2).name as node2, labels(gds.util.asNode(node2)) as node2Label, similarity
        ORDER BY similarity DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["node1"], rec["node1Label"], rec["node2"], rec["node2Label"], rec["similarity"]])

pd.DataFrame(label_prop_table, columns=["NodeId", "Name", "Label", "Page Rank Score"])

### Similarity of analyses

For similarities we are also interested in observing similarities between *analyses* specifically. 

First, we had a specific use case for the website: The notebook `/2. Extending the data with extracted information - GraphAnalysis/AnalysesSimilarity.ipynb` presents a the similarity measure between analyses that is used on the website. 
This is specific to the analyses matching on cell type, region and specie. By this the analyses need a similarity score of 1.0 to be measured there.

However, we are also interested in observing if there are any unpredicted similarity between the analyses based on all the other methodologies connected to the analyses nodes. By this, we present these efforts below

The first step is to create a direct relation between the analyses nodes is to add a few extra relationships. These relationships are necessary due to the hierarchy of information separated between experiments, analyses, and the data types quantitations, distributions, and cell morphologies. When comparing the similarity of the analyses, values from all these nodes might be relevant.
In addition, the exisitng relationships are renamed so that we easier can create the graph projection.

In [None]:
with driver.session() as session:   
    session.run("""
        MATCH (n:Analysis)-->(r:Reporter)
        MERGE (n)-[:NODE_SIMILARITY]->(r)
    """)
    session.run("""
        MATCH (n:Analysis)-->(m:CellType)
        MERGE (n)-[:NODE_SIMILARITY]->(m)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(:RegionRecord)-->(b:BrainRegion)
        MERGE (n)-[:NODE_SIMILARITY]->(b)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(r:CellularRegion)
        MERGE (n)-[:NODE_SIMILARITY]->(r)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:Experiment)-[:ANAESTHETIC]->(r:Solution)
        MERGE (n)-[:NODE_SIMILARITY]->(r)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:Experiment)-[:PERFUSION_FIX_MEDIUM]->(r:Solution)
        MERGE (n)-[:NODE_SIMILARITY]->(r)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:Experiment)-->(:Specimen)-->(s:Specie)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(s:Software)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)
    ## Microscopes?
    ## Treatment and stuff, look at graph

## Clean-up

Remove graph projections from database

In [None]:
with driver.session() as session:
    session.run("CALL gds.graph.drop('all-nodes')")