# Extracting general information about the dataset

This notebook uses the graph algorithms provided by Neo4j's Graph data science library (https://neo4j.com/docs/graph-data-science/1.3/algorithms/)

It includes the following algorithms:
* Clustering: Label propagation
* Clustering: Louvain algorithm
* Centrality: PageRank algorithm
* Centrality: Betweeness centrality
* Similarity: Node similarity

First, let us connect to the database:

In [183]:
from neo4j import GraphDatabase, basic_auth
from dotenv import load_dotenv
import os

load_dotenv()

neo4jUser = os.getenv("NEO4J_USER")
neo4jPwd = os.getenv("NEO4J_PASSWORD_DS")
neo4jUrl = os.getenv("NEO4j_BOLT_DS")

driver = GraphDatabase.driver(neo4jUrl,auth=basic_auth(neo4jUser, neo4jPwd))

## Inital preparation

As discovered by the graph visualization, we do not need to the ExclusionReasons and Considered papers of the database. These will now be deleted from the database.

**NB!** These nodes are deleted permanently so remember to run the script that generated the database again after if you need the original database, or run these algorithms in a Sandbox for this purpose

In [184]:
with driver.session() as session:
    session.run("""
        Match (n) WHERE n:ConsideredPaper OR n:ExclusionReason OR n:Calculation
        DETACH DELETE n
    """)


Many of the algorithms are run in the entire graph, so we create that projection now:

In [185]:
with driver.session() as session:
    session.run("CALL gds.graph.create('all-nodes', '*', '*')")

## Community detection Algorithms

Many graph representations, such as social networks, divide naturally into communities. Algorithms that discover such communities are called community detection algorithms. These algorithms can help uncover the structure of the graph and group tendencies. Communities are such that node members of a community have more relationships within the community than with nodes outside of that community. Examples of such algorithms are label propagation and the Louvain algorithm.

### Label Propagation

The label propagation finds clusters based on labels. The algorithm is iterative and propagates labels through the graph.

We run this algortihm here on the entire graph projection

In [186]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.labelPropagation.stream(
            'all-nodes',
            {
              maxIterations: 5
            }
        ) 
        YIELD nodeId, communityId
        RETURN communityId, count(nodeId) AS size
        ORDER BY size DESC
        LIMIT 15
    """)
    
    for rec in res:
        label_prop_table.append([rec["communityId"], rec["size"]])

pd.DataFrame(label_prop_table, columns=["Community Id", "Size"])

Unnamed: 0,Community Id,Size
0,9484,1191
1,9485,641
2,9510,419
3,1870,365
4,2423,189
5,1885,187
6,1737,177
7,2430,119
8,1878,116
9,2425,70


Next, we store the community Id to the databse

In [187]:
with driver.session() as session:
    session.run("""
        CALL gds.labelPropagation.write(
            'all-nodes',
            {
              maxIterations: 5,
              writeProperty:'community'
            }
        )
    """)

Now, we can use this community Id to investigate the largest communities.
One thing to do is visualization, but for that we need to use other tools, so for now, we can looke at the labels that are within the differnt groups and their count.

In [188]:
top_5 = label_prop_table[:5]
label_counts = []
with driver.session() as session:
    for row in top_5:
        res = session.run("""
            MATCH (n {community: %d})
            RETURN DISTINCT labels(n) as label, count(*) as number
        """ % row[0])
        
        for rec in res:
            label_counts.append([rec["label"], rec["number"]])
            
pd.DataFrame(label_prop_table, columns=["label", "number"])          
        

Unnamed: 0,label,number
0,9484,1191
1,9485,641
2,9510,419
3,1870,365
4,2423,189
5,1885,187
6,1737,177
7,2430,119
8,1878,116
9,2425,70


### Louvain algortihm

The Louvain algorithm is a more advanced method for finding communities in large graphs. The algorithm detects the communities by the concept of maximum modularity.

In [189]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.louvain.stream('all-nodes')
        YIELD nodeId, communityId
        RETURN communityId AS louvainId, COUNT(DISTINCT nodeId) AS members
        ORDER BY members DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["louvainId"], rec["members"]])

pd.DataFrame(label_prop_table, columns=["Louvain Id", "Size"])

Unnamed: 0,Louvain Id,Size
0,6146,855
1,6168,459
2,6147,387
3,1479,192
4,561,125
...,...,...
1310,6308,1
1311,6310,1
1312,6311,1
1313,6312,1


Are there any difference with intermediate communities?

In [190]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.louvain.stream(
          'all-nodes',
          {
            includeIntermediateCommunities: true
          }
        )
        YIELD nodeId, communityId, intermediateCommunityIds
        RETURN communityId AS louvainId, COUNT(DISTINCT nodeId) AS members, intermediateCommunityIds ORDER BY members DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["louvainId"], rec["members"]])

pd.DataFrame(label_prop_table, columns=["Louvain Id", "Size"])

Unnamed: 0,Louvain Id,Size
0,6146,764
1,6168,551
2,6147,405
3,1479,193
4,1773,117
...,...,...
1337,6308,1
1338,6310,1
1339,6311,1
1340,6312,1


Yes. So what is the difference? 

Now, let's write the property to the graph for further investigation. The nodes recieve the property `louvain` that tells which community they are in and `louvain-intermediate` for the intermediate communitites.

In [191]:
with driver.session() as session:
    session.run("""
        CALL gds.louvain.write(
          'all-nodes',
          {
            writeProperty: 'louvain'
          }
        )
    """)
    session.run("""
        CALL gds.louvain.write(
          'all-nodes',
          {
            includeIntermediateCommunities: true,
            writeProperty: 'louvain-intermediate'
          }
        )
    """)

As with Label propagation, we'll find the labels that are included in the largest communitites

## Centrality algorithms

Centrality algorithms measure which nodes are the most influential and have an extensive impact on the graph.
There are multiple ways to measure the centrality of nodes. There are more simplistic approaches, like only counting the in- or out-degree of the nodes, and more advanced methods that take the dynamics of the connected nodes into account.

Here, we run the betweenness centrality algorithm and the PageRank algorithm

### PageRank

PageRank is a centrality algorithm that takes the influence of all the nodes into account. The algorithm works by transferring the nodes' ranking scores to the neighbor nodes proportional to the number of neighboring nodes.


In [192]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('all-nodes')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId))[0] as label, score
        ORDER BY score DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])

Unnamed: 0,Name,Label,Page Rank Score
0,Rattus norvegicus,Specie,83.087979
1,Mus musculus,Specie,43.203104
2,Expression,CellPhenotypeCategory,37.024117
3,Custom_rat,Nomenclature,25.244152
4,Custom_mouse,Nomenclature,20.544201
...,...,...,...
6309,DB1X,Strain,0.150000
6310,C4A6/B,Substrain,0.150000
6311,C4A6/N,Substrain,0.150000
6312,SST-IRES-Cre/CAG-Lox-stop-Lox-H2B-GFP,TransgenicLine,0.150000


Let's store the result in a CSV and investigate the top 20 nodes:

In [193]:
_.to_csv("..\Data/csvs/graphAnalysis/pageRank_all_nodes.csv")
all_page_rank = _
_.head(20)

Unnamed: 0,Name,Label,Page Rank Score
0,Rattus norvegicus,Specie,83.087979
1,Mus musculus,Specie,43.203104
2,Expression,CellPhenotypeCategory,37.024117
3,Custom_rat,Nomenclature,25.244152
4,Custom_mouse,Nomenclature,20.544201
5,Caudoputamen,BrainRegion,19.811456
6,Paraformaldehyde,Solution,19.480766
7,Wistar,Strain,15.444185
8,Bright-field microscope,Microscope,14.666387
9,Caudoputamen,BrainRegion,14.214945


First, we see that the influential nodes are the species. This is natural as all analyses are conected to one of two species. In addition we have added relationship between brain region and specie and age categority and specie. By this, the influence is increased.

The this most influential node is the cell phenotype category Expression. This means that the name of the cell type is what the cell type expresses. For example the cell type tyrosine hydroxylase expressing, is a cell type that expresses tyrosine hydroxylase.

The next two are the nomenclatures Custom_rat and Cusom_mouse. This is just a wierd naming for the nomencalture used in this thesis; Waxholm (rat) and Allen mouse brain atlas (mouse). Again very natural that these are influential as they are referenced for every brain region that again is referenced by the analyses and also the species which we see are very influencial and by that increasing the nomenclatures influence.



Now, we do the same procedure for cell types, brain regions and analyses specifically

#### Cell types:

“Neuron” is the most connected cell type with “Tyrosine hydroxylase expressing” in second place

In [194]:
all_page_rank.loc[all_page_rank["Label"] == "CellType"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
31,Neuron,CellType,4.817682
58,Tyrosine hydroxylase expressing,CellType,3.111634
70,Medium spiny neuron,CellType,2.644263
79,Parvalbumin expressing,CellType,2.336629


With a very small margin, NeuN is the most connected cell phenotype

In [195]:
all_page_rank.loc[all_page_rank["Label"] == "CellPhenotype"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
100,NeuN,CellPhenotype,1.813725
124,Nissl,CellPhenotype,1.550272
136,VIP,CellPhenotype,1.489462
150,Tyrosine hydroxylase,CellPhenotype,1.384574


The most common object of interest is Neurons 

In [196]:
all_page_rank.loc[all_page_rank["Label"] == "NeuralStructure"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
28,Neurons,NeuralStructure,5.239823
407,Glia cell,NeuralStructure,0.539081
427,Dendritic spines,NeuralStructure,0.518981
522,Cells,NeuralStructure,0.443439


#### Brain regions

The most investigated brain region, for both species is Caudoputamen

In [197]:
all_page_rank.loc[all_page_rank["Label"] == "BrainRegion"].head(10)

Unnamed: 0,Name,Label,Page Rank Score
5,Caudoputamen,BrainRegion,19.811456
9,Caudoputamen,BrainRegion,14.214945
24,Pars compacta,BrainRegion,6.412738
26,Accumbens nucleus,BrainRegion,6.352405
45,Striatum,BrainRegion,3.642465
49,Substantia nigra compact part,BrainRegion,3.502658
62,Substantia nigra,BrainRegion,2.856483
76,Nucleus accumbens,BrainRegion,2.467706
87,Substantia nigra,BrainRegion,2.122685
137,Brainstem,BrainRegion,1.471857


#### Analyses

In [198]:
all_page_rank.loc[all_page_rank["Label"] == "Analysis"].head(10)

Unnamed: 0,Name,Label,Page Rank Score
5017,Cao_2013_females_sections_olig_cells_quantitation,Analysis,0.15
5018,Cao_2013_males_sections_olig_cells_quantitation,Analysis,0.15
5019,Chakraborty_2014_sections_spines_quantitation,Analysis,0.15
5020,Chakraborty_2014_sections_mushroom_spines_quan...,Analysis,0.15
5021,Champy_2003_sections_chat_neurons_quantitation,Analysis,0.15
5022,Champy_2003_sections_DARRP_neurons_quantitation,Analysis,0.15
5023,Champy_2003_sections_ED1_cells_quantitation,Analysis,0.15
5024,Champy_2003_sections_GFAP_cells_quantitation,Analysis,0.15
5025,Champy_2003_sections_NPY_cells_quantitation,Analysis,0.15
5026,Covey_2007_sections_neurons_quantitation,Analysis,0.15


From this, it seems like the analyses have the same score, but let ut check:

In [199]:
all_page_rank.loc[all_page_rank["Label"] == "Analysis"]["Page Rank Score"].unique()

array([0.15])

As we can see all analyses have the same score. This might be because all the analyses are connected to the same set of nodes (data types, methodologies and experiments). However, they are also in turn conencted experiments, and they might have different score:

In [200]:
all_page_rank.loc[all_page_rank["Label"] == "Experiment"].head(10)

Unnamed: 0,Name,Label,Page Rank Score
716,Ariano_1997,Experiment,0.332143
1107,Löscher_2006,Experiment,0.240758
1173,Lolova_1995_3m,Experiment,0.235
1174,Lolova_1995_12m,Experiment,0.235
1181,Pickel_2006,Experiment,0.233583
1192,Champy_2003,Experiment,0.231458
1216,Eilam_2003,Experiment,0.228432
1264,Kosinski_1997,Experiment,0.221923
1458,Larsson_2001,Experiment,0.208846
1461,Zoli_1993,Experiment,0.208438


Here, we observe that the expermintes are ranked different. This probably because one experiment can have one of many analysis. Ariano, as presented on the top has 20 analyses recorded, and Löscher, 2006 has 15. Zoli, 1993 has 4. This does not account for everything though, as for example Wissman, 2012 has 8 analyses reported. The difference is that all experiments on the top-list is connected to the specie "Rattus norvegicus" which has much higher page rank score than "Mus musculus".

#### Sources

Looking at number of experiments, most are collected from Neuroscience (47), Journal of Comparative Neurology (38), and Brain Research (36).

In [201]:
all_page_rank.loc[all_page_rank["Label"] == "SourceOrigin"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
29,Neuroscience,SourceOrigin,5.224971
50,Brain Research,SourceOrigin,3.491674
54,Journal of Comparative Neurology,SourceOrigin,3.287326


### Methods

“Bright-field microscope” is the most used microscope type.

In [202]:
all_page_rank.loc[all_page_rank["Label"] == "Microscope"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
8,Bright-field microscope,Microscope,14.666387
36,Transmission electron microscope,Microscope,4.274363
140,Fluorescence microscope,Microscope,1.435038


Immunohistochemistry is the most used visualization method, histochemistry the second most. No difference between species (found from visualization)

In [203]:
all_page_rank.loc[all_page_rank["Label"] == "VisualizationProtocol"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
53,Immunohistochemistry,VisualizationProtocol,3.394475
81,Histochemistry,VisualizationProtocol,2.288063
309,Immunofluorescence,VisualizationProtocol,0.701826


Tyrosine hydroxylase and Rabbit antibody are the most used reporter targets.

In [204]:
all_page_rank.loc[all_page_rank["Label"] == "ReporterTarget"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
82,Rabbit antibody,ReporterTarget,2.249094
99,Tyrosine hydroxylase,ReporterTarget,1.815922
175,Mouse antibody,ReporterTarget,1.252259


"Goat anti rabbit_biotin" is the most used Reporter.

In [205]:
all_page_rank.loc[all_page_rank["Label"] == "Reporter"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
284,Goat anti rabbit_biotin,Reporter,0.767074
454,Rabbit anti TH_3,Reporter,0.504921
482,Mouse anti TH_1,Reporter,0.480064


The most used sectioning instrument, excluding "Undefined", is "Cryostat", followed closely by “Freezing microtome” (17 and 15 percent). Cryostat is the most used in Mouse, while Freezing microtome in Rat. 

In [206]:
all_page_rank.loc[all_page_rank["Label"] == "SectioningInstrument"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
106,Unspecified,SectioningInstrument,1.74211
112,Freezing microtome,SectioningInstrument,1.697884
119,Cryostat,SectioningInstrument,1.578651


In [207]:
The most used is “Stereo Investigator”. The second most used software is “Custom” 

SyntaxError: invalid syntax (<ipython-input-207-b41a6c4f6fd8>, line 1)

In [208]:
all_page_rank.loc[all_page_rank["Label"] == "Software"].head(3)

Unnamed: 0,Name,Label,Page Rank Score
19,Stereo Investigator,Software,8.028368
22,Custom,Software,6.888384
48,Neurolucida,Software,3.525006


Most influential solutions: The most influenction solution is Paraformaldehyde.

In [209]:
all_page_rank.loc[all_page_rank["Label"] == "Solution"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
6,Paraformaldehyde,Solution,19.480766
10,Sucrose,Solution,13.737328
11,Unspecified,Solution,13.032988
21,Ethanol,Solution,7.094888


However, solutions are both used as *Perfusion fix medium* and *Anaesthetic* from the experiment, as *Mounting medium* from microscopes and *treatment* and *visualization medium* from the speciemns (see picture).
![image-2.png](attachment:image-2.png)

It is interesting to observe them separately. To do this, we need separat projections, only including one of the relationships at a time, and to measure what we create two projections:
- one where only Anaesthetic is included
- one where only Perfusion fix medium is included
- one where only visualization medium is included

In [210]:
with driver.session() as session:
    session.run("""
        CALL gds.graph.create.cypher(
            "anaesthetics", 
            "MATCH (n) return id(n) as id", 
            "MATCH (a)-[r]->(b) WHERE type(r)<>'PERFUSION_FIX_MEDIUM' AND type(r)<>'TREATMENT' AND type(r)<>'VISULIZATION_MEDIUM' AND type(r)<>'MOUNTING_MEDIUM' RETURN id(a) AS source, id(b) AS target"
        )
    """)
                
    session.run("""
        CALL gds.graph.create.cypher(
            "perfusion-fix", 
            "MATCH (n) return id(n) as id", 
            "MATCH (a)-[r]->(b) WHERE type(r)<>'ANAESTHETIC' AND type(r)<>'TREATMENT' AND type(r)<>'VISULIZATION_MEDIUM' AND type(r)<>'MOUNTING_MEDIUM' RETURN id(a) AS source, id(b) AS target"
        )
    """)
                
                
    session.run("""
        CALL gds.graph.create.cypher(
            "visualization-method", 
             "MATCH (n) return id(n) as id", 
            "MATCH (a)-[r]->(b) WHERE type(r)<>'ANAESTHETIC' AND type(r)<>'TREATMENT' AND type(r)<>'PERFUSION_FIX_MEDIUM' AND type(r)<>'MOUNTING_MEDIUM' RETURN id(a) AS source, id(b) AS target"
        )
    """)

In [211]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('anaesthetics')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId))[0] as label, score
        ORDER BY score DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

df = pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])
df.loc[df["Label"] == "Solution"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
68,Sodium pentobarbitone,Solution,2.569385
73,Unspecified,Solution,2.297328
162,Ketamine-xylacine,Solution,1.241465
179,Chloral hydrate,Solution,1.080898


In [212]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('perfusion-fix')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId))[0] as label, score
        ORDER BY score DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

df = pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])
df.loc[df["Label"] == "Solution"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
26,Paraformaldehyde,Solution,5.853241
113,Paraformaldehyde-glutaraldehyde,Solution,1.547687
210,Biocytin,Solution,0.897844
291,Unspecified,Solution,0.661041


In [213]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.pageRank.stream('visualization-method')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId))[0] as label, score
        ORDER BY score DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["name"], rec["label"], rec["score"]])

pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])
df = pd.DataFrame(label_prop_table, columns=["Name", "Label", "Page Rank Score"])
df.loc[df["Label"] == "Solution"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
208,Biocytin,Solution,0.924018
448,Neurobiotin,Solution,0.467573
453,DiI particles,Solution,0.461095
465,PalGFP-Sindbis virus,Solution,0.450005


In [214]:
with driver.session() as session:
    session.run("CALL gds.graph.drop('anaesthetics')")
    session.run("CALL gds.graph.drop('perfusion-fix')")
    session.run("CALL gds.graph.drop('visualization-method')")

#### Specimin findings

"Male” is the most influential sex. Both for Rat and Mouse

In [215]:
all_page_rank.loc[all_page_rank["Label"] == "Sex"]

Unnamed: 0,Name,Label,Page Rank Score
20,Male,Sex,7.24061
116,Female,Sex,1.630765
163,Both,Sex,1.333344


“Adult” is the age category, most often used, both for Rat and Mouse.

In [216]:
all_page_rank.loc[all_page_rank["Label"] == "AgeCategory"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
32,Adult,AgeCategory,4.653285
33,Adult,AgeCategory,4.624871
314,Juvenile,AgeCategory,0.690787
448,Adolescent,AgeCategory,0.509559


Most influential strain: Wistar for rats, C57BL/6 for mice

In [217]:
all_page_rank.loc[all_page_rank["Label"] == "Strain"].head(4)

Unnamed: 0,Name,Label,Page Rank Score
7,Wistar,Strain,15.444185
13,C57BL/6,Strain,9.820796
15,Sprague-Dawley,Strain,9.279906
233,Recombinant inbred mouse strain,Strain,0.938945


### Betweenness Centrality

Another centrality measure is the betweenness centrality. Instead of measuring the nodes' direct influence, it measures the nodes' influence in the graph's information flow. Betweenness centrality measures to what extent a node lies on the path between other nodes.

Neo4j's implementation uses Brandes' approximation of the betweenness centrality

In [218]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    new_call = """
        CALL gds.betweenness.stream('all-nodes')
        YIELD nodeId, score
        RETURN nodeId, gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId)) AS label, score
        ORDER BY name ASC
    """
    ## The sandbox database is not updated with newest production ready algorithms so we call the old way
    res = session.run("""
        CALL gds.alpha.betweenness.sampled.stream('all-nodes')
        YIELD nodeId, centrality
        RETURN nodeId, gds.util.asNode(nodeId).name AS name, labels(gds.util.asNode(nodeId))[0] AS label, centrality
        ORDER BY centrality DESC
    """)
    
    
    for rec in res:
        label_prop_table.append([rec["nodeId"], rec["name"], rec["label"], rec["centrality"]])

pd.DataFrame(label_prop_table, columns=["NodeId", "Name", "Label", "Betweenness Score"])

Unnamed: 0,NodeId,Name,Label,Betweenness Score
0,2503,Caudoputamen,BrainRegion,100393.598727
1,660,Neurons,NeuralStructure,72498.703371
2,2504,Caudoputamen,BrainRegion,50088.967141
3,2521,Striatum,BrainRegion,28120.507868
4,2508,Pars compacta,BrainRegion,24592.760300
...,...,...,...,...
6309,9705,Tg(Th–EGFP)1Gsat/MNmnc,TransgenicLine,0.000000
6310,9706,D1-tdTomato x D2-eGFP,TransgenicLine,0.000000
6311,9707,BAC D1R-tomato,TransgenicLine,0.000000
6312,9708,BAC D2R-eGFP,TransgenicLine,0.000000


We observe from this that the most influential brain regions, found by PageRank, are still the most influention.

Let's look at the top 10, excluding the brain region and neural structures:

In [219]:
_.loc[(_["Label"] != "BrainRegion") & (_["Label"] != "NeuralStructure")].head(10)

Unnamed: 0,NodeId,Name,Label,Betweenness Score
5,2918,Rostral,RegionZone,21816.773824
6,364,Neuron,CellType,20169.820725
10,344,Medium spiny neuron,CellType,12257.302328
11,2927,Central,RegionZone,8974.979766
13,2916,Dorsomedial,RegionZone,8860.840125
14,2914,Dorsolateral,RegionZone,8168.447479
15,9591,C57BL/6,Strain,7915.447572
16,2919,Caudal,RegionZone,7697.499342
17,345,Tyrosine hydroxylase expressing,CellType,7631.344862
18,419,Dopaminergic neuron,CellType,7494.81486


Here, we observe that the region-zones are influential and still the same cell types as from page rank. Region-zones might be more influential here as they are in the path between quantitations and brain regions. This can easily be the shortest path as the other step is through a region record. They are at least equal. 
However, not too interesting.

## Similarity algorithms

The final group of graph algorithms presented in this thesis is similarity algorithms. These algorithms measure the similarity of nodes by comparing node pairs

### Node similarity 

An intuitive similarity algorithm is the node similarity algorithm. This algorithm compares two nodes based on their neighboring nodes. In this algorithm, nodes receive a high similarity score if they connect to many of the same nodes. Node similarity defines the similarity of node _i_ and _j_ as the number of neighbor nodes common for _i_ and _j_, divided by the number of distinct neighbor nodes of _i_ and _j_. This measure is called the **Jaccard coefficient**.

In [220]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.nodeSimilarity.stream(
          'all-nodes',
          {
            degreeCutoff: 3,
            similarityCutoff: 0.5,
            topK: 2
          }
        )
        YIELD node1, node2, similarity
        RETURN gds.util.asNode(node1).name as node1, labels(gds.util.asNode(node1))[0] as node1Label, gds.util.asNode(node2).name as node2, labels(gds.util.asNode(node2))[0] as node2Label, similarity
        ORDER BY similarity DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["node1"], rec["node1Label"], rec["node2"], rec["node2Label"], rec["similarity"]])

pd.DataFrame(label_prop_table, columns=["node1", "node1Label", "node2", "node2Label", "Similarity"])

Unnamed: 0,node1,node1Label,node2,node2Label,Similarity
0,Low huntingtin expressing/calbindin expressing,CellType,Strong huntingtin expressing/calbindin expressing,CellType,1.0
1,Low huntingtin expressing/calbindin expressing,CellType,Moderate huntingtin expressing/calbindin expre...,CellType,1.0
2,Moderate huntingtin expressing/calbindin expre...,CellType,Strong huntingtin expressing/calbindin expressing,CellType,1.0
3,Moderate huntingtin expressing/calbindin expre...,CellType,Low huntingtin expressing/calbindin expressing,CellType,1.0
4,Strong huntingtin expressing/calbindin expressing,CellType,Moderate huntingtin expressing/calbindin expre...,CellType,1.0
...,...,...,...,...,...
3469,Zhang_2016_semithin,Specimen,San Jose_2001_3m_sections_GFAP,Specimen,0.5
3470,Zhang_2016_ultrathin,Specimen,Zheng_2018_ultrathin_D1,Specimen,0.5
3471,Zhang_2016_ultrathin,Specimen,Zhao_2002_ultrathin,Specimen,0.5
3472,Löscher_2006_sham_sections_GAD67heavy_neurons_...,Analysis,Löscher_2006_sections_GAD67_heavily_neurons_di...,Analysis,0.5


Let's store the result in a csv and investigate the top 20.

In [221]:
_.to_csv("..\Data/csvs/graphAnalysis/node_similarity_all_nodes.csv")
node_similarity = _
_.head(10)

Unnamed: 0,node1,node1Label,node2,node2Label,Similarity
0,Low huntingtin expressing/calbindin expressing,CellType,Strong huntingtin expressing/calbindin expressing,CellType,1.0
1,Low huntingtin expressing/calbindin expressing,CellType,Moderate huntingtin expressing/calbindin expre...,CellType,1.0
2,Moderate huntingtin expressing/calbindin expre...,CellType,Strong huntingtin expressing/calbindin expressing,CellType,1.0
3,Moderate huntingtin expressing/calbindin expre...,CellType,Low huntingtin expressing/calbindin expressing,CellType,1.0
4,Strong huntingtin expressing/calbindin expressing,CellType,Moderate huntingtin expressing/calbindin expre...,CellType,1.0
5,Strong huntingtin expressing/calbindin expressing,CellType,Low huntingtin expressing/calbindin expressing,CellType,1.0
6,Internal segment,BrainRegion,External segment,BrainRegion,1.0
7,External segment,BrainRegion,Internal segment,BrainRegion,1.0
8,Pallidum,BrainRegion,Ventral tegmental area,BrainRegion,1.0
9,Ventral tegmental area,BrainRegion,Pallidum,BrainRegion,1.0


All natural similarities

### Similarity of analyses

For similarities we are also interested in observing similarities between *analyses* specifically. 

First, we had a specific use case for the website: The notebook `/2. Extending the data with extracted information - GraphAnalysis/AnalysesSimilarity.ipynb` presents a the similarity measure between analyses that is used on the website. 
This is specific to the analyses matching on cell type, region and specie. By this the analyses need a similarity score of 1.0 to be measured there.

However, we are also interested in observing if there are any unpredicted similarity between the analyses based on all the other methodologies connected to the analyses nodes. By this, we present these efforts below

First, we list out the analysis similarities from the algorithm run above

In [223]:
node_similarity.loc[node_similarity["node1Label"]== "Analysis"].head(10)

Unnamed: 0,node1,node1Label,node2,node2Label,Similarity
1686,Wang_2007_sections_SP/ENK_neurons_patch_quanti...,Analysis,Wang_2007_sections_SP/ENK_neuron_matrix_quanti...,Analysis,0.882353
1687,Wang_2007_sections_SP/ENK_neuron_matrix_quanti...,Analysis,Wang_2007_sections_SP/ENK_neurons_patch_quanti...,Analysis,0.882353
1714,Baquet_2009_sections_recon_TH_neurons_quantita...,Analysis,Baquet_2009_sections_recon-2D_TH_neurons_distr...,Analysis,0.833333
1715,Baquet_2009_sections_recon-2D_TH_neurons_distr...,Analysis,Baquet_2009_sections_recon_TH_neurons_quantita...,Analysis,0.833333
1770,Saylor_2006_sections_proximal_spines_quantitation,Analysis,Saylor_2006_sections_all_spines_quantitation,Analysis,0.818182
1771,Saylor_2006_sections_proximal_spines_quantitation,Analysis,Saylor_2006_sections_distal_spines_quantitation,Analysis,0.818182
1772,Saylor_2006_sections_distal_spines_quantitation,Analysis,Saylor_2006_sections_all_spines_quantitation,Analysis,0.818182
1773,Saylor_2006_sections_distal_spines_quantitation,Analysis,Saylor_2006_sections_proximal_spines_quantitation,Analysis,0.818182
1774,Saylor_2006_sections_all_spines_quantitation,Analysis,Saylor_2006_sections_distal_spines_quantitation,Analysis,0.818182
1775,Saylor_2006_sections_all_spines_quantitation,Analysis,Saylor_2006_sections_proximal_spines_quantitation,Analysis,0.818182


Here, we observe the pattern that analysis from the same experiment are similar to each other. Is there a way to move beyond this and only observe the similar analysis from the same experiment?

In [224]:
def get_name(analysis_name):
    return analysis_name.str.split("_").str[:2]

similar_analyses = node_similarity.loc[(node_similarity["node1Label"]== "Analysis")].sort_values(["Similarity", "node1"], ascending=[False, True])
similar_analyses.loc[(get_name(similar_analyses["node1"]) != get_name(similar_analyses["node2"]) )]

Unnamed: 0,node1,node1Label,node2,node2Label,Similarity
2848,Chalimoniuk_2006_TH_sections_neurons_quantitation,Analysis,Vidyadhara_2017_cd1_sections_th_neurons_quanti...,Analysis,0.555556
2849,Chalimoniuk_2006_TH_sections_neurons_quantitation,Analysis,Vidyadhara_2017_c57bl/6_sections_th_neurons_qu...,Analysis,0.555556
2831,Lauber_2016_sections_PV_neurons_quantitation,Analysis,Lauber_2018_sections_PV_neurons_quantitation,Analysis,0.555556
2830,Lauber_2016_sections_VVA_neurons_quantitation,Analysis,Lauber_2018_sections_VVA_quantitation,Analysis,0.555556
2828,Lauber_2018_sections_PV_neurons_quantitation,Analysis,Lauber_2016_sections_PV_neurons_quantitation,Analysis,0.555556
2829,Lauber_2018_sections_VVA_quantitation,Analysis,Lauber_2016_sections_VVA_neurons_quantitation,Analysis,0.555556
3379,Kincaid_2001_sections_TH_neurons_quantitation,Analysis,Komnig_2016b_sections_TH_neurons_quantitation,Analysis,0.5
3380,Komnig_2016_24w_sections_TH_neurons_quantitation,Analysis,Komnig_2016b_sections_TH_neurons_quantitation,Analysis,0.5
3381,Komnig_2016b_sections_TH_neurons_quantitation,Analysis,Komnig_2016_24w_sections_TH_neurons_quantitation,Analysis,0.5
3392,Rodrigues_2001_sections_TH_neurons_quantitation,Analysis,Aguirre_1999_sections_TH_neurons_quantitation,Analysis,0.5


These are all analyses that are similar (only by approxmately 50 percent) and are not from the same experiment.

#### Specific analysis nodes
There are not all nodes that the analyses are connected too that are equally relevant. So we can create a projection with the desired nodes and relationships and run the same analysis again:

We are interested in the following:
- CellType, Brain Region and Neural Structure
- The data type
- Sectioning Instrument
- Reporter
- Microscope
- Specie

Some of these relationships does not exist. So we create them before making the projection: In addition we need to create a node that specifies the data type:

In [248]:
with driver.session() as session: ## TODO add specific relation to all so only analyses are compared, no interlinked relationshipsss
    ## data types
    session.run("CREATE (:AnalysisDataType {id: 1, name: 'Quantitation'})")
    session.run("CREATE (:AnalysisDataType {id: 2, name: 'Distribution'})")
    session.run("CREATE (:AnalysisDataType {id: 3, name: 'Morphology'})")
    
    session.run("""
        MATCH (n:Analysis)
        MATCH (m:AnalysisDataType)
        WHERE n.dataType = m.name
        MERGE (n)-[:NODE_SIMILARITY]->(m)
    """)
    ## brain region
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(:RegionRecord)-[:PRIMARY_REGION]->(b:BrainRegion)
        MERGE (n)-[:NODE_SIMILARITY]->(b)
    """)
    ## Specie
    session.run("""
        MATCH (n:Analysis)-->(:Specimen)-->(s:Specie)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)
    ## Microscope
    session.run("""
        MATCH (n:Analysis)-->()-->(m:Microscope)
        MERGE (n)-[:NODE_SIMILARITY]->(m)
    """)
    ## Reporter
    session.run("""
        MATCH (n:Analysis)-->(:ReporterIncubation)-->(r:Reporter)
        MERGE (n)-[:NODE_SIMILARITY {strength: 1}]->(r)
    """)
    ## CellularRegion
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(r:CellularRegion)
        MERGE (n)-[:NODE_SIMILARITY]->(r)
    """)
    ## Software
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(s:Software)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)
    ## RegionZone
    session.run("""
        MATCH (n:Analysis)-->(:DataType)-->(s:RegionZone)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)
    ## SectioningInstrument
    session.run("""
        MATCH (n:Analysis)-->(s:SectioningInstrument)
        MERGE (n)-[:NODE_SIMILARITY]->(s)
    """)

     ## CellType
    session.run("""
        MATCH (n:Analysis)-->(c:CellType)
        MERGE (n)-[:NODE_SIMILARITY]->(c)
    """)


Then we can create the projection

In [249]:
with driver.session() as session:
    prev = """CALL gds.graph.create(
            'analyses', 
            '["Analysis", "CellType", "BrainRegion", "NeuralStructure", "AnalysisDataType", "Reporter", "SectioningInstrument", "Microscope"]',
            '*'
        )"""
    res = session.run("""
        CALL gds.graph.create(
            'analyses', 
            ["CellType", "CellularRegion", "NeuralStructure", "Analysis", "SectioningInstrument", "Reporter","VisualizationProtocol", "AnalysisDataType", "Microscope", "Software", "BrainRegion"], 
            'NODE_SIMILARITY'
        )
    """)
    for rec in res:
        print(rec)

<Record graphName='analyses' nodeProjection={'Microscope': {'properties': {}, 'label': 'Microscope'}, 'CellularRegion': {'properties': {}, 'label': 'CellularRegion'}, 'BrainRegion': {'properties': {}, 'label': 'BrainRegion'}, 'CellType': {'properties': {}, 'label': 'CellType'}, 'Analysis': {'properties': {}, 'label': 'Analysis'}, 'Reporter': {'properties': {}, 'label': 'Reporter'}, 'SectioningInstrument': {'properties': {}, 'label': 'SectioningInstrument'}, 'Software': {'properties': {}, 'label': 'Software'}, 'VisualizationProtocol': {'properties': {}, 'label': 'VisualizationProtocol'}, 'NeuralStructure': {'properties': {}, 'label': 'NeuralStructure'}, 'AnalysisDataType': {'properties': {}, 'label': 'AnalysisDataType'}} relationshipProjection={'NODE_SIMILARITY': {'orientation': 'NATURAL', 'aggregation': 'DEFAULT', 'type': 'NODE_SIMILARITY', 'properties': {}}} nodeCount=1213 relationshipCount=3831 createMillis=6>


In [250]:
import pandas as pd

label_prop_table = []
with driver.session() as session:
    res = session.run("""
        CALL gds.nodeSimilarity.stream(
          'analyses',
          {
            degreeCutoff: 3,
            similarityCutoff: 0.5,
            topK: 2
          }
        )
        YIELD node1, node2, similarity
        RETURN gds.util.asNode(node1).name as node1, labels(gds.util.asNode(node1))[0] as node1Label, gds.util.asNode(node2).name as node2, labels(gds.util.asNode(node2))[0] as node2Label, similarity
        ORDER BY similarity DESC
    """)
    
    for rec in res:
        label_prop_table.append([rec["node1"], rec["node1Label"], rec["node2"], rec["node2Label"], rec["similarity"]])

analysis_similarity = pd.DataFrame(label_prop_table, columns=["node1", "node1Label", "node2", "node2Label", "Similarity"])

def get_name(analysis_name):
    return analysis_name.str.split("_").str[:2]

similar_analyses = analysis_similarity.loc[(analysis_similarity["node1Label"]== "Analysis")].sort_values(["Similarity", "node1"], ascending=[False, True])
similar_analyses.loc[(get_name(similar_analyses["node1"]) != get_name(similar_analyses["node2"]) )]

Unnamed: 0,node1,node1Label,node2,node2Label,Similarity
559,Andreassen_2000_sections_ppSS_neurons_quantita...,Analysis,Salin_1990_sections_som_neurons_quantitation,Analysis,1.0
560,Antzoulatos_2011_sections_spines_quantitation,Analysis,Fogarty_2017_young_slices_MSN_spines_quantitation,Analysis,1.0
561,Antzoulatos_2011_sections_spines_quantitation,Analysis,Fogarty_2017_old_slices_MSN_spines_quantitation,Analysis,1.0
200,Baker_1980_BALB/cJ_sections_neuron_quantitation,Analysis,Guidetti_2001_sections_spines_primary_quantita...,Analysis,1.0
201,Baker_1980_BALB/cJ_sections_neuron_quantitation,Analysis,Dodds_2014_sections_neurons_quantitation,Analysis,1.0
...,...,...,...,...,...
1288,Svingos_1999_ultrathin_KOR_terminals_quantitation,Analysis,Pickel_1998_sections_NPY_axonterminals_quantit...,Analysis,0.5
1291,Talavera_1997_mountedsections_weak_NADPHd_neur...,Analysis,Kopp_1992_sections_D2_neurons_quantitation,Analysis,0.5
1270,Uehara-Kunugi_1991_sections_nadph_neurons_quan...,Analysis,Lenz_1994_sections_PV_GAD67_neurons_quantitation,Analysis,0.5
1293,Yang_2008_sections_XPA_cells_quantitation,Analysis,Murata_2003_P10_sections_NADPH_neurons_quantit...,Analysis,0.5


In [251]:
similar_analyses.to_csv("..\Data/csvs/graphAnalysis/node_similarity_analyses.csv")

Finally, we clean up:

In [252]:
with driver.session() as session:
    session.run("""
        MATCH ()-[r:NODE_SIMILARITY]-()
        DETACH DELETE r
    """)
    session.run("""
        MATCH (n:AnalysisDataType)
        DETACH DELETE n
    """)
    session.run("CALL gds.graph.drop('analyses')")

## Clean-up

Remove graph projections from database

In [253]:
with driver.session() as session:
    session.run("CALL gds.graph.drop('all-nodes')")

ClientError: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure `gds.graph.drop`: Caused by: java.lang.IllegalArgumentException: Graph with name `all-nodes` does not exist and can't be removed.}