# Entries G6, 7, 8 a notebook: Unimodal Graph Global Metrics

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

import warnings
warnings.filterwarnings('once')

from neo4j import GraphDatabase

## Connect to the database

The title of the post should give away which version of the database we're using: the unimodal version.

#### 1. Open Neo4j and start the unimodal graph:

Click `Start` to start the database

<img src='https://github.com/julielinx/datascience_diaries/blob/master/graph/images/import_data1.png?raw=true'>

<img src='https://github.com/julielinx/datascience_diaries/blob/master/graph/images/import_data2.png?raw=true'>

#### 2. Connect the Jupyter notebook to the database

For this I use the official driver. The `py2neo` package is nice, but the official driver stays up to date with Neo4j releases better. In particular, at the time of this writing, `py2neo` couldn't handle the changes in version 4.X which allow multiple graphs on the same cluster. This is one of the major benefits at my work and I need to be able to accomodate it.

In [3]:
uri = "bolt://localhost:7687"

driver = GraphDatabase.driver(uri, auth=('neo4j', 'password'))
session = driver.session()

  and should_run_async(code)


#### 3. Look at the output

I used `py2neo` right up to the version 4.X breaking changes, so I'm not terribly familar with what I get back from the different methods. Let's explore.

In [4]:
session.run("CALL db.labels").data()

[{'label': 'Hero'}, {'label': 'Comic'}]

In [12]:
session.run("CALL db.labels").value()

['Hero', 'Comic']

In [13]:
session.run("CALL db.labels").values()

[['Hero'], ['Comic']]

In [15]:
session.run("CALL db.labels").keys()

['label']

In [16]:
pd.DataFrame(session.run("CALL db.labels").data())

Unnamed: 0,label
0,Hero
1,Comic


In [26]:
pd.DataFrame(session.run('CALL db.relationshipTypes').data())

Unnamed: 0,relationshipType
0,KNOWS


In [48]:
session.run("call apoc.meta.stats() YIELD nodeCount, relCount, labels, relTypesCount").data()

[{'nodeCount': 6439,
  'relCount': 171613,
  'labels': {'Hero': 6439},
  'relTypesCount': {'KNOWS': 171613}}]

In [49]:
session.run("call apoc.meta.stats() YIELD stats").data()

[{'stats': {'relTypeCount': 2,
   'propertyKeyCount': 2,
   'labelCount': 2,
   'nodeCount': 6439,
   'relCount': 171613,
   'labels': {'Hero': 6439},
   'relTypes': {'()-[:KNOWS]->(:Hero)': 171613,
    '()-[:KNOWS]->()': 171613,
    '(:Hero)-[:KNOWS]->()': 171613}}}]

In [40]:
session.run("call apoc.meta.stats() YIELD labels").value()

[{'Hero': 6439}]

In [41]:
pd.DataFrame(session.run("call apoc.meta.stats() YIELD labels").value())

Unnamed: 0,Hero
0,6439


## Node Count

This is a very basic way to evaluate the size of a graph. You'll see as we go that the density of relationships says a lot more about the overall size of the graph, but a straight node count will at least tell you how many noun-things are in the graph.

There are a couple of ways to get counts. I'll include several so that I have options when I get into the portion of this series where I evaluate how quickly I can gather this info.

1. Explicitly spell out what I want.

In [31]:
pd.DataFrame(session.run('''MATCH (c)
RETURN count(c) as node_count''').data())

  and should_run_async(code)


Unnamed: 0,node_count
0,6439


2. The `count()` function doesn't require variable names, so we can leave it blank and just look for all (*)

In [32]:
pd.DataFrame(session.run('''MATCH ()
RETURN count(*) as node_count''').data())

  and should_run_async(code)


Unnamed: 0,node_count
0,6439


3. Counts by label

If we only want to get counts for a specific label we can put that in the query.

In [50]:
pd.DataFrame(session.run('''MATCH (:Hero)
RETURN count(*) as node_count''').data())

Unnamed: 0,node_count
0,6439


4. Use meta stats

This is one of those handy things that the APOC library does for you. All you have to do is call the correct function and specify what you want.

For a full list of what `apoc.meta.stats()` will yield see the [APOC Documentation: apoc.meta.stats](https://neo4j.com/labs/apoc/4.1/overview/apoc.meta/apoc.meta.stats/)

In [44]:
pd.DataFrame.from_dict(session.run("call apoc.meta.stats() YIELD labels").value())

  and should_run_async(code)


Unnamed: 0,Hero
0,6439


## Isolate Count

In [34]:
pd.DataFrame(session.run('''MATCH (n) WHERE NOT (n)--() 
WITH COUNT(distinct n) as isolates_count
MATCH ()-[r]->()
WITH count(r) as relation_ct, isolates_count
MATCH (c)
with count(distinct c) as node_count, isolates_count, relation_ct
return node_count, relation_ct, isolates_count, round(toFloat(isolates_count)/node_count*10000) / 100 as isolates_pct''').data())

  and should_run_async(code)


Unnamed: 0,node_count,relation_ct,isolates_count,isolates_pct
0,6439,171613,18,0.28


## Relationship Count

Counting relationships is very similar to counting nodes. The catch when using the count store however, is that the direction of the relationship must be specified. Since it doesn't matter if the relationship is incoming or outgoing in our Marvel graph, we won't be able to use the count store.

One of the nice things about using the count store with relationships is that it can look for multiple types (ex: `MATCH ()-[r:KNOWS|APPEARED_IN]`). [Neo4j's Knowledge Base: Fast counts using the count store](https://neo4j.com/developer/kb/fast-counts-using-the-count-store/) tells us it just adds the counts together for each type.

In [10]:
session.run('''MATCH (c1)-[]-(c2)
with count(c2) as degree
RETURN degree/2 as total_degree_count''').data()

[{'total_degree_count': 171613}]

In [11]:
session.run('''MATCH (c1)-[]-(c2)
WHERE id(c1) < id(c2)
RETURN count(c2) as degree''').data()

[{'degree': 171613}]

In [13]:
session.run('''MATCH ()-[r]->()
RETURN count(r) as count''').data()

[{'count': 171613}]

In [14]:
pd.DataFrame.from_dict(session.run("call apoc.meta.stats() YIELD relCount").value())

Unnamed: 0,0
0,171613


## Number of possible relationships

In [82]:
rel_ct = pd.DataFrame(session.run('''MATCH ()-[r]->()
RETURN count(r) as count''').data())

node_ct = pd.DataFrame(session.run('''MATCH ()
RETURN count(*) as node_count''').data())

possible_rels = node_ct['node_count'] * (node_ct['node_count'] - 1)/2

possible_rels

  and should_run_async(code)


0    20727141.0
Name: node_count, dtype: float64

## Global density

In [83]:
global_density = rel_ct['count'] / possible_rels

global_density

0    0.00828
dtype: float64

## Component count

In [62]:
component_df = pd.DataFrame(session.run('''CALL gds.wcc.stream({
  nodeProjection: 'Hero',
  relationshipProjection: {
  KNOWS: {
  type: 'KNOWS',
  orientation:'UNDIRECTED'
  }}
})
YIELD componentId
RETURN componentId, count(*) as component_size
Order by component_size DESC''').data())

component_df.head()

  and should_run_async(code)


Unnamed: 0,componentId,component_size
0,0,6403
1,239,9
2,92,7
3,3504,2
4,465,1


In [73]:
len(component_df)

22

## Component size

In [74]:
component_df

Unnamed: 0,componentId,component_size
0,0,6403
1,239,9
2,92,7
3,3504,2
4,465,1
5,576,1
6,832,1
7,1084,1
8,1381,1
9,1829,1


## Component percent

In [81]:
component_df['component_pct'] = round(component_df['component_size'] / component_df['component_size'].sum() *100, 2)

component_df

  and should_run_async(code)


Unnamed: 0,componentId,component_size,component_pct
0,0,6403,99.44
1,239,9,0.14
2,92,7,0.11
3,3504,2,0.03
4,465,1,0.02
5,576,1,0.02
6,832,1,0.02
7,1084,1,0.02
8,1381,1,0.02
9,1829,1,0.02


## Diameter

While I'd really like to see the diameter, there isn't an optimized way to do it in Neo4j. This is a problem because diameter is the longest shortest path from one node to another.

Shortest path is the fastest way to get from one node to another. We're looking for the longest one of these in the graph.

The way to calculate that is to look at the shortest path from each node to every other node. Mathematically, this is the factorial of the node count (reminder: factorial is a number multiplied by every number smaller than itself, so 6! is 6 * 5 * 4 * 3 * 2 * 1 = 720).

Remember our node count for this relatively small graph is 6,439. The factorial of this is big, which means the query will take a long time to run.

For curiosity sake, I included the factorial of 6,439 below:

In [57]:
import math

math.factorial(node_ct['node_count'])

7297514397592690558249029978403606858420973686519642889082904928906524517796309765654119706092518848036618308646125589416747829961785515709268629477931707336569672039413392629448857296876933808163975191879494893414939541069145573569690833031742649148537837095920136737129884916978026877478156228803260867909356768515318907340744606968775568921078902533812198652883726499196664730982004818860595029079955149847353481791572966264308197354944525171860738817083685482495856616172956038139872549158431425548487098516665887931425249118331171339351882781983468006287247589839939469093050557904571879016786179464970230178411179305454767630618336408874898290654375279259021044132605049386361245263803039392592872253970664145866547525392492988579055985856571674483586712291016173613571107040293738729510175764794350850507462431114510110156779397101296242870354856579151048871206539622763619262016389058959083341409383073675306662981302732910644360631222063237540405917505119743578500373937263675825554682231194

The code to do this is below, but it takes so long it just isn't practical outside toy datasets that are even smaller than the projected unimodal Marvel Universe Social Network graph.

In [47]:
session.run('''MATCH (n), (m)
WHERE id(n) < id(m)
WITH n, m
MATCH p=shortestPath( (n)-[*]->(m) )
RETURN n.name, m.name, length(p)
ORDER BY length(p) desc LIMIT 1''').data()

Unnamed: 0,node_count
0,6439
