# Entries G6, 7, 8 b notebook: Bimodal Graph Global Metrics

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%matplotlib inline

import warnings
warnings.filterwarnings('once')

from neo4j import GraphDatabase

## Connect to the database

The title of the post should give away which version of the database we're using: the unimodal version.

#### 1. Open Neo4j and start the unimodal graph:

Click `Start` to start the database

<img src='https://github.com/julielinx/datascience_diaries/blob/master/graph/images/import_data1.png?raw=true'>

<img src='https://github.com/julielinx/datascience_diaries/blob/master/graph/images/import_data2.png?raw=true'>

#### 2. Connect the Jupyter notebook to the database

For this I use the official driver. The `py2neo` package is nice, but the official driver stays up to date with Neo4j releases better. In particular, at the time of this writing, `py2neo` couldn't handle the changes in version 4.X which allow multiple graphs on the same cluster. This is one of the major benefits at my work and I need to be able to accomodate it.

In [44]:
uri = "bolt://localhost:7687"

driver = GraphDatabase.driver(uri, auth=('neo4j', 'password'))
session = driver.session()

#### 3. Look at the output

I used `py2neo` right up to the version 4.X breaking changes, so I'm not terribly familar with what I get back from the different methods. Let's explore.

In [45]:
pd.DataFrame(session.run("CALL db.labels").data())

Unnamed: 0,label
0,Hero
1,Comic


In [9]:
pd.DataFrame(session.run('CALL db.relationshipTypes').data())

  and should_run_async(code)


Unnamed: 0,relationshipType
0,APPEARS_IN


In [11]:
session.run("call apoc.meta.stats() YIELD stats").data()

[{'stats': {'relTypeCount': 2,
   'propertyKeyCount': 1,
   'labelCount': 2,
   'nodeCount': 19090,
   'relCount': 96104,
   'labels': {'Hero': 6439, 'Comic': 12651},
   'relTypes': {'()-[:APPEARS_IN]->(:Comic)': 96104,
    '(:Hero)-[:APPEARS_IN]->()': 96104,
    '()-[:APPEARS_IN]->()': 96104}}}]

In [14]:
pd.DataFrame(session.run("call apoc.meta.stats() YIELD relTypes").value())

  and should_run_async(code)


Unnamed: 0,()-[:APPEARS_IN]->(:Comic),(:Hero)-[:APPEARS_IN]->(),()-[:APPEARS_IN]->()
0,96104,96104,96104


## Node Count

This is a very basic way to evaluate the size of a graph. You'll see as we go that the density of relationships says a lot more about the overall size of the graph, but a straight node count will at least tell you how many noun-things are in the graph.

There are a couple of ways to get counts. I'll include several so that I have options when I get into the portion of this series where I evaluate how quickly I can gather this info.

1. Explicitly spell out what I want.

In [15]:
pd.DataFrame(session.run('''MATCH (c)
RETURN count(c) as node_count''').data())

  and should_run_async(code)


Unnamed: 0,node_count
0,19090


2. The `count()` function doesn't require variable names, so we can leave it blank and just look for all (*)

In [18]:
pd.DataFrame(session.run('''MATCH ()
RETURN count(*) as node_count''').data())

  and should_run_async(code)


Unnamed: 0,node_count
0,19090


3. Use meta stats

This is one of those handy things that the APOC library does for you. All you have to do is call the correct function and specify what you want.

For a full list of what `apoc.meta.stats()` will yield see the [APOC Documentation: apoc.meta.stats](https://neo4j.com/labs/apoc/4.1/overview/apoc.meta/apoc.meta.stats/)

In [16]:
session.run("call apoc.meta.stats() YIELD nodeCount").data()

  and should_run_async(code)


[{'nodeCount': 19090}]

We can also get the counts of the labels

In [19]:
pd.DataFrame(session.run("call apoc.meta.stats() YIELD labels").value())

  and should_run_async(code)


Unnamed: 0,Hero,Comic
0,6439,12651


## Isolate Count

In [20]:
pd.DataFrame(session.run('''MATCH (n) WHERE NOT (n)--() 
WITH COUNT(distinct n) as isolates_count
MATCH ()-[r]->()
WITH count(r) as relation_ct, isolates_count
MATCH (c)
with count(distinct c) as node_count, isolates_count, relation_ct
return node_count, relation_ct, isolates_count, round(toFloat(isolates_count)/node_count*10000) / 100 as isolates_pct''').data())

  and should_run_async(code)


Unnamed: 0,node_count,relation_ct,isolates_count,isolates_pct
0,19090,96104,0,0.0


## Relationship Count

Counting relationships is very similar to counting nodes. The catch when using the count store however, is that the direction of the relationship must be specified. Since it doesn't matter if the relationship is incoming or outgoing in our Marvel graph, we won't be able to use the count store.

One of the nice things about using the count store with relationships is that it can look for multiple types (ex: `MATCH ()-[r:KNOWS|APPEARED_IN]`). [Neo4j's Knowledge Base: Fast counts using the count store](https://neo4j.com/developer/kb/fast-counts-using-the-count-store/) tells us it just adds the counts together for each type.

In [21]:
session.run('''MATCH (c1)-[]-(c2)
with count(c2) as degree
RETURN degree/2 as total_degree_count''').data()

  and should_run_async(code)


[{'total_degree_count': 96104}]

In [22]:
session.run('''MATCH (c1)-[]-(c2)
WHERE id(c1) < id(c2)
RETURN count(c2) as degree''').data()

[{'degree': 96104}]

In [23]:
session.run('''MATCH ()-[r]->()
RETURN count(r) as count''').data()

[{'count': 96104}]

In [24]:
pd.DataFrame.from_dict(session.run("call apoc.meta.stats() YIELD relCount").value())

Unnamed: 0,0
0,96104


#### Count by relationship type

Using the `apoc.meta.stats()` function we can see that the only relationship is `APPEARS_IN` and that this is a directed graph of (:Hero)-->(:Comic). Strangely, the function returns that information using 3 ways of looking at the relationship.

In [25]:
pd.DataFrame.from_dict(session.run("call apoc.meta.stats() YIELD relTypes").value())

  and should_run_async(code)


Unnamed: 0,()-[:APPEARS_IN]->(:Comic),(:Hero)-[:APPEARS_IN]->(),()-[:APPEARS_IN]->()
0,96104,96104,96104


## Number of possible relationships

In [46]:
rel_ct = pd.DataFrame(session.run('''MATCH ()-[r]->()
RETURN count(r) as count''').data())

node_ct = pd.DataFrame(session.run("call apoc.meta.stats() YIELD labels").value())

possible_rels = node_ct['Hero'] * node_ct['Comic']

possible_rels

  and should_run_async(code)


0    81459789
dtype: int64

## Global density

In [47]:
global_density = rel_ct['count'] / possible_rels

global_density

0    0.00118
dtype: float64

## Component count

In [34]:
component_df = pd.DataFrame(session.run('''CALL gds.wcc.stream({
  nodeProjection: ['*'],
  relationshipProjection: '*'})
YIELD componentId
RETURN componentId, count(*) as component_size
Order by component_size DESC''').data())

component_df.head()

Unnamed: 0,componentId,component_size
0,0,19029
1,2667,11
2,700,8
3,15007,4
4,14297,3


In [35]:
len(component_df)

  and should_run_async(code)


22

## Component size and percent

In [37]:
component_df['component_pct'] = round(component_df['component_size'] / component_df['component_size'].sum() *100, 2)

component_df

  and should_run_async(code)


Unnamed: 0,componentId,component_size,component_pct
0,0,19029,99.68
1,2667,11,0.06
2,700,8,0.04
3,15007,4,0.02
4,14297,3,0.02
5,17041,3,0.02
6,4544,2,0.01
7,5744,2,0.01
8,7048,2,0.01
9,8580,2,0.01
