<a href="https://colab.research.google.com/github/neo4j-contrib/training/blob/master/data_science/AppliedGraphAlgorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from neo4j.v1 import GraphDatabase, basic_auth
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import pandas as pd

In [4]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=basic_auth("neo4j", "neo"))

## Betweenness Centrality

Betweenness centrality identifies nodes that are strategically positioned in the network, meaning that information will often travel through that person. Such an intermediary position gives that person power and influence.

Betweenness centrality is a raw count of the number of short paths that go through a given node. For example, if a node is located on a bottleneck between two large communities, then it will have high betweenness.

In [5]:
query = """\
CALL algo.betweenness.stream("Character", "INTERACTS1", {direction: "BOTH"})
YIELD nodeId, centrality
MATCH (c:Character) WHERE ID(c) = nodeId
RETURN c.name, centrality
ORDER BY centrality DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(query)
df = pd.DataFrame([dict(record) for record in result])    

In [6]:
df.sort_values(by=["centrality"], ascending=False).head()

Unnamed: 0,c.name,centrality
0,Eddard-Stark,4638.534951
1,Robert-Baratheon,3682.391036
2,Tyrion-Lannister,3272.606016
3,Jon-Snow,2952.057282
4,Catelyn-Stark,2604.755647


In [7]:
query = """\
CALL algo.betweenness.stream("Character", "INTERACTS1", {direction: "BOTH"})
YIELD nodeId, centrality
MATCH (c:Character) WHERE ID(c) = nodeId
WITH c, centrality, [(c)-[r:INTERACTS1]-(other) | {character: other.name, weight: r.weight}] AS interactions
RETURN c.name, centrality,
       apoc.coll.sum([i in interactions | i.weight]) AS totalInteractions,
       [i in apoc.coll.reverse(apoc.coll.sortMaps(interactions, 'weight'))[..5] | i.character] as charactersInteractedWith
ORDER BY centrality DESC
LIMIT 10
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])  

In [8]:
df.sort_values(by=["centrality"], ascending=False).head(10)

Unnamed: 0,c.name,centrality,charactersInteractedWith,totalInteractions
0,Eddard-Stark,4638.534951,"[Robert-Baratheon, Petyr-Baelish, Cersei-Lanni...",1284.0
1,Robert-Baratheon,3682.391036,"[Eddard-Stark, Cersei-Lannister, Renly-Barathe...",941.0
2,Tyrion-Lannister,3272.606016,"[Bronn, Jon-Snow, Catelyn-Stark, Tywin-Lannist...",650.0
3,Jon-Snow,2952.057282,"[Jeor-Mormont, Samwell-Tarly, Bran-Stark, Tyri...",784.0
4,Catelyn-Stark,2604.755647,"[Eddard-Stark, Robb-Stark, Tyrion-Lannister, L...",520.0
5,Daenerys-Targaryen,1484.278023,"[Drogo, Jorah-Mormont, Viserys-Targaryen, Mirr...",443.0
6,Robb-Stark,1255.689656,"[Bran-Stark, Jon-Snow, Catelyn-Stark, Theon-Gr...",516.0
7,Drogo,1115.094639,"[Daenerys-Targaryen, Viserys-Targaryen, Illyri...",256.0
8,Bran-Stark,960.031914,"[Robb-Stark, Luwin, Jon-Snow, Rickon-Stark, Ty...",531.0
9,Sansa-Stark,639.076914,"[Arya-Stark, Joffrey-Baratheon, Mordane, Eddar...",545.0


In [9]:
df.sort_values(by=["totalInteractions"], ascending=False).head(10)

Unnamed: 0,c.name,centrality,charactersInteractedWith,totalInteractions
0,Eddard-Stark,4638.534951,"[Robert-Baratheon, Petyr-Baelish, Cersei-Lanni...",1284.0
1,Robert-Baratheon,3682.391036,"[Eddard-Stark, Cersei-Lannister, Renly-Barathe...",941.0
3,Jon-Snow,2952.057282,"[Jeor-Mormont, Samwell-Tarly, Bran-Stark, Tyri...",784.0
2,Tyrion-Lannister,3272.606016,"[Bronn, Jon-Snow, Catelyn-Stark, Tywin-Lannist...",650.0
9,Sansa-Stark,639.076914,"[Arya-Stark, Joffrey-Baratheon, Mordane, Eddar...",545.0
8,Bran-Stark,960.031914,"[Robb-Stark, Luwin, Jon-Snow, Rickon-Stark, Ty...",531.0
4,Catelyn-Stark,2604.755647,"[Eddard-Stark, Robb-Stark, Tyrion-Lannister, L...",520.0
6,Robb-Stark,1255.689656,"[Bran-Stark, Jon-Snow, Catelyn-Stark, Theon-Gr...",516.0
5,Daenerys-Targaryen,1484.278023,"[Drogo, Jorah-Mormont, Viserys-Targaryen, Mirr...",443.0
7,Drogo,1115.094639,"[Daenerys-Targaryen, Viserys-Targaryen, Illyri...",256.0


## Storing betweenness centrality

Although the betweenness centrality algorithm runs very quickly on this dataset we wouldn’t usually be running this types of algorithms in the normal request/response flow of a web/mobile app. Instead of that we can store the result of the calculation as a property on the node and then refer to it in future queries.

Each of the algorithms has a variant that saves its output to the database rather than returning a stream. Let’s run the betweenness centrality algorithm and store the result as a property named `book1BetweennessCentrality`:

In [10]:
query = """\
CALL algo.betweenness("Character", "INTERACTS1", {direction: "BOTH", writeProperty: "book1BetweennessCentrality"})
"""
with driver.session() as session:
    session.run(query)

We can write the following query to find the most influential characters:

In [11]:
query = """\
MATCH (c:Character)
RETURN c.name, c.book1BetweennessCentrality AS centrality
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])  

In [12]:
df.sort_values(by=["centrality"], ascending=False).head()

Unnamed: 0,c.name,centrality
79,Eddard-Stark,4638.534951
161,Robert-Baratheon,3682.391036
173,Tyrion-Lannister,3272.606016
116,Jon-Snow,2952.057282
59,Catelyn-Stark,2604.755647


## Exercise: Betweenness Centrality for books 2-5

Now we want to calculate the betweenness centrality for the other books in the series and store the results in the database.

* Write queries that call algo.betweenness for the INTERACTS2, INTERACTS3, and INTERACTS45 relationship types.

After you’ve done that see if you can write queries to answer the following questions:

* Which character had the biggest increase in influence from book 1 to 5?

Wh* ich character had the biggest decrease?

Bonus question:

* Which characters who were in the top 10 influencers in book 1 are also in the top 10 influencers in book 5?

## Page Rank

This is another version of weighted degree centrality with a feedback loop. This time, you only get your “fair share” of your neighbor’s importance.

i.e. your neighbor’s importance is split between their neighbors, proportional to the number of interactions with that neighbor.

Intuitively, PageRank captures how effectively you are taking advantage of your network contacts. In our context, PageRank centrality nicely captures narrative tension. Indeed, major developments occur when two important characters interact.

In [13]:
with driver.session() as session:
    result = session.run("""CALL algo.pageRank('Character', 'INTERACTS1', 
                              {direction: 'BOTH', writeProperty:'book1PageRank'})""")
    print(result.peek())
    
    result = session.run("""CALL algo.pageRank('Character', 'INTERACTS2', 
                              {direction: 'BOTH', writeProperty:'book2PageRank'})""")
    print(result.peek()) 
    
    result = session.run("""CALL algo.pageRank('Character', 'INTERACTS3', 
                              {direction: 'BOTH', writeProperty:'book3PageRank'})""")
    print(result.peek())     
    
    result = session.run("""CALL algo.pageRank('Character', 'INTERACTS45', 
                              {direction: 'BOTH', writeProperty:'book45PageRank'})""")
    print(result.peek())         

<Record nodes=796 iterations=20 loadMillis=4 computeMillis=0 writeMillis=3 dampingFactor=0.85 write=True writeProperty='book1PageRank'>
<Record nodes=796 iterations=20 loadMillis=5 computeMillis=0 writeMillis=2 dampingFactor=0.85 write=True writeProperty='book2PageRank'>
<Record nodes=796 iterations=20 loadMillis=4 computeMillis=0 writeMillis=3 dampingFactor=0.85 write=True writeProperty='book3PageRank'>
<Record nodes=796 iterations=20 loadMillis=3 computeMillis=0 writeMillis=3 dampingFactor=0.85 write=True writeProperty='book45PageRank'>


In [14]:
query = """\
MATCH (c:Character)
WITH c, [(c)-[r:INTERACTS1]-(other) | {character: other.name, weight: r.weight}] AS interactions
RETURN c.name, c.book1PageRank AS pageRank, c.book1BetweennessCentrality as centrality, 
       apoc.coll.sum([i in interactions | i.weight]) AS totalInteractions
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])      

In [15]:
df.sort_values(by=["centrality"], ascending = False).head(10)

Unnamed: 0,c.name,centrality,pageRank,totalInteractions
79,Eddard-Stark,4638.534951,0.660145,1284.0
161,Robert-Baratheon,3682.391036,2.073507,941.0
173,Tyrion-Lannister,3272.606016,4.367963,650.0
116,Jon-Snow,2952.057282,1.187026,784.0
59,Catelyn-Stark,2604.755647,0.18372,520.0
69,Daenerys-Targaryen,1484.278023,0.266875,443.0
159,Robb-Stark,1255.689656,1.300611,516.0
77,Drogo,1115.094639,0.243356,256.0
55,Bran-Stark,960.031914,0.164824,531.0
165,Sansa-Stark,639.076914,1.931982,545.0


In [16]:
df.sort_values(by=["pageRank"], ascending = False).head(10)

Unnamed: 0,c.name,centrality,pageRank,totalInteractions
173,Tyrion-Lannister,3272.606016,4.367963,650.0
183,Varys,185.720272,3.542741,231.0
174,Tywin-Lannister,450.607138,2.982616,181.0
161,Robert-Baratheon,3682.391036,2.073507,941.0
165,Sansa-Stark,639.076914,1.931982,545.0
192,Walder-Frey,556.385931,1.882878,41.0
159,Robb-Stark,1255.689656,1.300611,516.0
190,Willis-Wode,6.669261,1.209483,28.0
116,Jon-Snow,2952.057282,1.187026,784.0
189,Vardis-Egen,20.434495,1.181059,37.0


You’ll notice that there are some characters who have a high page rank but a very low betweenness centrality score.

This suggests that they aren’t necessarily influential in their own right, but are friends with important people. Varys is a good example of a character that fits this profile.

## Community Detection

We can detect communities in our data by running an algorithm which traverses the graph structure to find highly connected subgraphs with fewer connections other other subgraphs.

Run the following query to calculate the communities that exist based on interactions across all the books.

In [17]:
query = """\
CALL algo.labelPropagation(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c:Character)-[rel]->(c2) RETURN id(c) as source, id(c2) as target, SUM(rel.weight) as weight',
  'OUTGOING',
  {graph:'cypher', partitionProperty: 'community', iterations: 10})
"""

with driver.session() as session:
    result = session.run(query)
    print(result.peek())

<Record nodes=796 iterations=7 loadMillis=49 computeMillis=12 writeMillis=6 write=True didConverge=True weightProperty='weight' partitionProperty='community'>


In [18]:
query = """\
MATCH (c:Character)
WHERE exists(c.community)
RETURN c.community AS community, count(*) AS count
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])   

In [19]:
df.sort_values(by=["count"], ascending = False).head(10)

Unnamed: 0,community,count
151,192,210
94,495,51
65,489,28
111,354,28
9,738,21
75,752,18
77,743,17
80,178,11
59,361,11
170,751,11


## Querying Communities

It’d be good to know who are the influential people in each community. To do that we’ll need to calculate a Page Rank score for each character across all the books:

In [20]:
query = """\
CALL algo.pageRank(
  'MATCH (c:Character) RETURN id(c) as id',
  'MATCH (c:Character)-[rel]->(c2) RETURN id(c) as source,id(c2) as target, SUM(rel.weight) as weight',
  {graph:'cypher', writeProperty: 'pageRank', iterations: 10})
"""

with driver.session() as session:
    result = session.run(query)
    print(result.peek())

<Record nodes=796 iterations=10 loadMillis=33 computeMillis=2 writeMillis=9 dampingFactor=0.85 write=True writeProperty='pageRank'>


In [21]:
query = """\
MATCH (c:Character)
WHERE exists(c.community)
WITH c ORDER BY c.pageRank DESC
RETURN c.community as cluster, count(*) AS count, collect(c.name)[0] AS mostInfluential
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])   

In [22]:
df.sort_values(by=["count"], ascending = False).head(10)

Unnamed: 0,cluster,count,mostInfluential
151,192,210,Tyrion-Lannister
94,495,51,Jon-Snow
109,354,28,Theon-Greyjoy
65,489,28,Samwell-Tarly
9,738,21,Victarion-Greyjoy
75,752,18,Tyene-Sand
77,743,17,Skahaz-mo-Kandaq
59,361,11,Wyman-Manderly
80,178,11,Rakharo
170,751,11,Quentyn-Martell


## Intra-community Page Rank

We can also calculate the Page Rank within communities.

Run the following query to calculate the page rank for the 2nd largest community:

In [23]:
query = """\
MATCH (c:Character) WHERE EXISTS(c.community)
WITH c.community AS communityId, COUNT(*) AS count
ORDER BY count DESC
SKIP 1 LIMIT 1
CALL apoc.cypher.doIt(
  "CALL algo.pageRank(
    'MATCH (c:Character) WHERE c.community =" + communityId + " RETURN id(c) as id',
    'MATCH (c:Character)-[rel]->(c2) WHERE c.community =" + communityId + " AND c2.community =" + communityId + " RETURN id(c) as source,id(c2) as target, sum(rel.weight) as weight',
    {graph:'cypher', writeProperty: 'communityPageRank'}) YIELD nodes RETURN count(*)", {})
YIELD value
RETURN value
"""

with driver.session() as session:
    result = session.run(query)
    print(result.peek())

<Record value={'count(*)': 1}>


In [25]:
query = """\
MATCH (c:Character) WHERE EXISTS(c.community)
WITH c.community AS communityId, COUNT(*) AS count
ORDER BY count DESC
SKIP 1 LIMIT 1
MATCH (c:Character) WHERE c.community = communityId
RETURN c.name AS character, c.communityPageRank as pageRank
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])       

In [26]:
df.sort_values(by=["pageRank"], ascending = False).head(10)

Unnamed: 0,character,pageRank
43,Val,4.542044
42,Tormund,2.736099
13,Jon-Snow,2.2958
28,Shireen-Baratheon,1.385169
23,Melisandre,1.155507
26,Rattleshirt,1.066376
27,Selyse-Florent,1.066113
14,Mance-Rayder,0.976081
41,Styr,0.919938
40,Ryk,0.64725


Let's now calculate the intra-community Page Rank for all the communities:

In [27]:
query = """\
CALL algo.pageRank(
  'MATCH (c:Character) WHERE c.community=%d RETURN id(c) as id',
  'MATCH (c:Character)-[rel]->(c2) WHERE c.community=%d AND c2.community =%d RETURN id(c) as source,id(c2) as target, sum(rel.weight) as weight',
  {graph:'cypher', writeProperty: 'communityPageRank'}) 
YIELD nodes 
RETURN count(*)
"""

with driver.session() as session:
    for row in session.run("MATCH (c:Character) WHERE EXISTS(c.community) RETURN DISTINCT c.community AS communityId"):        
        community_id = row["communityId"]
        session.run(query % (community_id, community_id, community_id))
print("Page Ranks calculated")        

Page Ranks calculated


We can now work out who the most influential people are inside and outside a community:

In [28]:
query = """\
MATCH (c:Character)
WHERE exists(c.community)
WITH c ORDER BY c.pageRank DESC
WITH  c.community as cluster, count(*) AS count, collect(c) AS characters
RETURN cluster, count, 
       apoc.coll.reverse(apoc.coll.sortNodes(characters, "pageRank"))[0].name AS overallInfluential,
       apoc.coll.reverse(apoc.coll.sortNodes(characters, "communityPageRank"))[0].name AS communityInfluential
ORDER BY count DESC
"""

with driver.session() as session:
    df = pd.DataFrame([dict(record) for record in session.run(query)])   

In [29]:
df.sort_values(by=["count"], ascending = False).head(10)

Unnamed: 0,cluster,communityInfluential,count,overallInfluential
0,192,Tywin-Lannister,210,Tyrion-Lannister
1,495,Val,51,Jon-Snow
2,489,Small-Paul,28,Samwell-Tarly
3,354,Wex-Pyke,28,Theon-Greyjoy
4,738,Victarion-Greyjoy,21,Victarion-Greyjoy
5,752,Tyene-Sand,18,Tyene-Sand
6,743,Skahaz-mo-Kandaq,17,Skahaz-mo-Kandaq
7,361,Wyman-Manderly,11,Wyman-Manderly
8,178,Rakharo,11,Rakharo
9,751,Tattered-Prince,11,Quentyn-Martell
