<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 2

In the 2nd part of our recommendations notebook, we're going to use the PageRank algorithm to make article recommendations to an author. Let's import our libraries in case we don't have those from the previous notebooks:

And let's import those libraries:

In [1]:
from neo4j import GraphDatabase
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [2]:
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

Address(host='data-science-training-neo4j', port=7687)


## PageRank

We're going to use the PageRank algorithm, so let's first get up to speed on this algorithm.

PageRank is an algorithm that measures the transitive influence or connectivity of nodes. It can be computed by either iteratively distributing one node’s rank (originally based on degree) over its neighbours or by randomly traversing the graph and counting the frequency of hitting each node during these walks.

We can run PageRank over the whole graph to find out the most influential article in terms of citations:

In [3]:
query = """
CALL gds.pageRank.write({
  nodeProjection: "Article",
  relationshipProjection: "CITED",
  writeProperty: "pagerank"
})
"""

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,computeMillis,configuration,createMillis,didConverge,nodePropertiesWritten,ranIterations,writeMillis
0,842,"{'maxIterations': 20, 'writeConcurrency': 4, 'sourceNodes': [], 'writeProperty': 'pagerank', 're...",421,False,51956,20,839


This query stores a 'pagerank' property on each node. We can write the following query to view the most influential articles:

In [4]:
query = """
MATCH (a:Article) 
RETURN a.title as article, a.pagerank as score 
ORDER BY score DESC 
LIMIT 10 
""" 

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,score
0,A method for obtaining digital signatures and public-key cryptosystems,93.943
1,Secure communications over insecure channels,79.869
2,Rough sets,25.609
3,An axiomatic basis for computer programming,23.029
4,"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems",21.47
5,SCRIBE: The Design of a Large-Scale Event Notification Infrastructure,19.486
6,A field study of the software design process for large systems,19.028
7,Productivity factors and programming environments,18.499
8,Analyzing medium-scale software development,16.453
9,A Calculus of Communicating Systems,15.431


## Personalized PageRank

Personalized PageRank is a variant of PageRank that allows us to find influential nodes based on a set of source nodes.

For example, rather than finding the overall most influential articles, we could instead find the most influential articles with respect to a given author.

In [11]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH collect(article) + collect(other) AS sourceNodes
CALL gds.pageRank.stream({
  nodeProjection: 'Article', 
  relationshipProjection: 'CITED', 
  sourceNodes: sourceNodes
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, score
WHERE not node in sourceNodes
RETURN node.title AS article, score, [(node)-[:AUTHOR]->(author) | author.name] AS authors
ORDER BY score DESC
LIMIT 10
"""

author_name = "Whitfield Diffie"
with driver.session() as session:
    result = session.run(query, {"author": author_name})

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score
0,A method for obtaining digital signatures and public-key cryptosystems,"[Adi Shamir, Leonard M. Adleman, Ronald L. Rivest]",93.943
1,Secure communications over insecure channels,[Ralph C. Merkle],79.869
2,Rough sets,"[Roman Słowiński, Wojciech Ziarko, Jerzy W. Grzymala-Busse, Zdzisław Pawlak]",25.609
3,An axiomatic basis for computer programming,[C. A. R. Hoare],23.029
4,"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems","[Antony I. T. Rowstron, Peter Druschel]",21.47
5,SCRIBE: The Design of a Large-Scale Event Notification Infrastructure,"[Peter Druschel, Anne-Marie Kermarrec, Antony I. T. Rowstron, Miguel Castro]",19.486
6,A field study of the software design process for large systems,"[Bill Curtis, Herb Krasner, Neil Iscoe]",19.028
7,Productivity factors and programming environments,"[S. Hoben, Ray W. Wolverton, John R. Vosburgh, B. Albert, H Malec, Y. Liu, Bill Curtis]",18.499
8,Analyzing medium-scale software development,"[Victor R. Basili, Marvin V. Zelkowitz]",16.453
9,A Calculus of Communicating Systems,[Robin Milner],15.431


## Topic Sensitive Search

We can also use Personalized PageRank to do 'Topic Specific PageRank'.

When an author is searching for articles to read, they want that search to take them into account. Two authors using the same search term would expect to see different results depending on their area of research.

We'll start by creating a full text search index on the 'title' and 'abstract' properties of all nodes that have the label 'Article':

And let's have a look what articles they've published and how many citations they've received:

In [12]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""

with driver.session() as session:
    session.run(query).summary()

ClientError: There already exists an index NODE:label[0](property[2], property[5]).

We can check that our full text index has been created by running the following query:

In [15]:
query = """
CALL db.indexes()
YIELD description, indexName, tokenNames, properties, state, type, progress
WHERE type = "node_fulltext"
RETURN *
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,description,indexName,progress,properties,state,tokenNames,type
0,"INDEX ON NODE:Article(title, abstract)",articles,100.0,"[title, abstract]",ONLINE,[Article],node_fulltext


We can search the full text index like this:

In [21]:
query = """
CALL db.index.fulltext.queryNodes("articles", $searchTerm)
YIELD node, score
RETURN node.title, score, [(author)<-[:AUTHOR]-(node) | author.name] AS authors
LIMIT 10
"""
with driver.session() as session:
    result = session.run(query, {"searchTerm": "educating"})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,authors,node.title,score
0,[Peter J. Denning],Educating a new engineer,3.497
1,[Robert B. Schnabel],Educating computing's next generation,3.497
2,"[F. Díaz del Río, D. Cascado, José Luis Sevillano, F. J. Hurtado-Núñez, F. J. Moriana-García, Lu...","eDiab: a system for monitoring, assisting and educating people with diabetes",3.353
3,"[Barry W. Boehm, Supannika Koolmanojwong Mobasser]",System thinking: educating t-shaped software engineers,2.623
4,"[Daniel Port, Barry W. Boehm]",Educating software engineering students to manage risk,2.623
5,"[Ana M. Moreno, Lawrence Peters]",Educating software engineering managers: revisited: what software project managers need to know ...,1.748
6,"[Kenji Taguchi, Hironori Washizaki, Shinichi Honiden, Yasuyuki Tahara, Nobukazu Yoshioka]",Top SE: Educating Superarchitects Who Can Apply Software Engineering Tools to Practical Developm...,1.748
7,"[Thomas Delahunty, Katherine Sewell, Gregory M. P. O'Hare, A. Murphy]",ECHOES: An Immersive Training Experience,0.835
8,"[Ivica Crnkovic, Marin Orlic, Igor Čavrak]",Collaboration patterns in distributed software development projects,0.73
9,[Bruria Haberman],Teaching computing in secondary schools in a dynamic world: challenges and directions,0.626


We can write the following query to find the authors that have published the most articles on 'open source':

In [35]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node AS article, score
MATCH (article)-[:AUTHOR]->(author)
RETURN author.name, sum(score) AS totalScore, collect(article.title) AS articles
ORDER By totalScore DESC
LIMIT 20
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,articles,author.name,totalScore
0,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Brian Fitzgerald,16.119
1,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Joseph Feller,16.012
2,"[Open source application spaces: the 5th workshop on open source software engineering, The futur...",Walt Scacchi,10.731
3,"[Open source-style collaborative development practices in commercial projects using GitHub, Mach...",Daniel M. German,10.687
4,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Scott A. Hissam,10.642
5,"[A case study of a corporate open source development model, Managing a corporate open source sof...",James D. Herbsleb,10.476
6,"[Machine learning-based detection of open source license exceptions, Recommending source code fo...",Denys Poshyvanyk,8.907
7,"[Understanding broadcast based peer review on open source software projects, Peer Review on Open...",Margaret-Anne D. Storey,8.181
8,"[Understanding broadcast based peer review on open source software projects, Peer Review on Open...",Peter C. Rigby,7.649
9,"[An automated tool for generating change report from open-source software, Cross project change ...",Ruchika Malhotra,7.132


We can now use Full Text Search and Personalized PageRank to find interesting articles for different authors.

In [34]:
query = """
MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL gds.pageRank.stream({
  nodeQuery: 'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
              YIELD node, score
              RETURN id(node) as id',
  relationshipQuery: 'MATCH (a1:Article)-[:CITED]->(a2:Article) 
                      RETURN id(a1) as source,id(a2) as target', 
  sourceNodes: articles, 
  validateRelationships: false,
  parameters: {searchTerm: $searchTerm}
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS n, score
WHERE not(exists((author)<-[:AUTHOR]-(n)))
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"authorName": "Tao Xie", "searchTerm": "open source"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score
0,Static detection of cross-site scripting vulnerabilities,"[Gary Wassermann, Zhendong Su]",0.236
1,Characterizing logging practices in open-source software,"[Ding Yuan, Soyeon Park, Yuanyuan Zhou]",0.128
2,"Automated, contract-based user testing of commercial-off-the-shelf components","[Michal M. Sówka, Lionel C. Briand, Yvan Labiche]",0.128
3,Concern graphs: finding and describing concerns using structural program dependencies,"[Gail C. Murphy, Martin P. Robillard]",0.128
4,Who should fix this bug,"[Gail C. Murphy, Lyndon Hiew, John Anvik]",0.128
5,Conceptual module querying for software reengineering,"[Gail C. Murphy, Elisa L. A. Baniassad]",0.108
6,IBM's pragmatic embrace of open source,[Pamela Samuelson],0.0
7,Open courseware and open source software,"[Hauke Heier, Anett Mehler-Bicher, Stefan Baldi]",0.0
8,Reusing Open-Source Software and Practices: The Impact of Open-Source on Commercial Vendors,"[Alan W. Brown, Grady Booch]",0.0
9,From Research Software to Open Source,[Susan L. Graham],0.0


In [31]:
params = {"authorName": "Margus Veanes", "searchTerm": "open source"}
with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score
0,Crypto policy perspectives,"[Susan Landau, Dorothy E. Denning, Anthony G. Lauck, David Sobel, Scott Charney]",0.278
1,Reassessing the crypto debate,[Peter G. Neumann],0.15
2,Enabling crypto: how radical innovations occur,[Arnd Weber],0.15
3,A novel hybrid crypto-biometric authentication scheme for ATM based banking applications,"[Jie Zhou, Jiankun Hu, Yong Feng, Fengling Han, Xinhuo Yu]",0.15
4,Case study of a fault attack on asynchronous DES crypto-processors,"[Christophe Clavier, Yannick Monnet, Pascal Moitrel, Marc Renaudin, Régis Leveugle]",0.15
5,A Survey of Name-Passing Calculi and Crypto-Primitives,"[Giuseppe Castagna, Riccardo Focardi, Michele Bugliesi, Silvia Crafa, Vladimiro Sassone]",0.15
6,Crypto key management,[Peter G. Neumann],0.15
7,Neuroscience meets cryptography: crypto primitives secure against rubber hose attacks,"[Patrick Lincoln, Daniel J. Sanchez, Dan Boneh, Hristo Bojinov, Paul J. Reber]",0.15
8,A plain text on crypto policy,[John Perry Barlow],0.15
9,Crypto backup and key escrow,[David P. Maher],0.15


Let's try the same query with a different author:

In [None]:
params = {"authorName": "Marco Aurélio Gerosa", "searchTerm": "open source"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])