<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 2

In the 2nd part of our recommendations notebook, we're going to use the PageRank algorithm to make article recommendations to an author. Let's import our libraries in case we don't have those from the previous notebooks:

And let's import those libraries:

In [1]:
from neo4j import GraphDatabase
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [2]:
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

Address(host='data-science-training-neo4j', port=7687)


## PageRank

We're going to use the PageRank algorithm, so let's first get up to speed on this algorithm.

PageRank is an algorithm that measures the transitive influence or connectivity of nodes. It can be computed by either iteratively distributing one node’s rank (originally based on degree) over its neighbours or by randomly traversing the graph and counting the frequency of hitting each node during these walks.

We can run PageRank over the whole graph to find out the most influential article in terms of citations:

In [3]:
query = """
CALL gds.pageRank.write({
  nodeProjection: "Article",
  relationshipProjection: "CITED",
  writeProperty: "pagerank"
})
"""

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,computeMillis,configuration,createMillis,didConverge,nodePropertiesWritten,ranIterations,writeMillis
0,1004,"{'maxIterations': 20, 'writeConcurrency': 4, 'sourceNodes': [], 'writeProperty': 'pagerank', 're...",607,False,51956,20,732


This query stores a 'pagerank' property on each node. We can write the following query to view the most influential articles:

In [6]:
query = """
MATCH (a:Article) 
RETURN a.title as article, a.pagerank as score, a.year, [(a)-[:AUTHOR]->(au) | au.name]
ORDER BY score DESC 
LIMIT 10 
""" 

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,[(a)-[:AUTHOR]->(au) | au.name],a.year,article,score
0,"[Adi Shamir, Leonard M. Adleman, Ronald L. Rivest]",1978,A method for obtaining digital signatures and public-key cryptosystems,93.943
1,[Ralph C. Merkle],1978,Secure communications over insecure channels,79.869
2,"[Roman Słowiński, Wojciech Ziarko, Jerzy W. Grzymala-Busse, Zdzisław Pawlak]",1995,Rough sets,25.609
3,[C. A. R. Hoare],1969,An axiomatic basis for computer programming,23.029
4,"[Antony I. T. Rowstron, Peter Druschel]",2001,"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems",21.47
5,"[Peter Druschel, Anne-Marie Kermarrec, Antony I. T. Rowstron, Miguel Castro]",2001,SCRIBE: The Design of a Large-Scale Event Notification Infrastructure,19.486
6,"[Bill Curtis, Herb Krasner, Neil Iscoe]",1988,A field study of the software design process for large systems,19.028
7,"[S. Hoben, Ray W. Wolverton, John R. Vosburgh, B. Albert, H Malec, Y. Liu, Bill Curtis]",1984,Productivity factors and programming environments,18.499
8,"[Victor R. Basili, Marvin V. Zelkowitz]",1978,Analyzing medium-scale software development,16.453
9,[Robin Milner],1982,A Calculus of Communicating Systems,15.431


## Personalized PageRank

Personalized PageRank is a variant of PageRank that allows us to find influential nodes based on a set of source nodes.

For example, rather than finding the overall most influential articles, we could instead find the most influential articles with respect to a given author.

In [9]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH collect(article) + collect(other) AS sourceNodes
CALL gds.pageRank.stream({
  nodeProjection: 'Article', 
  relationshipProjection: 'CITED', 
  sourceNodes: sourceNodes
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, score
WHERE not node in sourceNodes
RETURN node.title AS article, score, [(node)-[:AUTHOR]->(author) | author.name] AS authors
ORDER BY score DESC
LIMIT 10
"""

author_name = "Josef Kittler"
with driver.session() as session:
    result = session.run(query, {"author": author_name})

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score
0,BioEVA: An Evaluation Tool for Biometric Algorithms,"[Luiz H. R. Sucupira, Miguel Gustavo Lizárraga, Lee Luan Ling, Lívia C. F. Araújo]",0.0
1,Using Classpects for Integrating Non-Functional and Functional Requirements.,"[Doo-Hwan Bae, Tegegne Marew]",0.0
2,Solution Proposals for Japan-Oriented Offshore Software Development in China,"[Lei Zhang, Jun Hosoya, Ryota Mibe, Xuan Zhang, Meiping Chai, Yoji Taniguchi, Yibing Tan, Shiger...",0.0
3,Mapping Requirements to Software Architecture by Feature-Orientation.,"[Dongyun Liu, Hong Mei]",0.0
4,QoS Signaling for Parameterized Traffic in IEEE 802.11e Wireless LANs,"[Sunghyun Choi, N. Sai Shankar]",0.0
5,Specification of Adleman's Restricted Model Using an Automated Reasoning System: Verification of...,"[C. Graciani Díaz, Mario J. Pérez-Jiménez, Francisco-Jesús Martín-Mateos]",0.0
6,Utility function induced by fuzzy target in probabilistic decision making,"[Van-Nam Huynh, Yoshiteru Nakamori, Tu Bao Ho]",0.0
7,Distributed Process Management System Based on Object-Centered Process Modeling,"[Makoto Oshita, Katsuro Inoue, Hajimu Iida, Makoto Matsushita]",0.0
8,Seizing Power: Shaders and Storytellers,[Kevin Bjorke],0.0
9,Experience with Global Analysis: A Practical Method for Analyzing Factors that Influence Softwar...,"[Dilip Soni, Robert L. Nord]",0.0


## Topic Sensitive Search

We can also use Personalized PageRank to do 'Topic Specific PageRank'.

When an author is searching for articles to read, they want that search to take them into account. Two authors using the same search term would expect to see different results depending on their area of research.

We'll start by creating a full text search index on the 'title' and 'abstract' properties of all nodes that have the label 'Article':

And let's have a look what articles they've published and how many citations they've received:

In [10]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""

with driver.session() as session:
    session.run(query).summary()

ClientError: There already exists an index NODE:label[0](property[2], property[5]).

We can check that our full text index has been created by running the following query:

In [11]:
query = """
CALL db.indexes()
YIELD description, indexName, tokenNames, properties, state, type, progress
WHERE type = "node_fulltext"
RETURN *
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,description,indexName,progress,properties,state,tokenNames,type
0,"INDEX ON NODE:Article(title, abstract)",articles,100.0,"[title, abstract]",ONLINE,[Article],node_fulltext


We can search the full text index like this:

In [16]:
query = """
CALL db.index.fulltext.queryNodes("articles", $searchTerm)
YIELD node, score
RETURN node.title, score, [(author)<-[:AUTHOR]-(node) | author.name] AS authors
LIMIT 10
"""
with driver.session() as session:
    result = session.run(query, {"searchTerm": "cryptosystems"})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,authors,node.title,score
0,"[Katsunori Tanaka, Shigenori Uchiyama, Tatsuaki Okamoto]",Quantum public-key cryptosystems,3.782
1,"[Wen Ping Ma, Moon Ho Lee]",Group oriented cryptosystems based on linear access structures,3.396
2,[Tatsuaki Okamoto],On pairing-based cryptosystems,3.122
3,[Ross J. Anderson],Why cryptosystems fail,3.122
4,"[Michael J. Wiener, Robert J. Zuccherato]",Faster Attacks on Elliptic Curve Cryptosystems,3.113
5,[Patrick Felke],On the affine transformations of HFE-cryptosystems and systems with branches,2.988
6,[Kristian Gjøsteen],Homomorphic cryptosystems based on subgroup membership problems,2.951
7,"[Sachar Paulus, Detlef Hühnlein]",On the Implementation of Cryptosystems Based on Real Quadratic Number Fields,2.88
8,"[Moti Yung, Jean-Jacques Quisquater, Marc Joye, Sung-Ming Yen]",Observability Analysis - Detecting When Improved Cryptosystems Fail,2.799
9,"[Siddika Berna Örs, G Geeke Bruin-Muurling, Lejla Batina]",Flexible Hardware Design for RSA and Elliptic Curve Cryptosystems,2.799


We can write the following query to find the authors that have published the most articles on 'open source':

In [19]:
query = """
CALL db.index.fulltext.queryNodes("articles", "cryptosystems")
YIELD node AS article, score
MATCH (article)-[:AUTHOR]->(author)
RETURN author.name, sum(score) AS totalScore, size(article.title) AS articles
ORDER By totalScore DESC
LIMIT 20
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,articles,author.name,totalScore
0,32,Tatsuaki Okamoto,3.782
1,32,Katsunori Tanaka,3.782
2,32,Shigenori Uchiyama,3.782
3,62,Wen Ping Ma,3.396
4,62,Moon Ho Lee,3.396
5,22,Ross J. Anderson,3.122
6,30,Tatsuaki Okamoto,3.122
7,46,Michael J. Wiener,3.113
8,46,Robert J. Zuccherato,3.113
9,76,Patrick Felke,2.988


We can now use Full Text Search and Personalized PageRank to find interesting articles for different authors.

In [24]:
query = """
MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL gds.pageRank.stream({
  nodeQuery: 'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
              YIELD node, score
              RETURN id(node) as id',
  relationshipQuery: 'MATCH (a1:Article)-[:CITED]->(a2:Article) 
                      RETURN id(a1) as source,id(a2) as target', 
  sourceNodes: articles, 
  validateRelationships: false,
  parameters: {searchTerm: $searchTerm}
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS n, score
WHERE not(exists((author)<-[:AUTHOR]-(n))) AND score > 0
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors, n.year as year
order by score desc limit 10
"""

params = {"authorName": "Josef Kittler", "searchTerm": "encryption"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score,year
0,A method for obtaining digital signatures and public-key cryptosystems,"[Adi Shamir, Leonard M. Adleman, Ronald L. Rivest]",3.277,1978
1,REACT: Rapid Enhanced-Security Asymmetric Cryptosystem Transform,"[David Pointcheval, Tatsuaki Okamoto]",0.759,2001
2,Using encryption for authentication in large networks of computers,"[Michael D. Schroeder, Roger M. Needham]",0.687,1978
3,Key-insulated public key cryptosystems,"[Jonathan Katz, Moti Yung, Yevgeniy Dodis, Shouhuai Xu]",0.523,2002
4,The Oracle Diffie-Hellman Assumptions and an Analysis of DHIES,"[Michel Abdalla, Mihir Bellare, Phillip Rogaway]",0.472,2001
5,Computing arbitrary functions of encrypted data,[Craig Gentry],0.405,2010
6,Long-lived broadcast encryption,"[Avishai Wool, Jessica Staddon, Juan A. Garay]",0.405,2000
7,Fast Implementation and Fair Comparison of the Final Candidates for Advanced Encryption Standard...,"[Kris Gaj, Pawel Chodowiec]",0.278,2001
8,Padding Oracle Attacks on the ISO CBC Mode Encryption Standard,"[Arnold K. L. Yau, Kenneth G. Paterson]",0.278,2004
9,A Designer’s Guide to KEMs,[Alexander W. Dent],0.278,2003


In [25]:
params = {"authorName": "Margus Veanes", "searchTerm": "open source"}
with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score,year
0,"TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology","[Edgar Gabriel, Thara Angskun, Graham E. Fagg, David J. Daniel, Vishal Sahay]",3.319,2004
1,"Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation","[Andrew Lumsdaine, Edgar Gabriel, David J. Daniel, Prabhanjan Kambadur, Richard L. Graham]",2.87,2004
2,Concern graphs: finding and describing concerns using structural program dependencies,"[Gail C. Murphy, Martin P. Robillard]",2.626,2002
3,Conceptual module querying for software reengineering,"[Gail C. Murphy, Elisa L. A. Baniassad]",2.542,1998
4,Who should fix this bug,"[Gail C. Murphy, Lyndon Hiew, John Anvik]",2.342,2006
5,Hipikat: recommending pertinent software development artifacts,"[Davor Cubranic, Gail C. Murphy]",2.205,2003
6,Version Sensitive Editing: Change History as a Programming Tool,[David L. Atkins],2.151,1998
7,DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones,"[Zhendong Su, Stéphane Glondu, Ghassan Misherghi, Lingxiao Jiang]",2.145,2007
8,Recovering documentation-to-source-code traceability links using latent semantic indexing,"[Jonathan I. Maletic, Andrian Marcus]",2.006,2003
9,Coverage is not strongly correlated with test suite effectiveness,"[Reid Holmes, Laura Inozemtseva]",1.928,2014


Let's try the same query with a different author:

In [26]:
params = {"authorName": "Marco Aurélio Gerosa", "searchTerm": "open source"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,authors,score,year
0,Toward an understanding of the motivation of open source software developers,"[Yunwen Ye, Kouichi Kishida]",0.111,2003
1,Which bug should I fix: helping new developers onboard a new project,"[Anita Sarma, Jianguo Wang]",0.089,2011
2,Hipikat: recommending pertinent software development artifacts,"[Davor Cubranic, Gail C. Murphy]",0.08,2003
3,Tesseract: Interactive visual exploration of socio-technical relationships in software development,"[Anita Sarma, James D. Herbsleb, Larry Maccherone, Patrick Wagstrom]",0.076,2009
4,Version Sensitive Editing: Change History as a Programming Tool,[David L. Atkins],0.068,1998
5,Unifying artifacts and activities in a visual tool for distributed software development teams,"[Paul Dourish, Jon Froehlich]",0.064,2004
6,Progressive open source,"[Rob Miller, Jamie Dinkelacker, Pankaj K. Garg, Dean Nelson]",0.031,2002
7,A case study of open source software development: the Apache server,"[James D. Herbsleb, Roy Fielding, Audris Mockus]",0.031,2000
8,A case study of the evolution of Jun: an object-oriented open-source 3D multimedia library,"[Kumiyo Nakakoji, Atsushi Aoki, Y. Yamamoto, Yoshiyuki Nishinaka, Kouichi Kishida]",0.031,2001
9,An empirical study of global software development: distance and speed,"[Rebecca E. Grinter, Thomas A. Finholt, Audris Mockus, James D. Herbsleb]",0.027,2001
