<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 2

In the 2nd part of our recommendations notebook, we're going to use the PageRank algorithm to make article recommendations to an author. Let's import our libraries in case we don't have those from the previous notebooks:

In [3]:
from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)

In [4]:
# Change the line of code below to use the IP Address, Bolt Port, and Password of your Sandbox.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>")) 

# graph = Graph("bolt://18.234.168.45:33679", auth=("neo4j", "daybreak-cosal-rumbles")) 
graph = Graph("bolt://localhost", auth=("neo4j", "neo")) 

## PageRank

We're going to use the PageRank algorithm, so let's first get up to speed on this algorithm. 

PageRank is an algorithm that measures the transitive influence or connectivity of nodes. It can be computed by either iteratively distributing one node’s rank (originally based on degree) over its neighbours or by randomly traversing the graph and counting the frequency of hitting each node during these walks.

We can run PageRank over the whole graph to find out the most influential article in terms of citations:

In [6]:
query = """
CALL algo.pageRank('Article', 'CITED')
"""
graph.run(query).data()

[{'nodes': 51956,
  'iterations': 20,
  'loadMillis': 35,
  'computeMillis': 29,
  'writeMillis': 45,
  'dampingFactor': 0.85,
  'write': True,
  'writeProperty': 'pagerank'}]

This query stores a 'pagerank' property on each node. We can write the following query to view the most influential articles:

In [8]:
query = """
MATCH (a:Article)
RETURN a.title as article,
       a.pagerank as score
ORDER BY score DESC 
LIMIT 10
"""
graph.run(query).to_data_frame()

Unnamed: 0,article,score
0,A method for obtaining digital signatures and public-key cryptosystems,93.938
1,Secure communications over insecure channels,79.865
2,Rough sets,25.608
3,An axiomatic basis for computer programming,23.022
4,"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems",21.469
5,SCRIBE: The Design of a Large-Scale Event Notification Infrastructure,19.486
6,A field study of the software design process for large systems,19.023
7,Productivity factors and programming environments,18.494
8,Analyzing medium-scale software development,16.448
9,A Calculus of Communicating Systems,15.428


## Personalized PageRank

Personalized PageRank is a variant of PageRank that allows us to find influential nodes based on a set of source nodes.

For example, rather than finding the overall most influential articles, we could instead find the most influential articles with respect to a given author.

In [15]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH collect(article) + collect(other) AS sourceNodes
CALL algo.pageRank.stream('Article', 'CITED', {sourceNodes: sourceNodes})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).title AS article, score
ORDER BY score DESC
LIMIT 10
"""

author_name = "Peter G. Neumann"
graph.run(query, {"author": author_name}).to_data_frame()

Unnamed: 0,article,score
0,A technique for software module specification with examples,0.358
1,A messy state of the union: taming the composite state machines of TLS,0.332
2,Crypto policy perspectives,0.278
3,Public interest and the NII,0.278
4,Risks of automation: a cautionary total-system perspective of our cyberfuture,0.278
5,The foresight saga,0.278
6,Risks of e-voting,0.278
7,Password security: a case history,0.278
8,The challenges of partially automated driving,0.268
9,Proof techniques for hierarchically structured programs,0.248


## Topic Sensitive Search

We can also use Personalized PageRank to do 'Topic Specific PageRank'. 

When an author is searcing for articles to read, they want that search to take them into account. Two authors using the same search term would expect to see different results depending on their area of research.

We'll start by creating a full text search index on the 'title' and 'abstract' properties of all nodes that have the label 'Article':

In [None]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""
graph.run(query).data()

In [None]:
We can search the full text index like this:

In [25]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
RETURN node.title, score, [(author)<-[:AUTHOR]-(node) | author.name] AS authors
LIMIT 10
"""
graph.run(query).to_data_frame()

Unnamed: 0,authors,node.title,score
0,"[Rob Miller, Pankaj K. Garg, Dean Nelson, Jamie Dinkelacker]",Progressive open source,4.252
1,"[Joseph Feller, Brian Fitzgerald, Walt Scacchi, Krishna K Lakhani, Scott A. Hissam]",Open source application spaces: the 5th workshop on open source software engineering,4.081
2,"[Alan W. Brown, Grady Booch]",Reusing Open-Source Software and Practices: The Impact of Open-Source on Commercial Vendors,4.071
3,[Roy T. Fielding],Software architecture in an open source world,3.815
4,[Susan L. Graham],From Research Software to Open Source,3.784
5,"[Anett Mehler-Bicher, Hauke Heier, Stefan Baldi]",Open courseware and open source software,3.693
6,[Pamela Samuelson],IBM's pragmatic embrace of open source,3.69
7,"[Zeheng Li, LiGuo Huang]",When to release in open source project,3.543
8,"[Jaap-Henk Hoepman, Bart Jacobs]",Increased security through open source,3.515
9,[Peter G. Neumann],Robust open-source software,3.492


We can write the following query to find the authors that have published the most articles on 'open source':

In [46]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
MATCH (node)-[:AUTHOR]->(author)
RETURN author.name, sum(score) AS totalScore, collect(node.title) AS articles
ORDER By totalScore DESC
LIMIT 20
"""

graph.run(query).to_data_frame()

Unnamed: 0,articles,author.name,totalScore
0,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Brian Fitzgerald,16.119
1,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Joseph Feller,16.012
2,"[Open source application spaces: the 5th workshop on open source software engineering, The futur...",Walt Scacchi,10.731
3,"[Open source-style collaborative development practices in commercial projects using GitHub, Mach...",Daniel M. German,10.687
4,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w...",Scott A. Hissam,10.642
5,"[A case study of a corporate open source development model, Managing a corporate open source sof...",James D. Herbsleb,10.476
6,"[Machine learning-based detection of open source license exceptions, Recommending source code fo...",Denys Poshyvanyk,8.907
7,"[Understanding broadcast based peer review on open source software projects, Peer Review on Open...",Margaret-Anne D. Storey,8.181
8,"[Understanding broadcast based peer review on open source software projects, Peer Review on Open...",Peter C. Rigby,7.649
9,"[An automated tool for generating change report from open-source software, Cross project change ...",Ruchika Malhotra,7.132


We can now use Full Text Search and Personalized PageRank to find interesting articles for different authors.

In [48]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH a, collect(article) + collect(other) AS sourceNodes
CALL algo.pageRank.stream(
  'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
   YIELD node, score
   RETURN id(node) as id',
  'MATCH (a1:Article)-[:CITED]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: sourceNodes,graph:'cypher', params: {searchTerm: $searchTerm}})
YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS n, score
WHERE not(exists((a)<-[:AUTHOR]-(n))) AND score > 0
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"author": "Tao Xie", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,Static detection of cross-site scripting vulnerabilities,"[Gary Wassermann, Zhendong Su]",0.386
1,Concern graphs: finding and describing concerns using structural program dependencies,"[Gail C. Murphy, Martin P. Robillard]",0.278
2,Characterizing logging practices in open-source software,"[Soyeon Park, Ding Yuan, Yuanyuan Zhou]",0.278
3,"Automated, contract-based user testing of commercial-off-the-shelf components","[Yvan Labiche, Michal M. Sówka, Lionel C. Briand]",0.278
4,Who should fix this bug,"[John Anvik, Gail C. Murphy, Lyndon Hiew]",0.278
5,Conceptual module querying for software reengineering,"[Gail C. Murphy, Elisa L. A. Baniassad]",0.236
6,Semantics-based code search,[Steven P. Reiss],0.15
7,Bandera: extracting finite-state models from Java source code,"[John Hatcliff, Robby, Corina S. Pasareanu, Matthew B. Dwyer, James C. Corbett]",0.15
8,AsDroid: detecting stealthy behaviors in Android applications by user interface and program beha...,"[Lin Tan, Peng Wang, Bin Liang, Jianjun Huang, Xiangyu Zhang]",0.15
9,EXSYST: search-based GUI testing,"[Andreas Zeller, Gordon Fraser, Florian Gross]",0.128


In [49]:
params = {"author": "Marco Aurélio Gerosa", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,Toward an understanding of the motivation of open source software developers,"[Yunwen Ye, Kouichi Kishida]",0.388
1,Hipikat: recommending pertinent software development artifacts,"[Gail C. Murphy, Davor Cubranic]",0.322
2,Version Sensitive Editing: Change History as a Programming Tool,[David L. Atkins],0.274
3,Which bug should I fix: helping new developers onboard a new project,"[Anita Sarma, Jianguo Wang]",0.239
4,Tesseract: Interactive visual exploration of socio-technical relationships in software development,"[Anita Sarma, Larry Maccherone, Patrick Wagstrom, James D. Herbsleb]",0.203
5,Role Migration and Advancement Processes in OSSD Projects: A Comparative Case Study,"[Chris Jensen, Walt Scacchi]",0.176
6,Does the initial environment impact the future of developers,"[Minghui Zhou, Audris Mockus]",0.176
7,Unifying artifacts and activities in a visual tool for distributed software development teams,"[Paul Dourish, Jon Froehlich]",0.173
8,A case study of open source software development: the Apache server,"[Audris Mockus, Roy Fielding, James D. Herbsleb]",0.11
9,A case study of the evolution of Jun: an object-oriented open-source 3D multimedia library,"[Kumiyo Nakakoji, Brent Reeves, A. Takasbima, Kaoru Hayashi, Y. Yamamoto]",0.11
