<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/02_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations

In this notebook we're going to learn how to make recommendations using Neo4j. As with the other notebooks, let's get our environment setup.

In [None]:
!pip install py2neo pandas matplotlib

And let's import those libraries:

In [9]:
from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [4]:
# Change the line of code below to use the IP Address, Bolt Port, and Password of your Sandbox.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>")) 

# graph = Graph("bolt://18.234.168.45:33679", auth=("neo4j", "daybreak-cosal-rumbles")) 
graph = Graph("bolt://localhost", auth=("neo4j", "neo")) 

##  Finding popular authors

Since we're going to make collaborator suggestions so let's find authors who have written the most articles so that we have some data to work with.

In [5]:
popular_authors_query = """
MATCH (author:Author)
RETURN author.name, size((author)<-[:AUTHOR]-()) AS articlesPublished
ORDER BY articlesPublished DESC
LIMIT 10
"""

graph.run(popular_authors_query).to_data_frame()

Unnamed: 0,articlesPublished,author.name
0,176,Anil K. Jain
1,160,Moshe Y. Vardi
2,151,Barry W. Boehm
3,140,Miroslav Krstic
4,139,Edwin R. Hancock
5,138,Thomas A. Henzinger
6,137,Josef Kittler
7,132,Tao Xie
8,130,Mark Harman
9,126,Ahmed E. Hassan


Let's pick one of these authors...

In [13]:
author_name = "Tao Xie"

And let's have a look what articles they've published and how many citations they've received:

In [19]:
author_articles_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)
RETURN article.title AS article, article.year AS year, size((article)<-[:CITED]-()) AS citations
ORDER BY citations DESC
LIMIT 20
"""

graph.run(author_articles_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,article,citations,year
0,Parseweb: a programmer assistant for reusing open source code on the web,94,2007
1,An approach to detecting duplicate bug reports using natural language and execution information,82,2008
2,MAPO: Mining and Recommending API Usage Patterns,72,2009
3,Mining API patterns as partial orders from source code: from usage scenarios to specifications,66,2007
4,Rostra: a framework for detecting redundant object-oriented unit tests,48,2004
5,Fitness-guided path exploration in dynamic symbolic execution,46,2009
6,Inferring Resource Specifications from Natural Language API Documentation,42,2009
7,Improving Structural Testing of Object-Oriented Programs via Integrating Evolutionary Testing an...,41,2008
8,DSD-Crasher: A hybrid analysis tool for bug finding,35,2008
9,Time-aware test-case prioritization using integer linear programming,31,2009


Find the authors collaborators...

In [16]:
collaborations_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor)
RETURN coauthor.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

graph.run(collaborations_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,coauthor,collaborations
0,Nikolai Tillmann,28
1,Jonathan de Halleux,21
2,Lu Zhang,17
3,Dongmei Zhang,17
4,Hong Mei,14
5,Xusheng Xiao,13
6,Suresh Thummalapenta,13
7,Kunal Taneja,10
8,David Notkin,9
9,Wolfram Schulte,8


How would we suggest some future collaborators for this author? One way is by looking at the collaborators of their collaborators!

In [18]:
collaborations_query = """
MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
      (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
RETURN coc.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

graph.run(collaborations_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,coauthor,collaborations
0,Bing Xie,323
1,Margus Veanes,288
2,Michał Moskal,247
3,Wolfgang Grieskamp,246
4,Manuel Fähndrich,231
5,Mike Barnett,162
6,Sarfraz Khurshid,159
7,Junjie Chen,157
8,Gregg Rothermel,152
9,Haiyan Zhao,139


In [None]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""

In [35]:
query = """
MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL algo.pageRank.stream(
  'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
   YIELD node, score
   RETURN id(node) as id',
  'MATCH (a1:Article)-[:CITED]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher', params: {searchTerm: $searchTerm}})
YIELD nodeId, score
WITH author, nodeId, score 
WITH algo.getNodeById(nodeId) AS n, score
WHERE not(exists((author)<-[:AUTHOR]-(n)))
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"authorName": "Tao Xie", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,The source code control system,[Marc J. Rochkind],22.13
1,Program Improvement by Source-to-Source Transformation,[David B. Loveman],16.746
2,Make — a program for maintaining computer programs,[Stuart I. Feldman],16.325
3,Two case studies of open source software development: Apache and Mozilla,"[Audris Mockus, Roy Fielding, James D. Herbsleb]",15.716
4,Improving and refining programs by program manipulation,"[Dennis F. Kibler, James Milne Neighbors, Thomas A. Standish]",15.708
5,Equivariant adaptive source separation,"[Beate Hvam Laheld, J.-F. Cardoso]",15.261
6,StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks,"[Qian Zhang, Perry Wagle, Aaron Grier, Steve Beattie, P. Bakke]",10.425
7,Clone detection using abstract syntax trees,"[Lorraine Bier, Marcelo M. SantAnna, Leonardo Mendonça de Moura, Andrew Yahin, Ira D. Baxter]",10.25
8,A New Learning Algorithm for Blind Signal Separation,"[Shun-ichi Amari, Andrzej Cichocki, Howard Hua Yang]",9.989
9,Building diverse computer systems,"[Stephanie Forrest, Anil Somayaji, David H. Ackley]",9.729


In [36]:
params = {"authorName": "Margus Veanes", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,The source code control system,[Marc J. Rochkind],22.13
1,Program Improvement by Source-to-Source Transformation,[David B. Loveman],16.746
2,Make — a program for maintaining computer programs,[Stuart I. Feldman],16.325
3,Two case studies of open source software development: Apache and Mozilla,"[Audris Mockus, Roy Fielding, James D. Herbsleb]",15.716
4,Improving and refining programs by program manipulation,"[Dennis F. Kibler, James Milne Neighbors, Thomas A. Standish]",15.708
5,Equivariant adaptive source separation,"[Beate Hvam Laheld, J.-F. Cardoso]",15.261
6,StackGuard: automatic adaptive detection and prevention of buffer-overflow attacks,"[Qian Zhang, Perry Wagle, Aaron Grier, Steve Beattie, P. Bakke]",10.425
7,Clone detection using abstract syntax trees,"[Lorraine Bier, Marcelo M. SantAnna, Leonardo Mendonça de Moura, Andrew Yahin, Ira D. Baxter]",10.25
8,A New Learning Algorithm for Blind Signal Separation,"[Shun-ichi Amari, Andrzej Cichocki, Howard Hua Yang]",9.989
9,Building diverse computer systems,"[Stephanie Forrest, Anil Somayaji, David H. Ackley]",9.729
