<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 2

In the 2nd part of our recommendations notebook, we're going to use the PageRank algorithm to make article recommendations to an author. Let's import our libraries in case we don't have those from the previous notebooks:

And let's import those libraries:

In [None]:
from neo4j import GraphDatabase
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [None]:
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

## PageRank

We're going to use the PageRank algorithm, so let's first get up to speed on this algorithm.

PageRank is an algorithm that measures the transitive influence or connectivity of nodes. It can be computed by either iteratively distributing one node’s rank (originally based on degree) over its neighbours or by randomly traversing the graph and counting the frequency of hitting each node during these walks.

We can run PageRank over the whole graph to find out the most influential article in terms of citations:

In [None]:
query = """
CALL gds.pageRank.write({
  nodeProjection: "Article",
  relationshipProjection: "CITED",
  writeProperty: "pagerank"
})
"""

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

This query stores a 'pagerank' property on each node. We can write the following query to view the most influential articles:

In [None]:
query = """
MATCH (a:Article) 
RETURN a.title as article, a.pagerank as score 
ORDER BY score DESC 
LIMIT 10 
""" 

with driver.session() as session:
    result = session.run(query)

pd.DataFrame([dict(record) for record in result])

## Personalized PageRank

Personalized PageRank is a variant of PageRank that allows us to find influential nodes based on a set of source nodes.

For example, rather than finding the overall most influential articles, we could instead find the most influential articles with respect to a given author.

In [None]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH collect(article) + collect(other) AS sourceNodes
CALL gds.pageRank.stream({
  nodeProjection: 'Article', 
  relationshipProjection: 'CITED', 
  sourceNodes: sourceNodes
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).title AS article, score
ORDER BY score DESC
LIMIT 10
"""

author_name = "Peter G. Neumann"
with driver.session() as session:
    result = session.run(query, {"author": author_name})

pd.DataFrame([dict(record) for record in result])

## Topic Sensitive Search

We can also use Personalized PageRank to do 'Topic Specific PageRank'.

When an author is searcing for articles to read, they want that search to take them into account. Two authors using the same search term would expect to see different results depending on their area of research.

We'll start by creating a full text search index on the 'title' and 'abstract' properties of all nodes that have the label 'Article':

And let's have a look what articles they've published and how many citations they've received:

In [None]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""

with driver.session() as session:
    session.run(query).summary()

We can check that our full text index has been created by running the following query:

In [None]:
query = """
CALL db.indexes()
YIELD description, indexName, tokenNames, properties, state, type, progress
WHERE type = "node_fulltext"
RETURN *
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

We can search the full text index like this:

In [None]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
RETURN node.title, score, [(author)<-[:AUTHOR]-(node) | author.name] AS authors
LIMIT 10
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

We can write the following query to find the authors that have published the most articles on 'open source':

In [None]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
MATCH (node)-[:AUTHOR]->(author)
RETURN author.name, sum(score) AS totalScore, collect(node.title) AS articles
ORDER By totalScore DESC
LIMIT 20
"""
with driver.session() as session:
    result = session.run(query)
pd.DataFrame([dict(record) for record in result])

We can now use Full Text Search and Personalized PageRank to find interesting articles for different authors.

In [None]:
query = """
MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL gds.pageRank.stream({
  nodeQuery: 'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
              YIELD node, score
              RETURN id(node) as id',
  relationshipQuery: 'MATCH (a1:Article)-[:CITED]->(a2:Article) 
                      RETURN id(a1) as source,id(a2) as target', 
  sourceNodes: articles, 
  validateRelationships: false,
  parameters: {searchTerm: $searchTerm}
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS n, score
WHERE not(exists((author)<-[:AUTHOR]-(n)))
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"authorName": "Tao Xie", "searchTerm": "open source"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])

In [None]:
params = {"authorName": "Margus Veanes", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Let's try the same query with a different author:

In [None]:
params = {"authorName": "Marco Aurélio Gerosa", "searchTerm": "open source"}

with driver.session() as session:
    result = session.run(query, params)
pd.DataFrame([dict(record) for record in result])