<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 1

In this notebook we're going to learn how to make recommendations using Neo4j. As with the other notebooks, let's get our environment setup.

And let's import those libraries:

In [1]:
from neo4j import GraphDatabase
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [2]:
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

Address(host='data-science-training-neo4j', port=7687)


##  Finding popular authors

Since we're going to make collaborator suggestions so let's find authors who have written the most articles so that we have some data to work with.

In [3]:
popular_authors_query = """
MATCH (author:Author)
RETURN author.name, size((author)<-[:AUTHOR]-()) AS articlesPublished
ORDER BY articlesPublished DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(popular_authors_query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,articlesPublished,author.name
0,89,Peter G. Neumann
1,80,Peter J. Denning
2,72,Moshe Y. Vardi
3,71,Pamela Samuelson
4,65,Bart Preneel
5,56,Vinton G. Cerf
6,53,Barry W. Boehm
7,49,Mark Guzdial
8,47,Edwin R. Hancock
9,46,Josef Kittler


Let's pick one of these authors...

In [7]:
author_name = "Peter J. Denning"

And let's have a look what articles they've published and how many citations they've received:

In [8]:
author_articles_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)
RETURN article.title AS article, article.year AS year, size((article)<-[:CITED]-()) AS citations
ORDER BY citations DESC
LIMIT 20
"""

with driver.session() as session:
    result = session.run(author_articles_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,citations,year
0,Computing as a discipline,18,1989
1,Certification of programs for secure information flow,11,1977
2,Educating a new engineer,10,1992
3,Is computer science science,7,2005
4,Properties of the working-set model,6,1972
5,Computing is a natural science,6,2007
6,A debate on teaching computing science,5,1989
7,Recentering computer science,4,2005
8,The social life of innovation,4,2004
9,The profession of IT Beyond computational thinking,4,2009


Find the authors collaborators...

In [9]:
collaborations_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor)
RETURN coauthor.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(collaborations_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,coauthor,collaborations
0,Robert Dunham,3
1,Dorothy E. Denning,3
2,Nicholas Dew,2
3,Rick Hayes-Roth,2
4,David H. Brandin,2
5,Jim Horning,1
6,Lauren Weinstein,1
7,Ted G. Lewis,1
8,David Lorge Parnas,1
9,Jack B. Dennis,1


How would we suggest some future collaborators for this author? One way is by looking at the collaborators of their collaborators!

In [10]:
collaborations_query = """
MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
      (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
RETURN coc.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(collaborations_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,coauthor,collaborations
0,Whitfield Diffie,6
1,Susan Landau,6
2,Anthony G. Lauck,4
3,Clinton C. Brooks,4
4,Scott Charney,4
5,David Sobel,4
6,Douglas Miller,4
7,Stephen T. Kent,4
8,Jeffrey G. Long,3
9,Dennis K. Branstad,3


In [20]:
def find_collaborators(author_name):
    collaborations_query = """
    MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
          (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
    WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
    RETURN coc.name AS coauthor, count(*) AS collaborations, collect(DISTINCT coauthor.name) AS coauthors
    ORDER BY collaborations DESC
    LIMIT 10
    """

    with driver.session() as session:
        result = session.run(collaborations_query, {"authorName": author_name})
    return pd.DataFrame([dict(record) for record in result])

In [21]:
df = find_collaborators("John Ioannidis")
df[df.collaborations > 5]

Unnamed: 0,coauthor,coauthors,collaborations
0,Susan Landau,"[Matt Blaze, Steven Michael Bellovin]",16
1,Kostas G. Anagnostakis,"[Angelos D. Keromytis, Sotiris Ioannidis]",14
2,Whitfield Diffie,"[Matt Blaze, Steven Michael Bellovin]",11
3,Peter G. Neumann,"[Matt Blaze, Steven Michael Bellovin]",10
4,Ross J. Anderson,"[Matt Blaze, Steven Michael Bellovin, Virgil D. Gligor]",10
5,Michael B. Greenwald,"[Angelos D. Keromytis, Sotiris Ioannidis]",7
6,Jonathan M. Smith,"[Angelos D. Keromytis, Sotiris Ioannidis]",7
7,Jennifer Rexford,"[Matt Blaze, Steven Michael Bellovin]",6
8,Michael Hicks,"[Angelos D. Keromytis, Sotiris Ioannidis]",6


Each of these people have collaborated with someone that Peter has worked with before, so they might be able to do an introduction.


## Exercise

* Can you find the top 20 suggested collaborators for 'Brian Fitzgerald' or 'Peter G. Neumann' instead of 'Tao Xie'?
* How many of these potential collaborators have collaborated with Brian's collaborators more than 3 times?
