<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations: Part 1

In this notebook we're going to learn how to make recommendations using Neo4j. As with the other notebooks, let's get our environment setup.

And let's import those libraries:

In [1]:
from neo4j import GraphDatabase
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [2]:
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=("neo4j", "admin"))        
print(driver.address)

Address(host='data-science-training-neo4j', port=7687)


##  Finding popular authors

Since we're going to make collaborator suggestions so let's find authors who have written the most articles so that we have some data to work with.

In [3]:
popular_authors_query = """
MATCH (author:Author)
RETURN author.name, size((author)<-[:AUTHOR]-()) AS articlesPublished
ORDER BY articlesPublished DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(popular_authors_query)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,articlesPublished,author.name
0,89,Peter G. Neumann
1,80,Peter J. Denning
2,72,Moshe Y. Vardi
3,71,Pamela Samuelson
4,65,Bart Preneel
5,56,Vinton G. Cerf
6,53,Barry W. Boehm
7,49,Mark Guzdial
8,47,Edwin R. Hancock
9,46,Josef Kittler


Let's pick one of these authors...

In [8]:
author_name = "Josef Kittler"

And let's have a look what articles they've published and how many citations they've received:

In [9]:
author_articles_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)
RETURN article.title AS article, article.year AS year, size((article)<-[:CITED]-()) AS citations
ORDER BY citations DESC
LIMIT 20
"""

with driver.session() as session:
    result = session.run(author_articles_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,article,citations,year
0,The BANCA database and evaluation protocol,11,2003
1,An Experimental Comparison of Classifier Fusion Rules for Multimodal Personal Identity Verificat...,4,2002
2,A Framework for Classifier Fusion: Is It Still Needed?,4,2000
3,A comparative study of automatic face verification algorithms on the BANCA database,3,2003
4,Classifier Combination as a Tomographic Process,3,2001
5,Combined Classifier Optimisation via Feature Selection,3,2000
6,Information Analysis of Multiple Classifier Fusion,2,2001
7,Reliable Classification of Chrysanthemum Leaves through Curvature Scale Space,2,1997
8,Face verification competition on the XM2VTS database,2,2003
9,Face Detection by Learned Affine Correspondences,2,2002


Find the authors collaborators...

In [12]:
collaborations_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor)
RETURN coauthor.name AS coauthor, count(*) AS collaborations, collect(article.title)
ORDER BY collaborations DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(collaborations_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,coauthor,collaborations,collect(article.title)
0,Kieron Messer,10,"[Photometric normalisation for face verification, Model Validation for Model Selection, Adaptive..."
1,David Windridge,9,"[Serial multiple classifier systems exploiting a coarse to fine output coding, The practical per..."
2,Mohammad T. Sadeghi,8,"[Approximate Gradient Direction Metric for Face Authentication, Modified Predictive Validation T..."
3,William J. Christmas,5,"[Building classifier ensembles for automatic sports classification, Fusion of Multiple Cue Detec..."
4,Miroslav Hamouz,4,"[Face authentication competition on the BANCA database, Face Detection by Learned Affine Corresp..."
5,Luc Vandendorpe,4,"[Face authentication competition on the BANCA database, Decision Level Fusion of Intramodal Pers..."
6,Jacek Czyz,4,"[Face authentication competition on the BANCA database, Decision Level Fusion of Intramodal Pers..."
7,Alireza Ahmadyfard,3,"[Serial multiple classifier systems exploiting a coarse to fine output coding, On Matching Algor..."
8,Petr Somol,3,"[Combining Multiple Classifiers in Probabilistic Neural Networks, Information Analysis of Multip..."
9,Edward Jaser,3,"[Building classifier ensembles for automatic sports classification, Fusion of Multiple Cue Detec..."


How would we suggest some future collaborators for this author? One way is by looking at the collaborators of their collaborators!

In [13]:
collaborations_query = """
MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
      (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
RETURN coc.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

with driver.session() as session:
    result = session.run(collaborations_query, {"authorName": author_name})
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,coauthor,collaborations
0,Elzbieta Pekalska,24
1,Giorgio Giacinto,23
2,Marina Skurichina,22
3,David M. J. Tax,18
4,Gian Luca Marcialis,18
5,Pavel Paclík,18
6,Jana Novovicová,15
7,Serguei Verzakov,10
8,Stéphane Pigeon,8
9,Thomas C. W. Landgrebe,8


In [14]:
def find_collaborators(author_name):
    collaborations_query = """
    MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
          (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
    WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
    RETURN coc.name AS coauthor, count(*) AS collaborations, collect(DISTINCT coauthor.name) AS coauthors
    ORDER BY collaborations DESC
    LIMIT 10
    """

    with driver.session() as session:
        result = session.run(collaborations_query, {"authorName": author_name})
    return pd.DataFrame([dict(record) for record in result])

In [20]:
df = find_collaborators("Stéphane Pigeon")
df[df.collaborations > 5]

Unnamed: 0,coauthor,coauthors,collaborations
0,Jacek Czyz,[Luc Vandendorpe],10
1,Josef Kittler,[Luc Vandendorpe],8
2,Samy Bengio,[Luc Vandendorpe],6
3,Mohammad T. Sadeghi,[Luc Vandendorpe],6


Each of these people have collaborated with someone that Peter has worked with before, so they might be able to do an introduction.


## Exercise

* Can you find the top 20 suggested collaborators for 'Brian Fitzgerald' or 'Peter G. Neumann' instead of 'Tao Xie'?
* How many of these potential collaborators have collaborated with Brian's collaborators more than 3 times?
