# GDMA Project
Author: Julian Schelb (1069967)

In [1]:
from neo4j import GraphDatabase
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Connection to the database instance

In [2]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "subatomic-shrank-Respond"))
database_name = "cddb"
session = driver.session(database = database_name)

### Task 4: Searching and Ranking

Implement a simple search engine that enables search by artist, album and
song name/title. The results must be ranked based on importance. It is up to
you to come up with how the importance of each result is computed and you
must justify your decision (it goes without saying that you need to come up
with a meaningful definition). However, the importance should ideally take into
account user preferences/likes. As such, this task is split in two parts:

**1. Write a Cypher query that adds a relationship :LIKES between a node with**

label :User and an artist, album, or song. Every user should be identified
just by a numerical userID (no more information is necessary). If a user
already exists in the system, no additional node should be added. After
coming up with the necessary Cypher query, add a significant number of
users and likes.

**2. Implement a simple Python function that has the following arguments:**

- the userID of the user submitting the search (the user ID may not
exist in the database),
- a string that contains one or more keywords for the search, and
- an optional argument that indicates whether the search is on all or
a specific field, i.e., artist, album, song.
The search must return exactly 10 results.

Python must only be used to call the database. You should not write any
code in Python that implements functionality necessary for the task. However,
submitting multiple queries in the same function call is allowed. Also, for this
task of the project, you are not only allowed but also encouraged to use functions
from the GDS library of Neo4j. Hence, before making any decisions, have a
careful look at the available functions. Again, you have to justify the use of any
function that you employ

In [3]:
query = """
MATCH (c:CD)-[r:CONTAINS]->(ar:Artist)
WHERE c.ayear = 2000
RETURN DISTINCT ar.artist
"""
        
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,ar.artist
0,the frequency benders
1,boyz ii men
2,adriana calcanhoto
3,syl johnson
4,cheo feliciano
...,...
5619,garry harrison
5620,catuaba com amendoim
5621,jimmy powells
5622,ofra haza


#### Processing

**Ideas:** 

Inplicit Meassures (to avoid cold start problem):
- Centrality in subgraph of artists, albums, songs according to search input (Use exact match and partial match) (and maybe liked nodes?)
    1) Select Subgraph by filtering artbist, albums and songs by input
    2) Calculate centrality (Multiple meassures)
    3) Cd with higher centrality is considered better
    

User Preference:

- Use Nod2Vec Embedding to compute Similarity to already liked Nodes:
    1) Include similar nodes to already liked ones
    2) Average similarity of artists/albums/sing nodes ... compared to previous likes
- Use average Distance to liked Nodes. (basically already included)
- centrality in the subgraph ob liked nodes (basically already included)

Output:

cd id, contained songs, contained artists, contained albums, centralities, cumulated scores

#### Adding Example Users

**Function to create Example User:**

In [9]:
def createExampleUser(user_id = 1, genre = "rock", limit = 50):
    
    ####### DELETE USER NODE #######
    query1 = """
    MATCH (u:User)
    WHERE u.id = $user_id
    DETACH DELETE u
    """
    
    ####### CREATE USER NODE #######
    query2 = """
    MERGE (u:User {id:  $user_id})
    RETURN u.id as user_id
    """
    
    ####### LINK TO LIKED SONGS #######
    query3 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(s:Song)
    WHERE g.genre = $genre
    WITH s 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(s)
    """
    
    ####### LINK TO LIKED ARTISTS #######
    query4 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(t:Artist)
    WHERE g.genre = $genre
    WITH t 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(t)
    """
    
    ####### LINK TO LIKED ALBUMS #######
    query5 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(t:Album)
    WHERE g.genre = $genre
    WITH t 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(t)
    """

    with driver.session(database = database_name) as session:
        session.run(query1, user_id = user_id)
        session.run(query2, user_id = user_id)
        session.run(query3, user_id = user_id, genre = genre, limit = limit)
        session.run(query4, user_id = user_id, genre = genre, limit = limit)
        session.run(query5, user_id = user_id, genre = genre, limit = limit)
    

**Adding some Likes to model User Preference:**

In [10]:
# User 1 likes "Rock" music
createExampleUser(user_id = 1, genre = "rock", limit = 50)
# User 2 likes "classic" music
createExampleUser(user_id = 2, genre = "classic", limit = 50)

***

#### Implementing Search Engine

##### Feature 1: User Preference

In [36]:
query = """
// DELETE EXISTING PROJECTION
CALL gds.graph.drop('searchdomain_preference', false) 
YIELD graphName 
RETURN graphName
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,graphName
0,searchdomain_preference


In [37]:
query = """
// CREATE NEW PROJECTION WITH SEARCH RELEVANT SUB GRAPH
CALL gds.graph.project.cypher(
  'searchdomain_preference',
  ' // Liked Artists, Albums and Songs
    MATCH (u:User)-[:LIKES]->(n) 
    WHERE u.id = 2 
        AND (n:Song OR n:Album OR n:Artist) 
    RETURN id(n) AS id, labels(n) AS labels 
    LIMIT 1000
    
    UNION
    
    // CDs linked to liked Artists, Albums and Songs
    MATCH (u:User)-[:LIKES]->(x)-[:APPEARED_ON]->(n:CD) 
    WHERE u.id = 2  
    RETURN id(n) AS id, labels(n) AS labels 
    LIMIT 1000',
    
    'MATCH (u:User)-[:LIKES]->(n)
    WHERE u.id = 2 
    AND (n:CD OR n:Song OR n:Album OR n:Artist) 
    MATCH (n)-[r:APPEARED_ON]->(m:CD) 
    RETURN id(n) AS source, id(m) AS target, type(r) AS type 
    LIMIT 1000' 
)
YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,graphName,nodes,rels
0,searchdomain_preference,1135,1000


In [38]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/eigenvector-centrality/
query = """
CALL gds.eigenvector.mutate('searchdomain_preference',  {
  mutateProperty: 'score_eig'
})
YIELD centralityDistribution, nodePropertiesWritten, ranIterations
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,minimumScore,meanScore,nodePropertiesWritten
0,0.001371,0.027647,1135


In [39]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/closeness-centrality/#algorithms-closeness-centrality-examples-mutate
query = """
CALL gds.beta.closeness.mutate('searchdomain_preference',  {
  mutateProperty: 'score_cln'
})
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

ClientError: {code: Neo.ClientError.Procedure.ProcedureCallFailed} {message: Failed to invoke procedure `gds.beta.closeness.mutate`: Caused by: java.lang.IllegalArgumentException: Node property `score_cln` already exists in the in-memory graph.}

In [40]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/degree-centrality/
query = """
CALL gds.degree.mutate('searchdomain_preference',  {
  mutateProperty: 'score_deg'
})
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,minimumScore,meanScore,nodePropertiesWritten
0,0.0,0.88106,1135


In [42]:
query = """
CALL gds.graph.streamNodeProperties('searchdomain_preference', ['score_eig', 'score_deg', 'score_cln'])
YIELD nodeId, nodeProperty, propertyValue
WITH nodeId, gds.util.asNode(nodeId).id AS id, nodeProperty, propertyValue

ORDER BY nodeId, nodeProperty
WITH  nodeId, gds.util.asNode(nodeId).id as id, 
collect(nodeProperty) as properties , collect(propertyValue) as values, sum(propertyValue) as sum_values

MATCH (n:CD)
WHERE n.id = id
MATCH (n)-[:CONTAINS]->(ar:Artist)
MATCH (n)-[:CONTAINS]->(ab:Album)
MATCH (n)-[:CONTAINS]->(so:Song)
RETURN id, 
values, sum_values, 
collect(DISTINCT ar.artist) as artists,
collect(DISTINCT ab.album) as albums, 
collect(DISTINCT so.song) as songs
ORDER BY sum_values DESC
LIMIT 100
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,id,values,sum_values,artists,albums,songs
0,38495,"[0.0, 68.0, 0.0013705075147236233]",68.001371,[gaetano donizetti],[anna bolena],[sinfonia overture]
1,39838,"[0.0, 48.0, 0.0013705075147236233]",48.001371,[ravel],[orchestral works (seiji ozawa - boston sympho...,[boléro]
2,40815,"[0.0, 37.0, 0.0013705075147236233]",37.001371,[schnebel],[chamber music],"[lamah, quintessenz, vier stuecke, oktett mit ..."
3,39931,"[0.0, 14.0, 0.0013705075147236233]",14.001371,[adam zwierz],[recital],"[przeciez bylo to co bylo, ej, dogonie wiatr, ..."
4,54725,"[0.0, 5.0, 0.0013705075147236233]",5.001371,[maurice ravel],[the best of ravel],"[la valse, poème choréographique, valses noble..."
5,130511,"[0.0, 4.0, 0.0013705075147236233]",4.001371,[bert kaemphert],[instrumental collection],"[sweet caroline, caravan, hold me, blue midnig..."
6,125099,"[0.0, 3.0, 0.0013705075147236233]",3.001371,[tchaikovsky],[sleeping beauty],"[act , sc 1 blindman's bluff, prologue dance s..."
7,47493,"[0.0, 3.0, 0.0013705075147236233]",3.001371,[david garrett],[pure ecstasy],"[ale. allegro vivacissimo, e espressivo, onett..."
8,40920,"[0.0, 2.0, 0.0013705075147236233]",2.001371,[lorenzo perosi],[il natale del redentore],"[l'annunciazione, il natale]"
9,113113,"[0.0, 2.0, 0.0013705075147236233]",2.001371,[bach johann sebastian],[kunst der fuge - erich bergel],"[14 canone alla duodecima, 15 mirror canon wit..."


In [31]:
query = """
CALL gds.graph.streamNodeProperty('searchdomain_preference', 'score_eig')
YIELD nodeId as nodeId_eig, propertyValue as score_eig

//CALL gds.graph.streamNodeProperty('searchdomain', 'score_deg')
//YIELD nodeId as nodeId_deg, propertyValue as propertyValue_deg

//CALL gds.graph.streamNodeProperty('searchdomain_preference', 'score_cln')
//YIELD nodeId as nodeId_cln, propertyValue as score_cln

WITH gds.util.asNode(nodeId_eig).id AS id, 
score_eig,  collect(score_eig)  as t//avg(score_eig) as score_eig_mean
RETURN *
ORDER BY score_eig DESC
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,id,score_eig,t
0,53708,0.061017,[0.06101741687030329]
1,135693,0.061017,[0.06101741687030329]
2,109958,0.061017,[0.06101741687030329]
3,44549,0.061017,[0.06101741687030329]
4,47493,0.061017,[0.06101741687030329]
5,131914,0.061017,[0.06101741687030329]
6,116208,0.061017,[0.06101741687030329]
7,51143,0.061017,[0.06101741687030329]
8,53698,0.061017,[0.06101741687030329]
9,111637,0.061017,[0.06101741687030329]


In [13]:
query = """
CALL gds.eigenvector.stream('searchdomain_preference')
YIELD nodeId, score
WITH gds.util.asNode(nodeId).id AS nodeId, score
MATCH (n:CD)-[r:CONTAINS]->(a:Artist)
WHERE n.id = nodeId
SET n.eigenvector = score
RETURN n.id, score, 
collect(a.artist) as artists
ORDER BY score DESC
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,n.id,score,artists
0,53708,0.061017,"[bruch, smetena, gluck, mahler, glinka]"
1,135693,0.061017,[stanislav bunin]
2,109958,0.061017,[stanislav bunin]
3,44549,0.061017,[jacques saint_yves]
4,47493,0.061017,[david garrett]
5,131914,0.061017,[lorenzo perosi]
6,116208,0.061017,[marina paccagnella - thomas haug]
7,51143,0.061017,[rimsky-korsakov]
8,53698,0.061017,[nassri chamssedine]
9,111637,0.061017,[bechara el khoury]


##### Feature 2: Content Match with Search Input

In [43]:
query = """
// DELETE EXISTING PROJECTION
CALL gds.graph.drop('searchdomain_content', false) 
YIELD graphName 
RETURN graphName
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,graphName
0,searchdomain_content


In [44]:
query = """
// CREATE NEW PROJECTION WITH SEARCH RELEVANT SUB GRAPH
CALL gds.graph.project.cypher(
  "searchdomain_content",
  
  " // Artists, Albums and Songs which match query
    MATCH (n) 
    WHERE 
    (n:Song OR n:Album OR n:Artist) 
    AND 
    (n.artist =~ \'.*jimi hendrix.*\' 
    OR n.album  =~ '.*Are You Experienced.*' 
    OR n.song  =~ '.*purple haze.*')
    RETURN id(n) AS id, labels(n) AS labels 
    LIMIT 1000
    
    UNION
    
    // CDS linked to Artists, Albums and Songs which match query
    MATCH (n)-[:APPEARED_ON]->(c:CD) 
    WHERE 
    (n:Song OR n:Album OR n:Artist) 
    AND 
    (n.artist =~ '.*jimi hendrix.*' 
    OR n.album  =~ '.*Are You Experienced.*' 
    OR n.song  =~ '.*purple haze.*')
    RETURN id(c) AS id, labels(c) AS labels 
    LIMIT 1000",
    
    "MATCH (n)-[r:APPEARED_ON]->(c:CD) 
    WHERE 
    (n:Song OR n:Album OR n:Artist) 
    AND 
    (n.artist =~ '.*jimi hendrix.*' 
    OR n.album  =~ '.*Are You Experienced.*' 
    OR n.song  =~ '.*purple haze.*')
    RETURN id(n) AS source, id(c) AS target, type(r) AS type 
    LIMIT 1000"
)
YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels
"""


dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)]) 
dtf_data

Unnamed: 0,graphName,nodes,rels
0,searchdomain_content,140,162


In [45]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/eigenvector-centrality/
query = """
CALL gds.eigenvector.mutate('searchdomain_content',  {
  mutateProperty: 'score_eig'
})
YIELD centralityDistribution, nodePropertiesWritten, ranIterations
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,minimumScore,meanScore,nodePropertiesWritten
0,0.002642,0.071719,140


In [46]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/closeness-centrality/#algorithms-closeness-centrality-examples-mutate
query = """
CALL gds.beta.closeness.mutate('searchdomain_content',  {
  mutateProperty: 'score_cln'
})
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,minimumScore,meanScore,nodePropertiesWritten
0,0.0,0.85,140


In [47]:
# https://neo4j.com/docs/graph-data-science/current/algorithms/degree-centrality/
query = """
CALL gds.degree.mutate('searchdomain_content',  {
  mutateProperty: 'score_deg'
})
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,minimumScore,meanScore,nodePropertiesWritten
0,0.0,1.157146,140


In [50]:
query = """
CALL gds.graph.streamNodeProperties('searchdomain_content', ['score_eig', 'score_deg', 'score_cln'])
YIELD nodeId, nodeProperty, propertyValue
WITH nodeId, gds.util.asNode(nodeId).id AS id, nodeProperty, propertyValue

ORDER BY nodeId, nodeProperty
WITH  nodeId, gds.util.asNode(nodeId).id as id, 
collect(nodeProperty) as properties , collect(propertyValue) as values, sum(propertyValue) as sum_values

MATCH (n:CD)
WHERE n.id = id
MATCH (n)-[:CONTAINS]->(ar:Artist)
MATCH (n)-[:CONTAINS]->(ab:Album)
MATCH (n)-[:CONTAINS]->(so:Song)
RETURN id, 
values, sum_values, 
collect(DISTINCT ar.artist) as artists,
collect(DISTINCT ab.album) as albums, 
collect(DISTINCT so.song) as songs
ORDER BY sum_values DESC
LIMIT 100
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(45)

Unnamed: 0,id,values,sum_values,artists,albums,songs
0,529,"[0.0, 74.0, 0.002642306670565559]",74.002642,[jimi hendrix],[bbc sessions],"[driving south, foxy lady, hear my train a com..."
1,3248,"[0.0, 55.0, 0.002642306670565559]",55.002642,[tamba trio],[tamba trio],"[procissao, consolacao, influencia do jazz, de..."
2,452,"[0.0, 8.0, 0.002642306670565559]",8.002642,[the jimi hendrix experience],[bbc sessions],"[voodoo child (slight return);, driving south ..."
3,20709,"[0.0, 5.0, 0.002642306670565559]",5.002642,[jimi hendrix experience],[the vpro archive - it must be dusty],"[spoken word, mockingbird (mpeg1 video data);,..."
4,2835,"[0.0, 2.0, 0.002642306670565559]",2.002642,[jimi hendrix (w. lonnie youngblood);],[two great experiencesremastered);],"[sweet thang (4); aka. wipe the sweat 3 (4);, ..."
5,110321,"[0.0, 2.0, 0.002642306670565559]",2.002642,[rajyashree josyer shrikanth],[sangeetha sowrabha],"[rtp pallavi shanmukhapriye (shanmukhapriya, k..."
6,7923,"[1.0, 0.0, 0.3011216110364633]",1.301122,[signature licks],[jimi hendrix],"[purple haze (solo);, purple haze (outro);, he..."
7,30138,"[1.0, 0.0, 0.18172988929010417]",1.18173,[jimi hendrix],[astro man box set],"[i don't live today takes (pre unre);, fire (p..."
8,33321,"[1.0, 0.0, 0.18172988929010417]",1.18173,[jimi hendrix],[astro man(alchemy); - studio outtakes 1966-68],"[la pouppee qui fait non (no no no no);, cat t..."
9,20709,"[1.0, 0.0, 0.18172988929010417]",1.18173,[jimi hendrix experience],[the vpro archive - it must be dusty],"[spoken word, mockingbird (mpeg1 video data);,..."


In [48]:
query = """
CALL gds.eigenvector.stream('searchdomain_content')
YIELD nodeId, score
WITH gds.util.asNode(nodeId).id AS nodeId, score
MATCH (n:CD)-[r:CONTAINS]->(a:Artist)
WHERE n.id = nodeId
SET n.eigenvector = score
RETURN n.id, score, 
collect(a.artist) as artists
ORDER BY score DESC
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,n.id,score,artists
0,7923,0.301122,[signature licks]
1,33321,0.18173,[jimi hendrix]
2,30138,0.18173,[jimi hendrix]
3,20709,0.18173,[jimi hendrix experience]
4,9197,0.122034,[the jimi hendrix experience]
5,14391,0.122034,[the jimi hendrix experience]
6,8660,0.122034,[the jimi hendrix experience]
7,452,0.122034,[the jimi hendrix experience]
8,136907,0.122034,[jimi hendrix]
9,164258,0.122034,[jimi hendrix]


Node Embeddings:

In [None]:
CALL gds.beta.node2vec.stream('cdRanking', {embeddingDimension: 2})
YIELD nodeId, embedding
RETURN nodeId, embedding

In [None]:
CALL gds.fastRP.stream('cdRanking',
  {
    embeddingDimension: 4,
    randomSeed: 42
  }
)
YIELD nodeId, embedding

In [None]:
#https://neo4j.com/docs/graph-data-science/current/graph-project-cypher/

MATCH (n)
WHERE n.age < 20 AND NOT n.name STARTS WITH "V"
WITH collect(n) AS olderPersons
CALL gds.graph.project.cypher(
  'personSubsetViaParameters',
  'UNWIND $nodes AS n RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:KNOWS]->(m)
    WHERE (n IN $nodes) AND (m IN $nodes)
    RETURN id(n) AS source, id(m) AS target, type(r) AS type, r.numberOfPages AS numberOfPages',
  { parameters: { nodes: olderPersons} }
)
 YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
 RETURN graphName, nodes, rels