# GDMA Project
Author: Julian Schelb (1069967)

In [1]:
from neo4j import GraphDatabase
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Connection to the database instance

In [2]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "subatomic-shrank-Respond"))
database_name = "cddb"
session = driver.session(database = database_name)

### Task 4: Searching and Ranking

Implement a simple search engine that enables search by artist, album and
song name/title. The results must be ranked based on importance. It is up to
you to come up with how the importance of each result is computed and you
must justify your decision (it goes without saying that you need to come up
with a meaningful definition). However, the importance should ideally take into
account user preferences/likes. As such, this task is split in two parts:

##### 1. Write a Cypher query that adds a relationship :LIKES between a node with

label :User and an artist, album, or song. Every user should be identified
just by a numerical userID (no more information is necessary). If a user
already exists in the system, no additional node should be added. After
coming up with the necessary Cypher query, add a significant number of
users and likes.

##### 2. Implement a simple Python function that has the following arguments:

- the userID of the user submitting the search (the user ID may not
exist in the database),
- a string that contains one or more keywords for the search, and
- an optional argument that indicates whether the search is on all or
a specific field, i.e., artist, album, song.
The search must return exactly 10 results.

Python must only be used to call the database. You should not write any
code in Python that implements functionality necessary for the task. However,
submitting multiple queries in the same function call is allowed. Also, for this
task of the project, you are not only allowed but also encouraged to use functions
from the GDS library of Neo4j. Hence, before making any decisions, have a
careful look at the available functions. Again, you have to justify the use of any
function that you employ

In [3]:
query = """
MATCH (c:CD)-[r:CONTAINS]->(ar:Artist)
WHERE c.ayear = 2000
RETURN DISTINCT ar.artist
"""
        
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,ar.artist
0,the frequency benders
1,boyz ii men
2,adriana calcanhoto
3,syl johnson
4,cheo feliciano
...,...
5619,garry harrison
5620,catuaba com amendoim
5621,jimmy powells
5622,ofra haza


#### Processing

**Ideas:** 

Inplicit Meassures (to avoid cold start problem):
- Centrality in subgraph of artists, albums, songs according to search input (Use exact match and partial match) (and maybe liked nodes?)
    1) Select Subgraph by filtering artbist, albums and songs by input
    2) Calculate centrality (Multiple meassures)
    3) Cd with higher centrality is considered better
    

User Preference:

- Use Nod2Vec Embedding to compute Similarity to already liked Nodes:
    1) Include similar nodes to already liked ones
    2) Average similarity of artists/albums/sing nodes ... compared to previous likes
- Use average Distance to liked Nodes. (basically already included)
- centrality in the subgraph ob liked nodes (basically already included)

Output:

cd id, contained songs, contained artists, contained albums, centralities, cumulated scores

#### Adding Example Users

**Function to create Example User:**

In [6]:
def createExampleUser(user_id = 1, genre = "rock", limit = 50):
    
    ####### DELETE USER NODE #######
    query1 = """
    MATCH (u:User)
    WHERE u.id = $user_id
    DETACH DELETE u
    """
    
    ####### CREATE USER NODE #######
    query2 = """
    MERGE (u:User {id:  $user_id})
    RETURN u.id as user_id
    """
    
    ####### LINK TO LIKED SONGS #######
    query3 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(s:Song)
    WHERE g.genre = $genre
    WITH s 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(s)
    """
    
    ####### LINK TO LIKED ARTISTS #######
    query4 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(t:Artist)
    WHERE g.genre = $genre
    WITH t 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(t)
    """
    
    ####### LINK TO LIKED ALBUMS #######
    query5 = """
    MATCH (g:Genre)<-[r:BELONGS_TO]-(c:CD)
    MATCH (c)-[r2:CONTAINS]->(t:Album)
    WHERE g.genre = $genre
    WITH t 
    LIMIT $limit
    MATCH (u:User)
    WHERE u.id = $user_id
    MERGE (u)-[r3:LIKES]->(t)
    """

    with driver.session(database = database_name) as session:
        session.run(query1, user_id = user_id)
        session.run(query2, user_id = user_id)
        session.run(query3, user_id = user_id, genre = genre, limit = limit)
        session.run(query4, user_id = user_id, genre = genre, limit = limit)
        session.run(query5, user_id = user_id, genre = genre, limit = limit)
    

**Adding some Likes to model User Preference:**

In [11]:
# User 1 likes "Rock" music
createExampleUser(user_id = 1, genre = "rock", limit = 5)
# User 2 likes "classic" music
createExampleUser(user_id = 2, genre = "classic", limit = 5)

***

#### Implementing Search Engine

**Feature 1: Centrality**

In [25]:
query = """
// DELETE EXISTING PROJECTION
//CALL gds.graph.drop('searchdomain', false) 
//YIELD graphName as graphDeleted

// CREATE NEW PROJECTION WITH SEARCH RELEVANT SUB GRAPH
CALL gds.graph.project.cypher(
  'searchdomain',
  ' // Liked Artists, Albums and Songs
    MATCH (u:User)-[:LIKES]->(n) 
    WHERE u.id = 2 
        AND (n:Song OR n:Album OR n:Artist) 
    RETURN id(n) AS id, labels(n) AS labels 
    LIMIT 1000
    
    UNION
    
    // CDs linked to liked Artists, Albums and Songs
    MATCH (u:User)-[:LIKES]->(x)-[:APPEARED_ON]->(n:CD) 
    WHERE u.id = 2  
    RETURN id(n) AS id, labels(n) AS labels 
    LIMIT 1000',
    
    'MATCH (u:User)-[:LIKES]->(n)
    WHERE u.id = 2 
    AND (n:CD OR n:Song OR n:Album OR n:Artist) 
    MATCH (n)-[r:APPEARED_ON]->(m:CD) 
    RETURN id(n) AS source, id(m) AS target, type(r) AS type 
    LIMIT 1000' 
)
YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,graphName,nodes,rels
0,searchdomain,543,588


In [29]:
query = """
CALL gds.eigenvector.stream('searchdomain')
YIELD nodeId, score
WITH gds.util.asNode(nodeId).id AS nodeId, score
MATCH (n:CD)-[r:CONTAINS]->(a:Artist)
WHERE n.id = nodeId
SET n.eigenvector = score
RETURN n.id, score, 
collect(a.artist) as artists
ORDER BY score DESC
"""
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data.head(30)

Unnamed: 0,n.id,score,artists
0,50271,0.1777,[ludwig van beethoven]
1,125757,0.142477,[bach j.s - mozart w.a. - rachmaninov s.]
2,110015,0.107254,[ludwig van beethoven]
3,51796,0.107254,[ludwig van beethoven]
4,50278,0.107254,[ludwig van beethoven]
5,50479,0.107254,[ludwig van beethoven]
6,47774,0.107254,[ludwig van beethoven]
7,54951,0.107254,[ludwig van beethoven]
8,48320,0.107254,[ludwig van beethoven]
9,48130,0.107254,[ludwig van beethoven]


In [None]:
# Query Input:
MATCH (u:User)-[:LIKES]->(n) 
WHERE u.id = 1 AND (n:CD OR n:Song OR n:Album OR n:Artist) 
RETURN id(n) AS id, labels(n) AS labels 
LIMIT 1000
UNION ALL
MATCH (n) 
WHERE 
(n:CD OR n:Song OR n:Album OR n:Artist) 
AND 
(n.artist =~ '.*jimi hendrix.*' 
OR n.album  =~ '.*Are You Experienced.*' 
OR n.song  =~ '.*purple haze.*')
RETURN id(n) AS id, labels(n) AS labels 
LIMIT 1000

In [None]:
// DELETE EXISTING PROJECTION
CALL gds.graph.drop('searchdomain', false) 
YIELD graphName as graphDeleted

// CREATE NEW PROJECTION WITH SEARCH RELEVANT SUB GRAPH
CALL gds.graph.project.cypher(
  'searchdomain',
  "MATCH (n) WHERE n:CD OR n:Song RETURN id(n) AS id, labels(n) AS labels",
  'MATCH (n:Song)-[r:APPEARED_ON]->(m:CD) RETURN id(n) AS source, id(m) AS target, type(r) AS type'
)
YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels

In [None]:
CALL gds.graph.project.cypher(
  'cds',
  'MATCH (n) WHERE n:CD OR n:Artist OR n:Song OR n:Album  RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:APPEARED_ON]->(m) RETURN id(n) AS source, id(m) AS target, type(r) AS type',
  {validateRelationships: false})
YIELD
  graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipCount AS rels

In [None]:
MATCH (u:User )
WHERE u.id = 1
MATCH (s:Song) 
WHERE 
    s.song =~ '.*purple haze.*' 
    OR s.song =~ '.*whole lotta love.*'
    OR s.song =~ '.*sympathy for the devil.*' 
    OR s.song =~ '.*under pressure.*'
    OR s.song =~ '.*iron man.*'
    OR s.song =~ '.*we will rock you.*'
    OR s.song =~ '.*bohemian rhapsody.*'
    OR s.song =~ '.*we are the champions.*'
    OR s.song =~ '.*another one bites the dust.*'    
MERGE  (u)-[r:LIKES]->(s)      
RETURN *

In [None]:
CALL gds.graph.project(
  'cdRanking',            
  ['CD', 'Album', 'Artist', 'Song'],             
  ['CONTAINS', 'APPEARED_ON']               
)
YIELD
  graphName AS graph, nodeProjection, nodeCount AS nodes, relationshipProjection, relationshipCount AS rels

In [None]:
CALL gds.graph.project.cypher(
  'cds',
  'MATCH (n) WHERE n:CD OR n:Artist OR n:Song OR n:Album  RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:APPEARED_ON]->(m) RETURN id(n) AS source, id(m) AS target, type(r) AS type',
  {validateRelationships: false})
YIELD
  graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipCount AS rels

In [None]:
CALL gds.eigenvector.stream('cdRanking')
YIELD nodeId, score
WITH gds.util.asNode(nodeId).id AS nodeId, score
RETURN nodeId, score
ORDER BY score DESC
LIMIT 100

Node Embeddings:

In [None]:
CALL gds.beta.node2vec.stream('cdRanking', {embeddingDimension: 2})
YIELD nodeId, embedding
RETURN nodeId, embedding

In [None]:
CALL gds.fastRP.stream('cdRanking',
  {
    embeddingDimension: 4,
    randomSeed: 42
  }
)
YIELD nodeId, embedding

In [None]:
#https://neo4j.com/docs/graph-data-science/current/graph-project-cypher/

MATCH (n)
WHERE n.age < 20 AND NOT n.name STARTS WITH "V"
WITH collect(n) AS olderPersons
CALL gds.graph.project.cypher(
  'personSubsetViaParameters',
  'UNWIND $nodes AS n RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:KNOWS]->(m)
    WHERE (n IN $nodes) AND (m IN $nodes)
    RETURN id(n) AS source, id(m) AS target, type(r) AS type, r.numberOfPages AS numberOfPages',
  { parameters: { nodes: olderPersons} }
)
 YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
 RETURN graphName, nodes, rels