# GDMA Project
Author: Julian Schelb (1069967)

In [None]:
from neo4j import GraphDatabase
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Connection to the database instance

In [None]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "subatomic-shrank-Respond"))
database_name = "cddb"
session = driver.session(database = database_name)

### Task 4: Searching and Ranking

Implement a simple search engine that enables search by artist, album and
song name/title. The results must be ranked based on importance. It is up to
you to come up with how the importance of each result is computed and you
must justify your decision (it goes without saying that you need to come up
with a meaningful definition). However, the importance should ideally take into
account user preferences/likes. As such, this task is split in two parts:

##### 1. Write a Cypher query that adds a relationship :LIKES between a node with

label :User and an artist, album, or song. Every user should be identified
just by a numerical userID (no more information is necessary). If a user
already exists in the system, no additional node should be added. After
coming up with the necessary Cypher query, add a significant number of
users and likes.

##### 2. Implement a simple Python function that has the following arguments:

- the userID of the user submitting the search (the user ID may not
exist in the database),
- a string that contains one or more keywords for the search, and
- an optional argument that indicates whether the search is on all or
a specific field, i.e., artist, album, song.
The search must return exactly 10 results.

Python must only be used to call the database. You should not write any
code in Python that implements functionality necessary for the task. However,
submitting multiple queries in the same function call is allowed. Also, for this
task of the project, you are not only allowed but also encouraged to use functions
from the GDS library of Neo4j. Hence, before making any decisions, have a
careful look at the available functions. Again, you have to justify the use of any
function that you employ

In [None]:
query = """
MATCH (c:CD)-[r:CONTAINS]->(ar:Artist)
WHERE c.ayear = 2000
RETURN DISTINCT ar.artist
"""
        
dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,ar.artist
0,the frequency benders
1,boyz ii men
2,adriana calcanhoto
3,syl johnson
4,cheo feliciano
...,...
5619,garry harrison
5620,catuaba com amendoim
5621,jimmy powells
5622,ofra haza


#### Processing

**Ideas:** 

Inplicit Meassures (to avoid cold start problem):
- Centrality in subgraph of artists, albums, songs according to search input (Use exact match and partial match) (and maybe liked nodes?)
    1) Select Subgraph by filtering artbist, albums and songs by input
    2) Calculate centrality (Multiple meassures)
    3) Cd with higher centrality is considered better
    

User Preference:

- Use Nod2Vec Embedding to compute Similarity to already liked Nodes:
    1) Include similar nodes to already liked ones
    2) Average similarity of artists/albums/sing nodes ... compared to previous likes
- Use average Distance to liked Nodes. (basically already included)
- centrality in the subgraph ob liked nodes (basically already included)

Output:

cd id, contained songs, contained artists, contained albums, centralities, cumulated scores

#### Adding Example Users

**Create User Node:**

In [12]:
user_id = 1 # User ID

query = """
MERGE (u:User {id:  $user_id})
RETURN u.id as user_id
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query,  user_id = user_id)])
dtf_data

Unnamed: 0,user_id
0,1


**Adding some Likes to model User Preference:**

In [16]:
query = """
MATCH (u:User)
WHERE u.id = 1
MATCH (a:Artist) 
WHERE 
    a.artist =~ '.*jimi hendrix.*' 
    OR a.artist =~ '.*led zeppelin.*'
    OR a.artist =~ '.*rolling stones.*' 
    OR a.artist =~ 'queen'
    OR a.artist =~ '.*david bowie.*' 
    OR a.artist =~ '.*black sabbath.*'
    OR a.artist =~ '.*thin lizzy.*' 
MERGE  (u)-[r:LIKES]->(a)      
RETURN a.artist
"""

Unnamed: 0,a.artist
0,the jimi hendrix experience
1,jimi hendrix
2,queen
3,the rolling stones
4,thin lizzy
5,black sabbath
6,jimi hendrix (w. lonnie youngblood);
7,led zeppelin
8,rolling stones
9,david bowie


In [24]:
query = """
MATCH (u:User)-[r:LIKES]->(a:Artist)
MATCH (a)-[r2:APPEARED_ON]->(c:CD)
MATCH (c)-[r3:CONTAINS]->(s:Song)
WHERE u.id = 1 and a.artist =~ '.*jimi hendrix.*' 
WITH u, s
LIMIT 20
MERGE (u)-[r4:LIKES]->(s)      
RETURN s.song
"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

Unnamed: 0,s.song
0,simon says
1,welcome home
2,hush now
3,gotta have a new dress
4,how would you feel
5,no business
6,strange things
7,you got me floatin
8,ain't no telling
9,exp


In [None]:
query = """
MATCH (u:User)-[r:LIKES]->(a:Artist)
MATCH (a)-[r2:APPEARED_ON]->(c:CD)
MATCH (c)-[r3:CONTAINS]->(s:Song)
WHERE u.id = 1 and a.artist =~ '.*led zeppelin.*' 
LIMIT 20
MERGE (u)-[r4:LIKES]->(s)      
RETURN s.song

"""

dtf_data = pd.DataFrame([dict(_) for _ in session.run(query)])
dtf_data

In [None]:
MATCH (u:User )
WHERE u.id = 1
MATCH (s:Song) 
WHERE 
    s.song =~ '.*purple haze.*' 
    OR s.song =~ '.*whole lotta love.*'
    OR s.song =~ '.*sympathy for the devil.*' 
    OR s.song =~ '.*under pressure.*'
    OR s.song =~ '.*iron man.*'
    OR s.song =~ '.*we will rock you.*'
    OR s.song =~ '.*bohemian rhapsody.*'
    OR s.song =~ '.*we are the champions.*'
    OR s.song =~ '.*another one bites the dust.*'    
MERGE  (u)-[r:LIKES]->(s)      
RETURN *

In [None]:
CALL gds.graph.project(
  'cdRanking',            
  ['CD', 'Album', 'Artist', 'Song'],             
  ['CONTAINS', 'APPEARED_ON']               
)
YIELD
  graphName AS graph, nodeProjection, nodeCount AS nodes, relationshipProjection, relationshipCount AS rels

In [None]:
CALL gds.graph.project.cypher(
  'cds',
  'MATCH (n) WHERE n:CD OR n:Artist OR n:Song OR n:Album  RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:APPEARED_ON]->(m) RETURN id(n) AS source, id(m) AS target, type(r) AS type',
  {validateRelationships: false})
YIELD
  graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipCount AS rels

In [None]:
CALL gds.eigenvector.stream('cdRanking')
YIELD nodeId, score
WITH gds.util.asNode(nodeId).id AS nodeId, score
RETURN nodeId, score
ORDER BY score DESC
LIMIT 100

Node Embeddings:

In [None]:
CALL gds.beta.node2vec.stream('cdRanking', {embeddingDimension: 2})
YIELD nodeId, embedding
RETURN nodeId, embedding

In [None]:
CALL gds.fastRP.stream('cdRanking',
  {
    embeddingDimension: 4,
    randomSeed: 42
  }
)
YIELD nodeId, embedding

In [None]:
#https://neo4j.com/docs/graph-data-science/current/graph-project-cypher/

MATCH (n)
WHERE n.age < 20 AND NOT n.name STARTS WITH "V"
WITH collect(n) AS olderPersons
CALL gds.graph.project.cypher(
  'personSubsetViaParameters',
  'UNWIND $nodes AS n RETURN id(n) AS id, labels(n) AS labels',
  'MATCH (n)-[r:KNOWS]->(m)
    WHERE (n IN $nodes) AND (m IN $nodes)
    RETURN id(n) AS source, id(m) AS target, type(r) AS type, r.numberOfPages AS numberOfPages',
  { parameters: { nodes: olderPersons} }
)
 YIELD
  graphName, nodeCount AS nodes, relationshipCount AS rels
 RETURN graphName, nodes, rels