# **Knowledge Representation in RAG methods**

Contributors:
* Szymon Pająk
* Tomasz Ogiołda

## Temporary notes

### Plan

1. Introduction
2. Background
  - What is RAG? Why is it used?
  - What kinds of knowledge representations RAG can use?
    - Vectorized embeddings
    - Knowledge graph
    - Combination of both
    - Comparison https://neo4j.com/blog/genai/graphrag-manifesto/

  - Explain the dataflow for both knowledge representations (the whole process, from raw data, to querying the knowledge database)
3. Demo

Tools to be used:

- langchain?
- neo4j

4. Resources

- https://neo4j.com/blog/genai/graphrag-manifesto/
- https://neo4j.com/blog/developer/langchain4j-graphrag-vector-stores-retrievers/
- https://neo4j.com/blog/genai/what-is-retrieval-augmented-generation-rag/
- https://neo4j.com/blog/developer/knowledge-graph-rag-application/
- https://neo4j.com/blog/news/graphrag-ecosystem-tools/

## **RAG quickstart & Motivation**

Some text

In [21]:
!pip install neo4j google-generativeai



In [5]:
from google.colab import userdata

NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_PASS = userdata.get('NEO4J_PASS')
NEO4J_DB_USER = userdata.get('NEO4J_DB_USER')
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [110]:
from neo4j import GraphDatabase
import google.generativeai as genai

genai.configure(api_key=GOOGLE_API_KEY)

URI = "neo4j+s://3a2f9088.databases.neo4j.io"

embedding_model = genai.GenerativeModel('models/text-embedding-004')
generative_llm = genai.GenerativeModel('gemini-1.5-flash-latest')

def get_db():
  with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_DB_USER, NEO4J_PASS)) as driver:
      driver.verify_connectivity()
      return driver

## **Data preparation & Indexing**

In [24]:
import kagglehub

path = kagglehub.dataset_download("devdope/900k-spotify")

Downloading from https://www.kaggle.com/api/v1/datasets/download/devdope/900k-spotify?dataset_version_number=3...


100%|██████████| 1.00G/1.00G [00:25<00:00, 41.8MB/s]

Extracting files...





In [44]:
import numpy as np
import pandas as pd

songs_csv = path + '/spotify_dataset.csv'

full_df = pd.read_csv(songs_csv)

In [98]:
np.random.seed(9)

df = full_df.sample(20000)
df = df[['Artist(s)','song', 'text', 'emotion', 'Length', 'Album', 'Genre', 'Energy', 'Popularity', 'Danceability', 'Positiveness']]
df[['Energy', 'Popularity', 'Danceability', 'Positiveness']] = df[['Energy', 'Popularity', 'Danceability', 'Positiveness']].astype(int)/100

In [102]:
df.head()

Unnamed: 0,Artist(s),song,text,emotion,Length,Tempo,Album,Genre,Release Date,Explicit,Energy,Popularity,Danceability,Positiveness,Liveness
465551,The Cast of Mary Poppins,Mary Poppins,"[Verse 1] Rose gold, rose quartz Stone cold, w...",anger,02:02,80,Mary Poppins Original Soundtrack,hip hop,6th February 2018,No,0.61,0.35,0.53,0.92,0.23
5902,98º,If Only She Knew,"If only she knew What was going right, I...",joy,04:27,96,98 Degrees And Rising,hip hop,1st January 1998,No,0.46,0.29,0.76,0.59,0.11
84105,CHASETHEMONEY,Been Dat,[Intro: Lil Yachty] Chase the money 'til a nig...,joy,01:42,115,Slim.E and Friends,hip hop,31st October 2020,Yes,0.42,0.22,0.86,0.51,0.15
31291,"Arthur Sullivan,Richard Lewis/Ian Wallace/Pro ...",And Have I Journeyd for a Month,"[NANKI-POO] And have I journeyed for a month, ...",joy,00:50,86,Gilbert & Sullivan: The Mikado,hip hop,13th June 2011,No,0.04,0.0,0.45,0.31,0.56
48636,Beyond Creation,Ethereal Kingdom,Alone among the living Man is haunted by the k...,sadness,05:19,96,Algorythm,"rock,garage rock",12th October 2018,No,0.96,0.2,0.37,0.09,0.35


In [118]:
def ingest_music_data_from_dataframe(db, df):
    print(f"Starting data ingestion for {len(df)} songs...")
    cnt = 0
    with db.session() as session:
        for idx, row in df.iterrows():
            try:
                song_id = str(idx)
                song_title = row['song']
                lyrics = str(row['text'])
                emotion = row['emotion']
                time_length = row['Length']
                album_name = row['Album']
                genre_name = row['Genre']
                energy = row['Energy']
                popularity = row['Popularity']
                danceability = row['Danceability']
                positiveness = row['Positiveness']

                # Create/Merge Song
                song_props = {
                    "id": song_id,
                    "title": song_title,
                    "lyrics": lyrics, # Store original lyrics
                    "time_length": time_length,
                    "energy": float(energy) if pd.notna(energy) else None,
                    "popularity": float(popularity) if pd.notna(popularity) else None,
                    "danceability": float(danceability) if pd.notna(danceability) else None,
                    "positiveness": float(positiveness) if pd.notna(positiveness) else None,
                }

                # Artists - handle multiple artists if separated by comma, etc.
                artist_names = []
                if pd.notna(row['Artist(s)']):
                    artist_names = [name.strip() for name in str(row['Artist(s)']).split(',')]

                for artist_name in artist_names:
                    if artist_name: # Ensure not empty string
                        session.run("""
                            MERGE (ar:Artist {name: $artist_name})
                            WITH ar
                            MATCH (s:Song {id: $song_id})
                            MERGE (ar)-[:PERFORMED]->(s)
                        """, artist_name=artist_name, song_id=song_id)

                # Album
                if pd.notna(album_name) and album_name.strip():
                    session.run("""
                        MERGE (al:Album {name: $album_name})
                        WITH al
                        MATCH (s:Song {id: $song_id})
                        MERGE (s)-[:APPEARS_ON]->(al)
                    """, album_name=album_name.strip(), song_id=song_id)

                # Genre
                if pd.notna(genre_name) and genre_name.strip():
                    session.run("""
                        MERGE (g:Genre {name: $genre_name})
                        WITH g
                        MATCH (s:Song {id: $song_id})
                        MERGE (s)-[:HAS_GENRE]->(g)
                    """, genre_name=genre_name.strip(), song_id=song_id)

                # Emotion
                if pd.notna(emotion) and emotion.strip():
                    session.run("""
                        MERGE (e:Emotion {name: $emotion})
                        WITH e
                        MATCH (s:Song {id: $song_id})
                        MERGE (s)-[:EVOKES]->(e)
                    """, emotion=emotion.strip(), song_id=song_id)

                if (cnt + 1) % 1000 == 0:
                    print(f"Processed {cnt + 1}/{len(df)} songs.")

                cnt+=1

            except Exception as e:
              print("Error encountered", e)


db = get_db()
ingest_music_data_from_dataframe(db, df)

Starting data ingestion for 20000 songs...


  with db.session() as session:


Processed 1000/20000 songs.
Processed 2000/20000 songs.
Processed 3000/20000 songs.
Processed 4000/20000 songs.
Processed 5000/20000 songs.
Processed 6000/20000 songs.
Processed 7000/20000 songs.
Processed 8000/20000 songs.
Processed 9000/20000 songs.
Processed 10000/20000 songs.
Processed 11000/20000 songs.
Processed 12000/20000 songs.
Processed 13000/20000 songs.
Processed 14000/20000 songs.
Processed 15000/20000 songs.
Processed 16000/20000 songs.
Processed 17000/20000 songs.
Processed 18000/20000 songs.
Processed 19000/20000 songs.
Processed 20000/20000 songs.


## **Retrieval**

Some text

In [None]:
# Some code

## **Generation**

Some text

In [None]:
# Some code

## **Challenges & Future Development**

Some text

In [None]:
# Some code