### Lesson 3: Preparing Text Data for RAG

- source: https://learn.deeplearning.ai/courses/knowledge-graphs-rag/lesson/j4mw1/preparing-text-for-rag

#### Import packages and set up Neo4j

In [None]:
from dotenv import load_dotenv
import os

from langchain_community.graphs import Neo4jGraph

# Warning control
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

# Note the code below is unique to this course environment, and not a 
# standard part of Neo4j's integration with OpenAI. Remove if running 
# in your own environment.
OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'

In [None]:
# Connect to the knowledge graph instance using LangChain
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

#### Create a vector index

In [None]:
kg.query("""
    CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
    FOR (m:Movie) ON (m.taglineEmbedding) 
    OPTIONS { indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
    }}"""
)


In [None]:
kg.query("""
    SHOW VECTOR INDEXES
    """
)

#### Populate the vector index

Calculate vector representation for each movie tagline using OpenAI

Add vector to the `Movie` node as `taglineEmbedding` property

In [None]:
kg.query("""
    MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
    WITH movie, genai.vector.encode(
        movie.tagline, 
        "OpenAI", 
        {
            token: $openAiApiKey,
            endpoint: $openAiEndpoint
        }) AS vector
    CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
    """, 
    params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT} )

In [None]:
result = kg.query("""
    MATCH (m:Movie) 
    WHERE m.tagline IS NOT NULL
    RETURN m.tagline, m.taglineEmbedding
    LIMIT 1
    """
)

In [None]:
result[0]['m.tagline']

In [None]:
result[0]['m.taglineEmbedding'][:10]

In [None]:
len(result[0]['m.taglineEmbedding'])

#### Similarity Search

Calculate embedding for question

Identify matching movies based on similarity of question and `taglineEmbedding` vectors

In [None]:
question = "What movies are about love?"

In [None]:
kg.query("""
    WITH genai.vector.encode(
        $question, 
        "OpenAI", 
        {
            token: $openAiApiKey,
            endpoint: $openAiEndpoint
        }) AS question_embedding
    CALL db.index.vector.queryNodes(
        'movie_tagline_embeddings', 
        $top_k, 
        question_embedding
        ) YIELD node AS movie, score
    RETURN movie.title, movie.tagline, score
    """, 
    params={"openAiApiKey":OPENAI_API_KEY,
            "openAiEndpoint": OPENAI_ENDPOINT,
            "question": question,
            "top_k": 5
            })

#### Try for yourself: ask your own question!

Change the question below and run the graph query to find different movies