# What is RAG?

## Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is an approach that enhances the responses of LLMs by providing them with relevant, up-to-date information retrieved from external sources.

RAG helps generate more accurate and tailored answers, especially when the required information is not present in the model’s training data.

The RAG process typically involves three main steps:


1. **Understanding the User Query**

    The system first interprets the user’s input or question to determine what information is needed.

2. **Information Retrieval**

    A retriever searches external data sources (such as documents, databases, or knowledge graphs) to find relevant information based on the user’s query.

3. **Response Generation**

    The retrieved information is inserted into the prompt, and the language model uses this context to generate a more accurate and relevant response.


## Grounding

The process of providing context to an LLM to improve the accuracy of its responses and reduce the likelihood of hallucinations is known as Grounding.

<img 
    src="https://graphacademy.neo4j.com/courses/genai-fundamentals/2-rag/1-what-is-rag/images/llm-news-agency.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>


## Retrievers

he retriever is a key component of the RAG process. A retriever is responsible for searching and retrieving relevant information from external data sources based on the user’s query.

A retriever typically takes an unstructured input (like a question or prompt) and searches for structured data that can provide context or answers.

Neo4j support various methods for building retrievers, including:

- **Full-text search**

- **Vector search**

- **Text to Cypher**

You will explore these methods in the rest of the course.

## Data sources

The data sources used in the RAG process can vary widely, depending on the application and the type of information needed. Common data sources include:


- **Documents**: Textual data sources, such as articles, reports, or manuals, that can be searched for relevant information.

- **APIs**: External services that can provide real-time data or specific information based on user queries.

- **Knowledge Graphs**: Graph-based representations of information that can provide context and relationships between entities.


## Semantic Search

One of the challenges of RAG is understanding what the user is asking for and finding the correct information to pass to the LLM.

Semantic search aims to understand search phrases' intent and contextual meaning, rather than focusing on individual keywords.

Traditional keyword search often depends on exact-match keywords or proximity-based algorithms that find similar words.

<img 
    src="https://graphacademy.neo4j.com/courses/genai-fundamentals/2-rag/2-vector-search/images/llm-rag-vector-process.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

In [1]:
import os

from dotenv import load_dotenv

load_dotenv()

import textwrap
from neo4j import GraphDatabase
from utils import execute_query

neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")

neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))

cypher = textwrap.dedent("""
MATCH (p:Person) DETACH DELETE p;
MATCH (m:Movie) DETACH DELETE m;

DROP CONSTRAINT Person_tmdbId IF EXISTS;
DROP CONSTRAINT Movie_movieId IF EXISTS;

CREATE CONSTRAINT Person_tmdbId IF NOT EXISTS
FOR (x:Person)
REQUIRE x.tmdbId IS UNIQUE;

CREATE CONSTRAINT Movie_movieId IF NOT EXISTS
FOR (x:Movie)
REQUIRE x.movieId IS UNIQUE;

LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/persons.csv' AS row
MERGE (p:Person {tmdbId: toInteger(row.person_tmdbId)})
SET
p.imdbId = toInteger(row.person_imdbId),
p.bornIn = row.bornIn,
p.name = row.name,
p.bio = row.bio,
p.poster = row.poster,
p.url = row.url,
p.born = date(row.born),
p.died = date(row.died);

LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/movies.csv' AS row
MERGE (m:Movie {movieId: toInteger(row.movieId)})
SET
m.tmdbId = toInteger(row.movie_tmdbId),
m.imdbId = toInteger(row.movie_imdbId),
m.released = date(row.released),
m.title = row.title,
m.year = toInteger(row.year),
m.plot = row.plot,
m.budget = toInteger(row.budget),
m.imdbRating = toFloat(row.imdbRating),
m.poster = row.poster,
m.runtime = toInteger(row.runtime),
m.imdbVotes = toInteger(row.imdbVotes),
m.revenue = toInteger(row.revenue),
m.url = row.url,
m.countries = split(row.countries, '|'),
m.languages = split(row.languages, '|');

LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/acted_in.csv' AS row
MATCH (p:Person {tmdbId: toInteger(row.person_tmdbId)})
MATCH (m:Movie {movieId: toInteger(row.movieId)})
MERGE (p)-[r:ACTED_IN]->(m)
SET r.role = row.role;

LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/directed.csv' AS row
MATCH (p:Person {tmdbId: toInteger(row.person_tmdbId)})
MATCH (m:Movie {movieId: toInteger(row.movieId)})
MERGE (p)-[r:DIRECTED]->(m);

MATCH (p:Person)-[:ACTED_IN]->()
WITH DISTINCT p SET p:Actor;

MATCH (p:Person)-[:DIRECTED]->()
WITH DISTINCT p SET p:Director;
""")

# Split the Cypher script into individual statements
cypher_statements = [stmt.strip() for stmt in cypher.split(';') if stmt.strip()]

# Execute each statement individually
for i, statement in enumerate(cypher_statements, 1):
    print(f"Executing statement {i}/{len(cypher_statements)}...")
    try:
        res = execute_query(neo4j_driver, statement)
        print(f"Statement {i} completed successfully")
    except Exception as e:
        print(f"Error executing statement {i}: {e}")
        break

neo4j_driver.close()

Executing statement 1/12...
Statement 1 completed successfully
Executing statement 2/12...
Statement 2 completed successfully
Executing statement 3/12...
Statement 3 completed successfully
Executing statement 4/12...
Statement 4 completed successfully
Executing statement 5/12...
Statement 5 completed successfully
Executing statement 6/12...
Statement 6 completed successfully
Executing statement 7/12...
Statement 7 completed successfully
Executing statement 8/12...
Statement 8 completed successfully
Executing statement 9/12...
Statement 9 completed successfully
Executing statement 10/12...
Statement 10 completed successfully
Executing statement 11/12...
Statement 11 completed successfully
Executing statement 12/12...
Statement 12 completed successfully


In [4]:
import pandas as pd

df = pd.read_csv("https://data.neo4j.com/importing-cypher/movies.csv")
print(df.shape)
df.head()

(93, 17)


Unnamed: 0,movieId,title,budget,countries,movie_imdbId,imdbRating,imdbVotes,languages,plot,movie_poster,released,revenue,runtime,movie_tmdbId,movie_url,year,genres
0,1,Toy Story,30000000.0,USA,114709,8.3,591836,English,A cowboy doll is profoundly threatened and jea...,https://image.tmdb.org/t/p/w440_and_h660_face/...,1995-11-22,373554033.0,81,862,https://themoviedb.org/movie/862,1995,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,65000000.0,USA,113497,6.9,198355,English|French,When two kids find and play a magical board ga...,https://image.tmdb.org/t/p/w440_and_h660_face/...,1995-12-15,262797249.0,104,8844,https://themoviedb.org/movie/8844,1995,Adventure|Children|Fantasy
2,3,Grumpier Old Men,,USA,113228,6.6,18615,English,John and Max resolve to save their beloved bai...,https://image.tmdb.org/t/p/w440_and_h660_face/...,1995-12-22,,101,15602,https://themoviedb.org/movie/15602,1995,Comedy|Romance
3,4,Waiting to Exhale,16000000.0,USA,114885,5.6,7210,English,"Based on Terry McMillan's novel, this film fol...",https://image.tmdb.org/t/p/w440_and_h660_face/...,1995-12-22,81452156.0,124,31357,https://themoviedb.org/movie/31357,1995,Comedy|Romance|Drama
4,5,Father of the Bride Part II,,USA,113041,5.9,25938,English,"In this sequel, George Banks deals not only wi...",https://image.tmdb.org/t/p/w440_and_h660_face/...,1995-12-08,76578911.0,106,11862,https://themoviedb.org/movie/11862,1995,Comedy


In [5]:
df = pd.read_csv("https://data.neo4j.com/rec-embed/movie-plot-embeddings-1k.csv")
print(df.shape)
df.head()

(1000, 3)


Unnamed: 0.1,Unnamed: 0,movieId,embedding
0,0,1,"[-0.026989128, -0.02415501, 0.0060582533, -0.0..."
1,1,2,"[-0.0016367682, -0.022421483, 0.004689293, 0.0..."
2,2,3,"[0.008853926, -0.02395768, 0.009994454, -0.023..."
3,3,4,"[-0.024737105, -0.034573562, -0.0049023638, -0..."
4,4,5,"[-0.0040508406, -0.024880992, -0.01273487, -0...."
