# Vector Indexes

In the last lesson, you learned about embeddings, vectors and their role in RAG.

In this lesson, you will learn how to use a vector index in Neo4j to compare embeddings to find similar data.

## Movie Plots

GraphAcademy created a Neo4j sandbox of movie recommendations when you enrolled in this course. The recommendations database contains over 9000 movies, 15000 actors, and over 100000 user ratings.

Each movie has a .plot property.

In [13]:
import os
from dotenv import load_dotenv
load_dotenv()

import textwrap
from neo4j import GraphDatabase
from utils import execute_query, create_embedding

neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")

neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))

neo4j_driver.verify_connectivity()

In [14]:
cypher = textwrap.dedent("""
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/rec-embed/movie-plot-embeddings-1k.csv'
AS row
MATCH (m:Movie {movieId: toInteger(row.movieId)})
CALL db.create.setNodeVectorProperty(
  m,
  'plotEmbedding',
  apoc.convert.fromJsonList(row.embedding)
);
""")
result = execute_query(neo4j_driver, cypher)

In [12]:
cypher = textwrap.dedent("""
MATCH (m:Movie)
RETURN count(m) AS movies,
       count(m.plotEmbedding) AS movies_with_embedding,
       head(collect(keys(m))) AS sample_keys
LIMIT 1;
""")

result = execute_query(neo4j_driver, cypher)

result

[{'movies': 93,
  'movies_with_embedding': 93,
  'sample_keys': ['revenue',
   'plotEmbedding',
   'imdbRating',
   'runtime',
   'imdbVotes',
   'title',
   'plot',
   'budget',
   'movieId',
   'imdbId',
   'released',
   'year',
   'countries',
   'languages',
   'genres',
   'tmdbId']}]

In [4]:
cypher = textwrap.dedent("""
MATCH (m:Movie)
WHERE m.plotEmbedding IS NOT NULL
RETURN m.title, m.plot
LIMIT 1
""")

result = execute_query(neo4j_driver, cypher)

result

[{'m.title': 'Toy Story',
  'm.plot': "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room."}]

In [5]:
cypher = textwrap.dedent("""
CREATE VECTOR INDEX moviePlots IF NOT EXISTS
FOR (m:Movie)
ON m.plotEmbedding
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}};
""")

result = execute_query(neo4j_driver, cypher)

In [6]:
cypher = textwrap.dedent("""
MATCH (m:Movie {title: "Toy Story"})
RETURN m.title AS title, m.plot AS plot
""")

result = execute_query(neo4j_driver, cypher)

result

[{'title': 'Toy Story',
  'plot': "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room."}]

## Plot Embeddings

Embeddings have been created for 1000 movie plots. The embedding is stored in the .plotEmbedding property of the Movie nodes.

In [7]:
cypher = textwrap.dedent("""
MATCH (m:Movie {title: "Toy Story"})
RETURN m.title AS title, m.plot AS plot, m.plotEmbedding
LIMIT 1
""")

result = execute_query(neo4j_driver, cypher)

result

[{'title': 'Toy Story',
  'plot': "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
  'm.plotEmbedding': [-0.026989128440618515,
   -0.024155009537935257,
   0.006058253347873688,
   -0.024324016645550728,
   -0.022516941651701927,
   -0.0050864629447460175,
   -0.013442561961710453,
   -0.004462436772882938,
   0.001889954088255763,
   -0.017147717997431755,
   0.00504421116784215,
   -0.007975833490490913,
   0.03221534565091133,
   -0.012272513471543789,
   0.01178499311208725,
   0.02133389189839363,
   0.028627198189496994,
   -0.0005025522550567985,
   0.014040587469935417,
   -0.014157592318952084,
   0.0014495606301352382,
   0.008027835749089718,
   -0.0222049281001091,
   -0.025013046339154243,
   0.004394183866679668,
   -0.00825534574687481,
   0.023660989478230476,
   -0.025416063144803047,
   0.037181556224823,
   0.00314450659789145,
   0.008619360625743866,
   -0.012064504437148571,
   0.0060257520

In [8]:
cypher = textwrap.dedent("""
MATCH (m:Movie)
WHERE m.plotEmbedding IS NOT NULL
RETURN m.title, m.plot
LIMIT 5
""")

result = execute_query(neo4j_driver, cypher)

result

[{'m.title': 'Toy Story',
  'm.plot': "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room."},
 {'m.title': 'Jumanji',
  'm.plot': 'When two kids find and play a magical board game, they release a man trapped for decades in it and a host of dangers that can only be stopped by finishing the game.'},
 {'m.title': 'Grumpier Old Men',
  'm.plot': "John and Max resolve to save their beloved bait shop from turning into an Italian restaurant, just as its new female owner catches Max's attention."},
 {'m.title': 'Waiting to Exhale',
  'm.plot': "Based on Terry McMillan's novel, this film follows four very different African-American women and their relationships with the male gender."},
 {'m.title': 'Father of the Bride Part II',
  'm.plot': 'In this sequel, George Banks deals not only with the pregnancy of his daughter, but also with the unexpected pregnancy of his wife.'}]

## Querying Vector Indexes

You can query the moviePlots index using the [db.index.vector.queryNodes()](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/?_gl=1*1ne1ez8*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DZP8Z65KK4*czE3NjMxODcxMDUkbzQ1JGcxJHQxNzYzMTg3NDE4JGoyNiRsMCRoMA..*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga_DL38Q8KGQC*czE3NjMxODcxMDUkbzQ1JGcxJHQxNzYzMTg3NDE4JGoyNiRsMCRoMA..#query-vector-index) procedure.

The procedure returns the requested number of approximate nearest neighbor nodes and their similarity score, ordered by the score.

```cypher
CALL db.index.vector.queryNodes(
    indexName :: STRING,
    numberOfNearestNeighbours :: INTEGER,
    query :: LIST<FLOAT>
) YIELD node, score
```


The procedure accepts three parameters:

1. indexName - The name of the vector index

2. numberOfNearestNeighbours - The number of results to return

3. query - A list of floats that represent an embedding

The procedure yields two arguments:

4. A node which matches the query

5. A similarity score ranging from 0.0 to 1.0.

You can use this procedure to find the closest embedding value to a given value.

## Querying Similar Movie Plots

You can use the moviePlots vector index to find movies with similar plots.

Review this Cypher before running it.

```cypher
MATCH (m:Movie {title: 'Toy Story'})

CALL db.index.vector.queryNodes('moviePlots', 6, m.plotEmbedding)
YIELD node, score

RETURN node.title AS title, node.plot AS plot, score
```

The query finds the Toy Story Movie node and uses the .plotEmbedding property to find the most similar plots.

The `db.index.vector.queryNodes()` procedure uses the moviePlots vector index to find similar embeddings.

Run the query. The procedure returns the requested number of nodes and their similarity score, ordered by the score.

In [9]:
cypher = textwrap.dedent("""
MATCH (m:Movie {title: 'Toy Story'})

CALL db.index.vector.queryNodes('moviePlots', 6, m.plotEmbedding)
YIELD node, score

RETURN node.title AS title, node.plot AS plot, score
""")

result = execute_query(neo4j_driver, cypher)

result

[{'title': 'Toy Story',
  'plot': "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
  'score': 1.0},
 {'title': 'Indian in the Cupboard, The',
  'plot': 'On his ninth birthday a boy receives many presents. Two of them first seem to be less important: an old cupboard from his brother and a little Indian figure made of plastic from his best ...',
  'score': 0.9169706106185913},
 {'title': 'Powder',
  'plot': 'A young bald albino boy with unique powers shakes up the rural community he lives in.',
  'score': 0.9130690097808838},
 {'title': 'Jumanji',
  'plot': 'When two kids find and play a magical board game, they release a man trapped for decades in it and a host of dangers that can only be stopped by finishing the game.',
  'score': 0.9099509716033936},
 {'title': 'Babe',
  'plot': 'Babe, a pig raised by sheepdogs, learns to herd sheep with a little help from Farmer Hoggett.',
  'score': 0.9046024084091187},
 {'tit

## Generate Embeddings

You can generate a new embedding in Cypher using the [genai.vector.encode](https://neo4j.com/docs/cypher-manual/current/genai-integrations/?_gl=1*1mimogn*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DZP8Z65KK4*czE3NjMxODcxMDUkbzQ1JGcxJHQxNzYzMTg3NDE4JGoyNiRsMCRoMA..*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga_DL38Q8KGQC*czE3NjMxODcxMDUkbzQ1JGcxJHQxNzYzMTg3NDE4JGoyNiRsMCRoMA..#single-embedding) function:

```cypher
WITH genai.vector.encode(
    "Text to create embeddings for",
    "OpenAI",
    { token: "sk-..." }) AS embedding
RETURN embedding
```

## Generate a Plot Embedding

You can use the embedding to query the vector index to find similar movies.

This query, creates and embedding for the text "A mysterious spaceship lands Earth" and uses it to query the moviePlots vector index for the 6 most similar movie plots.

```cypher
WITH genai.vector.encode(
    "A mysterious spaceship lands Earth",
    "OpenAI",
    { token: "sk-..." }) AS myMoviePlot
CALL db.index.vector.queryNodes('moviePlots', 6, myMoviePlot)
YIELD node, score
RETURN node.title, node.plot, score
```

## Considerations

Using embeddings and vectors is relatively straightforward and can quickly yield results. The downside to this approach is that it relies heavily on the embeddings and similarity function to produce valid results.

This approach is also a black box. There are 1536 dimensions; it would be impossible to determine how the vectors are structured and how they influenced the similarity score.

The movies returned look similar, but without reading and comparing them, you would have no way of verifying that the results are correct.

Vectors work well for:

- Contextual or Meaning Based Questions

- Fuzzy or Vague queries

- Broad or Open-Ended questions

- Complex queries with multiple concepts

Vectors are ineffective for:

- Highly Specific or Fact-Based Questions

- Numerical or Exact-Match Queries

- Boolean or Logical Queries

- Ambiguous or Unclear Queries without Context

- Specialized Knowledge

In the next lesson you will look at how you can improve the results by using a combination of vector and graph queries.