## Prerequisites

- **Docker**  
  Required to run Memgraph, as Memgraph is a native Linux application and cannot be installed directly on Windows or macOS.

- **pandas**  
  A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

- **kagglehub**  
  A powerful and flexible library similar to pandas for data manipulation and analysis.

- **sentence_transformers**  
  A library for state-of-the-art sentence embeddings, similar in purpose to pandas but focused on text data.

- **neo4j**  
  Used to query Memgraph.

# Build a Movie Similarity Search Engine with Vector Search in Memgraph

In this example, we will demonstrate how vector search can be used to find movies based on their plots or short descriptions. For this, we will use the Wikipedia Movie Plots dataset, available on Kaggle.
To get started, launch Memgraph with the `--experimental-enabled=vector-search` flag and the appropriate `--experimental-config flag`.

To start Memgraph, run:
`docker run -p 7687:7687 -p 7444:7444 memgraph/memgraph:latest --experimental-enabled=vector-search --experimental-config='{"vector-search": {"movies_index": {"label": "Movie","property": "embedding","dimension": 384,"capacity": 100, "metric": "cos"}}}'`

We created a vector index `movies_index`  which is defined on label `Movie`  and property `embedding` .

Firstly let’s load the dataset:

In [1]:
import pandas as pd
import kagglehub

dataset_path = kagglehub.dataset_download("jrobischon/wikipedia-movie-plots")
df = pd.read_csv(dataset_path + "/wiki_movie_plots_deduped.csv")
print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


   Release Year                             Title Origin/Ethnicity  \
0          1901            Kansas Saloon Smashers         American   
1          1901     Love by the Light of the Moon         American   
2          1901           The Martyred Presidents         American   
3          1901  Terrible Teddy, the Grizzly King         American   
4          1902            Jack and the Beanstalk         American   

                             Director Cast    Genre  \
0                             Unknown  NaN  unknown   
1                             Unknown  NaN  unknown   
2                             Unknown  NaN  unknown   
3                             Unknown  NaN  unknown   
4  George S. Fleming, Edwin S. Porter  NaN  unknown   

                                           Wiki Page  \
0  https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...   
1  https://en.wikipedia.org/wiki/Love_by_the_Ligh...   
2  https://en.wikipedia.org/wiki/The_Martyred_Pre...   
3  https://en.wikipedia.

This dataset consists of 32,432 movies. To keep the example simple and understandable, we will reduce the dataset size. We will filter movies based on the director.

In [2]:
nolan_movies = df[df['Director'] == 'Christopher Nolan']
nolan_movies.reset_index(drop=True, inplace=True)
print(nolan_movies.shape)

(9, 8)


We also need a function to compute embeddings:

In [3]:
from sentence_transformers import SentenceTransformer

def compute_embeddings(texts):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    return model.encode(texts)
    
embeddings = compute_embeddings(nolan_movies['Plot'].values)
print(embeddings.shape)

(9, 384)


Now, let’s import these movies into Memgraph:

In [4]:
import neo4j

driver = neo4j.GraphDatabase.driver("bolt://localhost:7687", auth=("", ""))
with driver.session() as session:
    for index, row in nolan_movies.iterrows():
        # remove quotes from the title to avoid parsing issues
        title = row["Title"].replace('"', '')
        
        embedding = embeddings[index].tolist()
        embeddings_str = ",".join([str(x) for x in embedding])
        query = f'CREATE (m:Movie {{title: "{title}", year: {row["Release Year"]}, embedding: [{embeddings_str}]}})'
        print(query)
        session.run(query)

CREATE (m:Movie {title: "Memento", year: 2000, embedding: [-0.26310715079307556,0.21804100275039673,-0.21692103147506714,0.042848847806453705,0.15567198395729065,0.29619085788726807,0.11049992591142654,0.0012278594076633453,0.08074574172496796,-0.16938811540603638,0.1006438285112381,0.2729353606700897,0.0809355080127716,0.11047548055648804,-0.10480464994907379,-0.20997054874897003,-0.081610307097435,0.4574175477027893,0.1803778111934662,0.2987024486064911,0.23225107789039612,0.02696278877556324,0.36068809032440186,-0.10776342451572418,0.20817792415618896,-0.02895597741007805,0.017136691138148308,-0.0403948575258255,-0.1979285180568695,-0.0825573205947876,0.13494744896888733,-0.08231435716152191,0.20165666937828064,-0.1465110033750534,0.02957802638411522,-0.08240166306495667,-0.2437560111284256,0.1404237598180771,-0.294454962015152,-0.3005193769931793,-0.17102986574172974,0.4151918590068817,0.008151259273290634,0.07082422822713852,-0.05849827080965042,0.21061791479587555,-0.334326773881

As you can see, we computed the embeddings with `compute_embeddings` function and then we used these embedding vectors to store it in the `embedding` property of each node.

After we have imported the data into Memgraph we can start with our experiments!

We will define a function which we can use to find most similar movies described with the plot:

In [5]:
def find_movie(plot):
    embeddings = compute_embeddings([plot])
    embeddings_str = ",".join([str(x) for x in embeddings[0]])
    with driver.session() as session:
        query = f"CALL vector_search.search('movies_index', 3, [{embeddings_str}]) yield node, similarity return node.title, similarity"
        result = session.run(query)
        for record in result:
            print(record)

Now, let’s try to find the Inception, by using the following plot:

In [6]:
plot = "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O."
find_movie(plot)

<Record node.title='Inception' similarity=0.5250678062438965>
<Record node.title='Interstellar' similarity=0.2907602787017822>
<Record node.title='The Dark Knight' similarity=0.2784501910209656>


Next, let’s attempt to find Memento:

In [7]:
plot = "An insurance investigator suffers from anterograde amnesia, leaving him unable to form new memories, and uses notes and tattoos to track down his wife's killer."
find_movie(plot)

<Record node.title='Memento' similarity=0.37598633766174316>
<Record node.title='Insomnia' similarity=0.26347029209136963>
<Record node.title='Inception' similarity=0.23714923858642578>
