# Semantic Search with Movie Plots

How do you find movies based on what they're about? Semantic search.

We can use movie plots and phrases to search through a movie database and pick movies based on which movies are the most similar to our search phrase. In this example, we create a way to do semantic search on movies in the Wikipedia-Movie-Plots Dataset found on [Kaggle](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots). We put together a system to semantically search movie plots using a vector database and the sentence-transformers library. For this example, we use [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to run our vector database locally. 

We begin by installing the necessary libraries:

In [None]:
! pip install pymilvus==2.2.5 sentence-transformers gdown milvus

Next, we download the data and unzip it.

In [None]:
import gdown
url = 'https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8'
output = './movies.zip'
gdown.download(url, output)

import zipfile

with zipfile.ZipFile("./movies.zip","r") as zip_ref:
    zip_ref.extractall("./movies")

We need to establish some constants for our vector database.

In [None]:
COLLECTION_NAME = 'movies_db'  # Collection name
DIMENSION = 384  # Embeddings size

# Inference Arguments
BATCH_SIZE = 128

# Search Arguments
TOP_K = 3

With our constants established for consistency, we spin up an instance of Milvus to locally run a vector database, making sure that we're not duplicating any existing collection.

In [None]:
from milvus import default_server
from pymilvus import connections, utility

# (OPTIONAL) Set if you want store all related data to specific location
# Default location:
#   %APPDATA%/milvus-io/milvus-server on windows
#   ~/.milvus-io/milvus-server on linux
# default_server.set_base_dir('milvus_data')

# (OPTIONAL) if you want cleanup previous data
# default_server.cleanup()

# Start your milvus server
default_server.start()

# Now you could connect with localhost and the given port
# Port is defined by default_server.listen_port
connections.connect(host='127.0.0.1', port=default_server.listen_port)

# Check if the server is ready.
print(utility.get_server_version())

In [None]:
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

Now we have an instance of a vector database spun up. Let's define our schema and create a collection.

For these movies, each object in the database needs three components: an ID, a title, and the embedding.

In [None]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection


# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=200),  # VARCHARS need a maximum length, so for this example they are set to 200 characters
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

Next, we need to define the vector index. For this example, we use an IVF index on an L2 distance metric with 128 vector indices just like we do in the
[reverse image search example notebook](../vision/reverse_painting_search.ipynb).

In [None]:
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

With our local vector database set up, we can dive into creating vectors out of movie plots and putting them into a vector space.

For this example, we use the [MiniLM L6 v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence transformer

In [None]:
import csv
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')

With our embeddings extractor loaded, we need the movie titles and plots to embed. Taking a look at the data from the [Kaggle page](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots), we see that the data contains eight columns. We are only interested in the title (column 2) and the plot (column 8) so our `csv_load` function extracts just those.

The second function we write in the block below takes a tuple, the `(title, plot)` tuple we create with `csv_load`, and turns that into an object we store in our vector database.

In [None]:
# Extract the movie titles
def csv_load(file):
    with open(file, newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            if '' in (row[1], row[7]):
                continue
            yield (row[1], row[7])


# Extract embeding from text using SentenceTransformer
def embed_insert(data: tuple):
    embeds = transformer.encode(data[1]) 
    ins = [
            data[0],
            [x for x in embeds]
    ]
    collection.insert(ins)

The vector database is set up, we have a model for the embeddings, and we have the functions we need to get embeddings from our text. The next step to be able to semantically search movie plots is to get the embeddings and populate the database.

*This step takes over 12 minutes on a 16GB RAM M1 Mac, we are processing 35000 movies!*

In [None]:
data_batch = [[],[]]

for title, plot in csv_load('./movies/plots.csv'):
    data_batch[0].append(title)
    data_batch[1].append(plot)
    if len(data_batch[0]) % BATCH_SIZE == 0:
        embed_insert(data_batch)
        data_batch = [[],[]]

# Embed and insert the remainder
if len(data_batch[0]) != 0:
    embed_insert(data_batch)

# Call a flush to index any unsealed segments.
collection.flush()

Now we are ready to run a semantic search on our movies and their plots. 

We start by coming up with some search terms and embedding them using the same transformer we used to get the embeddings for the movie plots before. Then, we search the collection for these embeddings and output the titles for the top 3 results.

For this example, we search for two movies. The example blurbs I've come up with are "We do not talk about fight club." and "Boxing with a Russian." Ideally we are looking for a return value that includes Fight Club for the first term and Rocky IV for the second.

In [None]:
import time

# Search for titles whose plots closely match these phrases.
search_terms = ['We do not talk about fight club.', 'Boxing with a Russian.']

# Search the database based on input text
def embed_search(data):
    embeds = transformer.encode(data) 
    return [x for x in embeds]

search_data = embed_search(search_terms)

start = time.time()
res = collection.search(
    data=search_data,  # Embeded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "L2",
            "params": {"nprobe": 10}},
    limit = TOP_K,  # Limit to top_k results per search
    output_fields=['title']  # Include title field in result
)
end = time.time()

for hits_i, hits in enumerate(res):
    print('Title:', search_terms[hits_i])
    print('Search Time:', end-start)
    print('Results:')
    for hit in hits:
        print( hit.entity.get('title'), '----', hit.distance)
    print()

In [None]:
# cleanup
default_server.stop()