# Vector Search of Movie Plots with Milvus

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/milvus/milvus_2_movie_search.ipynb)


This notebook will demonstrate how do 'semantic search' of movie plots.  For example we can query for movies like:

- "Where humans fight aliens"
- "Relationship drama between two good friends"

We will:

- 👉 Use this [movie data](https://huggingface.co/datasets/MongoDB/embedded_movies)
- 👉 index the plot text using embedding models
- 👉 Load the indexed data into [Milvus](https://milvus.io/) -  a popular vector database.  
- 👉 And run queries

References
- [Milvus quick start](https://milvus.io/docs/quickstart.md)

**This notebook is deisnged to run on local python environment and Google Colab environment 😄** 

## Configuration

### Embedding models

See hugging face embedding models (sentence transformers) here : https://huggingface.co/models?library=sentence-transformers&sort=trending

Here are a select models for comparison.  Taken from leaderboard : https://huggingface.co/spaces/mteb/leaderboard

| model name                              | overall score | model params | model size | embedding length | url                                                            |
|-----------------------------------------|---------------|--------------|------------|------------------|----------------------------------------------------------------|
| intfloat/e5-mistral-7b-instruct         | 66.x          | 7.11 B       | 15 GB      | 4096             | https://huggingface.co/intfloat/e5-mistral-7b-instruct         |
| BAAI/bge-large-en-v1.5                  | 64.x          | 335 M        | 1.34 GB    | 1024             | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 33.5 M       | 133 MB     | 384              | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          |              | 438 MB     | 768              | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          |              | 134 MB     | 384              | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          |              | 91 MB      | 384              | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |


In [1]:
class MyConfig:
    pass

MY_CONFIG = MyConfig()

## different embedding models to try

# MY_CONFIG.MODEL_NAME = "bge-large-en-v1.5"
# MY_CONFIG.EMBEDDING_LENGTH = 1024

MY_CONFIG.MODEL_NAME = "BAAI/bge-small-en-v1.5"
MY_CONFIG.EMBEDDING_LENGTH = 384

# MY_CONFIG.MODEL_NAME = "all-mpnet-base-v2 "
# MY_CONFIG.EMBEDDING_LENGTH = 768

# MY_CONFIG.MODEL_NAME = "all-MiniLM-L6-v2"
# MY_CONFIG.EMBEDDING_LENGTH = 384



## Determine Runtime

In [2]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   MY_CONFIG.RUNNING_IN_COLAB = True
else:
   print("NOT running in Colab")
   MY_CONFIG.RUNNING_IN_COLAB = False

NOT running in Colab


## Install Dependencies (If required)

In [3]:
if MY_CONFIG.RUNNING_IN_COLAB:
  !pip install pymilvus  'pymilvus[model]'  datasets

## Load Data

We will load movies data.  The movie data has the following fields.  

- plot: A brief summary of the movie's plot.
- title: The title of the movie.
- and many more ...

See the [dataset description](https://huggingface.co/datasets/MongoDB/embedded_movies) for full description


In [4]:
from datasets import load_dataset
import random

dataset = load_dataset("MongoDB/embedded_movies")['train']

# convert the dataset to an array of dicts, we only wants movies with plots
movies = [row for row in dataset if row['plot']]
print (f'Loaded {len(movies)} movies')

# select a few attributes
movies = [{k : v for k, v in m.items() if k in ['title', 'plot']} for m in movies ]

random.sample(movies, 5)

  from .autonotebook import tqdm as notebook_tqdm


Loaded 1473 movies


[{'plot': 'A young humanoid alien who gets stranded on earth hooks up with a grizzled old sheriff in a western town and tries to help him solve a tough case, but the sheriff doesn\'t want any help from a "kid."',
  'title': 'Uno sceriffo extraterrestre... poco extra e molto terrestre'},
 {'plot': 'Archeologist Jack keeps having reoccurring dreams of a past life, where he is the great General Meng Yi, whom is sworn to protect a Korean Princess named OK-soo. Jack decides to go investigate everything with his friend William.',
  'title': 'The Myth'},
 {'plot': 'A corrupt businessman commits a murder and the only witness is the girlfriend of another businessman with close connections to the Chinese government, so a bodyguard from Beijing is ...',
  'title': 'The Defender'},
 {'plot': 'Col. Paul Tibbetts piloted the plane that dropped the atomic bomb on Hiroshima in World War 2.',
  'title': 'Above and Beyond'},
 {'plot': 'The tough biker Harley and his no less tough cowboy friend Marlboro 

## Setup Embedded Database

Milvus can be embedded and easy to use.

After we execute this code, you will see `milvus_demo.db` and `.milvus_demo.db.lock` file in the folder

In [5]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

# Create A Collection



In [6]:
# if we already have a collection, clear it first
if client.has_collection(collection_name="movies"):
    client.drop_collection(collection_name="movies")

client.create_collection(
    collection_name="movies",
    dimension=MY_CONFIG.EMBEDDING_LENGTH
)


## Calculate Embeddings for Plots

In [7]:
import torch

# Set the default device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print ('Using device : ', device)

Using device :  cpu


In [8]:
from pymilvus import model
import random

# If connection to https://huggingface.co/ failed, uncomment the following path
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# embedding_fn = model.DefaultEmbeddingFunction()

## initialize the SentenceTransformerEmbeddingFunction
embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(
    model_name=MY_CONFIG.MODEL_NAME,
    device=device 
)

# calculate embeddings for plots
for i, m in enumerate (movies):
    m['id'] = i
    m['vector'] = embedding_fn ([m['plot']][0])
    # m['vector'] = embedding_fn.encode_documents ([m['plot']][0])

# print a sample
for m in random.sample (movies, 3):
    print ('id:', m['id'] )
    print ('title: ', m['title'])
    print ('plot : ', m['plot'])
    print ('vector dim :',  len(m["vector"]))
    print ('vector[:10] :', m["vector"][:10])
    print()

id: 567
title:  The Getaway
plot :  Doc McCoy is put in prison because his partners chickened out and flew off without him after exchanging a prisoner with a lot of money. Doc knows Jack Benyon, a rich "business"-man, is up ...
vector dim : 384
vector[:10] : [-0.04328088, 0.034391157, -0.017175872, -0.009192432, 0.07682802, -0.06499585, 0.084120356, -0.005084703, 0.012017756, -0.017768722]

id: 181
title:  Circle of Iron
plot :  A young martial artist embarks on an adventure, encountering other martial artists in battle until one day he meets an aging blind man who will show him the true meaning of martial arts and life.
vector dim : 384
vector[:10] : [-0.04944184, 0.032928616, -0.0026674287, 0.03647252, -0.039496258, -0.016607348, 0.071427315, -0.062343694, -0.021074358, -0.07305081]

id: 509
title:  El mariachi
plot :  A traveling mariachi is mistaken for a murderous criminal and must hide from a gang bent on killing him.
vector dim : 384
vector[:10] : [0.0135182245, 0.006492788, -0.

## Insert data

In [9]:
res = client.insert(collection_name="movies", data=movies)

print('inserted # rows', res['insert_count'])
print('cost', res['cost'])

inserted # rows 1473
cost 0


## Perform Vector Search (the FUN part!)

Let's do a semantic search on plot lines

In [10]:
from pprint import pprint

## helper function to perform vector search
def  do_vector_search (query):
    # query_vectors = embedding_fn.encode_queries([query])
    query_vectors = embedding_fn([query])

    results = client.search(
        collection_name="movies",  # target collection
        data=query_vectors,  # query vectors
        limit=5,  # number of returned entities
        output_fields=["title", "plot"],  # specifies fields to be returned
    )
    return results
## ----


def  print_search_results (results):
    # pprint (results)
    print ('num results : ', len(results[0]))

    for i, r in enumerate (results[0]):
        #pprint(r, indent=4)
        print (i+1)
        print ('search score:', r['distance'])
        print ('tile:', r['entity']['title'])
        print ('plot:', r['entity']['plot'])
        print()

In [11]:
query = "Where humans fight aliens"

results = do_vector_search (query)
print_search_results (results)

num results :  5
1
search score: 0.8236286640167236
tile: Independence Day
plot: The aliens are coming and their goal is to invade and destroy Earth. Fighting superior technology, mankind's best weapon is the will to survive.

2
search score: 0.7872649431228638
tile: Starship Troopers
plot: Humans in a fascistic, militaristic future do battle with giant alien bugs in a fight for survival.

3
search score: 0.7458122372627258
tile: V: The Final Battle
plot: A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth.

4
search score: 0.7356772422790527
tile: Enemy Mine
plot: A soldier from Earth crash-lands on an alien world after sustaining battle damage. Eventually he encounters another survivor, but from the enemy species he was fighting; they band together ...

5
search score: 0.7283806204795837
tile: Battlefield Earth
plot: After enslavement & near extermination by an alien race in the year 3000, humanity begi

In [12]:
query = "Relationship drama between friends"

results = do_vector_search (query)
print_search_results (results)

num results :  5
1
search score: 0.7513974905014038
tile: Varalaaru
plot: Relationships become entangled in an emotional web.

2
search score: 0.6959847211837769
tile: Once a Thief
plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them.

3
search score: 0.6907370090484619
tile: Dark Blue World
plot: The friendship of two men becomes tested when they both fall for the same woman.

4
search score: 0.6907370090484619
tile: Dark Blue World
plot: The friendship of two men becomes tested when they both fall for the same woman.

5
search score: 0.690660834312439
tile: Harsh Times
plot: A tough-minded drama about two friends in South Central Los Angeles and the violence that comes between them.

