# IMDB Vector Search using Milvus Client

First, import some common libraries and define the data reading functions.

In [1]:
# For colab install these libraries in this order:
# !pip install pymilvus, langchain, torch, transformers, python-dotenv

# Import common libraries.
import sys, time, pprint
import pandas as pd
import numpy as np

# Import custom functions for splitting and search.
sys.path.append("..")  # Adds higher directory to python modules path.
import milvus_utilities as _utils

## Start up a Zilliz free tier cluster.

Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  
  1. Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.  
  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See note below how to do that.
  3. Also on the Cluster main page, copy the `Public Endpoint URI`.

💡 Note: To keep your tokens private, best practice is to use an **env variable**.  See [how to save api key in env variable](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). <br>

In Jupyter, you also need a .env file (in same dir as notebooks) containing lines like this:
- VARIABLE_NAME=value


In [2]:
# STEP 1. CONNECT TO MILVUS

# !pip install pymilvus #python sdk for milvus
from pymilvus import connections, utility

import os
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("ZILLIZ_API_KEY")

# Connect to Zilliz cloud using endpoint URI and API key TOKEN.
# TODO change this.
CLUSTER_ENDPOINT="https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
  alias='default',
  #  Public endpoint obtained from Zilliz Cloud
  uri=CLUSTER_ENDPOINT,
  # API key or a colon-separated cluster username and password
  token=TOKEN,
)

# Check if the server is ready and get colleciton name.
print(f"Type of server: {utility.get_server_version()}")

Type of server: zilliz_cloud


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. 

Two model parameters of note below:
1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [3]:
# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.

# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
# python -m pip install -U angle-emb
model_name = "WhereIsAI/UAE-Large-V1"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() 
# # Assume tokens are 3 characters long.
# MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3
# HF_EOS_TOKEN_LENGTH = 1 * 3
# Test with 512 sequence length.
MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS
HF_EOS_TOKEN_LENGTH = 1

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu


No sentence-transformers model found with name /Users/christybergman/.cache/torch/sentence_transformers/WhereIsAI_UAE-Large-V1. Creating a new one with MEAN pooling.


<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: WhereIsAI/UAE-Large-V1
EMBEDDING_LENGTH: 1024
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  
💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.
Typical values are:
   - 768 for sbert embedding models
   - 1536 for ada-002 OpenAI embedding models
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.


### Exercise #1 (2 min):
Create a collection named "movies".  Use the default AUTOINDEX.
> 💡 AUTOINDEX works on both Milvus and Zilliz Cloud (where it is the fastest!)

In [None]:
from pymilvus import MilvusClient

# Set the Milvus collection name.
COLLECTION_NAME = # TODO (exercise): code here

# Use no-schema Milvus client uses flexible json key:value format.
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

mc.drop_collection(COLLECTION_NAME)
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_LENGTH, 
                    )

print(mc.describe_collection(COLLECTION_NAME))
print(f"Created collection: {COLLECTION_NAME}")

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  

Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

In [5]:
# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

# Re-run create collection and add vector index specifying custom params.
from pymilvus import MilvusClient

# For vector length, use the embedding length from the embedding model.
print(f"Embedding length: {EMBEDDING_LENGTH}")

# Set the Milvus collection name.
COLLECTION_NAME = "movies"

# Add custom HNSW search index to the collection.
# M = max number graph connections per layer. Large M = denser graph.
# Choice of M: 4~64, larger M for larger data and larger embedding lengths.
M = 16
# efConstruction = num_candidate_nearest_neighbors per layer. 
# Use Rule of thumb: int. 8~512, efConstruction = M * 2.
efConstruction = M * 2
# Create the search index for local Milvus server.
INDEX_PARAMS = dict({
    'M': M,               
    "efConstruction": efConstruction })
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Use no-schema Milvus client (flexible json key:value format).
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

# Check if collection already exists, if so drop it.
has = utility.has_collection(COLLECTION_NAME)
if has:
    drop_result = utility.drop_collection(COLLECTION_NAME)
    print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

mc.create_collection(
    COLLECTION_NAME, 
    EMBEDDING_LENGTH, 
    consistency_level="Eventually", 
    auto_id=True,  
    overwrite=True,
    # skip setting params below, if using AUTOINDEX
    params=index_params
    )

print(f"Created collection: {COLLECTION_NAME}")
print(mc.describe_collection(COLLECTION_NAME))

Embedding length: 1024
Successfully dropped collection: `movies`
Created collection: movies
{'collection_name': 'movies', 'auto_id': True, 'num_shards': 1, 'description': '', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': 5, 'params': {}, 'element_type': 0, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': 101, 'params': {'dim': 1024}, 'element_type': 0}], 'aliases': [], 'collection_id': 446268198622108304, 'consistency_level': 3, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True}


## Read CSV data into a pandas dataframe

The data used in this notebook is the [IMDB large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) from the Stanford AI Lab. It is a conveniently processed 50,000 dataset (50:50 sampled ratio Positive/Negative reviews). This data has columns: movie_index, raw review text, and movie rating.

In [6]:
# 1. Download data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# 2. Move .csv file to data/ folder.

# citation:  ACL 2011, @InProceedings{maas-EtAl:2011:ACL-HLT2011,
#   author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
#   title     = {Learning Word Vectors for Sentiment Analysis},
#   booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
#   month     = {June},
#   year      = {2011},
#   address   = {Portland, Oregon, USA},
#   publisher = {Association for Computational Linguistics},
#   pages     = {142--150},
#   url       = {http://www.aclweb.org/anthology/P11-1015}
# }

In [7]:
# Read locally stored data.
filepath = "data/movie_data.csv"

df = pd.read_csv(f"{filepath}")

# Drop duplicates
df.drop_duplicates(keep='first', inplace=True)

# Change label column names.
df.columns = ['text', 'label_int']

# Map numbers to text 'Postive' and 'Negative' for sentiment labels.
df["label"] = df["label_int"].apply(_utils.sentiment_score_to_name)

# Split data into train/valid/test.
columns = ['movie_index', 'text', 'label_int', 'label']
df, df_train, df_val, df_test = _utils.partition_dataset(df, columns, smoke_test=False)
print(f"original df shape: {df.shape}")
print(f"df_train shape: {df_train.shape}, df_val shape: {df_val.shape}, df_test shape: {df_test.shape}")
assert df_train.shape[0] + df_val.shape[0] + df_test.shape[0] == df.shape[0]

# Inspect data.
print(f"Example text length: {len(df.text[0])}")
print(f"Example text: {df.text[0]}")
display(df.head(2))


original df shape: (100, 4)
df_train shape: (100, 4), df_val shape: (0, 4), df_test shape: (0, 4)
Example text length: 1113
Example text: The whole town of Blackstone is afraid, because they lynched Bret Dixon's brother - and he is coming back for revenge! At least that's what they think.<br /><br />A great Johnny Hallyday and a very interesting, early Mario Adorf star in this Italo-Western, obviously filmed in the Alps.<br /><br />Bret Dixon is coming back to Blackstone to investigate why his brother was lynched. He is a loner and gunslinger par excellance, everybody is afraid of him - the Mexican bandits (fighting the Gringos that took their land!) as well as the "decent" citizens that lynched Bret's brother. They lynched him, because they thought he stole their money instead of bringing it to Dallas to the safety of the bank there. But this is is only half the truth, as we find out in the course of this psychologically interesting western.<br /><br />But beware, it's kind of a depre

Unnamed: 0,movie_index,text,label_int,label
0,80,"The whole town of Blackstone is afraid, becaus...",1,Positive
1,84,This Harold Lloyd short wasn't really much; no...,0,Negative


In [8]:
# Check if approx. equal number training examples for each class.
class1 = df_train.loc[(df_train.label == "Positive"), :].copy()
class2 = df_train.loc[(df_train.label == "Negative"), :].copy()
print(f"Count samples positive: {class1.shape[0]}")
print(f"Count samples negative: {class2.shape[0]}")

Count samples positive: 50
Count samples negative: 50


In [9]:
# Uncomment this to create the small sample of data for github.
# df_small = df.head(100)[['text', 'label_int']].copy()
# display(df_small.head())
# df_small.to_csv("data/movie_data_small.csv", index=False)

## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Keep movie reveiws as single chunks unless they are too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = Langchain's convenient `RecursiveCharacterTextSplitter` to split up long reviews recursively.


⚠️ **Demo batch size = 100 rows for demonstration purposes.**

This means the question results could be better with more data!

### Exercise #2 (2 min):
Change the chunk_size and see what happens?  Model default is 511.

- What do your observations imply about changing the chunk_size and the number of vectors?
- How many vectors are there with chunk_size=256?

In [None]:
###############
## EXERCISE #1: Change chunk_size to 256 below.  How many chunks (vectors) does this create?
## ANSWER:  542
## BONUS:   Can you explain why the number of vectors changed from 290 to 542?  
##          Hint:  What is the default chunk overlap?  290 * (2 - 0.10) approx. equals 542.
###############
# Default chunk_size and overlap are calculated from embedding model parameters.
chunk_size = # TODO (exercise): code here
chunk_overlap = np.round(chunk_size * 0.10, 0)
BATCH_SIZE = 100

# Chunk a batch of data from pandas DataFrame and inspect it.
batch = _utils.imdb_chunk_text( # TODO (exercise): code here )

In [11]:
# Don't forget to re-run using the better batch size!  

# Use the embedding model parameters to calculate chunk_size and overlap.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH
chunk_overlap = np.round(chunk_size * 0.10, 0)
BATCH_SIZE = 100

# Chunk a batch of data from pandas DataFrame and inspect it.
batch = _utils.imdb_chunk_text(encoder, BATCH_SIZE, df, chunk_size, chunk_overlap)

chunk size: 511
original shape: (100, 4)
new shape: (290, 5)
Chunking + embedding time for 100 docs: 20.689467906951904 sec


Unnamed: 0,movie_index,text,chunk,vector,label_int,label
0,80,"The whole town of Blackstone is afraid, becaus...","The whole town of Blackstone is afraid, becaus...","[0.023260135, 0.03262592, 0.0071149827, 0.0475...",1,Positive
1,80,"The whole town of Blackstone is afraid, becaus...",Mexican bandits (fighting the Gringos that too...,"[0.024261247, 0.018350782, -0.005168957, 0.020...",1,Positive
2,80,"The whole town of Blackstone is afraid, becaus...",and definitely everybody is bad to the bone......,"[0.034700453, 0.011013481, -0.022261137, 0.003...",1,Positive
3,84,This Harold Lloyd short wasn't really much; no...,This Harold Lloyd short wasn't really much; no...,"[0.01173156, 0.01819113, 0.03528512, 0.0179632...",0,Negative
4,84,This Harold Lloyd short wasn't really much; no...,part was the last four or five minutes when th...,"[0.05225119, 0.033677388, 0.011586295, 0.00569...",0,Negative


type embeddings: <class 'pandas.core.series.Series'> of <class 'numpy.ndarray'>
of numbers: <class 'numpy.float32'>


## Insert data into Milvus

For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.

<div>
<img src="../../pics/db_insert.png" width="80%"/>
</div>

**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**

Otherwise, in general, Milvus supports loading data from:
- pandas dataframes 
- list of dictionaries

Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.

In [12]:
# STEP 5. INSERT CHUNKS AND EMBEDDINGS IN ZILLIZ.

# Convert DataFrame to a list of dictionaries
dict_list = []
for _, row in batch.iterrows():
    dictionary = row.to_dict()
    dict_list.append(dictionary)

print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=dict_list,
    progress_bar=True)
end_time = time.time()
print(f"Milvus insert time for {batch.shape[0]} vectors: {end_time - start_time} seconds")

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)

Start inserting entities


100%|██████████| 1/1 [00:05<00:00,  5.90s/it]


Milvus insert time for 290 vectors: 5.9037556648254395 seconds


## Run a Semantic Search

Now we can run very fast search over all the movie review embeddings to find the `TOP_K` movie reviews with the closest embeddings to a user's query.
- In this example, we'll search for a movie recommendation for a medical doctor.

💡 The same model should always be used for consistency for all the embeddings.

## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 With LLMs:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "I'm a medical doctor, what movie should I watch?"

In [13]:
# Define a sample question about your data.
question = "I'm a medical doctor, what movie should I watch?"
query = [question]

# Inspect the length of the query.
QUERY_LENGTH = len(query[0])
print(f"query length: {QUERY_LENGTH}")

query length: 48


**Embed the question using the same embedding model you used earlier**

In order for vector search to work, the question itself should be embedded with the same model used to create the colleciton you want to search.

In [14]:
# Embed the query using same embedding model used to create the Milvus collection.
query_embeddings = _utils.embed_query(encoder, query)

# Inspect data.
print(type(query_embeddings), len(query_embeddings), type(query_embeddings[0]))
print(type(query_embeddings[0][0]) ) 

<class 'list'> 1 <class 'numpy.ndarray'>
<class 'numpy.float32'>


## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words,

### Exercise #3 (2 min):
Search Milvus using the default search index.


In [15]:
# Run semantic vector search using your query and the vector database.

# # Not needed with Milvus Client API.
# mc.load()

# Uses default search algorithm:  HNSW and top_k=10.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    )

elapsed_time = time.time() - start_time
print(f"Search time: {elapsed_time} sec")

# Inspect search result.
print(f"type: {type(results)}, count: {len(results[0])}")

Search time: 0.17416715621948242 sec
type: <class 'list'>, count: 10


In [16]:
# Re-run the search using custom settings.

# Return top k results with HNSW index.
TOP_K = 3
OUTPUT_FIELDS=["movie_index", "chunk", "label"]
SEARCH_PARAMS = dict({
    # Re-use index param for num_candidate_nearest_neighbors.
    "ef": INDEX_PARAMS['efConstruction']
    })

# Run the search and time it.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    # Milvus can utilize metadata in boolean expressions to filter search.
    # expr="",
    limit=TOP_K,
    consistency_level="Eventually",
    )

elapsed_time = time.time() - start_time
print(f"Milvus search time: {elapsed_time} sec")

# Inspect search result.
print(f"type: {type(results)}, count: {len(results[0])}")

Milvus search time: 0.0590059757232666 sec
type: <class 'list'>, count: 3


## Assemble and inspect the search result

The search result is in the variable `result[0]` of type `'pymilvus.orm.search.SearchResult'`.  

In [17]:
# Assemble `num_shot_answers` retrieved 1st context and context metadata.
METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
    results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)
print(f"Length context: {len(context[0])}, Number of contexts: {len(context)}")

# TODO - Uncomment to loop throught each context and metadata and print.
for i in range(len(context)):
    print(f"Retrieved result #{i+1}")
    print(f"Context: {context[i][:150]}")
    print(f"Metadata: {context_metadata[i]}")
    print()

Length context: 507, Number of contexts: 3
Retrieved result #1
Context: Dr. K(David H Hickey)has been trying to master a formula that would end all disease and handicaps, but needs live donors to complete his work. His doc
Metadata: {'movie_index': '56', 'label': 'Negative'}

Retrieved result #2
Context: is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satire to be precise. And it never runs out of steam! T
Metadata: {'movie_index': '44', 'label': 'Positive'}

Retrieved result #3
Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do
Metadata: {'movie_index': '67', 'label': 'Positive'}



## Same question, but add Metadata filter.

Keeping the same question, add a SQL-like filter on metadata.

We expect the same answers as above, but omitting any "Negative" labeled movies.

In [18]:
# Same question, but add Metadata filter only positive movies.
metadata_filter = "(label like 'Positive%')"

# Run the search and time it.
start_time = time.time()
new_results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    filter=metadata_filter,
    limit=TOP_K,
    consistency_level="Eventually",
    )

elapsed_time = time.time() - start_time
print(f"Milvus search time: {elapsed_time} sec")

# Assemble `num_shot_answers` retrieved 1st context and context metadata.
METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
    new_results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)
print(f"Length context: {len(context[0])}, Number of contexts: {len(context)}")

# TODO - Uncomment to loop throught each context and metadata and print.
for i in range(len(context)):
    print(f"Retrieved result #{i+1}")
    print(f"Context: {context[i][:150]}")
    print(f"Metadata: {context_metadata[i]}")
    print()

# As expected, same answers, except 'Negative' movies are omitted.

Milvus search time: 0.0647287368774414 sec
Length context: 457, Number of contexts: 3
Retrieved result #1
Context: is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satire to be precise. And it never runs out of steam! T
Metadata: {'movie_index': '44', 'label': 'Positive'}

Retrieved result #2
Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do
Metadata: {'movie_index': '67', 'label': 'Positive'}

Retrieved result #3
Context: This movie took the Jerry Springer approach to super-human power. "Wilder Napalm" is the kind of theme-based movie that I love, addressing the idea th
Metadata: {'movie_index': '88', 'label': 'Positive'}



## Try another question

This time just add the words **only good movies** to the question, see if the answers are any different?  

For semantically different questions, we expect the answers to be different.

In [19]:
# # Take as input a user question and conduct semantic vector search using the question.
question = "I'm a medical doctor, what movie should I watch?"
new_question = "I'm a computer scientist, what movie should I watch?"
print(f"Question: {new_question}")
# Embed the query using same embedding model used to create the Milvus collection.
new_query_embeddings = _utils.embed_query(encoder, [new_question])

# Run the search and time it.
start_time = time.time()
new_results = mc.search(
    COLLECTION_NAME,
    data=new_query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    # Milvus can utilize metadata in boolean expressions to filter search.
    # expr="",
    limit=TOP_K,
    consistency_level="Eventually",
    )

elapsed_time = time.time() - start_time
print(f"Milvus search time: {elapsed_time} sec")

# Assemble `num_shot_answers` retrieved 1st context and context metadata.
METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
    new_results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)
print(f"Length context: {len(context[0])}, Number of contexts: {len(context)}")

# TODO - Uncomment to loop throught each context and metadata and print.
for i in range(len(context)):
    print(f"Retrieved result #{i+1}")
    print(f"Context: {context[i][:150]}")
    print(f"Metadata: {context_metadata[i]}")
    print()

Question: I'm a computer scientist, what movie should I watch?
Milvus search time: 0.14014911651611328 sec
Length context: 133, Number of contexts: 3
Retrieved result #1
Context: i would be curious what kids think of this movie. Maybe they would enjoy it? But as for adults, safe bet they wont, even if a CS fan.
Metadata: {'movie_index': '37', 'label': 'Negative'}

Retrieved result #2
Context: Bears about as much resemblance to Dean Koontz's novel as Jessica Simpson does to a rocket scientist. If you've read the book, I suggest you put it as
Metadata: {'movie_index': '21', 'label': 'Positive'}

Retrieved result #3
Context: a good movie with a real good story. The fact that there are so many other big stars who all also had great performances is just an added BONUS! So do
Metadata: {'movie_index': '67', 'label': 'Positive'}



In [20]:
# Drop collection
utility.drop_collection(COLLECTION_NAME)

In [21]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.6
IPython version      : 8.18.1

torch       : 2.1.1
transformers: 4.35.2
milvus      : 2.3.3
pymilvus    : 2.3.4
langchain   : 0.0.322

conda environment: py311

