### Install Python Dependencies

**Note: After running this cell, ensure you restart the kernel.**

In [None]:
!pip install -q langchain==0.1.4
!pip install -q langchain-community==0.0.16
!pip install -q transformers
!pip install -q pymilvus
!pip install -q transformers
!pip install -q sentence-transformers

### Import Python Modules

Import the required Python modules from Langchain.

* **RecursiveCharacterTextSplitter** - Text splitter. This is required to split our Nutanix Bible content into chunks.
* **HuggingFaceEmbeddings** - This allows access to embedding models in Hugging Face
* **Milvus** - This allows the code to use and manage our Milvus Vector DB.
* **WebBaseLoader** - This loads HTML content into a document format we can use with our DB.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
from langchain.document_loaders import WebBaseLoader

### Initialize an instance of HuggingFaceEmbeddings

This code is initializing an instance of [HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html) from Langchain, which allows us to access the [Sentence Transformers](https://huggingface.co/sentence-transformers) embedding models in Hugging Face.

An embedding is a numerical representation of objects for use in machine learning systems. The text content from the Nutanix Bible needs to be translated into multi-dimensional vectors. Like the foundational models that can be used for chatbots (e.g. LLama2), there are many pre-trained embedding models to help us do this translation. These models have learned from a huge amount of text to understand how words are normally used and in what contexts those words are used in.

In this lab, we are using the [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) embedding model.

#### Note
When running this code, you can ignore the following warning:

```
/home/jovyan/.local/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
```

In [None]:
modelPath = "sentence-transformers/all-mpnet-base-v2"

model_kwargs = {}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

### Experiment with Embedding Calculations

The following code demonstrates how the embedding model works. This code embeds the text “pear” into a 768-dimensional vector. If you were to remove the [:3], it would print out all 768 elements of the vector.

In [None]:
example = embeddings.embed_query("pear")
print(len(example))
print(example[:3])

In [None]:
example = embeddings.embed_query("banana")
print(example[:3])

In [None]:
example = embeddings.embed_query("computer")
print(example[:3])

### Experiment with Distance Calculations

The following code demonstrates how the distance can be calculated between two vectors. A lower score means a shorter distance (i.e., higher relation).

In [None]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance", embeddings=embeddings)

evaluator.evaluate_strings(prediction="apple", reference="pear")

In [None]:
evaluator.evaluate_strings(prediction="pear", reference="computer")

In [None]:
evaluator.evaluate_strings(prediction="banana", reference="computer")

### Load the Nutanix Bible content

This code uses the <a href="https://python.langchain.com/docs/integrations/document_loaders/web_base" target="_blank">WebBaseLoader</a> component of Langchain to load our Nutanix Bible content into a document format that can be ingested into the database.

In [None]:
loader = WebBaseLoader("https://www.nutanixbible.com/classic")
data = loader.load()

### Optional - Print Contents

The `data` object is a Python list with 1 element. If you uncomment the following cell, it will print out the Document object that contains the entire contents of the Nutanix Bible.

In [None]:
#print(data)

### Split contents of the data object

We don’t want to store the data as one vector. In this code, we’ll split up the data into chunks of 500 characters each with the [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) component of Langchain.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)

### Check length of docs array

All of our chunks are stored in array. We can see how many chunks (documents) there are with following code.

In [None]:
len(docs)

### Optional - View one of the document's contents

Feel free to look at different elements by adjusting the index.

In [None]:
print(docs[10].page_content)

### Ingest the documents into Milvus

In this code, we'll ingest our documents into Milvus along with our embeddings object, which will ingest all of our documents and create a vector embedding of each.

Note that we are using the internal name of the database service instead of an IP.

In [None]:
vector_db = Milvus.from_documents(
    docs,
    embeddings,
    collection_name="nutanixbible_web",
    connection_args={"host":"milvus-vectordb.milvus.svc.cluster.local","port":"19530"}
)

### Search for similar content in the database

Now that we've ingested our data, we can search it using the [similarity_search()](https://python.langchain.com/docs/modules/data_connection/vectorstores/#similarity-search) function.

In [None]:
question = "What is Nutanix Kubernetes Engine?"
result_docs = vector_db.similarity_search(question)
print(result_docs[0].page_content)

### Return to the Lab Guide

Move onto the **View Documents in Milvus** section of the lab guide.