# Privacera AI Governance - Milvus Vector Database Filter

This notebook shows how to use Privacera Shield Library with a LangChain application that uses Milvus Vector Database. To run this notebook you will need the following,

## Prerequisites

1.  Sign up for a free account at [Privacera AI Governance (PAIG)](https://privacera.ai). This is simple, all you need is your email address.
2.  Your OpenAI API Key. This will allow you to create your first OpenAI application governed by Privacera AI Governance.

## Details

This notebook does the following:

1. Installs Milvus and runs the server within Google Colab.
2. Creates a VectorDB collection called PrivaceraSampleCollection with the required columns for Vector Search and PAIG's access controls.
3. Generates sample documents and associates the access permissions and classifications for these documents.
4. Using LangChain, embeddings are created and stored in VectorDB along with access control permissions and classifications from the original documents.
4. Sets up the GenAI application and VectorDB in the PAIG portal.
5. Writes a GenAI application using LangChain, which uses Milvus as VectorDB and PAIG for Safety, Security, and Observability.
6. Tries out various use cases to ensure that data leakage doesn't happen.


# 0. Reset grpcio system library and restart the environment

The grpcio that is bundled with Google Collab has some incompatibility with Milvus. We need to uninstall it and restart the runtime.


In [None]:
!pip install packaging

import subprocess
from packaging import version
import grpc

# Get the current version of grpcio
current_version = grpc.__version__
print(f"Current grpcio version: {current_version}")

# Define the version to compare against
target_version = version.parse("1.63")

# Compare the versions
if version.parse(current_version) > target_version:
    print(f"grpcio version is {version.parse(current_version)} which is greater than {target_version}, so uninstalling it")
    # Uninstall grpcio if the version is greater than 1.63
    subprocess.check_call(["pip", "uninstall", "-y", "grpcio"])
    print("grpcio has been successfully uninstalled.")
    print("Restarting runtime. No action needed from your side!!!")
    # We need to restart the runtime
    # Ignore the warning at the bottom that says the runtime crashed
    exit()
else:
    print("grpcio version is not greater than 1.63. No action needed.")

# 1. Install the Python packages
This will take several seconds, upto a minute. This installs LangChain, Milvus and PAIG

In [None]:
!pip -q install  \
  milvus \
  pymilvus \
  langchain==0.2.0 \
  langchain-core==0.2.0 \
  langchain-community==0.2.0 \
  langchain-openai==0.1.7 \
  langchain-text-splitters==0.2.0 \
  privacera_shield==1.1.9


# 2. Set your OpenAI API key in the environment
Enter your OpenAI API key so that it is set in the environment. This key will be stored in the memory and won't be uploaded to PAIG portal or used by PAIG components.

In [None]:
import os
from getpass import getpass

#if os.environ.get("OPENAI_API_KEY") is None:
openai_api_key = getpass("🔑 Enter your OpenAI API key and hit Enter:")
os.environ["OPENAI_API_KEY"] = openai_api_key
print("OpenAI key has been entered")

# 3. Create Privacera AI Application and the VectorDB configuration

In this step, we will create an AI Application configuration in PAIG that will be used to associate PAIG with a sample RAG Langchain application.

1. Log into your PAIG account to configure VectorDB and GenAI applcation
1. **Add VectorDB in PAIG**: Click on Application -> Vector DB and create a Vector DB and name it **Product Catalog - Milvus**, and save it. Note: This only adds the reference in PAIG. You still need to configure and start Vector, which is done in subsequent steps
1. **Enable User/Groups Access Control**: Go to the **Permissions** tab and click on the **pencil** icon and toggle **User/Group Access-Limited Retrieval** to enable it. Save after toggling it. This enforce document level access control while retrieving embeddings from VectorDB
1. **Add GenAI Application in PAIG**: Navigate back to the Application -> AI Application and create a new application and call it **Product Catalog - Milvus**
1. **Associate VectorDB with GenAI Application**: Click on the Associated VectorDB drop-down and select the **Product Catalog - Milvus** vector database, and then click on the **Create** button.
>If you missed associating VectorDB while creating GenAI application, then associate it by clicking on the pencil icon in the Information panel, and then click on the Enabled toggle to enable it, and then select the **Product Catalog - Milvus** vector database, and then click on the **Save** button.
1. **Download Config File**: By clicking the **DOWNLOAD APP CONFIG**, download your application configuration file to your local disk.
> By default, it is generally saved in the Downloads folder of your laptop


# 4. Upload the PAIG Application Config file to Colab

Your GenAI application will need the configuration file you downloaded from PAIG. You need to upload it to the Collab instance by running this cell and clicking on the **Choose Files** button. Select the application config file from your local disk and it will be uploaded into Colab. This configuration file is used when PAIG initializes for the first time in your GenAI application

> Generally the file will downloaded with the name privacera-shield-Product-Catalog---Milvus-config.json

In [None]:
from google.colab import files
uploaded = files.upload()
files = uploaded.keys()
if len(files) > 1:
  print("Upload only the application config json file")
else:
  app_config_file_content = uploaded[list(files)[0]].decode('UTF-8')

# 5. Start Milvus Vector Database
This step will start Milvus within the Collab. It should take less than a minute. There could be a few connection errors as Milvus starts, but finally it should say 'Connected to Milvus'
> Ignore errors like `Connection failed: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/packaging-24.1.dist-info/METADATA'`


In [None]:
get_ipython().system_raw('milvus-server &')
!while ! (ps aux | grep -q '[m]ilvus' && ps aux | grep -q '[m]ilvus-server'); do sleep 1; done; echo 'Milvus is ready'

# Replace with your actual Milvus server parameters if different
MILVUS_HOST = "127.0.0.1"
MILVUS_PORT = "19530"

while True:
    try:
        import time
        from pymilvus import connections

        connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)
        print("Connected to Milvus")
        break
    except Exception as e:
        print(f"Connection failed: {e}")
        time.sleep(1)

# 6. Create Collection in Milvus Vector Database for GenAI Application

In this step, we will create a collection in Milvus Vector Database with
following schema -
- source - name of the document file
- text - content of the document
- pk - primary key
- vector - embedding vector of the content
- users - list of users that have access to this document
- groups - list of groups that have access to this document
- metadata - additional metadata associated with this document

> The columns **users**, **groups** and **metadata** are used by PAIG to enforce access permissions to individual chunks.

In [None]:
from pymilvus import CollectionSchema, FieldSchema, DataType

COLLECTION_NAME = "PrivaceraSampleCollection"

def create_collection():
    source = FieldSchema(
        name="source",
        dtype=DataType.VARCHAR,
        max_length=65535
    )
    text = FieldSchema(
        name="text",
        dtype=DataType.VARCHAR,
        max_length=65535
    )
    pk = FieldSchema(
        name="pk",
        dtype=DataType.INT64,
        is_primary=True,
        auto_id=True
    )
    vector = FieldSchema(
        name="vector",
        dtype=DataType.FLOAT_VECTOR,
        dim=1536
    )
    # The following columns are used by PAIG for enforcing Fine Grained Access Control
    users = FieldSchema(
        name="users",
        dtype=DataType.ARRAY,
        element_type=DataType.VARCHAR,
        max_length=65535,
        max_capacity=1024
    )
    groups = FieldSchema(
        name="groups",
        dtype=DataType.ARRAY,
        element_type=DataType.VARCHAR,
        max_length=65535,
        max_capacity=1024
    )
    metadata = FieldSchema(
        name="metadata",
        dtype=DataType.JSON
    )

    schema = CollectionSchema(
        fields=[source, text, pk, vector, users, groups, metadata],
        description="Sample Privacera Milvus Collection",
        enable_dynamic_field=True
    )

    from pymilvus import connections
    connections.connect(
        alias="default",
        host=MILVUS_HOST,
        port=MILVUS_PORT
    )

    from pymilvus import Collection

    collection = Collection(
        name=COLLECTION_NAME,
        schema=schema,
        using='default'
    )

    from pymilvus import Collection

    collection = Collection(COLLECTION_NAME)

    index_params = {
        "index_type": "HNSW",
        "metric_type": "L2",
        "params": {
            "M": 10,
            "efConstruction": 8
        }
    }

    collection.create_index(
        field_name="vector",
        index_params=index_params,
        index_name="index"
    )
    print(f"Collection = {COLLECTION_NAME} created")

create_collection()

# 7. Create sample documents in a folder

In this notebook, we will create the sample documents dynamically in the local folder named `raw_data` within the Collab. Ideally, these documents should be loaded from appropriate sources.

- x10.txt - Contains existing product specification and it is accessible by everyone
- x11.txt - Contains the specification of the product which is under development. This is highly classified data and only team members from R&D have access to this file
- x10-salesdata.txt - Sales number for the product x10. Only Sales team have access to it.
- customer-feedback.txt - Customer feedback which contains PII data. Only few people can access see PII data

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

def create_raw_data():
    raw_data_dir = "raw_data"

    file_contents = {
        "x10.txt": """
Product Specification Sheet of x10
Display: Size and resolution - 6.5" AMOLED, 120Hz refresh rate
Processor: Model name  Snapdragon 8 Gen 1
RAM: Options 8GB/12GB
Storage: Options 128GB/256GB
Camera: rear camera system with multiple lenses, front-facing camera
Battery: Capacity 5000mAh
Operating System: Version Android 13
Key Features: long battery life, fast performance, high-quality camera
        """
        , "x11.txt": """
Product Specification Sheet of x11
Display: Size and resolution - 7.5" AMOLED, 360Hz refresh rate
Processor: Model name  Snapdragon 10 Gen 3
RAM: Options 16GB/24GB
Storage: Options 256GB/512GB
Camera: 360 camera system with multiple lenses, front-facing camera
Battery: Capacity 10000mAh
Operating System: Version Android 13
Key Features: super long battery life, ultra fast performance, 360 camera
        """
        , "x10-salesdata.txt": """
Sales Data for X10 Model:
Monthly Sales Report (Internal)
Region	Units Sold	Revenue
North America	20,000	$10,000,000
Europe	15,000	$7,500,000
Asia Pacific	10,000	$5,000,000
Total	45,000	$22,500,000
    """
        , "customer-feedback.txt": """
Customer Feedback Analysis - X10 Model

Positive Feedback for X10 Model:

"The X10's battery life is amazing! I can finally ditch the portable charger."

Sarah Jones, Busy Professional
Email: sarah.jones@samplemail.com
Phone: (123) 456-7890
"The camera takes crystal-clear pictures, even in low-light conditions. Perfect for capturing memories on the go!"

David Lee, Travel Blogger
Email: david.lee@travelblogger.com
Phone: (234) 567-8901
"The phone's design is sleek and feels luxurious in hand. The user interface is user-friendly and easy to navigate, even for non-tech-savvy users like me."

Emily Garcia, Teacher
Email: emily.garcia@schoolmail.com
Phone: (345) 678-9012

Areas for Improvement for X10 Model:

"The phone is a bit bulky for one-handed use. It can be challenging to reach the top of the screen comfortably."

Michael Chen, Gamer
Email: michael.chen@gamermail.com
Phone: (456) 789-0123
"I've encountered a few minor software bugs that require restarting the phone. Hopefully, future updates will address these."

Olivia Rodriguez, Social Media Manager
Email: olivia.rodriguez@socialhub.com
Phone: (567) 890-1234
"The current storage options are a bit limiting for someone who stores a lot of photos and videos. A higher storage tier or microSD card support would be ideal."

William Smith, Content Creator
Email: william.smith@creatorhub.com
Phone: (678) 901-2345

Feature Requests for X10 Model:

"Wireless charging would be a fantastic addition for convenience. No more fumbling with cables!" (Multiple Users)
"A built-in fingerprint sensor would be a welcome security feature for added peace of mind." (Several Users)
"The ability to expand storage with a microSD card would be incredibly helpful for users who need more space." (Content Creators & Photographers)
"""
    }

    os.makedirs(raw_data_dir, exist_ok=True)

    for file_path, content in file_contents.items():
        file_path_with_dir = raw_data_dir + "/" + file_path
        with open(file_path_with_dir, 'w') as file:
            file.write(content)

    print(f"Files created in {raw_data_dir}")


create_raw_data()

# 8. Associate metadata with the documents

Ideally, the access permissions will be carried from the source document. For this exercise, since we are dynamically creating the files, we will also set up the permissions for the files according to the use cases we want to try out.

In this cell, we create a custom loader class called **PrivaceraTextLoader** by extending LangChain's class **TextLoader** that will add additional metadata for each document in the collection. For each document, we have a list of users who are allowed to access the document, a list of groups that are allowed to access the document, and additional metadata such as classification associated with the document.

We will use the users, groups, and metadata attributes to filter the documents based on the user querying the vector database.

In [None]:
import json

from typing import Optional, List, Iterator
from langchain_community.document_loaders import TextLoader
from langchain.schema import Document

# Define the permissions and classifications for the files
file_metadata = {
    "x10.txt": {
        "users": ["sally", "peter", "emily", "mark"],
        "groups": [],
        "metadata": {"file_name": "x10.txt"}
    },
    "x11.txt": {
        "users": ["mark", "peter"],
        "groups": [],
        "metadata": {"SECURITY_LEVEL": "CONFIDENTIAL", "file_name": "x11.txt"}
    },
    "x10-salesdata.txt": {
        "users": ["sally"],
        "groups": ["Sales"],
        "metadata": {"file_name": "x10-salesdata.txt"}
    },
    "customer-feedback.txt": {
        "users": ["emily", "sally", "peter", "mark"],
        "groups": ["Sales"],
        "metadata": {"file_name": "customer-feedback.txt"}
    }
}

# Overload the TextLoader class from LangChain to inject additional metadata
class PrivaceraTextLoader(TextLoader):
    def __init__(self, file_path: str, encoding: Optional[str] = None, autodetect_encoding: bool = False):
        super().__init__(file_path, encoding, autodetect_encoding)
        print(f"inside CustomTextLoader init, file_path={file_path}")

    def lazy_load(self) -> Iterator[Document]:
        documents = super().lazy_load()

        for doc in documents:
            file_name = os.path.basename(self.file_path)
            print(f"lazy_load: file_name={file_name}")
            metadata = file_metadata.get(file_name)
            if metadata:
              # This instructs LangChain to add these additional meta data
              doc.metadata["users"] = file_metadata[file_name]["users"]
              doc.metadata["groups"] = file_metadata[file_name]["groups"]
              doc.metadata["metadata"] = file_metadata[file_name]["metadata"]

            yield doc

print("PrivaceraTextLoader is ready")


# 9. Load the sample documents into Milvus vector database
Now the sample documents are loaded into Milvus vector database using LangChain and OpenAI embedding API.

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.milvus import Milvus

text_loader_kwargs = {'autodetect_encoding': True}
# The custom PrivaceraTextLoader is passed here. The loaders can be customized
# to meet your requirements
loader = DirectoryLoader("raw_data", glob="**/*.txt",
                         loader_cls=PrivaceraTextLoader,
                         loader_kwargs=text_loader_kwargs)
docs = loader.load()

print(f"len docs = {len(docs)}")

text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(docs)

# Create OpenAI Embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

vector_store = Milvus.from_documents(
    docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}
)

print(f"Loaded {len(docs)} into collection {COLLECTION_NAME} successfully.")

# 10. LangChain RAG bot

This is a simple LangChain application which uses Milvus for VectorDB and PAIG for preventing Data Leakage to unauthorized users

Integrating PAIG requires to add couple of lines in your LangChain application. PAIG shield automatically intercepts all calls to RAG/VectorDB and LLM does the validation, guardrails and data filtering.

> Note: Look for comment **#PAIG** for the changes that needed to integrate PAIG

In [None]:
# PAIG: Add the following 2 imports
import privacera_shield
from privacera_shield import client as privacera_shield_client
from langchain.memory import ConversationBufferWindowMemory
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

# Create Milvus vector store
vector_store = Milvus(embeddings, COLLECTION_NAME,
                      connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT})

# expose this index in a retriever interface
milvus_retriever = vector_store.as_retriever(
    search_type="similarity", search_kwargs={"k": 100}
)

# PAIG: Add the below line to initialize Privacera Shield with milvus and
#       langchain. This needs to be done only one time and the PAIG config file needs
#       to be passed to it. The config contains the shared secret and URL to PAIG
#       server to get the policies and send the audit logs
privacera_shield_client.setup(frameworks=["milvus", "langchain"], application_config=app_config_file_content)

llm = ChatOpenAI(openai_api_key=openai_api_key, model_name="gpt-3.5-turbo")
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

def query_as_user(username, query):
    print(f"Prompt: {query}")
    print()

    memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True, k=3)

    llm_chain = ConversationalRetrievalChain.from_llm(llm=llm,
                                                      retriever=milvus_retriever,
                                                      memory=memory,
                                                      verbose=False)
    try:
#PAIG: Before LangChain invoke is called, set the PAIG context with the user who
#      is making the call
        with privacera_shield_client.create_shield_context(username=username):
            response = llm_chain.invoke({"question": query})
            print("LLM Response:")
            print(f"{response.get('answer')}")
            #wrap_text(f"{response.get('answer')}")
#PAIG: This is to handle access denied to the GenAI application or if the user
#      passed unappropriate or unauthorized contents in the prompt or if reply
#      contain unappropriate or unauthorized contents
    except privacera_shield.exception.AccessControlException as e:
        # If access is denied, then this exception will be thrown. You can handle it accordingly.
        print(f"AccessControlException: {e}")

# utility function to wrap the output
def wrap_text(text, width=80):
    words = text.split()
    character_count = 0
    for word in words:
        if character_count + len(word) + 1 > width:  # Check if adding the word would exceed the width
            print("\n", end="")  # Start a new line
            character_count = 0  # Reset the character count for the new line
        print(word, end=" ")  # Print the word followed by a space
        character_count += len(word) + 1  # Update the character count

print("RAG Bot is ready")

# 11. Ask question about the product X11 which is under development


Peter belongs to the R&D team and has access to details of unreleased product called X11. And he should be able to compare all the phone models.

Sally belongs to the Sales team and she doesn't have access to details of X11 and she shouldn't be able to compare the phone models

> Note: We explcitly passing the username. Ideally this would be the logged in user

In [None]:
query_as_user("peter", "Compare the product specifications for X10 and X11")
# this will compare both the product names

In [None]:
query_as_user("sally", "Compare the product specifications for X10 and X11")
# since Sally doesn't have access to new development, she won't be able to compare the models

# 12. Check Audit logs

1. In PAIG portal, go to **Security**->**Access Audits**
2. Click on the **eye** icon for peter's request. You should see the sequence of events, you expand all to the contexts that were retrieved from VectorDB. You should see the documents from X10 and X11
3. Similarly you can see the audit record for **sally** and in the **Context Documents** you won't any reference to X11 documents

This demonstrates that for the same prompt, based on the user who is asking, the response will be different based on the documents the user has access to it. This prevents unintentional data leakages when documents are stored centrally from multiple data sources with different access controls

# 13. Ask sales details by members of Sales and other teams

Sally belongs to the Sales team and she has access to the sales numbers.

Peter belonging to the R&D doesn't have access sales data.

Only the sales team has access to sales documents and these are carried forward in the VectorDB and enforced there

In [None]:
query_as_user("sally", "Give me the monthly sales data for X10?")

In [None]:
query_as_user("peter", "Give me the monthly sales data for X10?")

# 13. Let's redact PII data based on policy
Sally belongs to the Sales team and she can see customer details

Peter belonging to the R&D can't see customer PII data, but can see the feedback.

1. Go to **Application -> AI Applications** and select the **AI Application** you created
2. Now select the **PERMISSIONS** tab
3. Click the pencil for the **Personal Identifier Redaction** policy
1. Remove **Everyone** and add **peter**
1. On the right side for **Prompt** select the dropdown value **Allow**
1. Leave the **Reply** as **Redact**
1. Save the policy
1. Now **Enable** the policy by toggling **Status** toggle


In [None]:
query_as_user("sally", "Give me the feedbacks and their contact information")

In [None]:
query_as_user("peter", "Give me the feedbacks and their contact information")

# 14. Check Audits Logs

For this prompts and replies also you can check the access audit logs. You will see that even though the LLM responded with the PII information, since peter shouldn't be having access to PII data, they will redacted. This ensure appropriate privacy and compliance requirements are enforced