# To affect the diversity of documents retrieved from ChromaDB in LangChain:

Use MMR to balance relevance and diversity.
Increase the number of documents retrieved (k).
Experiment with different embedding models.
Apply metadata filters to ensure varied results.
Consider hybrid search to combine vector and keyword search.
Use query augmentation to generate diverse queries.

In [None]:
# Use MMR to balance relevance and diversity.
retriever = chromadb.as_retriever(search_kwargs={"k": 10, "max_marginal_relevance": True, "lambda_mult": 0.5})

"""
Maximal Marginal Relevance (MMR) is a common technique to balance relevance and diversity in retrieval. Instead of only selecting the most similar documents, it selects documents that are relevant to the query while minimizing redundancy among the results.

You can also control the diversity vs. relevance trade-off by adjusting a lambda parameter in MMR, where lambda = 0 maximizes diversity, and lambda = 1 maximizes relevance.
"""

In [None]:
# Increase the number of documents retrieved (k).
retriever = chromadb.as_retriever(search_kwargs={"k": 10})  # Return 10 results instead of default (often 4 or 5)

In [None]:
# Experiment with different embedding models.

In [None]:
# Apply metadata filters to ensure varied results.
retriever = chromadb.as_retriever(search_kwargs={"k": 10, "filter": {"category": "tech"}})

In [None]:
# Consider hybrid search to combine vector and keyword search.
retriever = hybrid_retriever = LangChainHybridRetriever(
    dense_retriever=chromadb.as_retriever(),
    sparse_retriever=sparse_retriever,
)

# Query augmentation

In [12]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.document_loaders import CSVLoader
from langchain_core.output_parsers import StrOutputParser

In [58]:


# Step 1: Define your LLM (e.g., OpenAI)
llm = ChatNVIDIA(model="meta/llama-3.1-70b-instruct", temperature=0)

# Step 2: Define a prompt template for query augmentation
prompt_template = """
You are an assistant tasked with generating different variations of a search query to help retrieve more diverse information.

Original query: "{query}"

Provide 3 different rephrased versions of this query that capture different ways of asking for the same or related information.
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["query"])

# Step 3: Create the LLMChain for query augmentation
aug_chain = prompt | llm | StrOutputParser()

docs = CSVLoader(file_path="/data/test/data/Closed Incidents/Closed_Incidents_2.csv").load()
users = CSVLoader(file_path="/data/test/data/users/exportUsers.csv").load()
docs = docs + users

# Step 4: Define your embedding model and ChromaDB
vectorstore = Chroma.from_documents(
        documents=docs,
        collection_name="test-chroma",
        embedding=NVIDIAEmbeddings(model='NV-Embed-QA'),
    )

# Step 5: Define a function to perform augmented search
def augmented_search(query):
    # Generate augmented queries using the LLM
    augmented_queries = aug_chain.invoke({"query" : query})

    """
    # Create a retriever to search ChromaDB
    retriever = vectorstore.as_retriever()
    
    all_results = []
    
    # Perform a search with the original and augmented queries
    queries = [query] + augmented_queries
    
    for q in queries:
        results = retriever.invoke(q)
        all_results.extend(results)
    """
    return augmented_queries



In [57]:
vectorstore._client.delete_collection(name="test-chroma")

In [10]:
incident = CSVLoader(file_path="/data/test/data/sample input/LoginOutsideFinland_Emilia_SG_True.csv").load()[0]

In [59]:
# Step 6: Perform the search with query augmentation
#query = "What are the health benefits of green tea?"
results = augmented_search(incident)


print(results)

Here are three rephrased versions of the original query:

**Version 1: Focus on User Activity**
"User login activity outside of Finland on September 24, 2024, with medium severity alert"

This version focuses on the user's activity and the specific date, while removing some of the technical details like IP address and metadata.

**Version 2: Emphasize Geolocation**
"Login attempts from Singapore (SG) on September 24, 2024, violating Finland login policy"

This version highlights the geolocation aspect of the alert, specifically mentioning Singapore as the location of the login attempt, and frames it as a policy violation.

**Version 3: Focus on Alert Details**
"Medium severity alerts for user logins outside of allowed locations on September 24, 2024, with details on user and IP address"

This version takes a more general approach, focusing on the alert severity and the fact that the login occurred outside of an allowed location, while still including some details about the user and IP 

In [76]:
# maximal marginal relevance
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 5, 'lambda_mult': 0.5, "fetch_k": 19}
)

documents = retriever.invoke(results)

for doc in documents:
    print(doc.metadata)

{'row': 4, 'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv'}
{'row': 12, 'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv'}
{'row': 5, 'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv'}
{'row': 6, 'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv'}
{'row': 1, 'source': '/data/test/data/users/exportUsers.csv'}


In [49]:
docs


[Document(metadata={'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv', 'row': 0}, page_content=': 0\nTimeGenerated [UTC]: 9/20/2024, 9:17:53.315 PM\nTitle: User logged in outside of Finland\nDescription: The user has logged in outside of Finland, which is not allowed.\nSeverity: Medium\nStatus: Closed\nClassification: TruePositive\nClassificationComment: Joel logged from Japan to Azure Portal from the IP address: . The IP Address was also flagged as malicious. User has been normally logged in from Finland. This was also confirmed from the user and he did not traveled to Japan during that time. User password was reset and sessions revoked.\nClassificationReason: SuspiciousActivity'),
 Document(metadata={'source': '/data/test/data/Closed Incidents/Closed_Incidents_2.csv', 'row': 1}, page_content=': 1\nTimeGenerated [UTC]: 9/20/2024, 9:57:55.832 PM\nTitle: User logged in outside of Finland\nDescription: The user has logged in outside of Finland, which is not allowed.\nSe