## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

## TODAY:

- Part A: We will divide our documents into CHUNKS
- Part B: We will encode our CHUNKS into VECTORS and put in Chroma
- Part C: We will visualize our vectors

### PART A: Divide our documents into chunks

In [1]:
# Install required packages
%pip install langchain-huggingface langchain-openai langchain-chroma langchain-community langchain-text-splitters tiktoken scikit-learn plotly chromadb sentence-transformers

/Users/nitin.aggarwal/Documents/llm_learning_2025_bootcamp/llm_engineering/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [6]:
import os
import glob
import tiktoken
import numpy as np
from dotenv import load_dotenv

# langchain packages for OpeaAI for classes related to embeddings, vector stores, document loaders, text splitters, CHROMA vector store for data storage for vector DB, 
from langchain_openai import OpenAIEmbeddings 
from langchain_chroma import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter # text splitter to split large documents into smaller chunks
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [7]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4.1-nano"
db_name = "vector_db"
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")


OpenAI API Key exists and begins sk-proj-


In [8]:
# How many characters in all the documents?

knowledge_base_path = "knowledge-base/**/*.md"
files = glob.glob(knowledge_base_path, recursive=True)
print(f"Found {len(files)} files in the knowledge base")

entire_knowledge_base = ""

for file_path in files:
    with open(file_path, 'r', encoding='utf-8') as f:
        entire_knowledge_base += f.read()
        entire_knowledge_base += "\n\n"

print(f"Total characters in knowledge base: {len(entire_knowledge_base):,}")

Found 31 files in the knowledge base
Total characters in knowledge base: 88,151


In [9]:
# How many tokens in all the documents?

encoding = tiktoken.encoding_for_model(MODEL)
tokens = encoding.encode(entire_knowledge_base)
token_count = len(tokens)
print(f"Total tokens for {MODEL}: {token_count:,}")

Total tokens for gpt-4.1-nano: 18,160


In [10]:
# Load in everything in the knowledgebase using LangChain's loaders

folders = glob.glob("knowledge-base/*")

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs={'encoding': 'utf-8'}) # langchain document loader to load all markdown files from the knowledge base.
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print(f"Loaded {len(documents)} documents")

Loaded 31 documents


In [11]:
documents[1]

Document(metadata={'source': 'knowledge-base/products/Markellm.md', 'doc_type': 'products'}, page_content="# Product Summary\n\n# Markellm\n\n## Summary\n\nMarkellm is an innovative two-sided marketplace designed to seamlessly connect consumers with insurance companies. Powered by advanced matching AI, Markellm transforms the insurance shopping experience, making it more efficient, personalized, and accessible. Whether you're a homeowner searching for the best rates on home insurance or an insurer looking to reach new customers, Markellm acts as the ultimate bridge, delivering tailored solutions for all parties involved. With a user-friendly interface and powerful algorithms, Markellm not only saves time but also enhances decision-making in the often-complex insurance landscape.\n\n## Features\n\n- **AI-Powered Matching**: Markellm utilizes sophisticated AI algorithms to match consumers with the most suitable insurance products based on their individual needs and preferences. This ensu

In [15]:
# Divide into chunks using the RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f"Divided into {len(chunks)} chunks")
print(f"First chunk:\n\n{chunks[0]}")

Divided into 231 chunks
First chunk:

page_content='# Product Summary

# Rellm: AI-Powered Enterprise Reinsurance Solution

## Summary' metadata={'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}


In [13]:
chunks[100]

Document(metadata={'source': 'knowledge-base/employees/Maxine Thompson.md', 'doc_type': 'employees'}, page_content='## Compensation History\n- **2017**: $70,000 (Junior Data Engineer)  \n- **2018**: $75,000 (Junior Data Engineer)  \n- **2019**: $80,000 (Data Engineer)  \n- **2020**: $84,000 (Data Engineer)  \n- **2021**: $95,000 (Senior Data Engineer)  \n- **2022**: $110,000 (Senior Data Engineer)  \n- **2023**: $120,000 (Senior Data Engineer)  \n\n## Other HR Notes\n- Maxine participated in various company-sponsored trainings related to big data technologies and cloud infrastructure.  \n- She was recognized for her contributions with the “Insurellm Innovator Award” in 2022.  \n- Maxine is currently involved in the women-in-tech initiative and participates in mentorship programs to guide junior employees.  \n- Future development areas include improving her stakeholder communication skills to ensure smoother project transitions and collaboration.')

### PART B: Make vectors and store in Chroma

In Week 3, you set up a Hugging Face account and got an HF_TOKEN

At this point, you might want to add it to your `.env` file and run `load_dotenv(override=True)`

(This actually shouldn't be required).

In [26]:
# Pick an embedding model

# embeddings_models_object for choosing the embedding model from HuggingFace, all-MiniLM-L6-v2 is model name from sentence-transformers
#embeddings_models_object = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# To use OpenAI embeddings instead, uncomment the line below and comment out the HuggingFace line above:
embeddings_models_object = OpenAIEmbeddings(model="text-embedding-3-large")

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings_models_object).delete_collection() # wipe out existing collection if it exists

# chroma database to store the embeddings
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings_models_object, persist_directory=db_name) # chunks created above, and with the model object, could be any embedding model
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 231 documents


In [28]:
# Let's investigate the vectors

collection = vectorstore._collection # table name in the chroma vector store
count = collection.count() # item in

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0] # get first embedding
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store") # There are 231 vectors with 384 dimensions in the vector store , 384 is dimension of all-MiniLM-L6-v2 embedding model and 231 is number of chunks created above from the knowledge base

There are 231 vectors with 3,072 dimensions in the vector store


### Part C: Visualize!

In [29]:
# Prework

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
metadatas = result['metadatas']
doc_types = [metadata['doc_type'] for metadata in metadatas]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [30]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)
# TSNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. we cannot see 384 dimensions, so we reduce to 2D for visualization, statistical technique that converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. It is particularly good at preserving local structure and forming clusters in the data.
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [31]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=10, b=10, l=10, t=40)
)

fig.show()