# Lab 1 - Overview of embeddings-based retrieval


## Objective
Develop an **embeddings-based retrieval system** to extract relevant information from the **Microsoft Annual Report 2022** using ChromaDB and OpenAI's GPT-3.5 Turbo.


## Workflow

1. **Data Loading and Preprocessing**  
   Extract text content from the PDF and clean it for further processing.

2. **Text Splitting**  
   - Use **character-level splitting** to segment text into smaller chunks.  
   - Further refine with **token-level splitting** to ensure token limit compliance and context overlap.

3. **Generate Embeddings**  
   Compute embeddings for the text chunks using a pre-trained SentenceTransformer model.

4. **Store Data in Chroma**  
   Create a ChromaDB collection and store the text chunks with their corresponding embeddings.

5. **Query and Retrieval**  
   Query the ChromaDB collection to retrieve relevant text chunks based on user input.

6. **LLM Augmented Generation**  
   Feed the retrieved chunks into OpenAI’s GPT-3.5 Turbo for generating final answers.


## Conclusion
The project demonstrates an effective pipeline for document-based **retrieval-augmented generation (RAG)**, enabling accurate extraction and contextual answers from large documents.


In [1]:
from helper_utils import word_wrap

In [2]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

1 Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements 
across every sector of our economy and society. As the
world’s largest software company, this places us at a historic

intersection of opportunity and responsibility to the world around us.
 
Our mission to empower every person and every organization on the
planet to achieve more has never been more 
urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every 
industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger. And no

company is better positioned to help th

In [3]:
len(pdf_texts)

90

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter


separators: A list of characters/strings used to split the text. The splitter tries each separator in order:
* \n\n → Double newlines (paragraph splits).
* \n → Single newlines.
* . → Periods followed by a space (sentence splits).
* " " → Space (word-level splits).
* "" → Empty string (character-level splits, as a fallback).

In [22]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=500,
    chunk_overlap=50
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))


print("Chunk 1:", word_wrap(character_split_texts[0]))
print("\nChunk 2:", word_wrap(character_split_texts[1]))
print(f"\nTotal chunks: {len(character_split_texts)}")

Chunk 1: 1 Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements

Chunk 2: across every sector of our economy and society. As the world’s largest
software company, this places us at a historic 
intersection of
opportunity and responsibility to the world around us.  
Our mission to
empower every person and every organization on the planet to achieve
more has never been more 
urgent or more necessary. For all the
uncertainty in the world, one thing is clear: People and organizations
in every

Total chunks: 691


In [23]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=50, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print("Chunk 1:", word_wrap(token_split_texts[0]))
print("\nChunk 2:", word_wrap(token_split_texts[1]))
print(f"\nTotal chunks: {len(token_split_texts)}")

Chunk 1: 1 dear shareholders, colleagues, customers, and partners : we are
living through a period of historic economic, societal, and
geopolitical change. the world in 2022 looks nothing like the world in
2019. as i write this, inflation is at a 40 - year high, supply chains
are stretched, and the war in ukraine is ongoing. at the same time, we
are entering a technological era with the potential to power awesome
advancements

Chunk 2: across every sector of our economy and society. as the world ’ s
largest software company, this places us at a historic intersection of
opportunity and responsibility to the world around us. our mission to
empower every person and every organization on the planet to achieve
more has never been more urgent or more necessary. for all the
uncertainty in the world, one thing is clear : people and organizations
in every

Total chunks: 691


In [36]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

## delete the existing collection
If you need to create a new collection with the same name, you must first delete the existing one.

In [26]:
collections = chroma_client.list_collections()
print("Existing collections:", [c.name for c in collections])

Existing collections: ['microsoft_annual_report_2022']


In [27]:
chroma_client.delete_collection(name="microsoft_annual_report_2022")
print("Collection 'microsoft_annual_report_2022' deleted.")

Collection 'microsoft_annual_report_2022' deleted.


## Create collection

In [28]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

691

In [29]:
results = chroma_collection.get()

In [30]:
print("Documents:", results['documents'][:3]) 

Documents: ['1 dear shareholders, colleagues, customers, and partners : we are living through a period of historic economic, societal, and geopolitical change. the world in 2022 looks nothing like the world in 2019. as i write this, inflation is at a 40 - year high, supply chains are stretched, and the war in ukraine is ongoing. at the same time, we are entering a technological era with the potential to power awesome advancements', 'across every sector of our economy and society. as the world ’ s largest software company, this places us at a historic intersection of opportunity and responsibility to the world around us. our mission to empower every person and every organization on the planet to achieve more has never been more urgent or more necessary. for all the uncertainty in the world, one thing is clear : people and organizations in every', 'industry are increasingly looking to digital technology to overcome today ’ s challenges and emerge stronger. and no company is better positi

In [32]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print("the document is: ",word_wrap(document))
    print('\n')

the document is  total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620
115, 856 96, 937 research and development 24, 512 20, 716 19, 269 sales
and marketing 21, 825 20, 117 19, 598 general and administrative 5, 900
5, 107 5, 111 operating income 83, 383 69, 916 52, 959 other income,
net 333 1, 186 77 income before income taxes 83, 716 71, 102 53, 036
provision for income taxes 10, 978 9, 831 8, 755


the document is  annually with revenue recognized upfront.


the document is  balance, beginning of period $ 44, 141 deferral of revenue 110, 455
recognition of unearned revenue ( 106, 188 ) balance, end of period $
48, 408 revenue allocated to remaining performance obligations, which
includes unearned revenue and amounts that will be invoiced and
recognized as revenue in future periods, was $ 193 billion as of june
30, 2022, of which $ 189 billion is


the document is  47 financial statements and supplementary data income statements ( in
millions, except per share amounts ) 

In [33]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [34]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [35]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The total revenue for the year ended June 30, 2022, was $198,270
million.
