# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

In [1]:
#!pip install pypdf
#!pip install sentence-transformers
#!pip install ollama

In [3]:
from pypdf import PdfReader

path = r"D:\\Coding_Stuff\\GitHub\\DLAI\\Advanced Retrieval with ChromaDB\\2022_Annual_Report.pdf"
reader = PdfReader(path)
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(pdf_texts[0][:300])

1 Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in U


In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

# splitting into chunks
character_splitter = RecursiveCharacterTextSplitter( 
    separators=["\n\n", "\n", ". ", " ", ""], # sperators used , first \n\n then \n then . etc
    chunk_size=1000, # maximun number of characters in a chunk 
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(character_split_texts[90]) # printing random chunk 
print(f"\nTotal chunks: {len(character_split_texts)}")

Windows Server, and revenue is reported along with the associated server product.  
Nuance and GitHub include both cloud and on -premises offerings. Nuance provides healthcare and enterprise AI 
solutions. GitHub provides a collaboration platform and code hosting service for developers.  
Enterprise Services  
Enterprise Services, including Enterprise Support Services, Microsoft Consulting Services, and Nuance Professional 
Services, assist customers in developing, deploying, and managing Microsoft server solutions, Microsoft desktop 
solutions, and Nuance conversational AI and ambient intelligent solutions, along with providing training and certification to  
developers and IT professionals on various Microsoft products.  
Competition  
Azure faces diverse competition from companies such as Amazon, Google, IBM, Oracle, VMware, and open source 
offerings. Our Enterprise Mobility + Security offerings also compete with products from a range of competitors including

Total chunks: 347


In [5]:
import warnings
warnings.filterwarnings('ignore')

token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")

increased, due in large part to significant global datacenter expansions and the growth in xbox sales and usage. despite these increases, we remain dedicated to achieving a net - zero future. we recognize that progress won ’ t always be linear, and the rate at which we can implement emissions reductions is dependent on many factors that can fluctuate over time. on the path to becoming water positive, we invested in 21 water replenishment projects that are expected to generate over 1. 3 million cubic meters of volumetric benefits in nine water basins around the world. progress toward our zero waste commitment included diverting more than 15, 200 metric tons of solid waste otherwise headed to landfills and incinerators, as well as launching new circular centers to increase reuse and reduce e - waste at our datacenters. we contracted to protect over 17, 000 acres of land ( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


In [6]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

[[0.042562663555145264, 0.03321182355284691, 0.030340107157826424, -0.034866590052843094, 0.06841651350259781, -0.08090908825397491, -0.015474383719265461, -0.0014509352622553706, -0.016744472086429596, 0.06770770251750946, -0.05054137110710144, -0.04919538274407387, 0.05139993876218796, 0.09192726761102676, -0.07177837938070297, 0.03951967507600784, -0.01283353753387928, -0.024947477504611015, -0.046228617429733276, -0.024357546120882034, 0.03394966199994087, 0.025502420961856842, 0.02731713280081749, -0.004126240964978933, -0.03633835166692734, 0.003690901678055525, -0.027430450543761253, 0.004796717781573534, -0.028896238654851913, -0.018870694562792778, 0.03666628897190094, 0.02569584734737873, 0.03131282702088356, -0.06393436342477798, 0.05394405126571655, 0.08225347846746445, -0.041756849735975266, -0.006995786912739277, -0.023486044257879257, -0.030747946351766586, -0.0029792170971632004, -0.07790940254926682, 0.00935310684144497, 0.0031628680881112814, -0.022257015109062195, -0

In [12]:
chroma_client = chromadb.PersistentClient('Microsoft_Report')
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function,
                                                   get_or_create=True)

ids = [str(i) for i in range(len(token_split_texts[:166]))]

chroma_collection.add(ids=ids, documents=token_split_texts[:166])
chroma_collection.count()

166

In [13]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=2)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(document)
    print('\n')

37 general and administrative expenses include payroll, employee benefits, stock - based compensation expense, and other headcount - related expenses associated with finance, legal, facilities, certain human resources and other administrative personnel, certain taxes, and legal and other administrative fees. general and administrative expenses increased $ 793 million or 16 % driven by investments in corporate functions. other income ( expense ), net the components of other income ( expense ), net were as follows : ( in millions ) year ended june 30, 2022 2021 interest and dividends income $ 2, 094 $ 2, 131 interest expense ( 2, 063 ) ( 2, 346 ) net recognized gains on investments 461 1, 232 net gains ( losses ) on derivatives ( 52 ) 17 net gains ( losses ) on foreign currency remeasurements ( 75 ) 54 other, net ( 32 ) 98 total $ 333 $ 1, 186


( in millions, except percentages ) 2022 2021 percentage change sales and marketing $ 21, 825 $ 20, 117 8 % as a percent of revenue 11 % 12 % ( 

In [17]:
# LangChain supports many other chat models. Here, we're using Ollama
from langchain_community.chat_models import ChatOllama
import ollama

# supports many more optional parameters. Hover on your `ChatOllama(...)`
# class to view the latest available supported parameters
chat = ChatOllama(model="llama3:text")
chat

ChatOllama(model='llama3:text')

In [20]:
def rag(query, retrieved_documents):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = ollama.chat(messages=messages, model="llama3:text")
    # response = chat(messages=messages)
    content = response['message']['content']
    return content

In [21]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(output)

 general and administrative expenses include payroll, employee benefits, stock - based compensation expense, and other headcount - related expenses associated with finance, legal, facilities, certain human resources and other administrative personnel, certain taxes, and legal and other administrative fees. general and administrative expenses increased $ 793 million or 16 % driven by investments in corporate functions.
( in millions ) year ended june 30, 20xx 20yy 2022 2021 ( increase / decrease ) sales $ 198, 300 $ 195, 600 1 % revenue 197, 200 193, 900 2 % cost of sales ( 48, 400 ) ( 47, 000 ) 3% gross profit 148, 800 146, 900 1 % selling and marketing expenses ( 21, 825 ) ( 20, 117 ) 8 % general and administrative expenses ( 5, 900 ) ( 5, 107 ) 16 % research and development expenses ( 6, 600 ) ( 6, 700 ) ( -2% ) other income ( expense ), net 333 1, 186 ( ( 72 % ) ) interest and dividends income 2, 094 2, 131 ( ( 3 % ) ) interest expense ( 2, 063 ) ( 2, 346 ) ( ( 10 % ) ) recognized g