# Del 1 - Oversikt over embedding-based retrival

Velkommen! Noen få notater om notebooken:
 - En del advarsel kan dukke opp under kjøring. Disse er bare å ignorere
 - Noen operasjoner som å kalle en LLM eller en operasjon for å generere data gir ikke-deterministiske svar så om du får forskjellig fra sidemann så er det helt vanlig.

Nyt!

In [1]:
##Det er noen utility funksjoner i helper utils. Det er bare for å forenkle litt.
from helper_utils import word_wrap

![Enkel Arkitektur-skisse](images/architecture.png)

Som vi kanskje husker fra en tidligere workshop i fjor er RAG en kraftig metode for å fore LLM applikasjoner med proprietære data og øke treffsikkerhet. En kjapp forklaring av skissen:
1. Vi tar først vår rådata, om det så er strukturet i tabeller eller ustrukturert i filer som PDF og gjør det om det til et leselig format, ie strings
2. Chunker så opp disse tekstene på en gitt størrelse og kjører de gjennom en embedder som omformer de til vektorer. Disse vektorene fanger forhåpentligvis opp semantisk likhet. Disse lagres så i en vektordatabase med påfølgende metadata. Når alt dette er på plass har vi i grunn fundamentet på plass
3. Fra den andre siden kommer spørringer. Disse kan vi også embedde til vektorer for så å måle likhet. Semantisk like tekster vil forhåpentligvis bli plassert i geometrisk nærhet. Herfra er det enkle operasjoner for å hente de mest like.
4. Med de mest like i hånd kan syntetisere en "smartere" spørring mot LLMen. Vi tar eksempelvis den mest like teksten fra vektordatabasen og legger den til i spørringen for å gi den kontekst før det blir sendt til en LLM for respons.

Liten gjennomgang i kode nedenfor

Trekker ut ordene fra pdf

In [2]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

#Filtrerer ut tomme strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

1 Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements 
across every sector of our economy and society. As the
world’s largest software company, this places us at a historic

intersection of opportunity and responsibility to the world around us.
 
Our mission to empower every person and every organization on the
planet to achieve more has never been more 
urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every 
industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger. And no

company is better positioned to help th

Splitter teksten i chunkz. Som vi kommer litt tilbake til så gjør vi dette fordi vi ikke vil ta med alt for mye inn til syntetisering. Bare det som er relevant er nyttig. Resten er støy og kan verste tilfelle gjøre responsen verre. Bruker både CharacterTextSplitter og SentenceTextSplitter for å få hver chunk på ønsket format og ønsket token lengde. Det siste er litt viktig fordi embeddingmodellen tar 256 tokens, mens resten blir truncated. 

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [4]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")

increased, due in large part to significant global datacenter
expansions and the growth in Xbox sales and usage. Despite 
these
increases, we remain dedicated to achieving a net -zero future. We
recognize that progress won’t always be linear, 
and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time.  
On the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate 
over 1.3  million cubic meters of volumetric benefits in nine
water basins around the world. Progress toward our zero waste

commitment included diverting more than 15,200 metric tons of solid
waste otherwise headed to landfills and incinerators, 
as well as
launching new Circular Centers to increase reuse and reduce e -waste at
our datacenters.  
We contracted to protect over 17,000 acres of land
(50% more than the land we use to operate), thus achieving our

Total chunks: 347


Trekker så ut setninger og embedder

In [5]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")



increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


![Sentence Transformer](images/SentenceTransformer.png)

Vi bruker en enkel sentence transformer modell for embeddingene. Den er i grunn bare en utvidet versjon av BERT. Der hvor BERT vektoriserer hver enkel token og returnerer det tar sentence transformeren og vektoriserer en hel setning.

In [6]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

[[0.04256267473101616, 0.0332118384540081, 0.030340105295181274, -0.03486659750342369, 0.0684165358543396, -0.08090916275978088, -0.015474417246878147, -0.001450875774025917, -0.01674446277320385, 0.06770766526460648, -0.050541382282972336, -0.04919533431529999, 0.051399923861026764, 0.09192726761102676, -0.07177833467721939, 0.039519742131233215, -0.012833529151976109, -0.024947531521320343, -0.046228647232055664, -0.024357518181204796, 0.033949632197618484, 0.025502441450953484, 0.02731708437204361, -0.00412622420117259, -0.03633838891983032, 0.003690858604386449, -0.027430452406406403, 0.0047967275604605675, -0.028896227478981018, -0.01887073740363121, 0.036666277796030045, 0.02569585293531418, 0.031312838196754456, -0.06393437087535858, 0.053944025188684464, 0.08225345611572266, -0.04175688326358795, -0.006995797622948885, -0.023485984653234482, -0.030747996643185616, -0.002979220123961568, -0.07790941745042801, 0.009353134781122208, 0.0031628424767404795, -0.022257106378674507, -0

Med modellen lastet er det bare å laste inn en vektordatabase å lagre dokumentene. I dette tilfellet og denne workshopen har vi valgt ChromaDB, men det fins utallige andre du kan bruke

In [7]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

349

Med 349 chunks i databasen er det på tide å teste ut.

In [8]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

revenue, classified by significant product and service offerings, was
as follows : ( in millions ) year ended june 30, 2022 2021 2020 server
products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office
products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22,
488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8,
077 search and news advertising 11, 591 9, 267 8, 524 enterprise
services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291
4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast
certain previously reported amounts in the table above to conform to
the way we internally manage and monitor our business.


74 note 13 — unearned revenue unearned revenue by segment was as
follows : ( in millions ) june 30, 2022 2021 productivity and business
processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more
personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in
unearned revenue were as follows : ( in milli

Som du kanskje kan se nedenfor så virker det som at de fleste bitene virker relevante. La oss teste det mot OpenAI-APIet og se om responsen ser nyttig ut.

In [9]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [10]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [11]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The total revenue for the year ended June 30, 2022, was $198,270
million.


Har ikke dobbeltsjekket svaret men 198M virker fornuftig i mine øyne i hvert fall!