## Del 4 - Cross-encoder re-ranking

![MultiQuery](images/Reranking.png)

Kort fortalt er reranking som ordet sier en rerangering av de mest relevante resultatene våre fra databasen.

![MultiQuery](images/CrossEncoder.png)

Som vist i bildet har vi i prinsippet embeddet hver chunk som vi splittet ut fra pdfen hver for seg og embedded spørringen vår til vektorer. Vi har så bare hentet de n nærmeste vektorene i databasen basert på cosinuslikhet. Måten Cross-Encoder måler likhet på er litt mer kompleks og nyansert og gir oss en ny skår vi kan rerangere resultatene våre på.

La oss se litt videre på det!

In [1]:
from helper_utils import load_chroma, word_wrap, project_embeddings
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
import numpy as np

In [2]:
embedding_function = SentenceTransformerEmbeddingFunction()

chroma_collection = load_chroma(filename='microsoft_annual_report_2022.pdf', collection_name='microsoft_annual_report_2022', embedding_function=embedding_function)
chroma_collection.count()



349

# Re-ranking the long tail

Vi prøver oss på et basic eksempel. Vi gjør en enkel spørring, men øker antallet resultater vi henter til 10 fra 5

In [3]:
query = "What has been the investment in research and development?"
results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings'])

retrieved_documents = results['documents'][0]

for document in results['documents'][0]:
    print(word_wrap(document))
    print('')

• operating expenses increased $ 1. 5 billion or 14 % driven by
investments in gaming, search and news advertising, and windows
marketing. operating expenses research and development ( in millions,
except percentages ) 2022 2021 percentage change research and
development $ 24, 512 $ 20, 716 18 % as a percent of revenue 12 % 12 %
0ppt research and development expenses include payroll, employee
benefits, stock - based compensation expense, and other headcount -
related expenses associated with product development. research and
development expenses also include third - party development and
programming costs, localization costs incurred to translate software
for international markets, and the amortization of purchased software
code and services content. research and development expenses increased
$ 3. 8 billion or 18 % driven by investments in cloud engineering,
gaming, and linkedin. sales and marketing

competitive in local markets and enables us to continue to attract top
talent from ac

Oppretter så CrossEncoder modellen vår og bruker den til å skåre resultatene våre fra forrige spørring.


Finnes for øvrig mange forskjellige modeller man kan bruke til det her så her er det bare å leke seg :)

In [9]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [5]:
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:
    print(score)

Scores:
0.98693526
2.6445775
-0.26803187
-10.731592
-7.706605
-5.6469955
-4.2970333
-10.933231
-7.038426
-7.3246937


In [6]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o+1)

New Ordering:
2
1
3
7
6
9
10
5
4
8


Som vi kan se av den nye rangering her så var kanskje ikke det nærmeste resultatet basert på cosinuslikhet det mest relevante. Det nest nærmeste er nå på toppen. Videre kan også se at resultater vi tidligere ikke hentet i 6 og 7 kanskje er mer relevante enn 4 og 5.

# Re-ranking with Query Expansion

Som du kanskje allerede har begynt å tenke på er dette en teknikk vi kan bruke for å "kvitte" oss med nedsiden fra forrige del.

I cellene under har jeg en spørring og 5 pregenererte spørringer vi kan prøve med for å se the in action!

In [31]:
original_query = "What were the most important factors that contributed to increases in revenue?"
generated_queries = [
    "What were the major drivers of revenue growth?",
    "Were there any new product launches that contributed to the increase in revenue?",
    "Did any changes in pricing or promotions impact the revenue growth?",
    "What were the key market trends that facilitated the increase in revenue?",
    "Did any acquisitions or partnerships contribute to the revenue growth?"
]

In [32]:
queries = [original_query] + generated_queries

results = chroma_collection.query(query_texts=queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

In [33]:
# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

unique_documents = list(unique_documents)

In [34]:
pairs = []
for doc in unique_documents:
    pairs.append([original_query, doc])

In [35]:
scores = cross_encoder.predict(pairs)


In [36]:
print("Scores:")
for score in scores:
    print(score)

Scores:
-10.042843
-5.2747493
-10.711213
-9.918428
-11.079268
-4.341771
-7.490656
-4.651889
-9.8078785
-9.768024
-3.7681506
-6.9020934
-3.794862
-7.754101
-1.1369964
-10.08394
-4.818484
-10.000137
-7.917178
-10.148884
-9.357723
-8.505106
-5.141831


In [37]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o)

New Ordering:
14
10
12
5
7
16
22
1
11
6
13
18
21
20
9
8
3
17
0
15
19
2
4


Litt vanskelig å sammenligne nummeringen her og på den forrige, men som vi kan se er det ganske store forskjeller på hva rerangering svarer oss tilbake med som mest relevant. 13 er på første som kanskje aldri hadde blitt med uten query expansion.

La oss sjekke svaret nå!

In [38]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [39]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [40]:
rearranged_documents = [pairs[13][1], 
                        pairs[20][1], 
                        pairs[11][1], 
                        pairs[0][1],
                        pairs[22][1]
                        ]

output = rag(query=original_query, retrieved_documents=rearranged_documents)

print(word_wrap(output))

The most important factors that contributed to the increases in revenue
were the following:

1. Sales and marketing expenses increased by $1.7
billion or 8%, driven by investments in commercial sales and
LinkedIn.
2. Intelligent cloud revenue increased by $15.2 billion or
25%, with server products and cloud services revenue growing by $14.7
billion or 28%, driven by Azure and other cloud services.
3. Dynamics
products and cloud services revenue increased by 25%, driven by
Dynamics 365 growth of 39%.
4. Gross margin increased by $7.3 billion
or 17%, driven by growth in Office 365 commercial and LinkedIn.
5.
Operating income increased by $5.3 billion or 22%.
6. Operating
expenses increased by $2.0 billion or 11%, driven by investments in
LinkedIn and cloud engineering.


Det ser jo veldig lovende ut. Dere kan jo sette opp en enkel rag-loop og sammenligne med om dette svaret virker bedre eller teste med noen av de andre spørringen fra tidligere deler 