## Setup

Import the necessary modules.

In [None]:
from ranking import rerank_documents
import pandas as pd
import os

## Configuration

Set up your Google Cloud project ID.

In [None]:
# Replace with your actual project ID
PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT", "your-project-id")
print(f"Using project: {PROJECT_ID}")

## Example Question

We'll use the same complex question about dokumentavgift.

In [None]:
question = "Hei, jeg lurer på når man slipper å betale dokumentavgift hvis bolig eller eiendom overføres, for eksempel hvis samboere går fra hverandre eller en av dem dør, hvordan det er hvis man arver fast eiendom, og om det også gjelder dersom man får eiendom gjennom testament ved privat skifte?"

print("Spørsmål:", question)

## Load Retrieved Documents

We'll assume you have a DataFrame with retrieved documents from your initial search.
The DataFrame should have columns: `id`, `content`, `embedding`, `metadata`, `created_at`

In [None]:
# Example: Load your DataFrame here
# df = pd.read_json("path/to/your/chunks.jsonl", lines=True)

# For demonstration purposes, let's assume df is already loaded
# Display first few rows
df.head(5)

## Prepare Documents for Re-ranking

Convert the DataFrame rows into a list of dictionaries with the required format:
- `id`: unique identifier
- `content`: the text content to rank
- `title` (optional): a short title or summary

In [None]:
# Convert top N retrieved documents to the format expected by the ranker
# Let's say you retrieved the top 20 documents from your initial search
top_n = 20

docs_to_rank = [
    {
        "id": row["id"],
        "content": row["content"],
        "title": row.get("metadata", {}).get("title", "") if isinstance(row.get("metadata"), dict) else ""
    }
    for _, row in df.head(top_n).iterrows()
]

print(f"Prepared {len(docs_to_rank)} documents for re-ranking")
print(f"\nFirst document:")
print(f"ID: {docs_to_rank[0]['id']}")
print(f"Content preview: {docs_to_rank[0]['content'][:200]}...")

## Re-rank Documents

Now we'll use the Vertex AI Ranking API to re-rank these documents based on their relevance to our query.

In [None]:
# Re-rank the documents
ranked_docs, elapsed_time = rerank_documents(
    project_id=PROJECT_ID,
    query=question,
    docs=docs_to_rank,
    model="semantic-ranker-default@latest",
    top_n=10  # Return top 10 most relevant documents
)

print(f"Re-ranking completed in {elapsed_time:.2f} seconds")
print(f"Returned {len(ranked_docs)} documents")

## View Re-ranked Results

Let's examine the top re-ranked documents and their relevance scores.

In [None]:
print("Top 5 re-ranked documents:\n")
print("=" * 80)

for i, doc in enumerate(ranked_docs[:5], 1):
    print(f"\nRank {i}:")
    print(f"ID: {doc['id']}")
    print(f"Re-rank Score: {doc['rerank_score']:.4f}")
    print(f"Content preview: {doc['content'][:300]}...")
    print("-" * 80)

## Compare Before and After

Let's create a comparison view showing how document rankings changed.

In [None]:
# Create a comparison DataFrame
comparison_data = []

for new_rank, doc in enumerate(ranked_docs, 1):
    doc_id = doc['id']
    old_rank = next((i+1 for i, d in enumerate(docs_to_rank) if d['id'] == doc_id), None)
    
    comparison_data.append({
        'Document ID': doc_id,
        'Original Rank': old_rank,
        'Re-ranked Position': new_rank,
        'Rank Change': old_rank - new_rank if old_rank else None,
        'Re-rank Score': doc['rerank_score'],
        'Content Preview': doc['content'][:100] + '...'
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df

## Analyze Ranking Changes

Let's see which documents moved up or down the most.

In [None]:
# Documents that improved the most
improved = comparison_df[comparison_df['Rank Change'] > 0].sort_values('Rank Change', ascending=False)

print("Documents that improved most (moved up):")
print(improved[['Document ID', 'Original Rank', 'Re-ranked Position', 'Rank Change', 'Re-rank Score']].head())

print("\n" + "=" * 80 + "\n")

# Documents that dropped
dropped = comparison_df[comparison_df['Rank Change'] < 0].sort_values('Rank Change')

print("Documents that dropped (moved down):")
print(dropped[['Document ID', 'Original Rank', 'Re-ranked Position', 'Rank Change', 'Re-rank Score']].head())

## Summary

Re-ranking provides several benefits:

- **Improved relevance**: More sophisticated scoring compared to pure vector similarity
- **Query-aware**: Takes into account the specific wording and intent of the query
- **Cost-effective**: Only re-ranks a smaller set of candidates (e.g., top 20-200)

The re-ranked documents can now be used as context for your LLM to generate a more accurate answer.