# Personalized Query Rewriter: Demonstration

This notebook provides a demonstration of the personalized query rewriter system. We will walk through three key parts:

1.  **The Dataset:** A look at the synthetic user data and query logs.
2.  **The Rewriter:** A live example of how the query rewriter personalizes a query for a specific user.
3.  **The Evaluation:** A summary of the evaluation results, comparing the rewritten queries to the ground truth.

In [2]:
import json
import pandas as pd
from src.rewriter import SimpleQueryRewriter
from src.evaluation import Evaluator

# Set pandas display options for better readability
pd.set_option('display.max_colwidth', 100)

  from .autonotebook import tqdm as notebook_tqdm


## Part 1: The Synthetic Dataset

To build and evaluate our system, we first generate a synthetic dataset. This dataset includes user profiles with defined `preferences` and a list of `queries`. Each query has an `original_query` and a `ground_truth_rewrite`, which serves as our ideal target for personalization.

In [4]:
# Load the dataset from the JSON file
with open('data/synthetic_data.json', 'r') as f:
    data = json.load(f)

# Display the user profiles
print("--- User Profiles ---")
for user_id, profile in data['users'].items():
    print(f"ID: {user_id}, Name: {profile['name']}, Preferences: {profile['preferences']}")

print("\n--- Sample Queries ---")
# Display the queries in a structured DataFrame
queries_df = pd.DataFrame(data['queries'])
display(queries_df.head())

--- User Profiles ---
ID: user_101, Name: Python Developer, Preferences: ['python', 'api', 'backend', 'performance', 'docker']
ID: user_102, Name: Data Analyst, Preferences: ['pandas', 'sql', 'statistics', 'visualization', 'matplotlib']

--- Sample Queries ---


Unnamed: 0,user_id,original_query,ground_truth_rewrite
0,user_101,how to read a file,how to read a file in python with performance in mind
1,user_101,best way to build a web service,best way to build a python backend api with docker
2,user_102,clean up my dataset,how to clean up a dataset using pandas
3,user_102,show trends in sales data,how to show trends in sales data with matplotlib visualization


## Part 2: The Query Rewriter in Action

Our rewriter (`SimpleQueryRewriter`) uses a straightforward, rule-based approach.

**Assumption:** Personalization can be achieved by appending relevant keywords from a user's profile to their original query.

Let's see how it works with a live example. We'll take a generic query and personalize it for two different users.

In [5]:
# Initialize the rewriter
rewriter = SimpleQueryRewriter('data/synthetic_data.json')

# --- Example 1: Python Developer ---
user1_id = "user_101"
user1_query = "best way to build a web service"
user1_prefs = data['users'][user1_id]['preferences']
rewritten1 = rewriter.rewrite_query(user1_id, user1_query)

print(f"--- Personalizing for a Python Developer ---")
print(f"User Preferences: {user1_prefs}")
print(f"Original Query:    '{user1_query}'")
print(f"Rewritten Query:   '{rewritten1}'")
print("-" * 40)

# --- Example 2: Data Analyst ---
user2_id = "user_102"
user2_query = "show trends in sales data"
user2_prefs = data['users'][user2_id]['preferences']
rewritten2 = rewriter.rewrite_query(user2_id, user2_query)

print(f"--- Personalizing for a Data Analyst ---")
print(f"User Preferences: {user2_prefs}")
print(f"Original Query:    '{user2_query}'")
print(f"Rewritten Query:   '{rewritten2}'")

--- Personalizing for a Python Developer ---
User Preferences: ['python', 'api', 'backend', 'performance', 'docker']
Original Query:    'best way to build a web service'
Rewritten Query:   'best way to build a web service python api backend performance docker'
----------------------------------------
--- Personalizing for a Data Analyst ---
User Preferences: ['pandas', 'sql', 'statistics', 'visualization', 'matplotlib']
Original Query:    'show trends in sales data'
Rewritten Query:   'show trends in sales data pandas sql statistics visualization matplotlib'


## Part 3: Evaluation Results

After running the rewriter on our entire dataset, we evaluate its performance using two metrics:

1.  **BERTScore (F1):** An automatic metric that measures the semantic similarity between our rewritten query and the ground truth. A higher score is better.
2.  **Qualitative Heuristic:** A simple 1-3 score to judge the degree of personalization.

The full evaluation results, including these scores for every query, are stored in `data/evaluation_report.csv`.

In [12]:
# Load the evaluation report into a pandas DataFrame
try:
    evaluation_df = pd.read_csv('data/evaluation_report.csv')
    print("--- Full Evaluation Report ---")
    display(evaluation_df)
except FileNotFoundError:
    print("Evaluation report not found. Please run 'python main.py' and 'python -m src.evaluation' first.")

--- Full Evaluation Report ---


Unnamed: 0,original_query,generated_rewrite,ground_truth,bert_f1,rouge_l,qualitative_score
0,how to read a file,how to read a file python api backend performance docker,how to read a file in python with performance in mind,0.911595,0.666667,3
1,best way to build a web service,best way to build a web service python api backend performance docker,best way to build a python backend api with docker,0.942462,0.727273,3
2,clean up my dataset,clean up my dataset pandas sql statistics visualization matplotlib,how to clean up a dataset using pandas,0.870507,0.470588,2
3,show trends in sales data,show trends in sales data pandas sql statistics visualization matplotlib,how to show trends in sales data with matplotlib visualization,0.915342,0.6,3


In [13]:
# Calculate and display the average scores from the report
if 'evaluation_df' in locals():
    avg_bert_f1 = evaluation_df['bert_f1'].mean()
    avg_rouge_l = evaluation_df['rouge_l'].mean()
    avg_qual_score = evaluation_df['qualitative_score'].mean()

    print("--- Average Performance Metrics ---")
    print(f"Average BERT F1 Score:     {avg_bert_f1:.4f}")
    print(f"Average ROUGE-L Score:     {avg_rouge_l:.4f}")
    print(f"Average Qualitative Score: {avg_qual_score:.4f}")

--- Average Performance Metrics ---
Average BERT F1 Score:     0.9100
Average ROUGE-L Score:     0.6161
Average Qualitative Score: 2.7500


## Conclusion

This notebook demonstrated the core components of the query rewriting system. We saw how synthetic data can be used to simulate user profiles, how a simple rule-based rewriter can inject personalization, and how a combination of automatic and heuristic-based metrics can be used to evaluate the quality of the rewritten queries.

**Potential Next Steps:**

*   **Advanced Rewriting Logic:** Instead of simple keyword appending, a small language model could be used to generate more natural-sounding, context-aware rewrites.
*   **Dynamic Preferences:** User preferences could be dynamically updated based on their query history rather than being static.