# Real Estate Search and Personalization System

This notebook demonstrates a complete real estate search and personalization pipeline that:

1. **Extracts structured preferences** from natural language buyer responses
2. **Performs semantic search** using vector embeddings and metadata filtering
3. **Reranks results** using cross-encoder models for better relevance
4. **Personalizes listings** by emphasizing features that match buyer preferences

The system uses OpenAI's GPT models for preference extraction and personalization, ChromaDB for vector storage, and sentence transformers for reranking to provide highly relevant, personalized property recommendations.


In [1]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
from typing import List, Optional, Literal
from pydantic import BaseModel, Field
from langchain_openai import OpenAIEmbeddings
from enum import Enum
import json
import chromadb
import random
from langchain.output_parsers import PydanticOutputParser
from sentence_transformers import CrossEncoder
import numpy as np
import textwrap

# Load environment variables from .env file
load_dotenv()

True

## Buyer Questions and Responses

Define the questionnaire used to gather buyer preferences. The system asks 5 key questions about property size, priorities, amenities, transportation, and urban preferences. The answers are provided in natural language and will be processed to extract structured preferences.


In [2]:
questions = [   
    "How big do you want your house to be?",
    "What are 3 most important things for you in choosing this property?", 
    "Which amenities would you like?", 
    "Which transportation options are important to you?",
    "How urban do you want your neighborhood to be?",   
]

"""
answers = [
    "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "A quiet neighborhood, good local schools, and convenient shopping options.",
    "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "A balance between suburban tranquility and access to urban amenities like restaurants and theaters."
]
"""

answers = [
    "A modern one-bedroom loft or studio, around 60 square meters, with an open layout and good soundproofing.",
    "Proximity to nightlife, stylish modern design, and a spacious living area for hosting friends.",
    "Rooftop access, a balcony with a city view, high-speed internet, and a well-equipped kitchen.",
    "Close to metro and night bus routes, easy access to downtown, and availability of bike-sharing nearby.",
    "Very urban — in the heart of the city, surrounded by bars, restaurants, and cultural spots."
]

## Buyer Preferences Data Model

Define the structured data model for buyer preferences using Pydantic. This model will help to extract key data from natural language input from the user. It captures all relevant property criteria including location, size, price, features, transportation needs, and lifestyle priorities that will be used for filtering and ranking search results.

In [3]:
class BuyerPreferences(BaseModel):
    """Structured buyer preferences extracted from natural language answers."""

    city: Optional[str] = Field(None, description="Preferred city if specified. None means any city")
    property_type: Optional[Literal["apartment", "house", "loft", "duplex", "studio"]] = Field(
        None, description="Preferred type of property. None means any property type"
    )

    bedrooms: Optional[int] = Field(None, description="Desired number of bedrooms. None means any number of bedrooms")
    bathrooms: Optional[int] = Field(None, description="Desired number of bathrooms. None means any number of bathrooms")
    min_size_sqm: Optional[int] = Field(None, description="Minimum preferred house size in square meters. None means no minimum size")
    max_price_pln: Optional[int] = Field(None, description="Maximum budget in PLN. None means no maximum price")

    must_haves: List[str] = Field(default_factory=list, description="Important required features, e.g., garden, garage")
    nice_to_haves: List[str] = Field(default_factory=list, description="Optional but desirable features")

    transport: List[str] = Field(default_factory=list, description="Preferred transport options, e.g., public transport, highway, bike lanes")
    priorities: List[str] = Field(default_factory=list, description="Top decision factors like quiet neighborhood, good schools")

    urban_level: Optional[Literal["low", "medium", "high"]] = Field(
        None, description="How urban the buyer wants the neighborhood to be"
    )

## Preference Extraction Process

Extract structured buyer preferences from natural language responses. The system analyzes the Q&A pairs to identify specific requirements like property type, size, location, must-have features, and lifestyle priorities.


In [4]:
llm = ChatOpenAI(
    model="gpt-4.1",
    api_key=os.getenv('OPENAI_API_KEY'),
    temperature=0.7
)

preference_llm = llm.with_structured_output(BuyerPreferences)

In [5]:
SYSTEM_PROMPT_PREFERENCES = """
You are an assistant that extracts structured buyer preferences for real estate listings
based on questions and answers provided by a potential home buyer.

You must return a JSON object strictly following the BuyerPreferences schema.

# RULES
1. Use ONLY the information provided in the user's answers. Do not invent details.
2. Estimate approximate numbers if clearly implied:
3. Use meters squared (m²) for size, PLN for price.
4. Property type: apartment, house, duplex, townhouse, or studio.
5. Urban level:
   - "quiet", "green", "suburban" → "low"
   - "balanced", "residential area" → "medium"
   - "downtown", "vibrant", "urban" → "high"
6. Extract must-haves and nice-to-haves from the user's language.
7. Extract transportation preferences (bus, metro, tram, bike lanes, highway, airport, train).
8. Extract priorities like “quiet neighborhood”, “good schools”, “shopping nearby”.
9. If information is missing, leave the field null or empty.
10. Output must be a single valid JSON matching the BuyerPreferences schema. No explanation, no extra text.
"""

USER_PROMPT_PREFERENCES = """
The following are questions and the buyer's answers.

# QUESTIONS AND ANSWERS:
{qa_text}

Please analyze them carefully and return the extracted BuyerPreferences as JSON only.
"""

qa_text = "\n".join(
    f"Q: {q}\nA: {a}\n" for q, a in zip(questions, answers)
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT_PREFERENCES},
    {"role": "user", "content": USER_PROMPT_PREFERENCES.format(qa_text=qa_text)}
]

response = preference_llm.invoke(messages)
preferences_params = response.model_dump()


## Embedding Generation and Search Query

Convert the extracted preferences into a natural language description and generate vector embeddings for semantic search. The preferences are formatted into a searchable text that captures all the buyer's requirements and priorities.


In [6]:
parts = []
if preferences_params['bedrooms'] is not None and preferences_params['bedrooms'] > 0 and preferences_params['bathrooms'] is not None and preferences_params['bathrooms'] > 0:
    parts.append(f"A {preferences_params['bedrooms'] or '?'}-bedroom, {preferences_params['bathrooms'] or '?'}-bathroom ")
elif preferences_params['bedrooms'] is not None and preferences_params['bedrooms'] > 0:
    parts.append(f"A {preferences_params['bedrooms'] or '?'}-bedroom ")
elif preferences_params['bathrooms'] is not None and preferences_params['bathrooms'] > 0:
    parts.append(f"A {preferences_params['bathrooms'] or '?'}-bathroom ")
else:
    parts.append("An any number of bedrooms and bathrooms ")

parts.append((preferences_params['property_type'] or 'home') + " ")

if preferences_params['min_size_sqm'] is not None and preferences_params['min_size_sqm'] > 0:
    parts.append(f"around {preferences_params['min_size_sqm']} m² ")

if preferences_params['city'] is not None:
    parts.append(f"in {preferences_params['city']}.\n")
else:
    parts.append("in any location.\n") 

parts.append(f"Urban level: {preferences_params['urban_level'] or 'medium'}.\n")
parts.append(f"Must-haves: {', '.join(preferences_params['must_haves']) or 'none'}.\n")
parts.append(f"Transport: {', '.join(preferences_params['transport']) or 'none'}.\n")
parts.append(f"Priorities: {', '.join(preferences_params['priorities']) or 'none'}.")

preferences_description = "".join(parts)

model = OpenAIEmbeddings(model="text-embedding-3-small")

preferences_embedding = model.embed_query(preferences_description)

print(preferences_description)
print(json.dumps(preferences_params, indent=2))


A 1-bedroom loft around 60 m² in any location.
Urban level: high.
Must-haves: modern design, open layout, good soundproofing, spacious living area, rooftop access, balcony with city view, high-speed internet, well-equipped kitchen.
Transport: metro, night bus, bike-sharing.
Priorities: proximity to nightlife, stylish modern design, spacious living area for hosting friends, easy access to downtown.
{
  "city": null,
  "property_type": "loft",
  "bedrooms": 1,
  "bathrooms": null,
  "min_size_sqm": 60,
  "max_price_pln": null,
  "must_haves": [
    "modern design",
    "open layout",
    "good soundproofing",
    "spacious living area",
    "rooftop access",
    "balcony with city view",
    "high-speed internet",
    "well-equipped kitchen"
  ],
  "nice_to_haves": [],
  "transport": [
    "metro",
    "night bus",
    "bike-sharing"
  ],
  "priorities": [
    "proximity to nightlife",
    "stylish modern design",
    "spacious living area for hosting friends",
    "easy access to downto

## Vector Database Search

Perform semantic search using ChromaDB with both vector similarity and metadata filtering. The search combines:
- **Semantic matching** using embeddings to find properties with similar characteristics
- **Metadata filtering** to enforce hard constraints like city, bedrooms, bathrooms, size, and price
- **Hybrid approach** that balances relevance with specific requirements


In [7]:
# Create the persist directory if it doesn't exist
persist_directory = os.path.join(os.getcwd(), "../data/.chroma_db")
os.makedirs(persist_directory, exist_ok=True)

# Initialize Chroma client with proper persistence settings
client = chromadb.PersistentClient(path=persist_directory)
collection = client.get_or_create_collection(
    name="listings", 
    embedding_function=None,
    metadata={"hnsw:space": "cosine", "dimension": 1536}
)


conditions = []

if preferences_params['city'] is not None:
    conditions.append({"city": preferences_params['city']})

if preferences_params['bedrooms'] is not None:
    conditions.append({"bedrooms": {"$gte": preferences_params['bedrooms']}})

if preferences_params['bathrooms'] is not None:
    conditions.append({"bathrooms": {"$gte": preferences_params['bathrooms']}})

if preferences_params['min_size_sqm'] is not None:
    conditions.append({"size_sqm": {"$gte": int(preferences_params['min_size_sqm'])}})

if preferences_params['max_price_pln'] is not None:
    conditions.append({"max_price_pln": {"$lte": preferences_params['max_price_pln']}})

if preferences_params['property_type'] is not None:
    conditions.append({"property_type": preferences_params['property_type']})

if preferences_params['urban_level'] is not None:
    conditions.append({"urban_level": preferences_params['urban_level']})

if preferences_params['property_type'] is not None:
    conditions.append({"property_type": preferences_params['property_type']})

if len(conditions) > 1:
    query_params = {"$and": conditions}
elif len(conditions) == 1:
    query_params = conditions[0]
else:
    query_params = None

res = collection.query(
    query_embeddings=[preferences_embedding], 
    n_results=50,
    where=query_params
)

listings = [{"id": i, "text": d, "metadata": m} for i, d, m in zip(res['ids'][0], res['documents'][0], res['metadatas'][0])]

print(f"Listings found: {len(listings)}")

Listings found: 14


## Reranking

Apply cross-encoder reranking to the search results to get the most relevant properties. The system:
1. **Reranks** all retrieved listings using the cross-encoder model
2. **Selects** the top N most relevant results
3. **Combines** semantic search with precise relevance scoring for optimal results


In [8]:
reranker = CrossEncoder("BAAI/bge-reranker-base")

In [9]:
top_n = 3

docs = [l['text'] for l in listings]
pairs = [(preferences_description, d) for d in docs]
scores = reranker.predict(pairs)
order = np.argsort(-scores)[:top_n]

top_listings = [{"score": float(scores[i]), "text": docs[i], "listing": listings[i]} for i in order]

## Listing Personalization and Display

Personalize the top search results by rewriting property descriptions to emphasize features that match the buyer's preferences. The system:
1. **Highlights** relevant aspects from the original listing
2. **Emphasizes** features that align with buyer priorities
3. **Maintains** factual accuracy while improving relevance
4. **Presents** results in an engaging, personalized format

This creates a tailored experience where each listing feels specifically chosen for the buyer's needs.


In [10]:
REPHRASE_PROMPT_MESSAGE = """
You are a real estate assistant. Your task is to personalize a property listing based on specific buyer preferences.

# CONTEXT:
You will receive:
1. The buyer's preferences.
2. The original listing description, containing all factual details about the property.

# INSTRUCTIONS:
Rewrite the listing to emphasize aspects that align with the buyer's preferences.

STRICT RULES:
- Use ONLY information explicitly present in the original listing description.
- Do NOT invent, infer, or assume any new facts, features, or amenities.
- Do NOT include details not mentioned in the original listing.
- You MAY rephrase, reorder, or highlight existing details to better match the buyer's interests.
- Maintain factual accuracy at all times.

STYLE GUIDELINES:
- Tone: warm, vivid, and engaging — as if written for a personalized real estate brochure.
- Length: 3–5 sentences.
- Focus on elements that are most relevant to the buyer's stated preferences.

# INPUTS:
BUYER PREFERENCES:
{buyer_preferences}

ORIGINAL LISTING DESCRIPTION:
{listing_description}

# OUTPUT:
Return ONLY the rewritten, personalized listing description (no explanations or meta text).
"""


print("=== USER PREFERENCES: ==========================================================")
print(qa_text)

for data in top_listings:
    listing = data['listing']
    message = REPHRASE_PROMPT_MESSAGE.format(buyer_preferences=qa_text, listing_description=json.dumps(listing, indent=2))
    rephrased_listing = llm.invoke([message])


    print("")
    print("=" * 80)
    print("| " + listing['metadata']['title'] + " " * (80 - len(listing['metadata']['title']) - 4) + " |")
    print("=" * 80)
    print(f"ID:{listing['id']} | {listing['metadata']['city']}, {listing['metadata']['neighborhood']} | {listing['metadata']['size_sqm']} m² {listing['metadata']['property_type']} | {listing['metadata']['price']} PLN")
    print("")
    print(textwrap.fill(rephrased_listing.content, width=80))



Q: How big do you want your house to be?
A: A modern one-bedroom loft or studio, around 60 square meters, with an open layout and good soundproofing.

Q: What are 3 most important things for you in choosing this property?
A: Proximity to nightlife, stylish modern design, and a spacious living area for hosting friends.

Q: Which amenities would you like?
A: Rooftop access, a balcony with a city view, high-speed internet, and a well-equipped kitchen.

Q: Which transportation options are important to you?
A: Close to metro and night bus routes, easy access to downtown, and availability of bike-sharing nearby.

Q: How urban do you want your neighborhood to be?
A: Very urban — in the heart of the city, surrounded by bars, restaurants, and cultural spots.


| Minimalist Scandinavian Loft with Panoramic Views in Katowice Srodmiescie    |
ID:listing_313.json | Katowice, Srodmiescie | 92 m² loft | 1240000 PLN

Experience the pulse of Katowice’s Srodmiescie in this striking Scandinavian
loft, wh