# Vector Database Creation for Real Estate Listings

Creates a vector database from real estate listings using ChromaDB and OpenAI embeddings. The process involves:

**Note:** All generated embeddings are stored in the `data/embeddings/` directory as individual JSON files.


In [42]:
import chromadb
import os
import random
from chromadb.config import Settings
import json
from langchain_openai import OpenAIEmbeddings

## Data Loading

Load raw real estate listing data from JSON files in the `data/raw` directory. Each file contains property details including title, description, features, and metadata.


In [43]:
raw_data_dir = os.path.join(os.getcwd(), "../data/raw")

raw_data = []

for file in os.listdir(raw_data_dir):
    with open(os.path.join(raw_data_dir, file), "r") as f:
        data = json.load(f)
        data['filename'] = file
        raw_data.append(data)


## Document Preparation

Transform raw listing data into structured documents suitable for embedding. Each document combines property details into a comprehensive text format while preserving key metadata for filtering and retrieval.

In [44]:
all_docs = []
for entry in raw_data:
    doc = {}
    doc['filename'] = entry['filename']
    doc['text'] = f"""TITLE: {entry['title']}
PROPERTY: {entry['property_type']}, {entry['bedrooms']} bedrooms, {entry['bathrooms']} bathrooms, {entry['size_sqm']  } square meters
LOCATION: {entry['city']}, {entry['neighborhood']}
DESCRIPTION: {entry['description']}
FEATURES: {", ".join(entry['key_features'])}
LIFESTYLE: {", ".join(entry['lifestyle_benefits'])}
NEIGHBORHOOD: {entry['neighborhood_description']}
BUYER PROFILE: {entry['target_buyer']}
MARKET POSITIONING: {entry['market_positioning']}
"""
    doc['metadata'] = {
        'city': entry['city'],
        'neighborhood': entry['neighborhood'],
        'property_type': entry['property_type'],
        'bedrooms': entry['bedrooms'],
        'bathrooms': entry['bathrooms'],
        'size_sqm': entry['size_sqm'],
        'price': entry['price'],
        'urban_level': entry['urban_level'],
        'has_highway_access': 'highway' in entry['transport_options'],
        'has_bike_lines_access': 'bike lines' in entry['transport_options'],
        'has_public_transport_access': 'public transport' in entry['transport_options'],
        'has_airport_access': 'airport' in entry['transport_options'],
    }
    all_docs.append(doc)

## Embedding Generation

Generate vector embeddings for each document using OpenAI's `text-embedding-3-small` model. The process includes caching to avoid regenerating embeddings for existing documents.


In [45]:
embeddings_data_path = os.path.join(os.getcwd(), "../data/embeddings/")

model = OpenAIEmbeddings(model="text-embedding-3-small")

for i, doc in enumerate(all_docs):
    filename = doc['filename']
    filepath = os.path.join(embeddings_data_path, filename)

    print(f"Generating embedding for {filename} [{i+1}/{len(all_docs)}]... ", end="", flush=True)

    if os.path.exists(filepath):
        loaded_doc = json.load(open(filepath))
        doc['embedding'] = loaded_doc['embedding']
        print("restored from file")
    else:
        doc['embedding'] = model.embed_query(doc['text'])
        print("generated")

    with open(filepath, "w") as f:
        json.dump(doc, f, indent=4)


Generating embedding for listing_137.json [1/500]... restored from file
Generating embedding for listing_422.json [2/500]... restored from file
Generating embedding for listing_072.json [3/500]... restored from file
Generating embedding for listing_160.json [4/500]... restored from file
Generating embedding for listing_025.json [5/500]... restored from file
Generating embedding for listing_475.json [6/500]... restored from file
Generating embedding for listing_249.json [7/500]... restored from file
Generating embedding for listing_176.json [8/500]... restored from file
Generating embedding for listing_463.json [9/500]... restored from file
Generating embedding for listing_199.json [10/500]... restored from file
Generating embedding for listing_033.json [11/500]... restored from file
Generating embedding for listing_121.json [12/500]... restored from file
Generating embedding for listing_064.json [13/500]... restored from file
Generating embedding for listing_434.json [14/500]... restor

## Vector Database Setup

Initialize ChromaDB with persistent storage and create a collection optimized for real estate listings. Configure the collection with cosine similarity and proper dimensionality for the embeddings.


In [46]:
# Create the persist directory if it doesn't exist
persist_directory = os.path.join(os.getcwd(), "../data/.chroma_db")
os.makedirs(persist_directory, exist_ok=True)

# Initialize Chroma client with proper persistence settings
client = chromadb.PersistentClient(path=persist_directory)
collection = client.get_or_create_collection(
    name="listings", 
    embedding_function=None,
    metadata={"hnsw:space": "cosine", "dimension": 1536}
)

### Adding Documents to the Collection

Populate the vector database with all prepared documents, embeddings, and metadata. This creates a searchable index of all real estate listings.


In [47]:
ids = [doc['filename'] for doc in all_docs]
documents = [doc['text'] for doc in all_docs]
metadatas = [doc['metadata'] for doc in all_docs]
embeddings = [doc['embedding'] for doc in all_docs]
collection.add(ids=ids, documents=documents, metadatas=metadatas, embeddings=embeddings)

print(f"Number of documents in collection: {collection.count()}")


Number of documents in collection: 500


## Search Demonstration

Test the vector database with a sample query that demonstrates both semantic search capabilities and metadata filtering.


In [48]:
query = "I'm looking for a 2-3 bedroom property in Krakow for me and my three cats. I love animals I and need a big balcony"
embedded_query = model.embed_query(query)

In [49]:
res = collection.query(
    query_embeddings=[embedded_query], 
    n_results=3,
    where={
        "$and": [
            {"city": "Krakow"},
            {"$and": [
              {"bedrooms": {"$gte": 2}},
              {"bedrooms": {"$lte": 3}}
            ]}
        ]
    }
)


print("== RESULTS ==============")

for i in range(len(res['documents'][0])):
    print(f"ID: {res['ids'][0][i]}")
    print(f"Metadata: {json.dumps(res['metadatas'][0][i], indent=4)}")
    print(f"Document: {res['documents'][0][i]}")
    print("---")


ID: listing_130.json
Metadata: {
    "city": "Krakow",
    "bedrooms": 3,
    "price": 1495000,
    "has_highway_access": true,
    "has_public_transport_access": true,
    "bathrooms": 3,
    "neighborhood": "Bronowice",
    "urban_level": "medium",
    "size_sqm": 135,
    "property_type": "duplex",
    "has_bike_lines_access": true,
    "has_airport_access": true
}
Document: TITLE: Sleek Scandinavian Duplex With Garden Views in Bronowice
PROPERTY: duplex, 3 bedrooms, 3 bathrooms, 135 square meters
LOCATION: Krakow, Bronowice
DESCRIPTION: Experience pared-back elegance in this newly renovated Scandinavian-style duplex. Offering 3 airy bedrooms, 3 minimalist bathrooms, and a sun-drenched living area, the property blends natural textures with clean lines. Glass doors open onto a private balcony, while floor-to-ceiling windows welcome tranquil garden glimpses inside. Contemporary finishes throughout invite both calm and comfort.
FEATURES: newly renovated Scandinavian interior, landscape