Step 1: Setting Up the Python Application

In [19]:
# Installed
#%pip install pandas
#%pip install chromadb
#%pip install sentence-transformers

# Created a local env
# python3 -m venv ./venv
# source ./venv/bin/activate
# Registered my local venv
# ipython kernel install --name "homematch-venv-kernel" --user

# Or run
# pip install -r ./requirements.txt

# %pip freeze > ./requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import re
import pandas as pd
from PIL import Image
import gradio as gr
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.agents import Tool, initialize_agent
from langchain.agents.agent_types import AgentType
from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel, Field
from pprint import pprint
import numpy as np
import json

os.environ["OPENAI_API_KEY"] = ""
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

CHROMA_PATH = './chroma_db'
IMAGES_FOLDER = "./images"
COLLECTION_NAME = 'listings'
COLLECTION_NAME_IMAGES = 'images'
clip_model = SentenceTransformer("clip-ViT-B-32")

  from .autonotebook import tqdm as notebook_tqdm
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Step 2: Generating Real Estate Listings

Generate real estate listings using a Large Language Model. Generate at least 10 listings This can involve creating prompts for the LLM to produce descriptions of various properties. An example of a listing might be:

Neighborhood: Green Oaks
Price: 800000 €
Bedrooms: 3
Bathrooms: 2
House Size: 2000 m2
Description: Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.
Neighborhood Description: Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a breeze.


In [8]:
# 1. Load CSV
df = pd.read_csv('listings.csv')

# 2. Define the prompt template
template = """
You are a real estate listing generator. Given the following data, generate a professional and engaging house listing and a separate neighborhood description.
On the description be specific about the house amenities, ambience of the house and the neighborhood, accessibility to transports and the type of transports, having in mind the condition and the location type.

Neighborhood: {neighborhood}
Price: {price} €
Bedrooms: {bedrooms}
Bathrooms: {bathrooms}
House Size: {house_size} m2
Location Type: {location_type}
Condition: {condition}

Output format:

Description:
[Generated description]

Neighborhood Description:
[Generated neighborhood description]
"""

prompt = PromptTemplate(
    input_variables=["neighborhood", "price", "bedrooms", "bathrooms", "house_size", "location_type", "condition"],
    template=template,
)

# 3. Create LLMChain
llm = ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens = 500)  # Or use "gpt-3.5-turbo"
chain = LLMChain(llm=llm, prompt=prompt)

# 4. Generate the listings descriptionshome
generated_docs = []
for _, row in df.iterrows():
    result = chain.run({
        "neighborhood": row['neighborhood'],
        "price": row['price'],
        "bedrooms": row['bedrooms'],
        "bathrooms": row['bathrooms'],
        "house_size": row['house_size'],
        "location_type": row['location_type'],
        "condition":  row['condition']
    })

    parts = re.split(r'\n\s*Neighborhood Description:\s*', result.strip(), flags=re.IGNORECASE)
    description = parts[0].replace("Description:", "").strip()
    neighborhood_description = parts[1].strip() if len(parts) > 1 else ""

    metadata = {
        "id": row['id'],
        "neighborhood": row['neighborhood'],
        "price": row['price'],
        "bedrooms": row['bedrooms'],
        "bathrooms": row['bathrooms'],
        "house_size": row['house_size'],
        "location_type": row['location_type'],
        "condition":  row['condition'],
        "reserved":  row['reserved']
    }

    content = f"{description}\n\n{neighborhood_description}"
    generated_docs.append(Document(page_content=content, metadata=metadata))

Step 3: Storing Listings in a Vector Database

* Vector Database Setup: Initialize and configure ChromaDB or a similar vector database to store real estate listings.
* Generating and Storing Embeddings: Convert the LLM-generated listings into suitable embeddings that capture the semantic content of each listing, and store these embeddings in the vector database.

Insert text data

In [9]:
def is_normalized(vec, tolerance=1e-6):
    norm = np.linalg.norm(vec)
    return abs(norm - 1.0) < tolerance

def normalize(vec, tolerance=1e-6):
    norm = np.linalg.norm(vec)
    if norm < tolerance:
        return np.array(vec)
    return np.array(vec) / norm

# Embeddings wrapper class to automatically normalize text vectors based on OpenAIEmbeddings
class NormalizedOpenAIEmbeddings(Embeddings):
    def __init__(self, **kwargs):
        self.embedder = OpenAIEmbeddings(**kwargs)

    def embed_documents(self, texts):
        embeddings = self.embedder.embed_documents(texts)
        return [normalize(vec) for vec in embeddings]

    def embed_query(self, text):
        vec = self.embedder.embed_query(text)
        return normalize(vec)

embedding_model = NormalizedOpenAIEmbeddings()

vector_store = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embedding_model,
    persist_directory=CHROMA_PATH
)

for doc in generated_docs:
    vector_store.add_documents(documents=[doc], ids=[str(doc.metadata["id"])])

def get_all():
    return vector_store.get()

# Change metadata on the text
def reserve_by_id(id):
    results = vector_store.get([id], include=["metadatas"])

    if len(results["metadatas"]) == 0:
        return False
    
    current_meta = results["metadatas"][0]
    collection = vector_store._collection
    collection.update(
        ids=[id],
        metadatas=[{"reserved": True}]
    )
    return True


In [10]:
# Check if the data was added or delete all the data
#vector_store._collection.delete(ids=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])
print(vector_store.get())

{'ids': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], 'embeddings': None, 'documents': ["Welcome to this charming ranch house nestled in the serene neighborhood of Green Oaks. This perfect condition home boasts 3 bedrooms, 2 bathrooms, and a spacious 2000 m2 lot, offering plenty of room for outdoor activities and relaxation. The interior features a cozy ambience with natural light flooding in through the large windows. The kitchen is equipped with modern appliances and the living room is perfect for cozy nights by the fireplace. With a mountain backdrop, this home provides a peaceful retreat from the hustle and bustle of city life. Accessibility to transportation is convenient, with easy access to nearby bus stops and train stations.\n\nGreen Oaks is a picturesque mountain neighborhood known for its tranquil surroundings and stunning views. Residents of Green Oaks enjoy a close-knit community atmosphere, perfect for families and nature lovers alike. The neighborhood offers plent

Insert image data

In [16]:
# Embeddings wrapper class to automatically normalize image vectors
class CLIPEmbeddings(Embeddings):
    def embed_documents(self, texts):
        return [normalize(v) for v in clip_model.encode(texts).tolist()]

    def embed_query(self, text):
        return normalize(clip_model.encode([text])[0].tolist())

embedding_model_images = CLIPEmbeddings()

vectorstore = Chroma(
    collection_name=COLLECTION_NAME_IMAGES,
    embedding_function=embedding_model_images,
    persist_directory=CHROMA_PATH
)

def get_filename_without_extension(file_path):
    base_name = os.path.basename(file_path)
    name_without_ext = os.path.splitext(base_name)[0]
    return name_without_ext

def add_images_to_vectorstore(image_dir):
    for file in os.listdir(image_dir):
        if file.lower().endswith((".png", ".jpg", ".jpeg")):
            image_path = os.path.join(image_dir, file)
            image = Image.open(image_path).convert("RGB")
            embedding = normalize(clip_model.encode([image])[0].tolist())

            doc = Document(
                page_content=f"Image: {file}",
                metadata={
                    "file_name": file,
                    "id": get_filename_without_extension(file)
                }
            )

            vectorstore._collection.add(
                embeddings=[embedding],
                documents=[doc.page_content],
                metadatas=[doc.metadata],
                ids=[file]
            )
    print("Images added to vectorstore and persisted.")

add_images_to_vectorstore(IMAGES_FOLDER)

Images added to vectorstore and persisted.


Define functions to search our vector databases

In [12]:
def convert_euclidean_to_cosine_similarity(distance):
    # Convert Euclidean distance between unit vectors to cosine similarity
    return 1 - (distance ** 2) / 2

def similarity_search_with_score_on_chroma(collection, embedding_function, query, number_of_listings):
    db = Chroma(
        collection_name=collection,
        embedding_function=embedding_function,
        persist_directory=CHROMA_PATH
    )
    results = db.similarity_search_with_score(query, k=number_of_listings)

    converted_results = []
    for doc, score in results:
        cosine_score = convert_euclidean_to_cosine_similarity(score)
        converted_results.append((doc, cosine_score))

    return converted_results

def merge_and_rank_results(text_results, image_results, number_of_top_results=2):
    merged = {}

    for doc, score in text_results:
        merged[doc.metadata["id"]] = {"text_score": score, "doc": doc}

    for doc, score in image_results:
        doc_id = int(doc.metadata["id"])
        if doc_id in merged:
            merged[doc_id]["image_score"] = score
        else:
            merged[doc_id] = {"image_score": score, "doc": doc}
    
    for m in merged.values():
        m.setdefault("text_score", -1.0)
        m.setdefault("image_score", -1.0)
    
    # Have more weight on text score since the descriptions have more info about the listing and it's surroundings
    alpha = 0.8
    ranked = sorted(
        merged.values(),
        key=lambda x: alpha * x["text_score"] + (1 - alpha) * x["image_score"],
        reverse=True
    )

    return ranked[:number_of_top_results]


Combine all the search methods implemented above into one

In [13]:
# Complexity could be improved
# merge_and_rank_results + 2 * similarity_search_with_score_on_chroma = O(N) + 2 * searches on ChromaDB
# First improvement would be to make only one search
def search_listings(text_query, image_query, min_bedrooms=0, min_bathrooms=0, min_size=0, max_size=99999, number_of_listings = 1, number_of_top_results = 1):
    
    text_results = similarity_search_with_score_on_chroma(COLLECTION_NAME, embedding_model, text_query, number_of_listings)
    image_results = similarity_search_with_score_on_chroma(COLLECTION_NAME_IMAGES, embedding_model_images, image_query, number_of_listings)

    results = merge_and_rank_results(text_results, image_results, number_of_top_results)
    
    filtered = []
    for doc in results:
        d = doc["doc"]
        meta = d.metadata
        bedrooms = meta.get("bedrooms", 0)
        bathrooms = meta.get("bathrooms", 0)
        size = meta.get("house_size", 0)

        if bedrooms >= min_bedrooms and bathrooms >= min_bathrooms and min_size <= size <= max_size:
            filtered.append(doc)
    
    return filtered


In [17]:
# Check if the data was added or delete all the data
#vectorstore._collection.delete(ids=['8.png', '9.png', '10.png', '5.png', '7.png', '6.png', '3.png'])
print(vectorstore.get())

{'ids': ['8.png', '9.png', '10.png', '4.png', '5.png', '7.png', '6.png', '2.png', '3.png', '1.png'], 'embeddings': None, 'documents': ['Image: 8.png', 'Image: 9.png', 'Image: 10.png', 'Image: 4.png', 'Image: 5.png', 'Image: 7.png', 'Image: 6.png', 'Image: 2.png', 'Image: 3.png', 'Image: 1.png'], 'uris': None, 'included': ['metadatas', 'documents'], 'data': None, 'metadatas': [{'file_name': '8.png', 'id': '8'}, {'file_name': '9.png', 'id': '9'}, {'file_name': '10.png', 'id': '10'}, {'file_name': '4.png', 'id': '4'}, {'id': '5', 'file_name': '5.png'}, {'id': '7', 'file_name': '7.png'}, {'id': '6', 'file_name': '6.png'}, {'file_name': '2.png', 'id': '2'}, {'id': '3', 'file_name': '3.png'}, {'file_name': '1.png', 'id': '1'}]}


Step 4: Building the User Preference Interface & Step 5: Searching Based on Preferences & Step 6: Personalizing Listing Descriptions

* Collect buyer preferences, such as the number of bedrooms, bathrooms, location, and other specific requirements from a set of questions or telling the buyer to enter their preferences in natural language. You can hard-code the buyer preferences in questions and answers, or collect them interactively however you'd like
* Buyer Preference Parsing: Implement logic to interpret and structure these preferences for querying the vector database.
* Semantic Search Implementation: Use the structured buyer preferences to perform a semantic search on the vector database, retrieving listings that most closely match the user's requirements.
* Listing Retrieval Logic: Fine-tune the retrieval algorithm to ensure that the most relevant listings are selected based on the semantic closeness to the buyer’s preferences.
* LLM Augmentation: For each retrieved listing, use the LLM to augment the description, tailoring it to resonate with the buyer’s specific preferences. This involves subtly emphasizing aspects of the property that align with what the buyer is looking for.
* Maintaining Factual Integrity: Ensure that the augmentation process enhances the appeal of the listing without altering factual information.

In [18]:
# Model to represent the input field for the agents tools
class NoInput(BaseModel):
    pass

# Model to represent the input field for the agents tools
class ID(BaseModel):
    id: str = Field(..., description="The ID to look up")

def format_listings_context(docs):
    formatted = ""
    for doc in docs:
        d = doc["doc"]
        listing = d.page_content
        metadata = d.metadata
        formatted += f"Listing: {listing}\n"
        formatted += f"Listing Metadata: {metadata}\n\n\n"
    return formatted

MAX_RESULTS = 3

def rag_listings_search(listings, user_input):
    listing_context = format_listings_context(listings)

    rag_prompt = PromptTemplate(
    input_variables=["listing_context", "user_input"],
    template = """
        You are a helpful and friendly real estate assistant. Your job is to help a homebuyer find the perfect property from the listings provided.
        Based on these listings and their metadata, write a tailored response summarizing the best options or recommendations, subtly emphasizing aspects of the property that align with what the buyer is looking for.
        Stay true to the facts and only use the information provided in the context.
        
        Here are the available listings:
        {listing_context}
        
        User's request: "{user_input}"
        
        Analyze the user's request and find the single best match from the listings.
        Your response MUST be a clean array of JSON objects containing two keys for the listings that match:
        1.  "recommendation_text": A tailored summary for the user. Explain why this listing is a good fit based on their request.
        2.  "image_path": The exact 'image_path' string of the recommended listing. This can be extracted from the metadata by making the id value plus .png
        3.  The array must be ordered by decreasing order of interest for the buyer
        
        Example Response:
        [{{
          "recommendation_text": "Based on your request for a spacious city center apartment, I recommend this 120 sqm property. It has been recently fixed up and offers great value for its size and location.",
          "image_path": "1.png"
        }}]
        
        Now, generate the JSON response for the user's request.
        """
    )

    chain = LLMChain(llm=llm, prompt=rag_prompt)

    llm_json_output = chain.run(listing_context=listing_context, user_input=user_input)
    
    response_data = json.loads(llm_json_output)

    text_updates = []
    image_updates = []
    for result in response_data:
        text_updates.append(gr.update(value=result["recommendation_text"], visible=True))
        image_updates.append(gr.update(value=IMAGES_FOLDER + "/" + result["image_path"], visible=True))

    # Fill remaining outputs with invisible placeholders
    for _ in range(MAX_RESULTS - len(response_data)):
        text_updates.append(gr.update(value="", visible=False))
        image_updates.append(gr.update(value=None, visible=False))

    return text_updates + image_updates 
 

def list_available_listings(str = ""):
    docs = get_all()
    
    results = []
    for id, doc, meta in zip(docs['ids'], docs['documents'], docs['metadatas']):
        results.append(f"- id: {id}: {doc} : metadata{meta}\n\n")
    return "\n".join(results) or "No available listings."

def reserve_listing(listing_id: str):
    reserved = reserve_by_id(listing_id)

    if reserved:
        return f"Listing {listing_id} updated."

    return f"Listing {listing_id} not found."

tools = [
    Tool(name="ListListings", func=list_available_listings, description="List all available property listings."),
    Tool(name="ReserveListing", func=reserve_listing, description="Reserve a listing by ID. Input should be the listing ID.", args_schema=ID)
]

agent_executor = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.OPENAI_FUNCTIONS, 
    verbose=True
)

def realtor_chat(user_msg, chat_history=[]):
    response = agent_executor.run(user_msg)
    
    chat_history.append((user_msg, response))
    return chat_history, chat_history

# Tab representing the realtor assistant chat bot
def create_chat_tab():
    with gr.Blocks() as chat_ui:
        gr.Markdown("## 🧑‍💼 Assistant for the Realtor")

        chatbot = gr.Chatbot()
        msg = gr.Textbox(label="Ask the realtor assistant...")
        clear = gr.Button("Clear Chat")

        state = gr.State([])

        msg.submit(realtor_chat, [msg, state], [chatbot, state])
        clear.click(lambda: ([], []), None, [chatbot, state])

    return chat_ui

def handle_form(
    min_size_str,
    max_size_str,
    min_bedrooms,
    min_bathrooms,
    important_factors,
    amenities,
    transportation,
    urban_description
):
    min_size = 0 if min_size_str == "No limit" else int(min_size_str)
    max_size = 99999 if max_size_str == "No limit" else int(max_size_str)
    
    text_query = (
        f"Important: {important_factors}. "
        f"Amenities: {', '.join(amenities)}. "
        f"Transportation: {', '.join(transportation)}. "
        f"Urban feel: {urban_description}."
    )

    # Remove transportation from image query because most likely it is not extracted from the image and could add noise
    image_query = (
        f"Important: {important_factors}. "
        f"Amenities: {', '.join(amenities)}. "
        f"Urban feel: {urban_description}."
    )

    # Perform the search with MAX_RESULTS*2 more listings than needed 
    # Then show only the MAX_RESULTS listings more pertinent
    results = search_listings(
        text_query,
        image_query,
        min_bedrooms=min_bedrooms,
        min_bathrooms=min_bathrooms,
        min_size=min_size,
        max_size=max_size,
        number_of_listings=MAX_RESULTS*2,
        number_of_top_results=MAX_RESULTS
    )

    if not results:
        return [gr.update(value="No listings found", visible=True)] + [gr.update(value="", visible=False) for _ in range(MAX_RESULTS - 1)] + [gr.update(value=None, visible=False) for _ in range(MAX_RESULTS)]

    rag_result = rag_listings_search(results, text_query)

    return rag_result

# Tab representing the user search
def create_user_searching_tab():
    # Size options for dropdowns
    size_options = [str(i) for i in range(40, 181, 20)]
    size_options.extend([str(i) for i in range(200, 451, 50)])
    size_options.extend([str(i) for i in range(500, 1501, 100)])
    size_options.append("No limit")
    
    # Amenities and transport options
    amenity_choices = [
        "Gym", "Swimming Pool", "Parking", "Garden", "Rooftop",
        "Security", "Smart Home Features"
    ]
    
    transport_choices = [
        "Subway", "Bus", "Bike Paths", "Walking Trails",
        "Electric Charging", "Highway Access"
    ]
    
    with gr.Blocks() as demo:
        with gr.Row():
            with gr.Column(scale=1):
                gr.Markdown("### 🧰 Filters")
    
                min_size = gr.Dropdown(choices=size_options, label="Minimum size of the house?", value="No limit")
                max_size = gr.Dropdown(choices=size_options, label="Maximum size of the house?", value="No limit")
                min_bedrooms = gr.Slider(0, 100, value=1, step=1, label="Minimum Bedrooms")
                min_bathrooms = gr.Slider(0, 100, value=1, step=1, label="Minimum Bathrooms")
    
                important_factors = gr.Textbox(label="Top 3 things you care about")
                amenities = gr.Dropdown(choices=amenity_choices, multiselect=True, label="Preferred Amenities")
                transportation = gr.Dropdown(choices=transport_choices, multiselect=True, label="Preferred Transportation")
                urban_description = gr.Textbox(label="How urban should the neighborhood feel?")
    
                search_btn = gr.Button("🔍 Search Listings")

            text_data = gr.Markdown("", visible=False)
            image_paths = gr.Image(label="Recommended Property",visible=False, type="filepath", interactive=False)
            with gr.Column(scale=2):
                gr.Markdown("### 📋 Matching Listings")
                text_outputs = []
                image_outputs = []
                for i in range(MAX_RESULTS):
                    text = gr.Markdown(visible=False)
                    image = gr.Image(label=f"Property {i+1}", type="filepath", interactive=False, visible=False)
                    text_outputs.append(text)
                    image_outputs.append(image)

        outputs = text_outputs + image_outputs
        search_btn.click(
            fn=handle_form,
            inputs=[
                min_size, max_size, min_bedrooms, min_bathrooms,
                important_factors, amenities, transportation, urban_description
            ],
            outputs=outputs
        )

    return demo

with gr.Blocks() as app:
    with gr.Tabs():
        with gr.Tab("🏠 Search Listings"):
            create_user_searching_tab()

        with gr.Tab("💬 Realtor Chatbot"):
            create_chat_tab()
    
# Run it
app.launch()

  agent_executor = initialize_agent(
  chatbot = gr.Chatbot()


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.






[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI can help you with the following functionalities:

1. List all available property listings.
2. Reserve a property listing by its ID.

Feel free to ask me for assistance with any of these tasks![0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `ListListings` with `house`


[0m[36;1m[1;3m- id: 1: Welcome to this charming ranch house nestled in the serene neighborhood of Green Oaks. This perfect condition home boasts 3 bedrooms, 2 bathrooms, and a spacious 2000 m2 lot, offering plenty of room for outdoor activities and relaxation. The interior features a cozy ambience with natural light flooding in through the large windows. The kitchen is equipped with modern appliances and the living room is perfect for cozy nights by the fireplace. With a mountain backdrop, this home provides a peaceful retreat from the hustle and bustle of city life. Accessibility to transportation is