#### Step 1: Setting Up the Python Application
Initialize a Python Project: Create a new Python project, setting up a virtual environment and installing necessary packages like LangChain, a suitable LLM library (e.g., OpenAI's GPT), and a vector database package compatible with Python (e.g., ChromaDB or LanceDB). If you don't wish to create your files from scratch, starter files are available in the workspace on the next page as an application skeleton.

In [2]:
# Install necessary packages
# Ensure you have these installed in your environment.
# You might need to run these commands in your terminal if not already installed.
#!pip install --quiet -r requirements.txt

In [15]:
# Import necessary libraries
import os
import getpass
import warnings
import pandas as pd

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders.csv_loader import CSVLoader


warnings.filterwarnings("ignore")

# Set up OpenAI API Key
# It's recommended to set this as an environment variable for security.
# If not set, it will prompt for the key.
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

# Set up OpenAI API Base URL
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

# Initialize the LLM (OpenAI's GPT-3.5-turbo
# Using a slightly more advanced model for better generation if available and within budget
llm = OpenAI(temperature=0.7, model_name='gpt-3.5-turbo', max_tokens=1000)
embeddings_model = OpenAIEmbeddings()

#### Step 2: Generating Real Estate Listings
Generate real estate listings using a Large Language Model. Generate at least 10 listings. This can involve creating prompts for the LLM to produce descriptions of various properties.

In [None]:
# Prompt template for generating real estate listings in CSV format
# We will generate one row at a time and combine them later.
# It's crucial to be explicit about format, delimiter, and quoting.
listing_generation_prompt_template_csv = """
Generate a single real estate listing as a single row of CSV data.
The CSV should have the following columns, IN THIS EXACT ORDER AND CASE:
neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description

Make sure to include:
1. Neighborhood name (be creative, e.g., 'Willow Creek Estates', 'Sunset Bluffs', 'Maplewood District')
2. Price (e.g., $350,000, $1,200,000)
3. Number of Bedrooms (between 1 and 5)
4. Number of Bathrooms (between 1 and 3)
5. House Size in meters (e.g., 100m², 250m²)
6. A compelling property Description (highlight unique features, style, condition, etc.)
7. A short Neighborhood Description (amenities, vibe, schools, parks, etc.)


For the CSV output:
- Use a comma (,) as the delimiter.
- Enclose each field in double quotes (") to handle commas or newlines within the text.
- If a double quote appears within a field, escape it by doubling it ("").
- Do NOT include a header row in your response.
- Provide details matching the characteristic if specified, otherwise generate a diverse property.

Characteristic for this listing: {characteristic}

Example CSV row:
"Green Oaks","$800,000","3","2","140 m²","Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.","Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a breeze."

Generate ONLY the single CSV row for the listing.
---
Output:
"""

listing_prompt_csv = PromptTemplate(
    input_variables=["characteristic"],
    template=listing_generation_prompt_template_csv
)

listing_generation_chain_csv = LLMChain(llm=llm, prompt=listing_prompt_csv)

# Generate 10 diverse listings as CSV rows
num_listings_to_generate = 10
generated_listings_csv_rows = []
listing_characteristics = [
    "A modern downtown apartment with city views",
    "A sprawling ranch-style home with a large backyard",
    "A cozy cottage perfect for a small family",
    "A luxurious penthouse with high-end amenities",
    "A historic Victorian home with original features",
    "A suburban family home with 4 bedrooms and a pool",
    "An eco-friendly house with solar panels and a vegetable garden",
    "A minimalist loft in an up-and-coming artistic neighborhood",
    "A lakefront property with a private dock",
    "A townhouse in a gated community with shared facilities"
]

print(f"Generating {num_listings_to_generate} listings in CSV format...\n")
# Generate listings
for i in range(num_listings_to_generate):
    characteristic = listing_characteristics[i % len(listing_characteristics)] # Cycle through characteristics
    print(f"Generating listing {i+1} with characteristic: {characteristic}")
    try:
        response_csv_row = listing_generation_chain_csv.run(characteristic=characteristic)
        # Basic check if the output looks like a CSV row (starts and ends with quotes)
        # This is not foolproof, but a simple validation.
        cleaned_row = response_csv_row.strip()
        if cleaned_row.startswith('"') and cleaned_row.endswith('"') and ',' in cleaned_row:
             generated_listings_csv_rows.append(cleaned_row)
             print(f"Listing {i+1} generated successfully.")
        else:
             print(f"Warning: Listing {i+1} output did not look like a valid CSV row. Skipping.")
             print(f"Raw response: {response_csv_row}")

    except Exception as e:
        print(f"Error generating listing {i+1}: {e}")
        # Optionally, add a placeholder or retry
    print("---")


In [None]:
# Define the CSV file path and header
csv_file_path = "listings.csv"
csv_header = "neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description"

# Save the generated CSV rows to a file
if generated_listings_csv_rows:
    with open(csv_file_path, "w", encoding="utf-8") as f:
        f.write(csv_header + "\n") # Write the header first
        for row in generated_listings_csv_rows:
            f.write(row + "\n")
    print(f"\nGenerated {len(generated_listings_csv_rows)} listings saved to {csv_file_path}")
else:
    print("\nNo valid CSV listings were generated.")

#### Step 3: Storing Listings in a Vector Database
Vector Database Setup: Initialize and configure ChromaDB or a similar vector database to store real estate listings.
Generating and Storing Embeddings: Load the listings from the generated CSV file using `CSVLoader`, convert them into suitable embeddings that capture the semantic content of each listing, and store these embeddings in the vector database.

In [39]:
# Define the path to the generated CSV file
csv_file_path = "listings.csv"
csv_header = "neighborhood,price,bedrooms,bathrooms,house_size,description,neighborhood_description"

if not os.path.exists(csv_file_path) or os.path.getsize(csv_file_path) <= len(csv_header):
    loaded_documents = []
else:
    try:
        loader = CSVLoader(
            file_path=csv_file_path,
            csv_args={'delimiter': ','},
            encoding='utf-8'
        )
        loaded_documents = loader.load()
    except Exception:
        loaded_documents = []

In [None]:
# Prepare data for ChromaDB. The loaded_documents are ALMOST ready.
# We need to add unique IDs and potentially enrich metadata if needed later,
# but the main text for embedding is doc.page_content.
documents_for_db = []
metadatas_for_db = []
ids_for_db = []

if loaded_documents:
    print("Preparing loaded documents for ChromaDB...")

    # Store the original loaded documents or just the page_content and metadata
    # Let's create simplified entries for ChromaDB
    for i, doc in enumerate(loaded_documents):
        # The text to be embedded is the page_content from the loader
        embedding_text = doc.page_content

        # Create a unique ID for each listing
        listing_id = f"listing_{i+1}"

        # We need the original metadata (neighborhood, price, etc.) for the personalization step later.
        # Since CSVLoader didn't put it in doc.metadata, we need to re-parse the page_content here
        # to get that structured data for the *metadata* store in Chroma.
        # This brings back some parsing complexity, but *only* for the metadata dictionary,
        # not for the main text that gets embedded.

        # --- Parsing key:value from page_content to get metadata for Chroma ---
        raw_key_value_string = doc.page_content
        parsed_listing_metadata = {}
        # Define how these keys appear as strings in the page_content (observed format from debug)
        page_content_key_strings = ['neighborhood: ', 'price: ', 'bedrooms: ', 'bathrooms: ', 'house_size: ', 'description: ', 'neighborhood_description: ']
        final_metadata_keys = ['neighborhood', 'price', 'bedrooms', 'bathrooms', 'house_size', 'description', 'neighborhood_description']

        # Find the start index for each key string and extract value
        current_pos = 0
        for j, final_key in enumerate(final_metadata_keys):
             # Assuming the format is exactly "key: " + value + "next_key: "
             key_string_in_page_content = final_key + ': '

             start_index_of_key_string = raw_key_value_string.find(key_string_in_page_content, current_pos)

             if start_index_of_key_string != -1:
                 start_index_of_value = start_index_of_key_string + len(key_string_in_page_content)

                 # Find the start of the *next* key string to determine the end of the current value
                 end_index_of_value = len(raw_key_value_string) # Assume value goes to the end by default
                 if j + 1 < len(final_metadata_keys):
                      next_key_string = final_metadata_keys[j+1] + ': '
                      next_key_start_index = raw_key_value_string.find(next_key_string, start_index_of_value)
                      if next_key_start_index != -1:
                           end_index_of_value = next_key_start_index

                 # Extract the value
                 value = raw_key_value_string[start_index_of_value:end_index_of_value].strip()
                 parsed_listing_metadata[final_key] = value if value != '' else 'N/A'

                 # Update current position for the next search
                 current_pos = end_index_of_value

             else:
                  # Key string was not found, add N/A and stop searching for keys from this point onwards
                  parsed_listing_metadata[final_key] = 'N/A'
                  # If a key is missing, the structure is broken for subsequent finds in this row
                  # Break the inner loop as subsequent keys might be misplaced
                  break # Exit the inner loop for this document


        # Add default N/A for any keys not found (if break happened before all keys were processed)
        for key in final_metadata_keys:
             if key not in parsed_listing_metadata:
                  parsed_listing_metadata[key] = 'N/A'


        # Add the generated ID and other basic metadata
        parsed_listing_metadata['id'] = listing_id
        # Optionally keep source and row from original doc.metadata
        parsed_listing_metadata['source'] = doc.metadata.get('source', 'N/A')
        parsed_listing_metadata['row'] = doc.metadata.get('row', i)

        # --- DEBUG PRINT: Inspect parsed metadata before adding to lists ---
        if i < 3:
            print(f"Doc {i+1} parsed metadata for DB: {parsed_listing_metadata}")
        # --- END DEBUG PRINT ---


        # Check if crucial metadata was parsed
        if parsed_listing_metadata.get('description', 'N/A') == 'N/A' and parsed_listing_metadata.get('neighborhood', 'N/A') == 'N/A':
             print(f"Warning: Parsed metadata for doc {i+1} is missing description or neighborhood. Skipping document.")
             continue # Skip adding this document if essential data is missing


        # Add the text for embedding, the parsed metadata, and the id
        documents_for_db.append(embedding_text) # page_content is the text to embed
        metadatas_for_db.append(parsed_listing_metadata) # This is the structured data for retrieval info
        ids_for_db.append(listing_id)


    # Check if any documents were successfully processed
    if not documents_for_db and loaded_documents:
         print("No documents were successfully processed into valid entries for the database.")


    print(f"Prepared {len(documents_for_db)} documents and metadatas for ChromaDB.")
else:
    print("No documents were successfully loaded by CSVLoader.")


Preparing loaded documents for ChromaDB...
Doc 1 parsed metadata for DB: {'neighborhood': 'Cityscape Heights', 'price': '$650,000', 'bedrooms': '2', 'bathrooms': '2', 'house_size': '120 m²', 'description': 'Step into luxury living in this modern downtown apartment at Cityscape Heights. This 2-bedroom, 2-bathroom unit offers breathtaking city views from every room. The sleek and stylish design features high-end finishes, floor-to-ceiling windows, and a gourmet kitchen with top-of-the-line appliances. The spacious master bedroom includes a walk-in closet and en-suite bathroom with a soaking tub. Enjoy the convenience of urban living with easy access to trendy restaurants, shopping, and entertainment options just steps away from your doorstep.', 'neighborhood_description': 'Cityscape Heights is a vibrant urban neighborhood known for its bustling nightlife, upscale dining options, and cultural attractions. Residents can take advantage of the nearby fitness centers, parks, and walking trail

In [25]:
# Initialize ChromaDB
# Using an in-memory collection for simplicity, but you can persist it.
collection_name = "home_match_listings"
# Check if the collection exists and delete if necessary (workaround)
try:
    # Attempt to get the collection to see if it exists
    chroma_client = Chroma(embedding_function=embeddings_model)._client
    existing_collections = chroma_client.list_collections()
    if any(coll.name == collection_name for coll in existing_collections):
         print(f"Clearing existing collection: {collection_name}")
         chroma_client.delete_collection(collection_name)

    # Initialize Chroma instance linked to the client (or create new if deleted)
    print(f"Initializing/Re-initializing ChromaDB collection: {collection_name}")
    vector_db = Chroma(
        collection_name=collection_name,
        embedding_function=embeddings_model,
        client=chroma_client,
        # persist_directory="./chroma_db_homematch"
    )

except Exception as e:
     print(f"Error during ChromaDB initialization/clearing (may be first run or client issue): {e}")
     vector_db = Chroma(
        collection_name=collection_name,
        embedding_function=embeddings_model,
        # persist_directory="./chroma_db_homematch"
    )

Clearing existing collection: home_match_listings
Initializing/Re-initializing ChromaDB collection: home_match_listings


In [26]:
# Add documents to ChromaDB if there are any to add
if documents_for_db:
    # We pass the page_content (embedding_text) as the text to embed,
    # and the parsed_listing_metadata as the metadata.
    vector_db.add_texts(texts=documents_for_db, metadatas=metadatas_for_db, ids=ids_for_db)
    print(f"\nAdded {len(documents_for_db)} listings to ChromaDB collection '{collection_name}'.")
    print(f"Total documents in collection: {vector_db._collection.count()}")
else:
    print("\\nNo valid listings were processed and added to the database.")


Added 10 listings to ChromaDB collection 'home_match_listings'.
Total documents in collection: 10


#### Step 4: Building the User Preference Interface
Collect buyer preferences, such as the number of bedrooms, bathrooms, location, and other specific requirements from a set of questions or telling the buyer to enter their preferences in natural language. You can hard-code the buyer preferences in questions and answers, or collect them interactively however you'd like.

In [28]:
# Hard-coded buyer preferences (as per project instructions example)
questions = [
    "How big do you want your house to be?",
    "What are 3 most important things for you in choosing this property?",
    "Which amenities would you like?",
    "Which transportation options are important to you?",
    "How urban do you want your neighborhood to be?",
]
answers = [
    "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "A quiet neighborhood, good local schools, and convenient shopping options.",
    "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "A balance between suburban tranquility and access to urban amenities like restaurants and theaters.",
]

# Buyer Preference Parsing:
# Combine answers into a single string representing buyer preferences for embedding and search.
buyer_preferences_string = " ".join(answers)

print("Buyer Preferences String for Searching:")
print(buyer_preferences_string)

Buyer Preferences String for Searching:
A comfortable three-bedroom house with a spacious kitchen and a cozy living room. A quiet neighborhood, good local schools, and convenient shopping options. A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system. Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads. A balance between suburban tranquility and access to urban amenities like restaurants and theaters.


#### Step 5: Searching Based on Preferences
Semantic Search Implementation: Use the structured buyer preferences to perform a semantic search on the vector database, retrieving listings that most closely match the user's requirements.
Listing Retrieval Logic: Fine-tune the retrieval algorithm to ensure that the most relevant listings are selected based on the semantic closeness to the buyer’s preferences.

In [37]:
# Perform semantic search
if vector_db._collection.count() > 0:
    print(f"\nSearching for listings based on preferences (using {vector_db._collection.count()} listings in DB)...")
    # The number of results to retrieve
    num_results_to_retrieve = 3
    if vector_db._collection.count() < num_results_to_retrieve:
         num_results_to_retrieve = vector_db._collection.count() # Adjust if fewer docs than desired results
         print(f"Adjusted number of results to retrieve to {num_results_to_retrieve} due to DB size.")


    if num_results_to_retrieve > 0:
        # The similarity search uses the embeddings created from 'full_text_for_embedding'
        # but returns the original Document objects (or their metadata)
        retrieved_docs_with_scores = vector_db.similarity_search_with_score(
            buyer_preferences_string,
            k=num_results_to_retrieve
        )
        print(f"\nTop {len(retrieved_docs_with_scores)} Retrieved Listings (with similarity scores):")
        retrieved_listings_for_personalization = []
        for doc, score in retrieved_docs_with_scores:
            # Accessing metadata which was loaded from CSV and processed
            print(f"Listing ID: {doc.metadata.get('id')}")
            print(f"Neighborhood: {doc.metadata.get('neighborhood')}")
            print(f"Price: {doc.metadata.get('price')}")
            print(f"Bedrooms: {doc.metadata.get('bedrooms')}, Bathrooms: {doc.metadata.get('bathrooms')}\n") # Added newline for readability
            print(f"Original Description (Snippet): {doc.metadata.get('description', '')[:200]}...") # Show a snippet
            print(f"Similarity Score (lower is better for L2, higher for cosine): {score}")
            print("------------------------------------")
            retrieved_listings_for_personalization.append(doc.metadata) # Store metadata for next step
    else:
        print("No documents in database to search against.")
        retrieved_listings_for_personalization = []
else:
    print("\nVector database is empty. Cannot perform search.")
    retrieved_listings_for_personalization = []


Searching for listings based on preferences (using 10 listings in DB)...

Top 3 Retrieved Listings (with similarity scores):
Listing ID: listing_6
Neighborhood: Sunset Meadows
Price: $650,000
Bedrooms: 4, Bathrooms: 2

Original Description (Snippet): Welcome to this spacious 4-bedroom, 2-bathroom family home in the desirable Sunset Meadows neighborhood. This property features a large backyard with a sparkling pool, perfect for outdoor entertaining...
Similarity Score (lower is better for L2, higher for cosine): 0.33005866408348083
------------------------------------
Listing ID: listing_5
Neighborhood: Victorian Heights
Price: $950,000
Bedrooms: 4, Bathrooms: 2

Original Description (Snippet): Step back in time with this stunning Victorian home in the sought-after neighborhood of Victorian Heights. This 4-bedroom, 2-bathroom property exudes historical charm with original hardwood floors, in...
Similarity Score (lower is better for L2, higher for cosine): 0.338046133518219
------------

#### Step 6: Personalizing Listing Descriptions
LLM Augmentation: For each retrieved listing, use the LLM to augment the description, tailoring it to resonate with the buyer’s specific preferences. This involves subtly emphasizing aspects of the property that align with what the buyer is looking for.
Maintaining Factual Integrity: Ensure that the augmentation process enhances the appeal of the listing without altering factual information.

In [38]:
personalization_prompt_template = """
You are a helpful real estate assistant.
Given the following original property description and the buyer's preferences,
rewrite the description to highlight aspects that match the buyer's preferences.
Do NOT invent new facts about the property. Only rephrase and emphasize existing details from the original description
that align with the buyer's needs.
If the original description is very short or lacks details relevant to preferences,
simply state that the original description is concise and then present it.

Buyer's Preferences:
{buyer_preferences}

Original Property Listing Details:
Neighborhood: {neighborhood}
Price: {price}
Bedrooms: {bedrooms}
Bathrooms: {bathrooms}
House Size: {house_size}
Original Description: {original_description}
Original Neighborhood Description: {neighborhood_description}

Personalized Property Description (emphasizing aspects relevant to buyer preferences):
"""
# The prompt template for personalization
# This template will be used to generate personalized descriptions based on buyer preferences
personalization_prompt = PromptTemplate(
    input_variables=["buyer_preferences", "neighborhood", "price", "bedrooms", "bathrooms", "house_size", "original_description", "neighborhood_description"],
    template=personalization_prompt_template
)

personalization_chain = LLMChain(llm=llm, prompt=personalization_prompt)

# Personalize the retrieved listings based on buyer preferences
print("\nPersonalized Listing Descriptions:")
# Check if we have any listings to personalize
if not retrieved_listings_for_personalization:
    print("No listings were retrieved to personalize.")
else:
    for listing_metadata in retrieved_listings_for_personalization:
        print(f"\n--- Personalizing Listing ID: {listing_metadata.get('id')} ---")
        try:
            # Passing the metadata dictionary loaded from CSV and retrieved from DB
            personalized_description = personalization_chain.run(
                buyer_preferences=buyer_preferences_string,
                neighborhood=listing_metadata.get('neighborhood', 'N/A'),
                price=listing_metadata.get('price', 'N/A'),
                bedrooms=listing_metadata.get('bedrooms', 'N/A'),
                bathrooms=listing_metadata.get('bathrooms', 'N/A'),
                house_size=listing_metadata.get('house_size', 'N/A'),
                original_description=listing_metadata.get('description', 'N/A'),
                neighborhood_description=listing_metadata.get('neighborhood_description', 'N/A') 
            )
            print(f"Original Description:\n{listing_metadata.get('description', 'N/A')}")
            print(f"\nPersonalized Description:\n{personalized_description.strip()}")
        except Exception as e:
            print(f"Error personalizing listing {listing_metadata.get('id')}: {e}")
        print("------------------------------------")


Personalized Listing Descriptions:

--- Personalizing Listing ID: listing_6 ---
Original Description:
Welcome to this spacious 4-bedroom, 2-bathroom family home in the desirable Sunset Meadows neighborhood. This property features a large backyard with a sparkling pool, perfect for outdoor entertaining and summer fun. The open floor plan includes a modern kitchen with granite countertops, stainless steel appliances, and a breakfast bar. The master suite offers a walk-in closet and en-suite bathroom with a soaking tub and dual vanity. Enjoy the suburban lifestyle in this charming Sunset Meadows home!

Personalized Description:
This spacious 4-bedroom, 2-bathroom family home in the desirable Sunset Meadows neighborhood is perfect for those seeking a comfortable living space. The modern kitchen with granite countertops, stainless steel appliances, and breakfast bar is ideal for cooking and entertaining. The large backyard, complete with a sparkling pool, offers a great space for gardening

#### Step 7: Deliverables and Testing
Test your "HomeMatch" application and make sure it meets all of the requirements in the rubric. Your project code will be run when it's assessed. Enter different "buyer preferences" and ensure it works.
Jupyter Notebook/Python Program: Compile the application code in a Jupyter notebook or a standalone Python program. Ensure the code is well-commented and logically structured.
Example Outputs: Include example outputs showcasing how user preferences are processed and how the application generates personalized listing descriptions. You can include these in comments in your application or in a Jupyter notebook that's saved with outputs.