## Project Documentation

1. **Environment Setup**  
   Installs and configures the OpenAI API, LangChain, and ChromaDB clients.

2. **Synthetic Data Generation**  
   Uses GPT-3.5 to generate 12 property listings, saved to `listings.json` and `listings.txt`.

3. **Embeddings & Indexing**  
   Computes text embeddings with OpenAI’s `text-embedding-ada-002` and indexes them in ChromaDB.

4. **Semantic Search**  
   Defines `search_listings(prefs, top_k)` to embed buyer preferences and retrieve the top-k most similar listings.

5. **Personalization**  
   Uses an `LLMChain` and prompt template to generate 2–3 paragraph descriptions tailored to each buyer’s needs while preserving factual details.

6. **Interactive Demo**  
   Prompts the user for preferences and displays the top 3 personalized listings.

7. **Files to Submit**  
   - `HomeMatch.ipynb` (this notebook)  
   - `listings.json` & `listings` (synthetic listings)  
   - `requirements.txt`  
   - `README.md` (project overview & setup instructions)


This is a starter notebook for the project, you'll have to import the libraries you'll need, you can find a list of the ones available in this workspace in the requirements.txt file in this workspace. 

## 1. Environment Setup 

In [1]:
pip install --user -r requirements.txt







Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

os.environ["OPENAI_API_KEY"] = "my key"
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

#All libraries
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
import chromadb
from chromadb.config import Settings
import json
from langchain import PromptTemplate, LLMChain
import pandas as pd

In [3]:
# Initialize LLM and embedder
llm = OpenAI(model_name='gpt-3.5-turbo')
embedder = OpenAIEmbeddings()

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Create or connect to the collection
collection = chroma_client.get_or_create_collection(
    name="home_match_listings",
    embedding_function=embedder
)

print("Environment setup complete.")

Environment setup complete.




## 2. Synthetic Data Generation  

In [4]:
def generate_listings(n=10):
    base_prompt = (
        "Generate a realistic real estate listing in JSON format with fields: "
        "neighborhood, price, bedrooms, bathrooms, size_sqft, description, neighborhood_description. "
        "Ensure diverse locations, styles, and budgets."
    )
    listings = []
    for i in range(n):
        resp = llm(base_prompt + f"\n\nListing {i+1}:", temperature=0.7, max_tokens=400)
        try:
            data = json.loads(resp)
            listings.append(data)
        except json.JSONDecodeError:
            print(f"Couldn’t parse listing {i+1}")
    return listings

# generate and save
listings = generate_listings(12)
with open("listings.json", "w") as f:
    json.dump(listings, f, indent=2)
print(f"Saved {len(listings)} listings to listings.json.")


Saved 12 listings to listings.json.


In [5]:
 # Save listings to a line‐delimited text file, per the instructor’s instructions
with open("listings", "w") as f:
    for lst in listings:
        f.write(json.dumps(lst) + "\n")
print("Saved synthetic listings to 'listings'")

Saved synthetic listings to 'listings'


## 3. Embeddings & Vector Store  

In [6]:
# 1. Load generated listings
with open('listings.json', 'r') as f:
    listings = json.load(f)
print(f"Loaded {len(listings)} listings from listings.json.")

Loaded 12 listings from listings.json.


In [7]:
# 2. Prepare each listing as a single text document
documents, metadatas, ids = [], [], []
for idx, lst in enumerate(listings, start=1):
    text = (
        f"Neighborhood: {lst['neighborhood']}. "
        f"Price: {lst['price']}. "
        f"Bedrooms: {lst['bedrooms']}, Bathrooms: {lst['bathrooms']}. "
        f"Size: {lst['size_sqft']} sqft. "
        f"Description: {lst['description']}"
    )
    documents.append(text)
    metadatas.append(lst)
    ids.append(f"listing_{idx}")

In [8]:
# 3. Compute embeddings
embeddings = embedder.embed_documents(documents)
print("Computed embeddings for documents, vector length:", len(embeddings[0]))

Computed embeddings for documents, vector length: 1536


In [9]:
# 4. Add them to your ChromaDB collection
collection.add(
    ids=ids,
    embeddings=embeddings,
    metadatas=metadatas,
    documents=documents
)


print(f"Indexed {len(documents)} listings in ChromaDB.")

Indexed 12 listings in ChromaDB.


## 4. Semantic Search  

In [10]:
def search_listings(preferences: str, top_k: int = 5):
    # 1. Embed the preferences
    pref_embedding = embedder.embed_query(preferences)
    # 2. Query the vector store
    results = collection.query(
        query_embeddings=[pref_embedding],
        n_results=top_k
    )
    # 3. Collect matches
    matches = []
    for idx, rid in enumerate(results['ids'][0]):
        metadata = results['metadatas'][0][idx]
        score = results['distances'][0][idx]
        entry = metadata.copy()
        entry['score'] = score
        matches.append(entry)
    return matches

In [11]:
# 4. Run a pref example
prefs = "A quiet suburban home with 3 bedrooms, a large backyard, and proximity to good schools."

results = search_listings(prefs, top_k=5)

# Convert to DataFrame for readability
def to_df(matches):
    return pd.DataFrame([
        {"neighborhood": m["neighborhood"],
         "price": m["price"],
         "score": m["score"]}
        for m in matches
    ])

df = to_df(results)

print("Top 5 matches for Suburban Family Home")
display(df)

Top 5 matches for Suburban Family Home


Unnamed: 0,neighborhood,price,score
0,West Hollywood,"$1,200,000",0.364597
1,West Hollywood,"$1,250,000",0.365515
2,Park Slope,"$1,200,000",0.368098
3,Williamsburg,"$1,200,000",0.368478
4,Brooklyn Heights,"$1,200,000",0.384759


## 5. Personalized Descriptions  

In [12]:
# 1. Define a prompt template that injects buyer prefs + listing facts
template = (
    "Buyer Preferences: {prefs}\n"
    "Listing Data: Neighborhood: {neighborhood}, Price: {price}, Bedrooms: {bedrooms}, "
    "Bathrooms: {bathrooms}, Size: {size_sqft} sqft. Description: {description}\n\n"
    "Write a personalized 2–3 paragraph property description that highlights the buyer’s "
    "priorities without altering any factual details."
)
prompt = PromptTemplate(
    input_variables=["prefs", "neighborhood", "price", "bedrooms", "bathrooms", "size_sqft", "description"],
    template=template
)

In [13]:
# 2. Create an LLMChain
chain = LLMChain(llm=llm, prompt=prompt)

In [14]:
# 3. Helper function to run it
def personalize_listing(listing: dict, prefs: str) -> str:
    return chain.run(
        prefs=prefs,
        neighborhood=listing["neighborhood"],
        price=listing["price"],
        bedrooms=listing["bedrooms"],
        bathrooms=listing["bathrooms"],
        size_sqft=listing["size_sqft"],
        description=listing["description"]
    )

In [15]:
# 4. Example usage
print("\n--- Personalized Descriptions ---\n")
matches = search_listings(prefs)
for i, match in enumerate(matches, start=1):
    print(f"Match {i}: {match['neighborhood']} at {match['price']}")
    print(personalize_listing(match, prefs))


--- Personalized Descriptions ---

Match 1: West Hollywood at $1,200,000
Welcome home to this stunning modern property located in the quiet suburban neighborhood of West Hollywood. Boasting 3 bedrooms and 2 bathrooms, this newly renovated home offers a spacious 1800 sqft of living space, perfect for a growing family. The large backyard with a pool provides ample space for outdoor activities and entertaining guests. Situated in proximity to top-rated schools, this property is ideal for families looking to settle down in a peaceful and family-friendly community. Don't miss out on the opportunity to own your dream home in this sought-after neighborhood for $1,200,000.
Match 2: West Hollywood at $1,250,000
Welcome to your quiet suburban oasis in the heart of West Hollywood! This beautiful modern home boasts 3 bedrooms, 2.5 bathrooms, and a spacious backyard, perfect for enjoying the California sunshine. With high-end finishes and an open floor plan, this recently renovated property is ide

## 6. Interactive Demo  

In [16]:
prefs = input("Enter your home preferences (e.g. 3 beds, big yard, quiet suburb, etc):\n> ")
matches = search_listings(prefs, top_k=3)

print("\nHere are your top matches:\n")
for i, match in enumerate(matches, start=1):
    print(f"--- Match {i}: {match['neighborhood']} at {match['price']} ---")
    print(personalize_listing(match, prefs))
    print()

Enter your home preferences (e.g. 3 beds, big yard, quiet suburb, etc):
> A big house with a big yard. I prefer to live in a quiet neighborhood and must have 3 bedrooms

Here are your top matches:

--- Match 1: West Hollywood at $1,200,000 ---
Welcome to your dream home in West Hollywood! This stunning property offers everything you're looking for, including a big house with 3 bedrooms and 2 bathrooms, perfect for you and your family. With a size of 1800 sqft, there's plenty of space for everyone to enjoy. The quiet neighborhood provides a peaceful and serene atmosphere, ideal for relaxing after a long day. Plus, the spacious backyard with a pool is perfect for hosting gatherings or simply unwinding in your own private oasis. Don't miss out on this opportunity to live in style and luxury in the heart of West Hollywood!

--- Match 2: West Hollywood at $1,250,000 ---
Welcome to your dream home in West Hollywood! This stunning property offers everything you've been searching for - a big h