## Synthetic Data Generation

### Generating Real Estate Listings with an LLM

The submission must demonstrate using a Large Language Model (LLM) to generate at least 10 diverse and realistic real estate listings containing facts about the real estate.


In [3]:
import os
import json
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("VOC_OPENAI_API_KEY") 
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

model_name = 'gpt-3.5-turbo'
llm = ChatOpenAI(model_name=model_name, temperature=0, max_tokens=2000)

In [2]:
# kickstart the listing creation with an example listing

listings = ["""
Neighborhood: Green Oaks
Price: $800,000
Bedrooms: 3
Bathrooms: 2
House Size: 2,000 sqft

Description: Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.

Neighborhood Description: Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a breeze.
"""]

In [3]:
from langchain_core.messages import HumanMessage, SystemMessage

def invent_listing(example_listings):
    history = [
        SystemMessage(content="You are a real estate agent creating a listings for a homes. Make sure to vary between neighborhoods in the city center and close to nature, large and small homes, and cheap and luxury homes."),
        HumanMessage(content="Here are examples of listings:\n\n" + "\n".join(example_listings)+ "Invent another listing for a home following the same format as the example above.")
    ]
    return llm.invoke(history).content

In [4]:
for i in range(10):
    # get the last two listings and generate a new one
    new_listing = invent_listing(listings[-2:])
    print(new_listing)
    listings.append(new_listing)

# write the listings to a json file to avoid having to recreate them every time
with open('listings.json', 'w') as f:
    json.dump(listings, f, indent=2)

Neighborhood: Downtown City Center
Price: $1,500,000
Bedrooms: 4
Bathrooms: 3.5
House Size: 3,500 sqft

Description: Step into luxury living in the heart of the bustling Downtown City Center. This exquisite 4-bedroom, 3.5-bathroom home offers unparalleled elegance and sophistication. The grand foyer welcomes you into a spacious living room with high ceilings and a stunning crystal chandelier. The gourmet kitchen features top-of-the-line appliances and a large island, perfect for entertaining guests. Retreat to the master suite with a spa-like bathroom and a private balcony overlooking the city skyline. Enjoy the convenience of urban living without compromising on comfort and style in this downtown masterpiece.

Neighborhood Description: Downtown City Center is a vibrant hub of culture, dining, and entertainment. Explore world-class restaurants, boutique shops, and art galleries just steps away from your doorstep. Catch a show at the historic theater or take a leisurely stroll along the

In [4]:
# load the listings from the json file
with open('listings.json', 'r') as f:
    listings = json.load(f)

Example of an invented listing:



## Semantic Search

### Creating a Vector Database and Storing Listings

The project must demonstrate the creation of a vector database and successfully storing real estate listing embeddings within it. The database should effectively store and organize the embeddings generated from the LLM-created listings.



In [26]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_chroma import Chroma

# split the listings into chunks of 4000 characters. The chunks should be large enough to contain the entire listing. 
# Neverthelss we chunk just to make sure we don't exceed the token limit of the model.
splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0)
split_docs = splitter.split_documents([Document(listing, metadata={"source":f"listing {i}"}) for i, listing in enumerate(listings)])

embeddings = OpenAIEmbeddings()

vector_store = Chroma(
    collection_name="listings_collection",
    embedding_function=embeddings,
)
vector_store.add_documents(documents=split_docs)

def print_results(results):
    for i, (doc, score) in enumerate(results):
        print(f"Rank {i+1}: {doc.metadata['source']} with score {score:.2f}\n=======================================\n{doc.page_content}\n\n")

# https://python.langchain.com/v0.2/docs/versions/migrating_chains/retrieval_qa/



### Semantic Search of Listings Based on Buyer Preferences

The application must include a functionality where listings are semantically searched based on given buyer preferences. The search should return listings that closely match the input preferences.

In [27]:

results = vector_store.similarity_search_with_relevance_scores("An affordable house in the woods", k=3)
print_results(results)

Rank 1: listing 8 with score 0.79
Neighborhood: Tranquil Woods Retreat
Price: $600,000
Bedrooms: 4
Bathrooms: 3
House Size: 2,500 sqft

Description: Escape to the peaceful surroundings of Tranquil Woods Retreat, where this charming 4-bedroom, 3-bathroom home offers a serene sanctuary amidst nature's beauty. The cozy living room features a stone fireplace and large windows that frame the lush greenery outside, creating a warm and inviting atmosphere for relaxation. The country-style kitchen boasts wood cabinetry, granite countertops, and a breakfast nook overlooking the tranquil backyard. The spacious master suite is a private haven with a spa-like ensuite bathroom and a walk-in closet for added convenience. Step outside to the expansive deck surrounded by towering trees, perfect for enjoying morning coffee or hosting outdoor gatherings with loved ones. Experience the tranquility of nature in this Tranquil Woods Retreat home.

Neighborhood Description: Tranquil Woods Retreat is a seclud


## Augmented Response Generation

### Logic for Searching and Augmenting Listing Descriptions

The project must demonstrate a logical flow where buyer preferences are used to search and then augment the description of real estate listings. The augmentation should personalize the listing without changing factual information.


In [11]:
# Establish buyer preferences

questions = [   
                "How big do you want your house to be?" 
                "What are 3 most important things for you in choosing this property?", 
                "Which amenities would you like?", 
                "Which transportation options are important to you?",
                "How urban do you want your neighborhood to be?",   
            ]
answers = [
    "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "A quiet neighborhood, good local schools, and convenient shopping options.",
    "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "A balance between suburban tranquility and access to urban amenities like restaurants and theaters."
]

In [14]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

llm = ChatOpenAI(model_name=model_name, temperature=0, max_tokens=2000)

history = []
for q, a in zip(questions, answers):
    history.append(AIMessage(content=q))
    history.append(HumanMessage(content=a))
history.append(AIMessage(content="Thank you for sharing your preferences. I will now summarize them."))

# summarize the user preferences
summary = llm.invoke(history).content
print(summary)

Summary of Preferences:
- A comfortable three-bedroom house with a spacious kitchen and a cozy living room.
- Amenities: Quiet neighborhood, good local schools, and convenient shopping options.
- Desired features: Backyard for gardening, two-car garage, and modern, energy-efficient heating system.
- Transportation preferences: Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.
- Urban preference: Access to a reliable bus line, proximity to a major highway, and bike-friendly roads.


In [28]:
results = vector_store.similarity_search_with_relevance_scores(summary, k=3)
print_results(results)

Rank 1: listing 0 with score 0.76
Neighborhood: Green Oaks
Price: $800,000
Bedrooms: 3
Bathrooms: 2
House Size: 2,000 sqft

Description: Welcome to this eco-friendly oasis nestled in the heart of Green Oaks. This charming 3-bedroom, 2-bathroom home boasts energy-efficient features such as solar panels and a well-insulated structure. Natural light floods the living spaces, highlighting the beautiful hardwood floors and eco-conscious finishes. The open-concept kitchen and dining area lead to a spacious backyard with a vegetable garden, perfect for the eco-conscious family. Embrace sustainable living without compromising on style in this Green Oaks gem.

Neighborhood Description: Green Oaks is a close-knit, environmentally-conscious community with access to organic grocery stores, community gardens, and bike paths. Take a stroll through the nearby Green Oaks Park or grab a cup of coffee at the cozy Green Bean Cafe. With easy access to public transportation and bike lanes, commuting is a b

### Use of LLM for Generating Personalized Descriptions

The submission must utilize an LLM to generate personalized descriptions for the real estate listings based on buyer preferences. The descriptions should be unique, appealing, and tailored to the preferences provided.

In [32]:
from langchain_core.messages import HumanMessage, SystemMessage

top_listings = "\n===========================\n".join([doc.page_content for doc, _ in results])

system_prompt = f"""
You are a real estate agent who helps users to find the right real estate listing based on their preferences. 
These are the available listings:

{top_listings}

Present the best listing to the user based on their preferences. 
Highlight the features that match their preferences. 
Only use the information provided in the listings."""

history = [
        SystemMessage(content=system_prompt),
        HumanMessage(content="My preferences are:\n" + summary + "\nWhich listing do you recommend?")
    ]

response = llm.invoke(history).content
print(response)

Based on your preferences, I recommend the first listing in Green Oaks priced at $800,000.

Here's how this listing matches your preferences:
- It is a comfortable three-bedroom house with a spacious kitchen and a cozy living room.
- The house has a backyard with a vegetable garden, perfect for gardening.
- While it doesn't mention a two-car garage, it does emphasize eco-friendly features like solar panels, which align with your desire for modern, energy-efficient systems.
- Green Oaks is described as a close-knit, quiet, and environmentally-conscious community with access to organic grocery stores and community gardens.
- The neighborhood offers easy access to public transportation and bike lanes, meeting your transportation preferences.
- The house is located in an urban area with access to public transportation and bike-friendly roads.

Overall, this listing in Green Oaks aligns well with your preferences for a comfortable home in a quiet neighborhood with convenient amenities and t