Utilisation de pydantic pour forcer le LLM à mieux respecter un format qui sera transformé grâce à des mécanismes de LangChain.

In [1]:
import os
os.environ["OPENAI_API_KEY"] = "voc-8162499801266773377505669655d3c05508.40840521"
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

In [None]:
#from langchain_openai import OpenAI
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import json

In [3]:
model_name="gpt-3.5-turbo-instruct"
#model_name="gpt-4o-mini"
temperature = 0.0
llm = OpenAI(
    model_name=model_name, temperature=temperature, max_tokens=3500
)

In [4]:
class PropertyAdvertClass(BaseModel):
    location: str = Field(
        description = "location in USA including the name the neighborhood"
    )
    style: str = Field(
        description = "style of construction"
    )
    rooms: int = Field(
        description = "number of rooms"
    )
    bedrooms: int = Field(
        description = "number of bedrooms"
    )
    bathrooms: int = Field(
        description = "number of bathrooms"
    )
    floors: int = Field(
        description = "number of floors"
    )
    house_size: int = Field(
        description = "surface area in square feet"
    )
    price: int = Field(
        description = "price in dollars"
    )
    property_description : str = Field(
        description = "a detailed description including its surface area in square feet, the number of rooms, bedrooms and bathrooms, the number of floors, if there are a garage and a garden, the style of construction and its price in dollars"
    )
    neighborhood_description : str = Field(
        description = "the neighborhood description"
    )

class ListOfAdvertsClass(BaseModel):
    adverts_list: list[PropertyAdvertClass]

```python
complete_advert : str = Field(
    description = "the complete detailled description of this advertisement, including the neightborhood location, style, rooms, bedrooms, bathrooms, floors, house_size, price, and property and nieghtborhood descriptions"
)

description = "the complete detailled description of this advertisement, including the neightborhood location, style, rooms, bedrooms, bathrooms, floors, house_size, price, and property and nieghtborhood descriptions"

```

In [5]:
parser = PydanticOutputParser(pydantic_object=ListOfAdvertsClass)
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"PropertyAdvertClass": {"properties": {"location": {"description": "location in USA including the name the neighborhood", "title": "Location", "type": "string"}, "style": {"description": "style of construction", "title": "Style", "type": "string"}, "rooms": {"description": "number of rooms", "title": "Rooms", "type": "integer"}, "bedrooms": {"description": "number of bedrooms", "title": "Bedrooms", "type": "integer"}, "bathrooms": {"description": "number of bathrooms", "title": "Bathrooms", "type": "integer"}, "floors": {"descriptio

In [6]:
gen_prompt = PromptTemplate(
    template="{question}.{context}\n{format_instructions}",
    input_variables=["question", "context"],
    partial_variables={"format_instructions": parser.get_format_instructions},
)

In [None]:
# each information and descriptions of an advertisement must be repeated in a new complete description set in the 'complete_advert' property.

requests_data = [
    {
        "num_ads": 5,
        "context": "the following is a list of properties for sale in the west coast of USA.",
    },
    {
        "num_ads": 5,
        "context": "the following is a list of properties for sale in the midlle west of USA.",
    },
    {
        "num_ads": 5,
        "context": "the following is a list of properties for sale in the midlle east of USA.",
    },
    {
        "num_ads": 5,
        "context": "the following is a list of properties for sale in the east coast of USA.",
    },
]

generated_adverts = []
for request in requests_data:
    num_ads = request["num_ads"]
    context = request["context"]
    
    adverts_query = f"""
        generate {num_ads} real estate advertisements for middle-class buyers, each respecting the output schema, and all gathered in a unique array. be creative in your descriptions but consistent and realistic.
    """

    prompt = gen_prompt.format(question=adverts_query, context=context)

    generated_adverts = llm.invoke(prompt)
    print(generated_adverts)

    adverts_query = f"""
        generate {num_ads} real estate advertisements for middle-class buyers, each respecting the output schema, and all gathered in a unique array. be creative in your descriptions but consistent and realistic.
    """
    context_query = f"the following is a list of properties for sale in {context}."

    prompt = gen_prompt.format(question=adverts_query, context=context_query)
    #print(prompt)

    generated_adverts = llm.invoke(prompt)
    #print(generated_adverts)


    generate 5 real estate advertisements for middle-class buyers, each respecting the output schema, and all gathered in a unique array. be creative in your descriptions but consistent and realistic.
.the following is a list of properties for sale in the west of USA.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"PropertyAdvertClass": {"properties": {"location": {"description": "location in USA including the name the neighborhood", "title": "Location", "type": "string"}, "style": {"description": "style of construction", "title": "Style", "type": "string"}, "rooms": {"descript

In [8]:
generated_adverts = parser.parse(generated_adverts)

In [9]:
print(">", generated_adverts.adverts_list[0].property_description, end="\n\n")
print(">", generated_adverts.adverts_list[0].neighborhood_description, end="\n\n")

> This beautiful modern home is located in the heart of Los Angeles. It features 3 bedrooms, 2 bathrooms, and a spacious living area. The house has 2 floors and a total size of 2000 square feet. It also has a garage and a small garden. The price for this property is $500,000.

> The neighborhood is known for its vibrant culture and diverse community. It is also conveniently located near popular restaurants, shopping centers, and entertainment venues.



In [None]:
filename = "generated_adverts_b.jsonl"
with open(filename, "w") as save_file:
    for advert in generated_adverts.adverts_list:
        json.dump(advert.model_dump(mode="json"), save_file)
        save_file.write('\n')
save_file.close()

In [11]:
with open(filename, "r") as file:
    for line in file:
        data_entry = json.loads(line)
        # Process each data_entry as a Python dict
        print(data_entry)

{'location': 'Los Angeles, California', 'style': 'Modern', 'rooms': 6, 'bedrooms': 3, 'bathrooms': 2, 'floors': 2, 'house_size': 2000, 'price': 500000, 'property_description': 'This beautiful modern home is located in the heart of Los Angeles. It features 3 bedrooms, 2 bathrooms, and a spacious living area. The house has 2 floors and a total size of 2000 square feet. It also has a garage and a small garden. The price for this property is $500,000.', 'neighborhood_description': 'The neighborhood is known for its vibrant culture and diverse community. It is also conveniently located near popular restaurants, shopping centers, and entertainment venues.'}
{'location': 'Phoenix, Arizona', 'style': 'Ranch', 'rooms': 5, 'bedrooms': 4, 'bathrooms': 3, 'floors': 1, 'house_size': 3000, 'price': 400000, 'property_description': 'This charming ranch-style home is located in Phoenix, Arizona. It has 4 bedrooms, 3 bathrooms, and a total of 3000 square feet. The house has a single floor and a spacious

In [12]:
vector_store_directory = "./chroma_langchain_db"

In [13]:
!rm -rf {vector_store_directory}

In [14]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vector_store = Chroma(
    collection_name="real_estate",
    embedding_function=embeddings,
    persist_directory=vector_store_directory,  # Where to save data locally, remove if not necessary
)

In [15]:
from uuid import uuid4
from langchain_core.documents import Document

documents = []
for i, advert in enumerate(generated_adverts.adverts_list, start=1):
    metadata = {}
    metadata["source"] = model_name
    metadata["id"] = i
    metadata["location"] = advert.location
    metadata["style"] = advert.style
    metadata["rooms"] = advert.rooms
    metadata["bedrooms"] = advert.bedrooms
    metadata["bathrooms"] = advert.bathrooms
    metadata["floors"] = advert.floors
    metadata["house_size"] = advert.house_size
    metadata["price"] = advert.price

    page_content = advert.property_description + advert.neighborhood_description
    documents.append(
        Document(page_content=page_content, metadata=metadata)
        )
    
uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids)

['14a17187-b613-4a0d-af49-79e5579499e4',
 'edb652ed-5dc1-4d55-af9e-223c1faff3ee',
 'bb6c7f39-572a-447f-99a3-14b1f6dd6231',
 '1af81d2d-14eb-4264-a373-62e5aeddf6a9',
 'f3be6216-b7e8-4c65-a4a8-85ecfb219ef4']

In [16]:
request = "I'm looking for a modern-style house with a sea view for a large family (at least 3 bedrooms)"

In [17]:
results = vector_store.similarity_search(
    request,
    k=2,
    filter={"source": "generated_adverts"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

In [18]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=2, filter={"source": "generated_adverts"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

In [19]:
results = vector_store.similarity_search_by_vector(
    embedding=embeddings.embed_query(request), k=2
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* This beautiful modern home is located in the heart of Los Angeles. It features 3 bedrooms, 2 bathrooms, and a spacious living area. The house has 2 floors and a total size of 2000 square feet. It also has a garage and a small garden. The price for this property is $500,000.The neighborhood is known for its vibrant culture and diverse community. It is also conveniently located near popular restaurants, shopping centers, and entertainment venues. [{'house_size': 2000, 'bedrooms': 3, 'price': 500000, 'style': 'Modern', 'bathrooms': 2, 'floors': 2, 'id': 1, 'rooms': 6, 'location': 'Los Angeles, California', 'source': 'gpt-3.5-turbo-instruct'}]
* This luxurious Mediterranean home is located in the bustling city of Las Vegas, Nevada. It has 6 bedrooms, 5 bathrooms, and a total of 5000 square feet. The house has 2 floors and a spacious backyard with a pool. The price for this property is $800,000.The neighborhood is known for its vibrant nightlife and entertainment options. It is also close