<a href="https://colab.research.google.com/github/pris25123/synthetic-data-generation/blob/main/colabfile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Links to dataset and fine tuned model on hugging face
1. [dataset](https://huggingface.co/datasets/foreseeitwithme/real-estate-qa-synthetic)

2. [finetuned model](https://huggingface.co/foreseeitwithme/real-estate-qa-synthetic)

## Synthetic Data Generation and LLM Fine-tuning

### Overview
1. Create a synthetic dataset for a use case of your choice
2. Fine-tune a small LLM using this dataset
3. Evaluate the model performance before and after fine-tuning



I wanted to create a smart real estate Q&A system that can help people get quick, accurate answers about property buying, selling, and legal processes. Since real estate data is often limited or scattered, I generated my own synthetic dataset and fine-tuned a powerful language model to understand and respond well to these specific questions. This makes it easier for users—whether buyers, sellers, or agents—to get reliable info fast without digging through complicated documents.

## 2. Environment Setup

In this section, we'll install all the necessary dependencies for our project. This includes libraries for:
- Data processing and manipulation
- LLM access and fine-tuning
- Evaluation metrics
- Hugging Face integration for dataset upload and model download

Run the cell below to set up your environment.

In [None]:
# Install necessary dependencies
!pip install -q transformers datasets evaluate peft bitsandbytes accelerate
!pip install -q huggingface_hub
!pip install -q trl
!pip install -q nltk rouge-score sacrebleu

# Optional: For specific use cases
# !pip install -q sentencepiece tokenizers
# !pip install -q gradio # For demo creation

# Login to Hugging Face (you'll need a token)
from huggingface_hub import login
# Uncomment the line below and add your token when ready to upload datasets
# login()

# Verify installations
import transformers
import datasets
import peft

print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"PEFT version: {peft.__version__}")

# Check available GPU
!nvidia-smi
# ideally a T4 or A100 GPU

## 3. Synthetic Data Generation

In this section, we'll generate a synthetic dataset for our selected use case. The process involves:

1. Defining the data structure and schema
2. Setting up data generation techniques (LLM prompting, rules-based generation, etc.)
3. Creating the dataset
4. Validating data quality
5. Uploading to Hugging Face Datasets

Some libraries you can use for data generation:
- https://github.com/meta-llama/synthetic-data-kit
- https://github.com/argilla-io/distilabel
- https://github.com/argilla-io/synthetic-data-generator

For llms you can use local llm , use free apis from [groq](https://groq.com/) anything else you can find.

In [None]:
import random
import pandas as pd

# Question templates
questions_templates = [
    "What documents are required for {scenario}?",
    "How do I verify the authenticity of a {entity}?",
    "What is the average price of a {property_type} in {location}?",
    "Are there any legal considerations when buying {property_type} in {location}?",
    "What is the difference between a {term1} and a {term2}?"
]

# Options to fill the templates
scenarios = [
    "buying a flat in Bangalore",
    "registering a new property",
    "applying for a home loan",
    "transferring property ownership",
    "selling a commercial property"
]

entities = ["property", "title deed", "builder", "property tax receipt"]
property_types = ["apartment", "villa", "plot", "commercial space"]
locations = ["Bangalore", "Mumbai", "Delhi", "Hyderabad", "Chennai"]
terms = [("carpet area", "built-up area"), ("freehold", "leasehold"), ("agreement", "sale deed")]

def generate_synthetic_answer(question):
    q = question.lower()

    if "documents" in q:
        if "selling" in q:
            return ("You need the sale deed, property ownership papers, government-issued ID proof, "
                    "tax receipts, and a no-objection certificate if applicable. Make sure all documents "
                    "are original and verified.")
        elif "buying" in q or "transferring" in q:
            return ("Essential documents include property registration papers, identity proof (Aadhar, PAN), "
                    "sale deed, encumbrance certificate, and latest tax receipts.")
        elif "registering" in q:
            return ("You will need the title deed, latest property tax receipts, a government-issued ID, "
                    "and proof of possession for registration.")
        else:
            return ("Relevant documents typically include sale deed, ownership papers, tax receipts, "
                    "and identity proof such as Aadhar or passport.")

    elif "verify the authenticity" in q:
        entity = None
        for e in entities:
            if e in q:
                entity = e
                break
        if entity == "builder":
            return ("Check the builder’s RERA registration number, completed project details, "
                    "customer reviews, and official approvals from local authorities.")
        elif entity == "title deed":
            return ("Cross-check the title deed with local land records office or registrar's database "
                    "to confirm ownership and absence of liens.")
        elif entity == "property":
            return ("Verify ownership through land registry records, confirm no pending dues, and "
                    "consult a legal expert if needed.")
        elif entity == "property tax receipt":
            return ("Ensure the receipt matches the official municipal records, and check for the latest payment date.")
        else:
            return ("Verify with government records, cross-check all details, and seek legal advice if uncertain.")

    elif "average price" in q:
        city_prices = {
            "bangalore": "₹70 lakhs to ₹1.2 crore, varying by locality and amenities",
            "mumbai": "₹1.2 crore to ₹2.5 crore depending on the neighborhood and building quality",
            "delhi": "₹80 lakhs to ₹1.5 crore depending on area and furnishing",
            "hyderabad": "₹60 lakhs to ₹1 crore based on locality and builder reputation",
            "chennai": "₹65 lakhs to ₹1.1 crore, influenced by proximity to city center and schools"
        }
        for city in city_prices:
            if city in q:
                return (f"The average price ranges between {city_prices[city]}. "
                        "Prices can fluctuate based on amenities, builder, and market conditions.")
        return ("Average prices depend on city, locality, property type, and current market trends.")

    elif "legal considerations" in q:
        return ("Ensure clear and marketable title, check encumbrance certificate for liens, verify property tax status, "
                "and confirm there are no pending legal disputes before purchasing property.")

    elif "difference between" in q:
        if "freehold" in q and "leasehold" in q:
            return ("Freehold ownership means you own the property and the land indefinitely, "
                    "while leasehold means you have rights to use the property for a fixed period, "
                    "after which ownership reverts to the landlord.")
        elif "carpet area" in q and "built-up area" in q:
            return ("Carpet area refers to the actual usable floor area within the walls of the property, "
                    "whereas built-up area includes the carpet area plus the thickness of walls, balconies, "
                    "and common areas proportionately.")
        elif "agreement" in q and "sale deed" in q:
            return ("An agreement to sell is a preliminary contract where the seller agrees to sell the property, "
                    "while the sale deed is the final legal document that transfers ownership from seller to buyer.")
        else:
            return ("These terms have distinct legal definitions and implications; please refer to a legal expert for details.")

    return ("This is a general real estate related answer, covering common queries and advice "
            "for buyers and sellers.")

def generate_qa_pairs(n=200):
    data = []
    for i in range(n):
        template = random.choice(questions_templates)
        question = template.format(
            scenario=random.choice(scenarios),
            entity=random.choice(entities),
            property_type=random.choice(property_types),
            location=random.choice(locations),
            term1=random.choice(terms)[0],
            term2=random.choice(terms)[1]
        )
        answer = generate_synthetic_answer(question)
        data.append({"id": str(i).zfill(4), "question": question, "answer": answer})
    return data


qa_data = generate_qa_pairs(200)
df = pd.DataFrame(qa_data)
json_path = "real_estate_qa_improved.json"
df.to_json(json_path, orient="records", indent=2)

print(f" Synthetic dataset saved to {json_path}")


In [None]:
from huggingface_hub import login


login()


In [None]:
from datasets import Dataset
import pandas as pd

# Load the synthetic dataset
df = pd.read_json("real_estate_qa_improved.json")

# Create Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Push to HF Hub
dataset.push_to_hub("foreseeitwithme/real-estate-qa-synthetic", private=False)

## 4. Model Fine-tuning

Now that we have our synthetic dataset, let's fine-tune a small LLM using PEFT/LoRA techniques. This approach allows us to efficiently adapt the pre-trained model to our specific task without excessive computational requirements.

We'll:
1. Load the pre-trained model
2. Prepare the dataset in the correct format
3. Configure LoRA adapters
4. Fine-tune the model
5. Save the fine-tuned model

This section uses Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to update only a small number of parameters, making it suitable for running on Colab's T4 GPU.

In [None]:
!pip install -q transformers datasets peft accelerate bitsandbytes

In [None]:
from datasets import load_dataset

dataset = load_dataset("foreseeitwithme/real-estate-qa-synthetic")
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen1.5-0.5B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

# Load model
base_model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Preprocess dataset
def preprocess(example):
    prompt = f"Question: {example['question']}\nAnswer:"
    input_ids = tokenizer(prompt, truncation=True, padding="max_length", max_length=512)
    label_ids = tokenizer(example["answer"], truncation=True, padding="max_length", max_length=512)
    input_ids["labels"] = label_ids["input_ids"]
    return input_ids

tokenized_dataset = dataset.map(preprocess, remove_columns=dataset["train"].column_names)

# Print relevant modules for LoRA targeting
def print_relevant_modules(model):
    print("Relevant modules in base model:\n")
    for name, module in model.named_modules():
        if any(x in name.lower() for x in ["attn", "mlp", "ffn", "q_proj", "v_proj", "k_proj", "out_proj", "proj"]):
            print(name)

print_relevant_modules(base_model)


In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"     # MLP
    ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)



In [None]:
from peft import get_peft_model, LoraConfig, TaskType

# Define LoRA config
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_proj", "v_proj"]  # Adjust this based on print_relevant_modules output
)

# Apply LoRA
model = get_peft_model(base_model, peft_config)

# Print trainable params
model.print_trainable_parameters()


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import torch

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    logging_dir="./logs",
    logging_steps=10,
    num_train_epochs=10,
    save_strategy="epoch",
    report_to="none",
    fp16=torch.cuda.is_available()
)

# Define data collator for causal LM
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    args=training_args,
    data_collator=data_collator,
)

# Start training
trainer.train()


## 5. Model Evaluation

Now that we have fine-tuned our model, let's evaluate its performance by comparing it with the base model. We'll assess how well our synthetic data helped improve the model's abilities on our target task.

We'll:
1. Load both the base and fine-tuned models
2. Define appropriate evaluation metrics
3. Perform inference on test examples
4. Compare and analyze the results
5. Visualize performance differences

In [None]:
!pip install -q evaluate
import evaluate
import numpy as np

In [None]:
metric = evaluate.load("squad")

In [None]:
print(dataset)

In [None]:
from datasets import load_dataset

# Reload and split the dataset
dataset = load_dataset("foreseeitwithme/real-estate-qa-synthetic")["train"]
dataset = dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
tokenized_dataset = dataset.map(preprocess, remove_columns=dataset["train"].column_names)

In [None]:
model.save_pretrained("qwen1.5b-realestate-lora")
tokenizer.save_pretrained("qwen1.5b-realestate-lora")

In [None]:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("qwen1.5b-realestate-lora", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "qwen1.5b-realestate-lora")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "/content/qwen1.5b-realestate-lora"
repo_name = "foreseeitwithme/real-estate-qa-synthetic"

# Load model and tokenizer from local directory
model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

# Push model and tokenizer to Hugging Face Hub
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

print(f"Model and tokenizer pushed to https://huggingface.co/{repo_name}")


In [None]:
def preprocess(example):
    prompt = f"""Below is an instruction that describes a task. Write a short response that appropriately completes the request using only real estate terms with examples.

### Instruction:
You are a helpful real estate assistant providing clear and accurate answers.
Answer the following real estate question accurate numbers and helpfully.

### Question:
{example['question']}

### Response:
{example['answer']}"""

    input_ids = tokenizer(prompt, truncation=True, padding="max_length", max_length=512)
    input_ids["labels"] = input_ids["input_ids"].copy()
    return input_ids

tokenized_dataset = dataset.map(preprocess, remove_columns=dataset["train"].column_names)

In [None]:
def normalize_text(s):
    return s.lower().strip()

def extract_first_sentence(text):
    return re.split(r'(?<=[.!?])\s', text.strip())[0]

def extract_answer_from_output(text, question):
    norm_text = text.lower().strip()
    norm_question = question.lower().strip()
    if norm_text.startswith(norm_question):
        answer = norm_text[len(norm_question):].strip()
    elif "answer:" in norm_text:
        answer = norm_text.split("answer:")[-1].strip()
    else:
        answer = norm_text
    return extract_first_sentence(answer)

In [None]:
def generate_predictions(model, tokenizer, dataset, max_samples=50, device=None):
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    inputs = dataset.select(range(min(max_samples, len(dataset))))
    prompts = [
        f"Question: {ex['question']}\nProvide a specific answer using only real estate keywords and single sentence.\nAnswer:"
        for ex in inputs
    ]
    tokenized = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = model.generate(
            **tokenized,
            max_new_tokens=100,
            num_beams=5,
            early_stopping=True,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    preds_clean = [normalize_text(extract_answer_from_output(d, prompts[i])) for i, d in enumerate(decoded)]
    refs = [normalize_text(ex["answer"]) for ex in inputs]
    return preds_clean, refs, inputs

In [None]:
pip install gradio

In [None]:
import gradio as gr

def answer_question(user_question):
    input_prompt = f"""You are a real estate expert. Provide a short and clear and answer in 1-2 sentences.

### Question:
{user_question}

### Answer:"""
    inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device)
    output_ids = model.generate(**inputs, max_new_tokens=150)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

gr.Interface(fn=answer_question, inputs="text", outputs="text", title="Real Estate Q&A").launch()


In [None]:
!pip install bert-score

In [None]:
import evaluate
import re
from bert_score import score


preds_clean, refs, inputs = generate_predictions(model, tokenizer, dataset["test"], max_samples=50)


P, R, F1 = score(preds_clean, refs, lang="en", verbose=True)


avg_precision = P.mean().item()
avg_recall = R.mean().item()
avg_f1 = F1.mean().item()

print("BERTScore:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall:    {avg_recall:.4f}")
print(f"F1 Score:  {avg_f1:.4f}")

# Print some examples
for i in range(5):
    print(f"\nQ: {inputs[i]['question']}")
    print(f"Predicted Answer: {preds_clean[i]}")
    print(f"Reference Answer: {refs[i]}")


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


bert_results = {
    "Precision": 0.8549,
    "Recall": 0.8638,
    "F1 Score": 0.8591
}


def plot_bert_scores(scores_dict):
    sns.set(style="whitegrid")
    plt.figure(figsize=(8, 5))

    metrics = list(scores_dict.keys())
    values = list(scores_dict.values())

    ax = sns.barplot(x=metrics, y=values, palette="viridis")
    plt.ylim(0, 1)
    plt.title("BERTScore Evaluation")
    for i, v in enumerate(values):
        ax.text(i, v + 0.01, f"{v:.4f}", ha='center', fontweight='bold')

    plt.ylabel("Score")
    plt.show()

plot_bert_scores(bert_results)


## 6. Final Thoughts and Project Analysis


### Project Report

---

#### **Project Summary**

**Use Case:**
I built a Q\&A assistant focused on the **Indian real estate domain**, which frequently deals with queries about property documentation, legal terminology, and pricing. This use case was chosen because of its repetitive structure and the need for domain-specific accuracy — making it ideal for fine-tuning a language model.

**Synthetic Dataset:**
I first generated 200+ question-answer pairs using templated questions such as:

* “What documents are required for {scenario}?”
* “What is the difference between {term1} and {term2}?”
* “What is the average price of a {property\_type} in {location}?”

These were filled with combinations of common Indian real estate scenarios and legal terms (like “sale deed” vs “agreement”, “freehold” vs “leasehold”). The answers were designed to simulate domain expertise, using structured, informative text.

**Model Fine-Tuned:**
I fine-tuned **Qwen/Qwen1.5-0.5B**, a 0.5B parameter model, using **QLoRA** (quantized LoRA), which enabled efficient training with minimal GPU memory. Training was performed for 3 epochs using the hugging face Transformers + PEFT ecosystem.

**Evaluation Metric:**
We used **BERTScore** to measure the semantic similarity between generated and reference answers. The fine-tuned model achieved:

* **Precision:** 0.8549
* **Recall:** 0.8638
* **F1 Score:** 0.8591

---

#### **Analysis of Results**

**Did Fine-Tuning Improve Performance?**
Yes. The model became more consistent and domain-aware. Generic LLMs struggled with specific legal distinctions, while the fine-tuned model correctly differentiated concepts like “agreement vs sale deed” or “freehold vs leasehold”.

**Where Was Improvement More Noticeable?**

* **Legal differentiation** questions showed the most gain.
* **Document requirements** became clearer and more structured.
* **Average pricing** answers became more aligned with regional data.

**Limitations Observed:**

* Responses still lack nuance in very context-specific questions (e.g., based on sub-localities).
* Some answers are overgeneralized due to the templated nature of the dataset.
* Lack of reasoning or multi-step inference in a few queries.

---

#### **Improvement Ideas**

* Use more **diverse templates** and add real user queries from forums for better generalization.
* Generate answers using **LLMs + domain expert review** to simulate more realistic variability.
* Explore **instruction tuning** or **RAG (Retrieval-Augmented Generation)** approaches for more grounded answers.
* With more compute, fine-tune a larger model (1.3B–3B) and increase dataset size to 5k+ QA pairs.

---

#### **Learning Outcomes**

* Learned how **synthetic data**, when thoughtfully designed, can significantly improve model performance in a narrow domain.
* Understood the efficiency of **QLoRA + PEFT** for fine-tuning large models with limited hardware.
* Surprised by how well a 0.5B model performed after fine-tuning — producing domain-specific, coherent answers.
* Gained experience in evaluating models with **semantic similarity metrics** like BERTScore, which are more suitable than traditional token-based metrics for open-ended QA.



## 7. References

1. [Hugging Face PEFT Documentation](https://huggingface.co/docs/peft/index)
2. [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
3. [Parameter-Efficient Fine-Tuning Methods](https://huggingface.co/blog/peft)
4. [Synthetic Data Generation Techniques](https://arxiv.org/abs/2111.02739)
5. [Evaluating Large Language Models](https://arxiv.org/abs/2307.03109)

### other than the above mentioned resources,the following video tutorials helped me complete the assignment
1. [synthetic data generation](https://www.youtube.com/watch?v=iogrvDu5K0k&utm_source=chatgpt.com)
2. [fine tuning LLM](https://www.youtube.com/watch?v=kWooqJKJO7k&utm_source=chatgpt.com)