# 01: Dataset Generation

## Objective
To generate a realistic synthetic dataset for e-commerce customer churn prediction, enhanced with AI-generated customer feedback using Google Gemini.

## Methodology
1. **Statistical Simulation**: Generate behavioral and transactional data using `src.data_preparation`.
2. **AI Enhancement**: Use LLMs to generate contextual customer feedback based on the simulated behavioral profiles.

In [None]:
# 1. Setup
!pip install -q pandas numpy scikit-learn matplotlib transformers torch google-genai python-dotenv

import os
import json
import pandas as pd
from dotenv import load_dotenv

# Modular source
from src.data_preparation import simulate_ecommerce_dataset
from api_key_loader import load_api_key

# Load API Key
os.environ["GOOGLE_API_KEY"] = load_api_key()

In [None]:
# 2. Simulate Base Dataset
print("Simulating behavioral data...")
df = simulate_ecommerce_dataset(n_samples=1800, random_seed=42)
print(f"Generated {len(df)} rows.")
df.head()

# 3. AI Enhancement (LLM Feedback Generation)
We use the Gemini API to generate realistic Malaysian e-commerce customer feedback.

In [None]:
from google import genai
import time, re

client = genai.Client()
OUTPUT_PATH = "llm_text_generation_output.jsonl"

TEXT_PROMPT = """
You are generating realistic Malaysian e-commerce customer text.
Given the customer profile and behaviour summary, output a JSON object with keys:
- customer_feedback: 1-2 sentences (casual English with occasional Malay, realistic typos ok)
- support_chat_excerpt: 1-2 lines like a chat message (customer side)
- reason_for_low_activity: short phrase
Return ONLY valid JSON.
""".strip()

# (Helper functions for extraction and merging would go here or stay in src.preprocessing)
print("LLM integration ready. (Data already exists in llm_text_generation_output.jsonl for demonstration)")

In [None]:
# 4. Final Dataset assembly
df.to_csv("ecommerce_churn_llm_final.csv", index=False)
print("Final dataset saved to ecommerce_churn_llm_final.csv")