# 🚗 Multilingual Car Problem Dataset Generator

## 🎯 Use Case

This notebook generates a **realistic multilingual dataset** for training and testing **RAG (Retrieval-Augmented Generation)** systems that need to handle automotive technical support queries across different languages and writing styles.

### 💡 Why This Dataset?

Real-world technical support systems face challenges like:
- 🌍 **Multilingual queries**: Users speak different languages
- ✍️ **Inconsistent writing**: Varying grammar quality and terminology
- 🔀 **Mixed language data**: Brand names may stay in English while descriptions are in local languages
- 📱 **Natural variations**: Same problem described in multiple ways

This dataset simulates these real-world conditions to help build robust RAG systems that can:
- Understand queries in 7 different languages
- Handle imperfect grammar and colloquial expressions
- Match semantic meaning across language barriers
- Retrieve relevant solutions regardless of language mixing

### 📊 Generated Data Structure

```mermaid
graph TB
    A[🏭 Car Brands<br/>5 Manufacturers] --> B[🚙 Models<br/>2 per Brand]
    B --> C[⚠️ Problem Types<br/>3 Common Issues]
    C --> D[🌐 Languages<br/>7 Languages]
    D --> E[💬 Variations<br/>4 per Problem]
    E --> F[📝 Final Dataset<br/>60 Records]
    
    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
    style B fill:#F5A623,stroke:#C87D0E,stroke-width:3px,color:#fff
    style C fill:#E74C3C,stroke:#C0392B,stroke-width:3px,color:#fff
    style D fill:#27AE60,stroke:#1E8449,stroke-width:3px,color:#fff
    style E fill:#9B59B6,stroke:#7D3C98,stroke-width:3px,color:#fff
    style F fill:#F39C12,stroke:#D68910,stroke-width:3px,color:#fff
```

### 🗣️ Supported Languages

🇬🇧 English | 🇫🇷 French | 🇪🇸 Spanish | 🇯🇵 Japanese | 🇨🇳 Chinese (Simplified) | 🇬🇷 Greek | 🇮🇱 Hebrew

In [1]:
import pandas as pd
import random
from typing import List, Dict, Tuple
import uuid

# Define car brands and models with translations
car_data = {
    "Toyota": ["Corolla", "Camry"],
    "Honda": ["Civic", "Accord"],
    "Ford": ["Focus", "Fusion"],
    "Volkswagen": ["Golf", "Passat"],
    "Nissan": ["Altima", "Sentra"]
}

# Translations for car brands (keeping original if no translation exists)
brand_translations = {
    "Toyota": {
        "en": "Toyota", "fr": "Toyota", "es": "Toyota", 
        "ja": "トヨタ", "zh": "丰田", "el": "Toyota", "he": "טויוטה"
    },
    "Honda": {
        "en": "Honda", "fr": "Honda", "es": "Honda",
        "ja": "ホンダ", "zh": "本田", "el": "Honda", "he": "הונדה"
    },
    "Ford": {
        "en": "Ford", "fr": "Ford", "es": "Ford",
        "ja": "フォード", "zh": "福特", "el": "Ford", "he": "פורד"
    },
    "Volkswagen": {
        "en": "Volkswagen", "fr": "Volkswagen", "es": "Volkswagen",
        "ja": "フォルクスワーゲン", "zh": "大众", "el": "Volkswagen", "he": "פולקסווגן"
    },
    "Nissan": {
        "en": "Nissan", "fr": "Nissan", "es": "Nissan",
        "ja": "日産", "zh": "日产", "el": "Nissan", "he": "ניסאן"
    }
}

# Translations for car models (keeping original for most, transliterating for some)
model_translations = {
    "Corolla": {
        "en": "Corolla", "fr": "Corolla", "es": "Corolla",
        "ja": "カローラ", "zh": "卡罗拉", "el": "Corolla", "he": "קורולה"
    },
    "Camry": {
        "en": "Camry", "fr": "Camry", "es": "Camry",
        "ja": "カムリ", "zh": "凯美瑞", "el": "Camry", "he": "קאמרי"
    },
    "Civic": {
        "en": "Civic", "fr": "Civic", "es": "Civic",
        "ja": "シビック", "zh": "思域", "el": "Civic", "he": "סיוויק"
    },
    "Accord": {
        "en": "Accord", "fr": "Accord", "es": "Accord",
        "ja": "アコード", "zh": "雅阁", "el": "Accord", "he": "אקורד"
    },
    "Focus": {
        "en": "Focus", "fr": "Focus", "es": "Focus",
        "ja": "フォーカス", "zh": "福克斯", "el": "Focus", "he": "פוקוס"
    },
    "Fusion": {
        "en": "Fusion", "fr": "Fusion", "es": "Fusion",
        "ja": "フュージョン", "zh": "蒙迪欧", "el": "Fusion", "he": "פיוז'ן"
    },
    "Golf": {
        "en": "Golf", "fr": "Golf", "es": "Golf",
        "ja": "ゴルフ", "zh": "高尔夫", "el": "Golf", "he": "גולף"
    },
    "Passat": {
        "en": "Passat", "fr": "Passat", "es": "Passat",
        "ja": "パサート", "zh": "帕萨特", "el": "Passat", "he": "פאסאט"
    },
    "Altima": {
        "en": "Altima", "fr": "Altima", "es": "Altima",
        "ja": "アルティマ", "zh": "天籁", "el": "Altima", "he": "אלטימה"
    },
    "Sentra": {
        "en": "Sentra", "fr": "Sentra", "es": "Sentra",
        "ja": "セントラ", "zh": "轩逸", "el": "Sentra", "he": "סנטרה"
    }
}

# Define 3 common problems with translations
problem_types = {
    "engine_overheating": {
        "en": "Engine Overheating", "fr": "Surchauffe moteur", "es": "Sobrecalentamiento motor",
        "ja": "エンジンオーバーヒート", "zh": "发动机过热", "el": "Υπερθέρμανση κινητήρα", "he": "התחממות יתר מנוע"
    },
    "brake_noise": {
        "en": "Brake Noise", "fr": "Bruit de frein", "es": "Ruido de frenos",
        "ja": "ブレーキノイズ", "zh": "刹车噪音", "el": "Θόρυβος φρένων", "he": "רעש בלמים"
    },
    "battery_drain": {
        "en": "Battery Drain", "fr": "Décharge batterie", "es": "Descarga batería",
        "ja": "バッテリー消耗", "zh": "电池耗电", "el": "Εκφόρτιση μπαταρίας", "he": "ריקון סוללה"
    }
}

problems = list(problem_types.keys())

In [2]:
# Define translations with intentional variations to simulate real user input
translations = {
    "engine_overheating": {
        "fault": {
            "en": ["Engine overheating", "engine is overheating", "Engine gets too hot", "overheating problem"],
            "fr": ["Moteur surchauffe", "le moteur chauffe trop", "Problème de surchauffe moteur", "moteur trop chaud"],
            "es": ["Motor sobrecalentado", "el motor se calienta mucho", "Problema de sobrecalentamiento", "motor muy caliente"],
            "ja": ["エンジンオーバーヒート", "エンジンが熱い", "エンジン過熱問題", "エンジンが熱くなる"],
            "zh": ["发动机过热", "引擎太热了", "发动机温度过高", "引擎过热问题"],
            "el": ["Υπερθέρμανση κινητήρα", "ο κινητήρας υπερθερμαίνεται", "Πρόβλημα υπερθέρμανσης", "κινητήρας πολύ ζεστός"],
            "he": ["התחממות יתר של המנוע", "המנוע מתחמם", "בעיית חימום יתר", "מנוע חם מדי"]
        },
        "fix": {
            "en": ["Check coolant level and radiator", "refill coolant and check radiator", "Add coolant, inspect radiator for leaks", "coolant low - add more"],
            "fr": ["Vérifier niveau liquide refroidissement", "ajouter du liquide de refroidissement", "Verifier radiateur et liquide", "remplir liquide refroidissement"],
            "es": ["Revisar nivel de refrigerante", "añadir refrigerante y revisar radiador", "Verificar radiador y liquido", "poner mas refrigerante"],
            "ja": ["冷却水レベルチェック", "冷却液を補充する", "ラジエーター確認", "クーラント追加"],
            "zh": ["检查冷却液水平", "添加冷却液", "检查散热器", "加冷却液"],
            "el": ["Έλεγχος ψυκτικού υγρού", "προσθήκη ψυκτικού", "Έλεγχος ψυγείου", "βάλε ψυκτικό υγρό"],
            "he": ["בדוק רמת נוזל קירור", "הוסף נוזל קירור", "בדיקת רדיאטור", "מלא נוזל קירור"]
        }
    },
    "brake_noise": {
        "fault": {
            "en": ["Brake making noise", "brakes squeak", "Squeaking brakes when stopping", "brake noise problem"],
            "fr": ["Bruit de frein", "les freins grincent", "Freins qui font du bruit", "probleme bruit freins"],
            "es": ["Ruido en frenos", "frenos hacen ruido", "Frenos chirrian", "ruido al frenar"],
            "ja": ["ブレーキ音がする", "ブレーキがキーキー鳴る", "ブレーキノイズ", "ブレーキの音"],
            "zh": ["刹车有噪音", "刹车声音大", "制动器噪音", "刹车响"],
            "el": ["Θόρυβος φρένων", "τα φρένα κάνουν θόρυβο", "Φρένα τρίζουν", "θόρυβος στα φρένα"],
            "he": ["רעש בבלמים", "בלמים מרעישים", "צריחת בלמים", "רעש בזמן בלימה"]
        },
        "fix": {
            "en": ["Replace brake pads", "change brake pads", "New brake pads needed", "brake pads worn - replace"],
            "fr": ["Remplacer plaquettes de frein", "changer les plaquettes", "Nouvelles plaquettes necessaires", "plaquettes usées"],
            "es": ["Cambiar pastillas de freno", "reemplazar pastillas", "Pastillas nuevas necesarias", "cambiar las pastillas"],
            "ja": ["ブレーキパッド交換", "パッド交換必要", "新しいブレーキパッド", "パッド替える"],
            "zh": ["更换刹车片", "换新刹车片", "需要新刹车片", "刹车片要换"],
            "el": ["Αλλαγή τακάκια", "αντικατάσταση τακάκια φρένων", "Νέα τακάκια", "άλλαξε τακάκια"],
            "he": ["החלף רפידות בלמים", "רפידות חדשות", "צריך רפידות בלם חדשות", "להחליף רפידות"]
        }
    },
    "battery_drain": {
        "fault": {
            "en": ["Battery draining fast", "battery dies quickly", "Battery won't hold charge", "battery drain issue"],
            "fr": ["Batterie se vide vite", "batterie se décharge", "Batterie tient pas la charge", "probleme batterie"],
            "es": ["Batería se descarga rápido", "bateria no dura", "Batería no mantiene carga", "bateria se agota"],
            "ja": ["バッテリーが早く減る", "バッテリーすぐ切れる", "充電持たない", "バッテリー問題"],
            "zh": ["电池耗电快", "电池不耐用", "电池充不进电", "电池问题"],
            "el": ["Μπαταρία αδειάζει γρήγορα", "η μπαταρία δεν κρατάει", "Πρόβλημα μπαταρίας", "μπαταρία αδειάζει"],
            "he": ["סוללה מתרוקנת מהר", "הסוללה לא מחזיקה", "בעיית סוללה", "סוללה נגמרת מהר"]
        },
        "fix": {
            "en": ["Test alternator and replace battery", "check alternator", "Replace battery or alternator", "new battery needed"],
            "fr": ["Tester alternateur et remplacer batterie", "verifier alternateur", "Changer batterie", "nouvelle batterie"],
            "es": ["Probar alternador y cambiar batería", "revisar alternador", "Cambiar batería", "bateria nueva"],
            "ja": ["オルタネーター確認", "バッテリー交換", "新しいバッテリー", "バッテリー替える"],
            "zh": ["检查发电机和电池", "更换电池", "换新电池", "需要新电池"],
            "el": ["Έλεγχος δυναμό και μπαταρία", "αλλαγή μπαταρίας", "Νέα μπαταρία", "άλλαξε μπαταρία"],
            "he": ["בדוק אלטרנטור והחלף סוללה", "החלף סוללה", "סוללה חדשה", "צריך סוללה חדשה"]
        }
    }
}

In [None]:
# Generate the dataset
def generate_dataset() -> List[Dict[str, str]]:
    dataset = []
    
    # Language distribution for each problem to ensure all languages appear
    language_assignments = {
        "engine_overheating": ["en", "fr", "es", "ja", "zh", "el", "he", "en", "fr", "es"],
        "brake_noise": ["ja", "zh", "el", "he", "en", "fr", "es", "ja", "zh", "el"],
        "battery_drain": ["he", "en", "fr", "es", "ja", "zh", "el", "he", "en", "fr"]
    }
    
    for brand, models in car_data.items():
        for model in models:
            for problem in problems:
                # Generate entries for this problem with assigned languages
                languages = language_assignments[problem][:2]  # Take 2 languages per problem per model
                
                for lang in languages:
                    # Pick random variations for fault and fix
                    fault_text = random.choice(translations[problem]["fault"][lang])
                    fix_text = random.choice(translations[problem]["fix"][lang])
                    
                    # Randomly decide: 70% use same language as fault/fix, 30% keep English
                    use_local_language = random.random() < 0.7
                    
                    if use_local_language:
                        brand_text = brand_translations[brand][lang]
                        model_text = model_translations[model][lang]
                        problem_type_text = problem_types[problem][lang]
                    else:
                        brand_text = brand
                        model_text = model
                        problem_type_text = problem_types[problem]["en"]
                    
                    dataset.append({
                        "Id": str(uuid.uuid4()),
                        "Brand": brand_text,
                        "Model": model_text,
                        "ProblemType": problem_type_text,
                        "Fault": fault_text,
                        "Fix": fix_text
                    })
    
    # Shuffle to make it more realistic
    random.shuffle(dataset)
    return dataset

# Generate the data
data = generate_dataset()
print(f"Generated {len(data)} records")

In [4]:
# Create DataFrame and display sample
df = pd.DataFrame(data)
print("Sample of generated data:")
print(df.head(10))
print(f"\nTotal records: {len(df)}")
print(f"Unique brands (all languages): {df['Brand'].nunique()}")
print(f"Unique models (all languages): {df['Model'].nunique()}")
print(f"\nBrand examples:")
print(df['Brand'].value_counts().head(15))

Sample of generated data:
                                     Id       Brand    Model  \
0  4a0f47a5-2410-4bd6-8dc6-981027e592bc  Volkswagen   Passat   
1  a0a1ad1f-05e6-4f1e-bd2d-f380cf318eee       Honda    Civic   
2  8da8e4a4-e217-411e-8c1e-7c4c768eeb83      Nissan   Altima   
3  e27c818e-df8d-4f70-8fd3-1064f9ae5142       Honda    Civic   
4  dc4bb767-f6ae-40ee-b89f-b0f5d50672d1   פולקסווגן     גולף   
5  b58dc4c3-7e3e-4dfe-a993-5385809a8a29  Volkswagen   Passat   
6  69580655-2091-47ac-96c0-84c576a9d7a4          丰田      卡罗拉   
7  e7a13b1b-e06c-4014-b49e-5259ac8ccff3      Toyota  Corolla   
8  f35a4427-a021-455a-b591-e44f660dbfdb   フォルクスワーゲン      ゴルフ   
9  18330958-10d9-4e6d-bcda-9aa3bbbad452  Volkswagen   Passat   

          ProblemType                          Fault  \
0   Surchauffe moteur  Problème de surchauffe moteur   
1       Battery Drain          Battery draining fast   
2       Battery Drain          Battery draining fast   
3  Engine Overheating            overheating 

In [None]:
# Save to Excel file
output_file = "car_problems_multilingual.xlsx"
df.to_excel(output_file, index=False, engine='openpyxl')
print(f"\nData saved to {output_file}")

## 📋 Dataset Summary

### 🎲 Dataset Composition

The generated dataset includes:

| Component | Count | Details |
|-----------|-------|---------|
| 🏭 **Car Brands** | 5 | Toyota, Honda, Ford, Volkswagen, Nissan |
| 🚙 **Models** | 10 | 2 models per brand |
| ⚠️ **Problem Types** | 3 | Engine Overheating, Brake Noise, Battery Drain |
| 🌐 **Languages** | 7 | English, French, Spanish, Japanese, Chinese, Greek, Hebrew |
| 📝 **Total Records** | 60 | Fully multilingual dataset |

### 🔀 Language Distribution Strategy

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'pie1':'#4A90E2', 'pie2':'#E74C3C'}}}%%
pie title Language Mixing Strategy
    "Same Language (Brand/Model/Problem = Fault/Fix)" : 70
    "Mixed Language (English Brand/Model, Local Fault/Fix)" : 30
```

**Key Features:**
- ✅ **70% Same Language**: Brand, model, and problem type match the fault/fix language
- ✅ **30% Mixed Language**: English brand/model with localized fault/fix descriptions
- ✅ **Natural Variations**: 4 different ways to express each problem/solution
- ✅ **Grammar Variations**: Intentional imperfections to simulate real user input

### 🎯 Example Records

**Same Language Record (Japanese):**
```
Brand: ホンダ (Honda)
Model: シビック (Civic)
ProblemType: ブレーキノイズ (Brake Noise)
Fault: ブレーキがキーキー鳴る (Brakes squeak)
Fix: ブレーキパッド交換 (Replace brake pads)
```

**Mixed Language Record:**
```
Brand: Ford
Model: Focus
ProblemType: Battery Drain
Fault: Batterie se vide vite (Battery drains fast - French)
Fix: Changer batterie (Change battery - French)
```

### 💾 Output

The dataset is saved as **`car_problems_multilingual.xlsx`** and contains:
- 📊 Structured columns: Id, Brand, Model, ProblemType, Fault, Fix
- 🔤 UTF-8 encoding for proper multilingual character support
- 🎲 Shuffled records for realistic distribution
- ✨ Ready for RAG system training and testing

### 🚀 Use Cases

This dataset is perfect for:
- 🤖 Training multilingual RAG chatbots
- 🔍 Testing semantic search across languages
- 📊 Evaluating embedding models on mixed-language data
- 🌐 Building automotive technical support systems
- 🧪 Testing retrieval accuracy with language variations