# Generate synthetic data

## Why do we need synthetic data?
1. Data Scarcity:
   - Limited availability of Darija (Moroccan Arabic) text data
   - Most available data is from social media or YouTube comments, which are:
     * Short in length
     * Often contain informal language
     * May include spelling errors and non-standard writing

2. Quality Control:
   - Synthetic data allows us to:
     * Generate longer, more structured text
     * Control the language style and formality
     * Ensure grammatical correctness
     * Create diverse topics and contexts

3. Training Benefits:
   - Larger training dataset
   - Better coverage of language patterns
   - More consistent quality
   - Ability to generate specific types of content

4. Cost Efficiency:
   - Using GCP credits for inference
   - More cost-effective than manual data collection
   - Scalable solution for generating large datasets


## Prompts best practices

Check examples here https://cloud.google.com/discover/what-is-prompt-engineering#use-cases-and-examples-of-prompt-engineering

Example Prompt Structure:

In [1]:
"""
Context: [Domain/Topic]
Audience: [Target readers]
Style: [Formal/Informal]
Length: [Word count]
Language: Darija (Moroccan Arabic)
Format: [Structure requirements]

Examples:
[2-3 example inputs and outputs]

Constraints:
- Avoid [specific issues]
- Include [required elements]
- Maintain [style guidelines]
"""

'\nContext: [Domain/Topic]\nAudience: [Target readers]\nStyle: [Formal/Informal]\nLength: [Word count]\nLanguage: Darija (Moroccan Arabic)\nFormat: [Structure requirements]\n\nExamples:\n[2-3 example inputs and outputs]\n\nConstraints:\n- Avoid [specific issues]\n- Include [required elements]\n- Maintain [style guidelines]\n'

## Generate data Using OpenAI GPT 4o Model

In [2]:
!pip install openai -q

In [3]:
from openai import OpenAI
from google.colab import userdata
import base64
import os
import json

In [31]:
# generate function to get answers from gpt 4o model
def generate(messages):

    client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))
    model = "gpt-4o"

    resp = client.chat.completions.create(
    model=model,
    messages=(
        [{"role": "system","content": "You are a helpful assistant. Output must be valid JSON."}]
        +[{"role": message["role"], "content": message["content"]}
          for message in messages]
    ),
    temperature= 0.7,
    # top_k=40,
    top_p=0.8,
    max_tokens=1024,
    response_format={"type": "json_object"}
    )

    output=resp.choices[0].message.content
    # Optional: Parse the JSON output to a Python dictionary
    try:
        output_json = json.loads(output)
        return output_json
    except json.JSONDecodeError:
        print("Warning: Model did not return valid JSON.")
        return output # Return raw text if JSON parsing fails

In [32]:
# Example prompt template modified to request JSON output
prompt_template = """
Generate a paragraph in Moroccan Arabic Darija (arabic letters) about daily life.
Style: Informal
Length: 100-150 words
Topic: Random daily activities

Format: JSON with a single key "paragraph" containing the generated text.

Example:
Input: Write about going to the market
Output:
{
  "paragraph": "[Example in Darija]"
}


Constraints:
- Use natural, conversational Darija
- Include common expressions
- Avoid formal Arabic
"""

messages=[
  {
      "role": "user",
      "content": prompt_template
  }
]

In [33]:
from pprint import pprint #for beautiful print
# Call generate and print the output (which will be a dictionary if JSON is parsed)
json_output = generate(messages)
pprint(json_output)

{'paragraph': 'كل صباح كنفيق مع الستة ديال الصباح، كنغسل وجهي ونشرب قهيوة، '
              'ونخرج نخدم. فالطريق كنلقى بزاف ديال الناس كيجريو باش يوصلو '
              'للخدمة ديالهم. بعد ما نسالي الخدمة، كنمشي نتغدى مع صحابي فشي '
              'كافيتيريا قريبة. كنضحكو ونهدرو على شنو وقع لينا فالصباح. من بعد '
              'كنرجع نخدم شوية حتى يجي العشية. فالطريق للدار كنوقف نشري شوية '
              'الخضرة والخبز. فالعشية كنكون تعبان، كنحبس فالتلفازة نشوف شوية '
              'ديال المسلسلات ولا نشوف شنو واقع فالدنيا. فالعشاء كنكون مع '
              'العائلة، كنضحكو ونهضرو على نهارنا. من بعد كنقرا شوية ولا نسمع '
              'الموسيقى حتى نعس. هادي هي الحياة اليومية ديالي، بسيطة ولكن '
              'زوينة.'}


## Generate data many samples

In [34]:
from datetime import datetime
from tqdm import tqdm
from time import sleep

generated_data = []
num_samples = 10

for i in tqdm(range(0, num_samples)):
  messages=[
    {
        "role": "user",
        "content": prompt_template
    }
  ]
  try:
      # Generate batch of samples
      response = generate(messages)

      # Process and validate responses
      generated_data.append({
            "text": response["paragraph"],
              "metadata": {
                      "generated_at": datetime.now().isoformat(),
                      "sample_id": i+1
                  }
              })

  except Exception as e:
      print(f"Error in sample {i+1}: {e}")
      continue

100%|██████████| 10/10 [01:30<00:00,  9.06s/it]


In [35]:
#from generated dataset to pandas
import pandas as pd
df = pd.DataFrame(generated_data)
df.head()

Unnamed: 0,text,metadata
0,كل صباح كنفيق مع السبعة ديال الصباح، كنشرب واح...,"{'generated_at': '2025-07-08T11:54:03.408969',..."
1,نهار كامل كيف العادة، كنفيق بكري مع السبعة د ا...,"{'generated_at': '2025-07-08T11:54:11.306880',..."
2,فنهار عادي فحياة الناس فالمغرب، كتلقى الناس كي...,"{'generated_at': '2025-07-08T11:54:21.498979',..."
3,نهار عادي في المغرب كيبدأ مع الصباح، كنفيقو بك...,"{'generated_at': '2025-07-08T11:54:32.130860',..."
4,نهار عادي فحياة الناس فالمغرب كيبدا مع الفجر، ...,"{'generated_at': '2025-07-08T11:54:42.469281',..."


## Push data to Hugging Face Hub

In [None]:
# Install required packages
!uv pip install datasets huggingface_hub -q

In [None]:
from huggingface_hub import login
login(token="") # your token

In [None]:
from datasets import Dataset

In [None]:
dataset=Dataset.from_list(generated_data)
dataset

In [None]:
dataset.push_to_hub("username/dataset_title")

## Challenge

Try to translate 10 **random** samples (not 10 first ones) from [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) to Moroccan Arabic, make sure the quality is good before translating everything and pushing it to hub

In [7]:
!pip install -Uq datasets fsspec==2023.9.2

In [None]:
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories", split="train")
ds = ds.shuffle(seed=42).select(range(10))

In [20]:
prompt_template = """
Translate this paragraph in Moroccan Arabic Darija (arabic letters).

Format: JSON with a single key "paragraph" containing the generated text.

Example:
Input: {input_text}
Output:
{{
  "paragraph": "[translation in Darija]"
}}


Constraints:
- Use natural, conversational Darija
- Include common expressions
- Avoid formal Arabic
"""


In [27]:
for item in tqdm(ds):
  prompt = prompt_template.format(input_text=item['text'])
  messages=[
    {
        "role": "user",
        "content": prompt
    }
  ]
  try:
      # Generate batch of samples
      response = generate(messages)

      # Process and validate responses
      generated_data.append({
            "text": response["paragraph"],
              "metadata": {
                      "generated_at": datetime.now().isoformat(),
                      "sample_id": i+1
                  }
              })

  except Exception as e:
      print(f"Error in sample {i+1}: {e}")
      continue

100%|██████████| 10/10 [01:14<00:00,  7.46s/it]


In [30]:
pd.DataFrame(generated_data)

Unnamed: 0,text,metadata
0,كان يا ما كان، كانت واحد البنت صغيرة سميتها سي...,"{'generated_at': '2025-07-08T11:50:20.596678',..."
1,مرة كان وحد الولد سميتو تيم كيعيش فدار صغيرة. ...,"{'generated_at': '2025-07-08T11:50:31.357354',..."
2,كان يا ما كان وحد البنت صغيرة سميتها ميا اللي ...,"{'generated_at': '2025-07-08T11:50:39.991854',..."
3,نهار واحد، شي قط مشاكس سميتو توم لقى واحد الحل...,"{'generated_at': '2025-07-08T11:50:45.583529',..."
4,كان يا ما كان، كانت واحد المرا كبيرة كتعيش فدا...,"{'generated_at': '2025-07-08T11:50:52.247360',..."
5,مرة واحد الوقت، فمدينة صغيرة، كان واحد الولد ص...,"{'generated_at': '2025-07-08T11:50:59.907968',..."
6,مرة كانو جوج ديال الأصدقاء اللي كانوا ديما مع ...,"{'generated_at': '2025-07-08T11:51:07.501227',..."
7,كان يا ما كان، كان واحد الولد صغير سميتو تيم. ...,"{'generated_at': '2025-07-08T11:51:12.993895',..."
8,مرة كان واحد الولد صغير سميتو جو. ديما كان فرح...,"{'generated_at': '2025-07-08T11:51:18.890752',..."
9,نهار بارد ديال الشتاء، تيلي كانت برا كتعاون فا...,"{'generated_at': '2025-07-08T11:51:27.637085',..."
