# Flow

Dummy personal information (label/ground truth) -> Hiragana Pronunciation -> Phrase -> Audio file

# Usage

1. Generate the dummy personal information csv: https://testdata.userlocal.jp/.
2. Get an elevenlabs API Key for `.env` (can use a free account): https://elevenlabs.io/app/settings/api-keys
3. Configure the label column.
4. Modify the pronunciation -> phrase template as needed.
5. Run the notebook. It takes about 0.6 second per file.

Note userlocal purports to generate data by the ENTIRE demographic of Japan. So if our client's customer bas a different demographic(e.g. the elderly in the small towns), we may need to resample to match.

## Changing the voice
This is configured with `voice_id`. See https://elevenlabs.io/app/voice-library. Note some voices may not be compatible with your model of choice.

## Sound not being articulated correctly
I would try modifying the label -> pronunciation template. If that did not help, might need to fiddle with the `convert()` call.

In [1]:
import os
import traceback
import wave

import pandas as pd
from elevenlabs.client import ElevenLabs
from jinja2 import Template
from tqdm import tqdm

from ut.aoai import gpt_call
from ut.para import process_list_in_parallel

client = ElevenLabs()

folder = "data/phone"
clear = True  # DELETES EVERYTHING in the folder first. For debugging
samples = 100

In [2]:
# "氏名","氏名（ひらがな）","年齢","生年月日","性別",
# "血液型","メールアドレス","電話番号","携帯電話番号",
# "郵便番号","住所","会社名","クレジットカード","有効期限","マイナンバー"
df = pd.read_csv("dummy.csv")
sample = df.sample(samples)["電話番号"].values.tolist()

In [3]:
# Convert text to its pronunciation.
# Elevenlab's model does not seem to know how to pronunce certain kanji/phone numbers,
# Names of places can be tricky in Japanese too,
# So we're giving it some help with LLM.

template = Template(
  """Convert the following japanese text into how it would be pronunced in hiragana. 
Insert half-width spaces if there would be a short pause.

{{address}}"""
)

converted = process_list_in_parallel(lambda x: gpt_call(template.render(address=x)), sample)

100%|██████████| 100/100 [00:19<00:00,  5.10it/s]


In [4]:
# Doing this instead of setting previous_text, next_text from convert()
# Produces more natural results.

phrase = Template("""電話番号は {{address}} です。""")
converted = [phrase.render(address=x) for x in converted]

In [5]:
if not os.path.exists(folder):
  os.makedirs(folder)

if clear:
  for file in os.listdir(folder):
    os.remove(f"{folder}/{file}")

for label, pronunciation in tqdm(list(zip(sample, converted))):
  try:
    audio = client.text_to_speech.convert(
      text=pronunciation,
      voice_id="3JDquces8E8bkmvbh6Bc",
      model_id="eleven_flash_v2_5",
      output_format="pcm_16000",  # https://github.com/elevenlabs/elevenlabs-python/blob/main/src/elevenlabs/types/output_format.py
      language_code="ja",  # https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
    )

    with wave.open(f"{folder}/{label}.wav", "wb") as wav_file:
      # these values need to be set as there's no default.
      wav_file.setnchannels(1)
      wav_file.setsampwidth(2)
      wav_file.setframerate(16000)
      for chunk in audio:
        wav_file.writeframes(chunk)
  except Exception as e:
    print(f"Failed to generate audio for {label}: {e}")
    traceback.print_exc()
    continue  # sometimes the remote will have a random error

100%|██████████| 100/100 [01:04<00:00,  1.55it/s]
