### Voice Translation of Audio Files into Different Languages 

Have you ever wanted to translate a podcast into your native language? Translating and dubbing audio content can make it more accessible to a global audience. This guide will walk you through the process of converting an English audio file into Hindi using OpenAI's APIs.

The steps to dub audio content are: transcribing the audio, translating it into the target script, converting the text into speech in the target language, and benchmarking to ensure quality and accuracy.   

The process flow can be illustrated as follows:

![Voice Translations Steps](./images/voice-translation-steps.png)

A note on semantics used in this Cookbook regarding **Language** and written **Script**. These words are generally used interchangeably, though it's important to understand the distinction, given the task at hand. 
- **Language** refers to the spoken or written system of communication. For instance, Hindi and Marathi are different languages, but both use the Devanagari script. Similarly, English and French are different languages, but are written in Latin script. 
- **Script** refers to the set of characters or symbols used to write the language. For example, Serbian language traditionally written in Cyrillic Script, is also written in Latin script.


In this cookbook, we will walk through the following 5 steps to dub an audio podcast from English to Hindi. 

1. **Transcribe** the audio file into text with Whisper 
2. **Translate** the English language text to Hindi in Devanagari script    
3. **Text-to-speech** conversion of the Devanagari script into spoken Hindi language 
4. **Translation benchmarking** (BLEU or ROUGE) 
5. **Interpret and improve** scores by adjusting prompting parameters in steps 1-3 as needed  


Before we get started, make sure you have the `openai` library installed, and your OpenAI API key is configured as an environment variable. 

### Step 1: Transcribe the audio file into text with Whisper

[Whisper](https://github.com/openai/whisper) is an automatic speech recognition (ASR) system developed by OpenAI, accessible through both an API and an open-source model. It can transcribe audio files with high accuracy across multiple languages. OpenAI Whisper API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2. The code below invokes the API, passes the audio file as parameter, and receives the transcription in English in return. 

![Text-to-speech](./images/Whisper.png)

[*Learn more about Whisper here](https://platform.openai.com/docs/guides/speech-to-text) 


In [1]:
from openai import OpenAI
client = OpenAI()

# We'll use pre-recorded English language audio of 2023 OpenAI dev day keynote 
# You could change this path to the audio file you want to translate 
audio_file = "../gpt4o/data/keynote_recap.mp3"

audio_file= open(audio_file, "rb")

# Get the transcription from Whisper model 
transcription = client.audio.transcriptions.create(
  model="whisper-1",  
  file=audio_file
)

# Retrieve the transcribed text and print the output
transcription_english_first_pass = transcription.text

print(transcription_english_first_pass)

Welcome to our first-ever OpenAI Dev Day. Today, we are launching a new model, GPT-4 Turbo. GPT-4 Turbo supports up to 128,000 tokens of context. We have a new feature called JSON mode, which ensures that the model will respond with valid JSON. You can now call many functions at once, and it'll do better at following instructions in general. You want these models to be able to access better knowledge about the world. So do we. So we're launching retrieval in the platform. You can bring knowledge from outside documents or databases into whatever you're building. GPT-4 Turbo has knowledge about the world up to April of 2023, and we will continue to improve that over time. Dolly 3, GPT-4 Turbo with Vision, and the new Text-to-Speech model are all going into the API today. Today, we're launching a new program called Custom Models. With Custom Models, our researchers will work closely with the company to help them make a great custom model, especially for them and their use case using our t

**Note:** the model transcribed "Dall-e" as "Dolly". A common issue with similar sounding words for speech-to-text APIs is that they may misidentify the term without additional context.  

There are two techniques to provide context to the model to correct such transcription errors: 

#### Option 1: Provide the correct transcription in the `prompt` parameter. 
Use `prompt` in Whisper API call to improve the quality of the transcripts generated by the model. Prompts can be very helpful for correcting specific words or acronyms that the model may misspell in the audio transcription.

You could build a "glossary" of terms that the model is likely to transcribe inaccurately, given the context of the text. Over time, terms can be added to the glossary, improving accuracy.

[Learn more about prompting Whisper model here](https://platform.openai.com/docs/guides/speech-to-text/prompting)

In [2]:
# Path to the source audio file  
audio_file = "../gpt4o/data/keynote_recap.mp3"

# Transcription errors can be stored in a glossary of transcription errors (comma separated)
glossary_of_transcription_errors = "Dall-e"

# Invoke the Whisper model to get the transcription
audio_file= open(audio_file, "rb")

transcription = client.audio.transcriptions.create(
  model="whisper-1", 
    prompt=glossary_of_transcription_errors,   
  file=audio_file
)

# Retrieve the transcribed text and print the output
english_transcription_second_pass = transcription.text
print(english_transcription_second_pass)

Welcome to our first ever OpenAI Dev Day. Today, we are launching a new model, GPT-4 Turbo. GPT-4 Turbo supports up to 128,000 tokens of context. We have a new feature called JSON mode, which ensures that the model will respond with valid JSON. You can now call many functions at once, and it will do better at following instructions in general. You want these models to be able to access better knowledge about the world. So do we. So we're launching retrieval in the platform. You can bring knowledge from outside documents or databases into whatever you're building. GPT-4 Turbo has knowledge about the world up to April of 2023, and we will continue to improve that over time. Dall-e 3, GPT-4 Turbo with Vision, and the new Text-to-Speech model are all going into the API today. Today, we're launching a new program called Custom Models. With Custom Models, our researchers will work closely with a company to help them make a great custom model, especially for them and their use case using our 

Now the model accurately transcribed word "Dall-e" as we provided the correct transcription in the prompt for the given context.
 

#### Option 2: Use GPT model for post-processing output 
Another popular technique to improve the quality of output and correct spelling mistakes is to use GPT model for post-processing as described below.

In [3]:
system_prompt = f"You are a helpful assistant. Your task is to correct any spelling discrepancies and grammar in the transcribed text. Only add necessary punctuation such as periods, commas, and capitalization, and use only the context provided. Some words that may be misspelled include: {glossary_of_transcription_errors}" 

response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcription_english_first_pass
            }
        ]
    )

transcription_english_third_pass = response.choices[0].message.content

print(transcription_english_third_pass)

Welcome to our first-ever OpenAI Dev Day. Today, we are launching a new model, GPT-4 Turbo. GPT-4 Turbo supports up to 128,000 tokens of context. We have a new feature called JSON mode, which ensures that the model will respond with valid JSON. You can now call many functions at once, and it'll do better at following instructions in general. You want these models to be able to access better knowledge about the world. So do we. So we're launching retrieval in the platform. You can bring knowledge from outside documents or databases into whatever you're building. GPT-4 Turbo has knowledge about the world up to April of 2023, and we will continue to improve that over time. DALL-E 3, GPT-4 Turbo with Vision, and the new Text-to-Speech model are all going into the API today. Today, we're launching a new program called Custom Models. With Custom Models, our researchers will work closely with the company to help them make a great custom model, especially for them and their use case using our 

### Step 2. Translate the English language text to Hindi in Devanagari script

The next step is to translate the text from the source language to the target language script. In this case, we prompt the GPT-4o model to translate the text from English (written in Latin script) to Hindi (written in Devanagari script).

For certain new words in a language, there may not be a direct translation in the target language. In such cases, we prompt the model to retain these words in the original language. We can also explicitly provide examples of words to keep in the original script (e.g., English). These terms can be stored as a glossary, which can be expanded as we continue translating the text.

In [4]:
glossary_of_terms_to_keep_in_original_language = "Some words to keep in English include - Turbo, OpenAI, token, GPT, Dall-e, Python"

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": f"You are an assistant that translates text from English to Hindi. The user will provide content for you to translate. You may keep certain words in English when a direct translation doesn't exist. {glossary_of_terms_to_keep_in_original_language} "},
    {"role": "user", "content": transcription.text},
  ]
)

hindi_transcription = response.choices[0].message.content

print(hindi_transcription)

हमारे पहले OpenAI Dev Day में आपका स्वागत है। आज, हम एक नया मॉडल लॉन्च कर रहे हैं, GPT-4 Turbo। GPT-4 Turbo 128,000 tokens तक के संदर्भ को समर्थन करता है। हमारे पास एक नई फीचर है, जिसे JSON mode कहा जाता है, जो सुनिश्चित करती है कि मॉडल वैध JSON के साथ प्रतिक्रिया देगा। आप अब कई functions को एक साथ कॉल कर सकते हैं, और यह सामान्य निर्देशों का पालन करने में भी बेहतर होगा। आप चाहते हैं कि ये मॉडल दुनिया के बारे में बेहतर ज्ञान तक पहुंच सकें। हम भी ऐसा ही चाहते हैं। इसलिए हम प्लेटफ़ॉर्म में retrieval लॉन्च कर रहे हैं। आप बाहरी दस्तावेजों या डेटाबेस से ज्ञान को अपने निर्माण में ला सकते हैं। GPT-4 Turbo के पास अप्रैल 2023 तक का दुनिया का ज्ञान है, और हम इसे समय के साथ सुधारते रहेंगे। Dall-e 3, GPT-4 Turbo with Vision, और नया Text-to-Speech मॉडल आज API में जा रहे हैं। 

आज, हम एक नया कार्यक्रम लॉन्च कर रहे हैं जिसे Custom Models कहा जाता है। Custom Models के साथ, हमारे शोधकर्ता किसी कंपनी के साथ मिलकर काम करेंगे ताकि वे उनके उपकरणों का उपयोग करके उनके विशेष उपयोग के मामले के लिए एक महान कस्टम

The transcribed text is a combination of Hindi and English, represented in their respective scripts: Devanagari for Hindi and Latin for English. This approach ensures more natural-sounding speech with the correct pronunciation of both languages' words.

### 3. Text-to-speech conversion of the written script into spoken language

OpenAI's text-to-speech (TTS) model can take the written script as input and produce spoken output with a native-sounding Hindi accent (where the script represents the Hindi language), intermingled with a native-sounding English accent (where the script represents the English language).    

![Text-to-speech](./images/Text-to-speech.png)
 

*[Learn more about tts model here](https://platform.openai.com/docs/guides/text-to-speech)

In [5]:
output_file_path = "./sounds/output.mp3"

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input=hindi_transcription,
)

response.write_to_file(output_file_path)

In [6]:
from playsound import playsound

playsound(output_file_path)

### Step 4. Translation benchmarking (e.g., BLEU or ROUGE) 

We can assess the quality of the translated text by comparing it to a reference translation using evaluation metrics like BLEU and ROUGE. 

**BLEU (Bilingual Evaluation Understudy)**: Measures the overlap of n-grams between the candidate and reference translations. Scores range from 0 to 100, with higher scores indicating better quality.

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Commonly used for summarization evaluation. Measures the overlap of n-grams and the longest common subsequence between the candidate and reference texts.

Ideally, a reference translation (a human-translated version) of the original text is needed for an accurate evaluation. However, developing such evaluations can be challenging, as it requires time and effort from bilingual humans proficient in both languages.

An alternative is to transcribe the output audio file from the target language back into the original language to assess the quality of the translation. The Whisper API provides a `translations` endpoint to transcribe the audio back into English.

In [7]:
# Translate the audio output file generated by tts to English text using Whisper 

audio_file= open(output_file_path, "rb")
transcription = client.audio.translations.create(
  model="whisper-1",  
  file=audio_file,
)

english_re_transcription = transcription.text

print(english_re_transcription)

Welcome to our first OpenAI Dev Day. Today, we are launching a new model, GPT-4 Turbo. GPT-4 Turbo supports up to 128,000 tokens. We have a new feature called JSON mode, which ensures that the model will react with a method. You can now call multiple functions at once, and it will be great for following general instructions. You want this model to reach the world with the best knowledge possible. We want that too. That's why we are launching Retrieval on the platform. You can bring your knowledge from external documents or databases into your development. GPT-4 Turbo has the knowledge of the world up to April 22, 2023, and we will keep improving it over time. DALI 3, GPT-4 Turbo with Vision, and a new text-to-speech model is going into the API today. Today, we are launching a new program called Custom Models. With Custom Models, our researchers will work with a company so that they can use their products to create a great custom model for their special needs. Uchdar Seemai. We are doub

With the text re-translated into English from the Hindi audio, we can run the evaluation metrics by comparing it to the original English transcription.

In [8]:
import sacrebleu
from rouge_score import rouge_scorer

# We'll use the original english transcription as the reference text 
reference_text = transcription_english_third_pass

candidate_text = english_re_transcription

# BLEU Score Evaluation
bleu = sacrebleu.corpus_bleu([candidate_text], [[reference_text]])
print(f"BLEU Score: {bleu.score}")

# ROUGE Score Evaluation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_text, candidate_text)
print(f"ROUGE-1 Score: {scores['rouge1'].fmeasure}")
print(f"ROUGE-L Score: {scores['rougeL'].fmeasure}")


BLEU Score: 47.553973306995736
ROUGE-1 Score: 0.8320463320463322
ROUGE-L Score: 0.7297297297297297


### Step 5. Interpret and improve scores by adjusting prompting parameters in steps 1-3 as needed

In this example, both BLEU and ROUGE scores indicate that the quality of the voice translation is between very good and excellent.

**Interpreting BLEU Scores:** While there is no universally accepted scale, some interpretations suggest:

0 to 10: Poor quality translation; significant errors and lack of fluency.

10 to 20: Low quality; understandable in parts but contains many errors.

20 to 30: Fair quality; conveys the general meaning but lacks precision and fluency.

30 to 40: Good quality; understandable and relatively accurate with minor errors.

40 to 50: Very good quality; accurate and fluent with very few errors.

50 and above: Excellent quality; closely resembles human translation.

**Interpreting ROUGE scores:** The interpretation of a "good" ROUGE score can vary depending on the task, dataset, and domain. The following guidelines indicate a good outcome:

ROUGE-1 (unigram overlap): Scores between 0.5 to 0.6 are generally considered good for abstractive summarization tasks.

ROUGE-L (Longest Common Subsequence): Scores around 0.4 to 0.5 are often regarded as good, reflecting the model's ability to capture the structure of the reference text.

If the score for your translation is unsatisfactory, consider the following questions:

#### 1. Is the source audio accurately transcribed? 
Revisit parameters `glossary_of_transcription_errors` and add text that has incorrect transcriptions.  

#### 2. Is the transcribed text free of grammatical errors? 
Consider using a post-processing step with the GPT model to refine the transcription by removing grammatical mistakes and adding appropriate punctuation.

#### 3. Are there words that make sense to keep in the original language?  
There may be new terms or concepts for which a translation in target language may not exist or is not universally understood. Revisit `glossary_of_terms_to_keep_in_original_language` and add such terms.

### Conclusion

To recap, this cookbook provides a step-by-step process for translating and dubbing audio from one language to another, making content more accessible to a global audience. The example we used is the voice translation of an audio file from English to Hindi. 

The steps are as follows:

**1. Transcription**: Using OpenAI's Whisper to transcribe the English audio into text.

**2. Translation**: Converting the transcribed English text into Hindi, specifically in the Devanagari script.

**3. Text-to-Speech**: Transforming the translated text into spoken Hindi using a text-to-speech service.

**4. Benchmarking**: Evaluating the quality and accuracy of the translation with metrics like BLEU or ROUGE, and refining the results by adjusting parameters throughout the process.

This guide also clarifies the distinction between "language" and "script," which are often used interchangeably but have specific meanings critical to the translation task. Language refers to the spoken or written system of communication, while script refers to the characters used to write the language. Understanding this distinction is key to effectively translating and dubbing content.

By using the techniques outlined in this cookbook, you can translate and dub a wide range of content—such as podcasts, training videos, tutorials, and even full-length films—into multiple languages. This method can be applied across various industries, from entertainment and education to business and global communication efforts, enabling creators to expand their reach to diverse linguistic audiences.