# [text gen] Plain Text Generation with phi3.5
This sample demonstrates how to generate a plain text file to train custom speech model. 

> ✨ ***Note*** <br>
> Please check the custom speech support for each language before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt#:~:text=Custom%20speech%20support 


## Prerequisites
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

* A subscription key for the Speech service. See [Try the speech service for free](https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started).
* Python 3.5 or later needs to be installed. Downloads are available [here](https://www.python.org/downloads/).
* The Python Speech SDK package is available for Windows (x64 or x86) and Linux (x64; Ubuntu 16.04 or Ubuntu 18.04).
* On Ubuntu 16.04 or 18.04, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl3 libasound2
  ```
* On Debian 9, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.2 libasound2
  ```
* On Windows you need the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads) for your platform.

Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## 1. Generate synthetic dataset with phi3.5

### Set up the environment variables

In [2]:
import azure.cognitiveservices.speech as speechsdk
import os
import time
import json
from dotenv import load_dotenv
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage
from azure.ai.inference.models import UserMessage
from azure.core.credentials import AzureKeyCredential

load_dotenv()


speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

phi_api_endpoint = os.getenv("AZURE_PHI3.5_ENDPOINT")
phi_api_key = os.getenv("AZURE_PHI3.5_API_KEY")
phi_deployment_name = os.getenv("AZURE_PHI3.5_DEPLOYMENT_NAME")

try:
    client = ChatCompletionsClient(
    #endpoint="https://aoai-services1.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview", # you will run into a 500 error if you use this endpoint
    endpoint="https://aoai-services1.services.ai.azure.com/models",
    credential=AzureKeyCredential(phi_api_key),
)
except (ValueError, TypeError) as e:
    print(e)

phi_deployment_name

'Phi-3.5-MoE-instruct'

### prepare prompt to generate text

In [3]:
import os
from azure.ai.inference.models import SystemMessage
from azure.ai.inference.models import UserMessage

topic = """
LG electronics call center QnA related expected spoken utterances for Vietnamese and English languages.
"""
question = """
create 10 lines of jsonl of the topic in Vietnamese and english. jsonl format is required. use no as number and vi-VN, en-US keys for the languages.
only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. 
"""


response = client.complete(
    messages=[
        SystemMessage(content="""
        Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases.
        Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. 
        Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. 
        """),

       
        UserMessage(content=f"""
        #topic#: {topic}
        Question: {question}
        """),
    ],
    # Simply change the model name for the appropiate model "Phi-3.5-mini-instruct" or "Phi-3.5-vision-instruct"
    model=phi_deployment_name, 
    temperature=0.8,
    max_tokens=2048,
    top_p=0.1
)

content = response.choices[0].message.content
print(content)
print("Usage Information:")
#print(f"Cached Tokens: {response.usage.prompt_tokens_details.cached_tokens}") #only o1 models support this
print(f"Completion Tokens: {response.usage.completion_tokens}")
print(f"Prompt Tokens: {response.usage.prompt_tokens}")
print(f"Total Tokens: {response.usage.total_tokens}")

 {"no":1,"vi-VN":"Chào bộ phận chăm sóc khách hàng LG, tôi muốn hỏi về sản phẩm của công ty.", "en-US":"Hello LG customer service, I would like to inquire about the company's products."}
{"no":2,"vi-VN":"Tôi gặp vấn đề với máy giặt LG, có thể bạn giúp tôi khắc phục không?", "en-US":"I'm having an issue with my LG washing machine, can you help me fix it?"}
{"no":3,"vi-VN":"Làm thế nào để tôi tắt chế độ chạy tự động trên máy giặt LG?", "en-US":"How do I turn off the automatic wash mode on my LG washing machine?"}
{"no":4,"vi-VN":"Tôi muốn biết thời gian bảo hành của máy lạnh LG là bao lâu?", "en-US":"I would like to know how long the warranty period is for the LG air conditioner."}
{"no":5,"vi-VN":"Tôi muốn đặt lịch bảo dưỡng cho máy lạnh LG, có thể bạn giúp tôi không?", "en-US":"I would like to schedule a maintenance appointment for my LG air conditioner, can you help me?"}
{"no":6,"vi-VN":"Tôi muốn biết cách thay lốp xe máy LG, có thể bạn giúp tôi không?", "en-US":"I would like to know

In [4]:
synthetic_text_file = "cc_support_expressions.jsonl"
with open(synthetic_text_file, 'w', encoding='utf-8') as f:
    f.write(content)

%store synthetic_text_file

Stored 'synthetic_text_file' (str)


## 2. Create a plain text file to train custom speech models

In [5]:
import datetime

languages = ['vi-VN'] # List of languages to generate audio files
output_dir = "plain_text"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

with open(synthetic_text_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            expression = json.loads(line)
            for lang in languages:
                text = expression[lang]
                timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
                file_name = f"{lang}_{timestamp}.txt"
                with open(os.path.join(output_dir,file_name), 'a', encoding='utf-8') as plain_text:
                    plain_text.write(f"{text}\n")
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON on line: {line}")
            print(e)

In [6]:
output_dir = "plain_text"  # Redefine output_dir if necessary

with open(os.path.join(output_dir, file_name), 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

Chào bộ phận chăm sóc khách hàng LG, tôi muốn hỏi về sản phẩm của công ty.
Tôi gặp vấn đề với máy giặt LG, có thể bạn giúp tôi khắc phục không?
Làm thế nào để tôi tắt chế độ chạy tự động trên máy giặt LG?
Tôi muốn biết thời gian bảo hành của máy lạnh LG là bao lâu?
Tôi muốn đặt lịch bảo dưỡng cho máy lạnh LG, có thể bạn giúp tôi không?
Tôi muốn biết cách thay lốp xe máy LG, có thể bạn giúp tôi không?
Tôi muốn biết cách sạc pin cho điện thoại LG, có thể bạn giúp tôi không?
Tôi muốn biết cách khôi phục dữ liệu cho máy tính bảng LG, có thể bạn giúp tôi không?
Tôi muốn biết cách cài đặt ứng dụng trên điện thoại LG, có thể bạn giúp tôi không?
Tôi muốn biết cách khởi động lại máy tính bảng LG, có thể bạn giúp tôi không?



### Store the generated text file path in the variable `plain_text_path` 

In [8]:
plain_text_path = os.path.join(output_dir, file_name)
%store plain_text_path

Stored 'plain_text_path' (str)
