# 위스퍼 트랜스크립션 향상: 사전 및 사후 처리 기술

이 노트북은 Whisper의 트랜스크립션을 개선하기 위한 가이드를 제공합니다. 트리밍과 분할을 통해 오디오 데이터를 간소화하여 Whisper의 트랜스크립션 품질을 향상시킵니다. 트랜스크립션 후에는 구두점을 추가하고, 제품 용어를 조정(예: '529'를 '529'로)하고, 유니코드 문제를 완화하여 출력을 개선할 것입니다. 이러한 전략은 트랜스크립션의 명확성을 개선하는 데 도움이 되지만, 고유한 사용 사례에 따라 사용자 지정하는 것도 도움이 될 수 있다는 점을 기억하세요.



설정 ## 설정

시작하기 위해 몇 가지 다른 라이브러리를 가져와 보겠습니다:

- PyDub](http://pydub.com/)는 오디오 파일 슬라이스, 연결 및 내보내기와 같은 오디오 처리 작업을 위한 간단하고 사용하기 쉬운 Python 라이브러리입니다.

- IPython.display` 모듈의 `Audio` 클래스를 사용하면 Jupyter 노트북에서 사운드를 재생할 수 있는 오디오 컨트롤을 만들 수 있어, 노트북에서 직접 오디오 데이터를 재생할 수 있는 간단한 방법을 제공합니다.

- 오디오 파일에는 ChatGPT가 작성하고 작성자가 낭독한 가상의 수익 통화를 사용하겠습니다. 이 오디오 파일은 비교적 짧지만, 이러한 사전 및 사후 처리 단계를 모든 오디오 파일에 어떻게 적용할 수 있는지에 대한 예시적인 아이디어를 제공할 수 있기를 바랍니다.

In [1]:
import openai
import os
import urllib
from IPython.display import Audio
from pathlib import Path
from pydub import AudioSegment

In [2]:
# set download paths
EarningsCall_remote_filepath = "https://cdn.openai.com/API/examples/data/EarningsCall.wav"

# set local save locations
EarningsCall_filepath = "data/EarningsCall.wav"

# download example audio files and save locally
urllib.request.urlretrieve(EarningsCall_remote_filepath, EarningsCall_filepath)


('data/EarningsCall.wav', <http.client.HTTPMessage at 0x11a92fbe0>)

때때로 파일 시작 부분에 긴 침묵이 있는 경우 Whisper가 오디오를 잘못 전사할 수 있습니다. Pydub를 사용하여 침묵 부분을 감지하고 잘라냅니다.

여기서는 데시벨 임계값을 20으로 설정했습니다. 원하는 경우 이 값을 변경할 수 있습니다.

In [3]:
# Function to detect leading silence
# Returns the number of milliseconds until the first sound (chunk averaging more than X decibels)
def milliseconds_until_sound(sound, silence_threshold_in_decibels=-20.0, chunk_size=10):
    trim_ms = 0  # ms

    assert chunk_size > 0  # to avoid infinite loop
    while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold_in_decibels and trim_ms < len(sound):
        trim_ms += chunk_size

    return trim_ms

In [4]:
def trim_start(filepath):
    path = Path(filepath)
    directory = path.parent
    filename = path.name
    audio = AudioSegment.from_file(filepath, format="wav")
    start_trim = milliseconds_until_sound(audio)
    trimmed = audio[start_trim:]
    new_filename = directory / f"trimmed_{filename}"
    trimmed.export(new_filename, format="wav")
    return trimmed, new_filename

In [5]:
def transcribe_audio(file,output_dir):
    audio_path = os.path.join(output_dir, file)
    with open(audio_path, 'rb') as audio_data:
        transcription = openai.Audio.transcribe("whisper-1", audio_data)
        return transcription['text']

간혹 성적증명서에 유니코드 문자가 삽입되는 경우가 있는데, ASCII가 아닌 문자를 제거하면 이 문제를 완화하는 데 도움이 됩니다.

그리스어, 키릴 문자, 아랍어, 중국어 등으로 스크립트를 작성하는 경우에는 이 기능을 사용해서는 안 된다는 점에 유의하세요.

In [6]:
# Define function to remove non-ascii characters
def remove_non_ascii(text):
    return ''.join(i for i in text if ord(i)<128)

이 기능은 성적표에 서식과 구두점을 추가합니다. 속삭임은 서식을 지정하지 않고 구두점이 포함된 성적증명서를 생성합니다.

In [7]:
# Define function to add punctuation
def punctuation_assistant(ascii_transcript):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[
            {
                "role": "system", 
                "content": "You are a helpful assistant that adds punctuation to text. Preserve the original words and only insert necessary punctuation such as periods, commas, capialization, symbols like dollar sings or percentage signs, and formatting. Use only the context provided. If there is no context provided say, 'No context provided'\n"
            },
            {
                "role": "user", 
                "content": ascii_transcript  
            }
        ]
    )
    return response

이 오디오 파일은 많은 금융 상품이 포함된 가짜 실적 발표 전화의 녹음 파일입니다. 이 기능을 사용하면 Whisper가 이러한 금융 상품 이름을 잘못 입력할 경우 이를 수정할 수 있습니다.

In [8]:
# Define function to fix product mispellings
def product_assistant(ascii_transcript):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        temperature=0,
        messages=[
            {
                "role": "system", 
                "content": "You are an intelligent assistant specializing in financial products; your task is to process transcripts of earnings calls, ensuring that all references to financial products and common financial terms are in the correct format. For each financial product or common term that is typically abbreviated as an acronym, the full term should be spelled out followed by the acronym in parentheses. For example, '401k' should be transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)', 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)', and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing financial products into their numeric representations, followed by the full name of the product in parentheses. For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'. However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for 'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted transcript and a list of the words you've changed"
            },
            {
                "role": "user", 
                "content": ascii_transcript  
            }
        ]
    )
    return response

이 함수는 원본 파일 이름에 'trimmed'가 추가된 새 파일을 생성합니다.

In [9]:
# Trim the start of the original audio file
trimmed_audio = trim_start(EarningsCall_filepath)

In [10]:
trimmed_audio, trimmed_filename = trim_start(EarningsCall_filepath)


가짜 수익 보고서 오디오 파일은 길이가 상당히 짧기 때문에 세그먼트 길이를 적절히 조정할 것입니다. 필요에 따라 세그먼트 길이를 조정할 수 있다는 점을 기억하세요.

In [11]:
# Segment audio
trimmed_audio = AudioSegment.from_wav(trimmed_filename)  # Load the trimmed audio file

one_minute = 1 * 60 * 1000  # Duration for each segment (in milliseconds)

start_time = 0  # Start time for the first segment

i = 0  # Index for naming the segmented files

output_dir_trimmed = "TrimmedEarningsDirectory"  # Output directory for the segmented files

if not os.path.isdir(output_dir_trimmed):  # Create the output directory if it does not exist
    os.makedirs(output_dir_trimmed)

while start_time < len(trimmed_audio):  # Loop over the trimmed audio file
    segment = trimmed_audio[start_time:start_time + one_minute]  # Extract a segment
    segment.export(os.path.join(output_dir_trimmed, f"trimmed_{i:02d}.wav"), format="wav")  # Save the segment
    start_time += one_minute  # Update the start time for the next segment
    i += 1  # Increment the index for naming the next file


In [12]:
# Get list of trimmed and segmented audio files and sort them numerically
audio_files = sorted(
    (f for f in os.listdir(output_dir_trimmed) if f.endswith(".wav")),
    key=lambda f: int(''.join(filter(str.isdigit, f)))
)

In [13]:
# Use a loop to apply the transcribe function to all audio files
transcriptions = [transcribe_audio(file, output_dir_trimmed) for file in audio_files]

In [14]:
# Concatenate the transcriptions
full_transcript = ' '.join(transcriptions)

In [15]:
print(full_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 bill

In [16]:
# Remove non-ascii characters from the transcript
ascii_transcript = remove_non_ascii(full_transcript)

In [17]:
print(ascii_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 bill

In [18]:
# Use punctuation assistant function
response = punctuation_assistant(ascii_transcript)

In [19]:
# Extract the punctuated transcript from the model's response
punctuated_transcript = response['choices'][0]['message']['content']

In [20]:
print(punctuated_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 b

In [21]:
# Use product assistant function
response = product_assistant(punctuated_transcript)

In [22]:
# Extract the final transcript from the model's response
final_transcript = response['choices'][0]['message']['content']

In [23]:
print(final_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar second quarter (Q2) with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in Collateralized Debt Obligations (CDOs), and Residential Mortgage-Backed Securities (RMBS). We've also invested $25 million in AAA rated corporate 