# Speech-to-Text and Summarization Workflow

In this notebook, we will explore a practical workflow for converting speech to text and summarizing the generated transcripts using HuggingFace models. This process combines automatic speech recognition (ASR) and text summarization, demonstrating how Generative AI can handle audio-to-text workflows efficiently.

### Outline

In this walkthrough, we will:

1. **Read in an Audio File:** Load an audio file for transcription. We recommend using a publicly available audio sample, such as from the [Open Speech and Language Resources](https://www.openslr.org/12/). HuggingFace also has some common benchmarks in their `datasets` package
2. **Stand Up an Automatic Speech Recognition (ASR) Pipeline:** Use HuggingFace's `automatic-speech-recognition` pipeline for transcription.
3. **Generate and Save the Transcript:** Transcribe the audio file and save the output as a text file.
4. **Read and Explore the Transcript:** Load the transcript, read a sample, and prepare it for summarization.
5. **Summarize the Transcript:** Stand up a text-generation pipeline with the `tiiuae/Falcon3-3B-Instruct` model and summarize the transcript.
6. **Evaluate the Summary:** Compute the ROUGE score to evaluate the quality of the generated summary.

By the end of this notebook, you'll learn how to integrate ASR and summarization into an efficient workflow.

## Configure the Environment

In [None]:
! pip install transformers
! pip install datasets
! pip install bert-score
! pip install soundfile

## Read in an Audio File

__Prompt__: Provide Python code to read in an audio file `"2902-9008-0000.flac"` using the 'soundfile' package, and play an the file within a Jupyter notebook.

In [1]:
import soundfile as sf

audio, sampling_rate = sf.read("2902-9008-0000.flac") 


In [None]:
from IPython.display import Audio
# Play the audio
Audio(data=audio, rate=sampling_rate)


## Stand Up an Automatic Speech Recognition (ASR) Pipeline

__Prompt__: Provide Python code to set up an ASR pipeline using HuggingFace.


In [None]:
from transformers import pipeline

# Initialize the ASR pipeline
asr_pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")

## Generate and Save the Transcript

__Prompt__: Provide Python code to transcribe audio file read in by this code `audio, sampling_rate = sf.read("2902-9008-0000.flac") `

In [None]:
# Transcribe the audio file
transcription = asr_pipeline(audio) ["text"]

print(transcription)


__Prompt__: Provide Python code to transcribe all `.flac` files in the current directory, append their transcriptions, and save them to a single text file.

In [None]:
import os

# Initialize the transcript file
transcript_file = "transcript.txt"

# Open the transcript file for writing
with open(transcript_file, "w") as f:
    # Iterate through all .flac files in the current directory
    for file_name in os.listdir():
        if file_name.endswith(".flac"):
            print(f"Processing: {file_name}")
            # Transcribe the audio file
            audio, sampling_rate = sf.read(file_name) 
            transcription = asr_pipeline(audio)["text"]
            # Append the transcription to the file with a newline
            f.write(transcription + "\n")

print(f"All transcriptions saved to {transcript_file}")

In [None]:
transcription

__Prompt__: Provide Python code to read a transcript file called "transcript.txt" and print some sentences.

In [None]:
with open(transcript_file, "r") as f:
    transcript_text = f.read()

# Print the first few sentences
print("Sample from the transcript:")
print("\n".join(transcript_text.split(".")[:5]))

## Summarize the Transcript

__Prompt__: Provide Python code to create a text-generation pipeline using the model of "tiiuae/Falcon3-3B-Instruct" from HuggingFace and use the 0th GPU. Also, make sure when creating the pipeline to specify "max_new_tokens = 500", and make sure the pipeline only outputs the generated text and not the prompt.

In [None]:
from transformers import pipeline

# Define the model name
model_name = "tiiuae/Falcon3-3B-Instruct"

# Create a text-generation pipeline
text_gen_pipeline = pipeline(
    "text-generation",
    model=model_name,
    device=0,  # Use the 0th GPU
    max_new_tokens=500,
    return_full_text=False  # Ensure the output only includes the generated text
)

__prompt__: Using the a huggingface text-generation pipeline called "text_gen_pipeline", create a prompt using an f-string to summarize text called "transcript" text and run that prompt.

In [None]:
prompt = f"Summarize the following transcript in one sentence. Please make the summary concise as possible:\n{transcript_text}..."
summary = text_gen_pipeline(prompt)[0]["generated_text"]

print("Summary:")
print(summary)

## Evaluate the Summary

__Prompt__: Provide Python code to calculate the BERTScore of a text called "summary". The original text is in a variable called "transcript_text". Please use the CPU for the embedding model.

In [None]:
from bert_score import score

# Calculate BERTScore
P, R, F1 = score(
    [summary],  # Candidate (e.g., summary)
    [transcript_text],  # Reference (e.g., transcript)
    lang="en",  # Specify language
    device="cpu",
    verbose=True  # Optional: enable verbose logging for debugging
)

# Display results
print(f"BERTScore Precision: {P.item():.4f}")
print(f"BERTScore Recall: {R.item():.4f}")
print(f"BERTScore F1: {F1.item():.4f}")