# Notebook Documentation: Audio Transcription and Summarization


This notebook demonstrates the process of transcribing audio files using OpenAI's Whisper model and summarizing the transcriptions. The notebook is divided into several sections, each performing specific tasks such as loading models, processing audio files, and summarizing the transcriptions.


## Sections:


### 1. Environment Setup
This section includes the necessary imports and environment configurations required for the notebook. It ensures that all dependencies are installed and the environment is ready for processing.
```python
import whisper
import os
from transformers import pipeline
```



### 2. Load Whisper Model
Here, the Whisper model from OpenAI is loaded. Whisper is an automatic speech recognition (ASR) system that can transcribe spoken language into text.
```python
model = whisper.load_model("base")
```



### 3. Load and Transcribe Audio
In this section, the audio file is loaded and transcribed into text using the Whisper model. The transcription is saved into a variable for further processing.
```python
# Load audio file
audio = whisper.load_audio("path_to_audio_file")
# Transcribe audio
result = model.transcribe(audio)
transcription = result['text']
```



### 4. Summarize Transcription
The transcription obtained in the previous section is split into smaller chunks and summarized using a transformer-based summarization pipeline.
```python
from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization")

# Split transcription into chunks
chunks = [transcription[i:i + 512] for i in range(0, len(transcription), 512)]

# Summarize each chunk
summaries = [summarizer(chunk, max_length=60, min_length=5)[0]['summary_text'] for chunk in chunks]
```



### 5. Combine Summaries
The individual summaries are combined to form a coherent summary of the entire transcription.
```python
# Combine summaries
combined_summary = " ".join(summaries)
```



### 6. Display Results
Finally, the original transcription and the combined summary are displayed for comparison.
```python
print("Transcription:", transcription)
print("Summary:", combined_summary)
```


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/testing-audio/Y2meta.app-I Built a Personal Speech Recognition System for my AI Assistant-(480p).mp4
/kaggle/input/testing-audio/videoplayback.weba
/kaggle/input/audio1/ANIMAL_PEHLE BHI MAIN (Lyrical)  Ranbir KapoorTripti Dimri  Sandeep V  Vishal Mishra  Bhushan K.mp3


In [2]:
!pip install --upgrade pip
!pip install transformers jupyter 
!pip install pytube3

Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.0
Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl.metadata (995 bytes)
Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Installing collected packages: jupyter
Successfully installed jupyter-1.0.0
Collecting pytube3
  Downloading pytube3-9.6.4-py3-none-any.whl.metadata (16 kB)
Downloading pytube3-9.6.4-py3-none-any.whl (38 kB)
Installing collected packages: pytube3
Successfully installed pytube3-9.6.4


In [3]:
import requests
from transformers import pipeline





2024-04-12 19:07:38.417619: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-12 19:07:38.417717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-12 19:07:38.548484: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
# URL of the audio file
audio_url = "https://www.uclass.psychol.ucl.ac.uk/Release2/Conversation/AudioOnly/wav/F_0101_10y4m_1.wav"

# Download the audio file
response = requests.get(audio_url)
if response.status_code == 200:
    with open("audio_file.wav", "wb") as f:
        f.write(response.content)

In [5]:
# Initialize the ASR pipeline
pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3", device = 0)

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [6]:
transcription = pipe("audio_file.wav")

# Print the transcription
print(transcription)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


{'text': " What happened after she got out of the orphanage? She found the old building where she used to live. And she went inside it. Um, and she started to have a look around. Um... What does she find in the building? Um, these are pictures on the wall. and she found this old sort of like case thing. You've forgotten about the puppy. Where did you find the puppy? After she was thrown out, a puppy came running up. And who did she meet in the palace? A boy. Two men. What were they doing? Do you remember? No. They were trying to find, they were trying to audition, weren't they? People to pretend to be her. Yep. So they could get the money from the Grandmother. Great story. Yeah. and people to pretend to be her. Yep. So they could get the money from the grandma. It's a great story. Yeah. Where are we?"}


In [7]:
# audio_file_path = "/kaggle/input/audio1/ANIMAL_PEHLE BHI MAIN (Lyrical)  Ranbir KapoorTripti Dimri  Sandeep V  Vishal Mishra  Bhushan K.mp3"

# trans2 = pipe(audio_file_path)
# print(trans2)



In [9]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe2 = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
summary = pipe2([transcription['text']])

In [11]:
print(summary)

[{'summary_text': "After she was thrown out, a puppy came running up. Who did she meet in the palace? A boy. Two men. They were trying to find, they were tried to audition, weren't they? People to pretend to be her. So they could get the money from the grandma."}]


In [12]:
trans2 = pipe('/kaggle/input/testing-audio/videoplayback.weba')

In [13]:
print(trans2)

{'text': " Here we are again with another English test for you, this time with the movie Ratatouille. Get ready because we have 21 vocabulary questions for you to answer today. I'm sure you will... Excuse me. Hello? Ethan! Hey man, how's it going? All good, yeah, I'm just filming a lesson here for the channel. Test your English with Ratatouille, you know? Mm-hmm. Okay, so I should tell the viewers to subscribe to the channel, because every week we put out videos to help them understand their favorite movies and TV series, right? And also test their English from time to time, correct? Uh-huh. Without getting lost, yeah, without missing the jokes, and without subtitles. Got it. Will do. Thanks, man. Yeah, talk to you soon. Ethan, you know, I guess he's making sure I'm doing my job correctly here and well, So yeah, I don't have to tell you again what I just shared with him, right? I think you got the idea and the message, so please subscribe. Now let's get started with the test. Which wor

In [14]:
input_text = trans2['text']

# Split the input text into smaller chunks
max_chunk_length = 512  # Define the maximum length for each chunk
chunks = [input_text[i:i + max_chunk_length] for i in range(0, len(input_text), max_chunk_length)]

# Summarize each chunk individually
summaries = []
for chunk in chunks:
    summary = pipe2(chunk,max_length=60)
    summaries.append(summary)

# Combine the summaries if needed
# combined_summary = " ".join(summaries)



--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 104

In [15]:
for i in summaries:
    print(i[0]['summary_text'])

Ethan: "Get ready because we have 21 vocabulary questions for you to answer today" Ethan: "Test your English with Ratatouille, you know? Mm-hmm. Okay, so I should tell the viewers to subscribe to the channel, because every week we
Ethan: I guess he's making sure I'm doing my job correctly here and well, So yeah, I don't have to tell you again what I just shared with him, right? I think you got the idea and the message, so please subscribe. Thanks, man.
Dazzling refers to something that is extremely impressive or brilliant, often in a visually captivating way. Beyonce's concerts are known for their dazzling choreography and extravagant stage effects. A dazzling fireworks display can light up the night sky with bursts of color and spectacle, leaving people amazed
If something is useless, it has no value or practical application. On the other hand, if something is useful, it is valuable and helps to do something well. Knowing how to swim is a useful life skill to have. If you are what you

In [26]:
trans3 = pipe('/kaggle/input/testing-audio/Y2meta.app - I Built a Personal Speech Recognition System for my AI Assistant (128 kbps).mp3')
print(trans3)

{'text': " Speech. It's the most natural form of human communication. This is my demo of a real-time speech recognition system using deep learning. Yo, what's up world? Michael here, and today we're going to be talking about speech recognition, why it's hard, and how deep learning can help solve it. Later in this video, we're going to build our own neural network and train the speech recognition model from scratch. So humans are really good at understanding speech, so you would also think it's easy for computers to do too as well, right? Speech recognition is actually really hard for computers. Speech is essentially sound waves, which lives in the physical world with their own physical properties. For example, a person's age, gender, style, personality, accent, all affects how they speak and the physical properties of sound. A computer also got to consider the environmental noise around the speaker and the type of microphones they're using to record. So because there's so many variatio

In [27]:
input_text2 = trans3['text']

# Split the input text into smaller chunks
max_chunk_length2 = 512  # Define the maximum length for each chunk
chunks23 = [input_text2[i:i + max_chunk_length2] for i in range(0, len(input_text), max_chunk_length2)]

# Summarize each chunk individually
summaries2 = []
for chunk in chunks23:
    summary = pipe2(chunk,max_length=60,min_length=5)
    summaries2.append(summary)

# Combine the summaries if needed
# combined_summary = " ".join(summaries)

Your max_length is set to 60, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)
Your max_length is set to 60, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)
