To-do:
- Audio extraction from video file
- Find subtitles.
- Audio to text.
- Open model and change the last layers.
- Own simple models.
- Summary for topiic.
- Detailed summary of big text.
- Work with big text to summary

# Data source

We will use a [video](https://www.youtube.com/watch?v=FrDnPTPgEmk) on YouTube. This video, created by IBM Research (Generative AI for Business, IBM Think 2023, Dario Gil), is about generative AI and is published under a Creative Commons Licence ([CC BY](https://support.google.com/youtube/answer/2797468)).

- Download the [video](https://www.youtube.com/watch?v=FrDnPTPgEmk) and save it in the `data` folder.
- Copy the transcript and save it as `transcript.txt` in the same folder.


In [4]:
from IPython.core.display import HTML
# The YouTube video ID
video_id = "FrDnPTPgEmk"
# Creating the HTML iframe code
iframe_code = f"""
<div style="display: flex; justify-content: center;">
    <iframe width="560" height="315" src="https://www.youtube.com/embed/{video_id}" 
    frameborder="0" allow="accelerometer; autoplay; clipboard-write; 
    encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>
"""
# Displaying the YouTube video
HTML(iframe_code)

## Convert the txt transcript into an SRT (SubRip Subtitle) file.

Convert the TXT transcript into an [SRT (SubRip Subtitle)](https://docs.fileformat.com/video/srt/) file to enable the use of subtitles in various media players, including VLC (see the [txt_to_srt module](utils/txt_to_srt.py)).

In [2]:
from utils.txt_to_srt import txt_to_srt

In [3]:
srt = txt_to_srt("data/transcript.txt", "data/transcript.srt")
print(srt[:1000])

1
00:00:09,000 --> 00:00:17,000
>> Welcome to IBM THINK 2023!

2
00:00:17,000 --> 00:00:23,000
>> AI generated art, AI generated songs.

3
00:00:23,000 --> 00:00:31,000
AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well,

4
00:00:31,000 --> 00:00:36,000
you need to think bigger. Because AI and business needs to be held to a higher standard.

5
00:00:36,000 --> 00:00:42,000
Built to be trusted, secured, and adaptable. This isn't simple automation that is only

6
00:00:42,000 --> 00:00:48,000
trained to do one thing. This is AI that is built and focused to work across your organization.

7
00:00:48,000 --> 00:00:54,000
This isn't committing to a single system. This is hybrid ready AI that can scale across your systems.

8
00:00:54,000 --> 00:01:00,000
This isn't wondering where an answer came from. This is AI that can show its work.

9
00:01:00,000 --> 00:01:07,000
When you build AI into the core of your business, you can go so much fu

## Download and Install FFmpeg

[Download](https://www.ffmpeg.org/download.html) the FFmpeg multimedia framework for selected OS (MacOS in our case).

Use Terminal and Homebrew:

```bash
brew update
```

```bash
brew upgrade
```

```bash
brew install ffmpeg
```

## Install pydub

Install pydub.

`pydub` is a Python library that can be used for audio processing tasks (uses ffmpeg under hood). 

In [4]:
!pip install pydub



## Extract audio

Many ASR libraries and models use the following [requirements](https://github.com/ggerganov/whisper.cpp#quick-start) for audio:
- the format file: wav
- the bit depth: 16 bit
- the frame rate: 16 kHz
- channels: 1 (mono)

Extract audio from video using [the extract_audio.py](utils/extract_audio.py):

In [1]:
from utils.extract_audio import extract_audio

In [2]:
extract_audio("data/Generative AI for business.mp4", "data/audio.wav")

{'file_name': 'data/audio.wav',
 'duration': '2060.539 s',
 'channels': 1,
 'frame_rate': '16000 Hz',
 'bit_depth': '16 bit'}

## Audio to text

There are several strategies for converting an audio file to a text file:

1. **Cloud Services Offered by Big Tech Companies**:
   - [Google Speech-to-Text](https://cloud.google.com/speech-to-text/)
   - [IBM Watson Speech to Text](https://www.ibm.com/products/speech-to-text)
   - [Microsoft Azure Speech to Text](https://azure.microsoft.com/en-us/products/ai-services/speech-to-text)
   - [Amazon Transcribe](https://aws.amazon.com/transcribe/)
   - [Yandex SpeechKit](https://cloud.yandex.com/en/services/speechkit)
   
   These companies offer various plans with user-friendly interfaces. You can upload an audio file and receive the transcription without any additional steps. However, these services can be quite expensive for large volumes of audio. Some offer a limited amount of free minutes each month, but this is often insufficient for substantial projects. Additionally, some companies prefer not to use cloud services due to security concerns.

2. **Built-in Services in Operating Systems on Smartphones and Laptops** 

   MacOS, iOS, Windows, Android, etc., have built-in options for working with audio. However, they are generally more focused on speech-to-text than audio-to-text. These tools are tied to specific operating systems, which can be inconvenient in some instances.

4. **Professional Transcription Software** 

   Examples include Express Scribe, InqScribe, Riverside, etc. Most of these are paid and proprietary.

6. **Various Apps and Online Services for Different Platforms** 

   Many of them have low quality or collect user data.

8. **Open Source Projects, Libraries, and Models (Most Work Offline)**:
   - [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech)
   - [OpenAI Whisper](https://github.com/openai/whisper)
   - [SeamlessM4T](https://github.com/facebookresearch/seamless_communication)
   - [Kaldi](https://github.com/kaldi-asr/kaldi)
   - [Vosk](https://github.com/alphacep/vosk-api)
   - [PocketSphinx](https://github.com/cmusphinx/pocketsphinx)
   - [Julius](https://github.com/julius-speech/julius)
  
    These tools are often developed by communities of researchers and engineers. They provide a cost-effective way for anyone to transcribe audio, especially since they work offline, ensuring data privacy. However, they might require technical expertise to set up and use.

9. **Fine-Tuning an Open Model**

   This involves adjusting the parameters of an existing speech-to-text model to improve its performance, often using a new dataset that's more relevant to the specific use case. It requires some machine learning expertise and resources but can lead to more accurate transcriptions.
11. **Building Your Own Model**

    For those with the necessary resources and expertise, creating a custom speech-to-text model from scratch allows for the most control over the transcription process, accommodating unique requirements and data privacy concerns. However, this is the most resource-intensive option and requires a deep understanding of machine learning and audio processing.

We will focus on points 1, 5, and 6.

Useful libraries and models for our goal:
- [SpeechRecognition library](https://github.com/Uberi/speech_recognition)
- [Hugging Face](https://github.com/huggingface)
- https://huggingface.co/facebook/wav2vec2-base-960h
- https://jitsi.github.io/jiwer/