<a href="https://colab.research.google.com/github/pearl-yu/twitch_project/blob/main/streamer_action_wav2vec_video_captions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[GitHub](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb)
# Creating YouTube Captions with Wav2Vec

---

The Wav2Vec model was introduced by Facebook [here](https://arxiv.org/abs/2006.11477). Thanks to 🤗 Transformers, we can load it in seconds and build cool applications on top of it!

This notebooks aim is to serve as an inspiration for just that. We will build a simple script to create captions for YouTube videos! The notebook can be run on CPU. If you have any questions feel free to raise an issue at the GitHub link above.

## Setup

---

In [1]:
!pip -q install transformers 
!pip -q install youtube_dl
!pip install moviepy
!pip3 install imageio==2.4.1

[K     |████████████████████████████████| 5.8 MB 4.3 MB/s 
[K     |████████████████████████████████| 182 kB 17.8 MB/s 
[K     |████████████████████████████████| 7.6 MB 31.4 MB/s 
[K     |████████████████████████████████| 1.9 MB 4.2 MB/s 
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting imageio==2.4.1
  Downloading imageio-2.4.1.tar.gz (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 3.8 MB/s 
Building wheels for collected packages: imageio
  Building wheel for imageio (setup.py) ... [?25l[?25hdone
  Created wheel for imageio: filename=imageio-2.4.1-py3-none-any.whl size=3303886 sha256=236662514920cd6c7b8ee48d8207fc07be8136de579fff9a6ecde878f272d319
  Stored in directory: /root/.cache/pip/wheels/be/7b/04/4d8d56f1d503e5c404f0de6018c0cfa592c71588a39b49e002
Successfully built imageio
Installing collected packag

In [2]:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
from IPython.display import Audio

import moviepy.editor as mp
import torch
import librosa
import os

Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 8192/45929032 bytes (0.0%)1941504/45929032 bytes (4.2%)3604480/45929032 bytes (7.8%)5677056/45929032 bytes (12.4%)8134656/45929032 bytes (17.7%)11198464/45929032 bytes (24.4%)14065664/45929032 bytes (30.6%)15851520/45929032 bytes (34.5%)18669568/45929032 bytes (40.6%)20570112/45929032 bytes (44.8%)22503424/45929032 bytes (49.0%)25124864/45929032 bytes (54.7%)27631616/45929032 bytes (60.2%)301

## Get Clip

---

Choose your favorite clip from YouTube & paste in the YouTube link. Ideally make it a short clip, as it will take some time to download. Choose the start & end seconds for the sequence whose caption you'd like to create. You can also give it a run with the default first 😊

In [None]:
# Substitute below YT link
clip = "https://www.twitch.tv/videos/1012207207"

# Substitue below for start/end seconds
start = 1
end = 60

In [None]:
# Download the clip as mp4 & rename it for usability
os.system('youtube-dl {} --recode-video mp4'.format(clip))
os.system('mv *.mp4 clip.mp4')

## Model and tokenizer

---

Load the Wav2Vec model from 🤗 Transformers. See [here](https://huggingface.co/transformers/model_doc/wav2vec2.html) for the models documentation.

In [3]:
# Load Wav2Vec from huggingface
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.



Downloading:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Extract Audio

---

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


First we'll extract the audio in mp3 format from the clip, as the Wav2Vec models expects audio input. We do this in subclips of 10 second length to save some memory lateron. 

In [5]:
%cd /content/drive/MyDrive/twitch_data/videos

/content/drive/MyDrive/twitch_data/videos


In [7]:
start = 1
end = 1000000000

clip = mp.VideoFileClip("1013713136.mp4")
end = min(clip.duration, end)

# Save the paths for later
clip_paths = []

# Extract Audio-only from mp4
for i in range(start, int(end), 10):
  sub_end = min(i+10, end)
  sub_clip = clip.subclip(i,sub_end)

  sub_clip.audio.write_audiofile("audio_" + str(i) + ".mp3")
  clip_paths.append("audio_" + str(i) + ".mp3")

[MoviePy] Writing audio in audio_1.mp3


100%|██████████| 221/221 [00:00<00:00, 335.75it/s]


[MoviePy] Done.
[MoviePy] Writing audio in audio_11.mp3


100%|██████████| 221/221 [00:00<00:00, 391.46it/s]

[MoviePy] Done.





[MoviePy] Writing audio in audio_21.mp3


100%|██████████| 177/177 [00:00<00:00, 403.07it/s]

[MoviePy] Done.





In [8]:
# Play Audio 
Audio(clip_paths[0])

## Transcribe Audio

---

The last step is turning the Audio into text! The Wav2Vec model does most of the job here for us. We do each 10-second clip one-by-one to save memory.

In [9]:
cc = ""

for path in clip_paths:
    # Load the audio with the librosa library
    input_audio, _ = librosa.load(path, 
                                sr=16000)

    # Tokenize the audio
    input_values = tokenizer(input_audio, return_tensors="pt", padding="longest").input_values

    # Feed it through Wav2Vec & choose the most probable tokens
    with torch.no_grad():
      logits = model(input_values).logits
      predicted_ids = torch.argmax(logits, dim=-1)

    # Decode & add to our caption string
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    cc += transcription + " "






In [10]:
# Here's your caption!
# Note that there may be mistakes especially if the audio is noisy or there are uncommon words
# If you picked the default video and change start to 0, you will see that the model gets confused by the word "Anakin"
print(cc)

OQUE QUET WORZON GET A LITTLE BUGGY IN WHYU REPLYIN AH AH  AH A   WAS A WOG I A IL I 
