<a href="https://colab.research.google.com/github/jermwatt/whisper_clipper/blob/main/whisper_clipper_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
from IPython.display import HTML
from base64 import b64encode

# if running in collab pull repo and install requirements
if os.getenv("COLAB_RELEASE_TAG"):
    !git clone https://github.com/jermwatt/bleep_that_sht.git
    %cd bleep_that_sht
    !pip install -r requirements.txt


# make sure video can be played on ubuntu
def display_video(path):
    mp4 = open(path, "rb").read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    display(
        HTML(
            """
          <video width=200 controls>
                <source src="%s" type="video/mp4">
          </video>
      """
            % data_url
        )
    )

# Whisper clipper

Effortlessly clip your favorite moments from YouTube videos with Whisper Clipper.  Simply enter in a YouTube url and rough estimation of the spoken moment you're looking to clip in plain text, and sit back while your clip is generated for you.


## How it works

Whisper Clipper uses the popular Whisper transcription model to first transcribe an input YouTube video.  To make any moment capturable, this transcript is then chunked and embedded as a set of varying length substrings and vectorized (to enable semantic search).


## STOPPED HERE

In [2]:
# play the *bleep* sound
from IPython.display import Audio, display

display(Audio("bleep_that_sht/bleep.mp3", autoplay=True))

- we can do this in 5 steps for any given video:

1.  strip the audio off video
2.  transcribe the audio with whisper (or any transcription model), returning a timestamped transcript indicating the temporal position of each spoken word from the audio
3.  choose a set of "bleep words" that to replace in the audio
4.  replace each instance of a "bleep word" with an nicely sized slice of bleep sound above, creating a bleeped version of the audio
5.  replace the original audio in the video with this new bleeped version

# 1.  strip audio off video

- we use the following short test video for this and subsequent steps

In [3]:
# define path to short test video file
og_video_path = "data/input/bleep_test_1.mp4"

# display video
display_video(og_video_path)

- we can then use [moviepy](https://pypi.org/project/moviepy/) to extract its audio

In [4]:
from moviepy.editor import VideoFileClip


def extract_audio(*, local_file_path: str, audio_filepath: str) -> None:
    try:
        video = VideoFileClip(local_file_path)
        audio = video.audio
        if audio is not None:
            audio.write_audiofile(audio_filepath, verbose=False, logger=None)
            audio.close()
            video.close()
    except Exception as e:
        raise ValueError(f"error extracting audio from video {local_file_path}, exception: {e}")

- leveraging this function we extract the audio from this video
- we define a local savepath for the extracted mp3

In [5]:
# extract audio
og_audio_path = "data/input/bleep_test_1.mp3"
extract_audio(local_file_path=og_video_path, audio_filepath=og_audio_path)

# 2.  transcribe the audio with whisper

- now lets use whisper to transcribe the text of our audio
- there are many versions of whisper models out there
- we will use this [whisper timestamp](https://github.com/linto-ai/whisper-timestamped) as it requires minimal installs and nicely returns timestamps for each spoken word
- it also has some other nice features we won't use - like voice activity detection - built in 

In [6]:
import whisper_timestamped as whisper

# we will use the base whisper model - which is accurate enough for our purposes -  but all sizes are available
model = whisper.load_model("base")
process_output = whisper.transcribe(model, og_audio_path, verbose=False)

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

Detected language: English


100%|██████████| 3093/3093 [00:03<00:00, 816.54frames/s]


- we can extract both the complete transcript, and the timestamped verision, from the process_output as shown below

In [7]:
# extract transcript and timestamped transcript from process_output
transcript = process_output["text"]
timestamped_transcript = process_output["segments"]

- lets print the transcript

In [8]:
# print the transcript
print(transcript)

 My sister would stop talking about this every time we'd mention Costa Rica. Threats is so good. Threats, treats. No context. Turns out, treats is like an f***ing sandwich that you'd only find in Costa Rica. I mean, if I can get it elsewhere, let me know, because I am a hooked. There's kind of like a thin, crumbly f***ing sandwich that's decadent than a l***ing between. It's laced with a simple There's nothing pretentious about this. Everything's simple, but it was so good that I bought five more that day. I love treats. Have you heard of treats? One has some treats. Sweet, sweet, sweet.


- and the first few words from the timestamped transcript

In [9]:
# print the first few words from the timestamped transcript
timestamped_transcript[0]["words"]

[{'text': 'My', 'start': 0.06, 'end': 0.14, 'confidence': 0.918},
 {'text': 'sister', 'start': 0.14, 'end': 0.4, 'confidence': 0.993},
 {'text': 'would', 'start': 0.4, 'end': 0.58, 'confidence': 0.595},
 {'text': 'stop', 'start': 0.58, 'end': 0.76, 'confidence': 0.462},
 {'text': 'talking', 'start': 0.76, 'end': 1.2, 'confidence': 0.991},
 {'text': 'about', 'start': 1.2, 'end': 1.42, 'confidence': 0.977},
 {'text': 'this', 'start': 1.42, 'end': 1.64, 'confidence': 0.99},
 {'text': 'every', 'start': 1.64, 'end': 1.96, 'confidence': 0.789},
 {'text': 'time', 'start': 1.96, 'end': 2.28, 'confidence': 0.992},
 {'text': "we'd", 'start': 2.28, 'end': 2.58, 'confidence': 0.652},
 {'text': 'mention', 'start': 2.58, 'end': 2.8, 'confidence': 0.956},
 {'text': 'Costa', 'start': 2.8, 'end': 3.14, 'confidence': 0.975},
 {'text': 'Rica.', 'start': 3.14, 'end': 3.46, 'confidence': 0.966}]

# 3.  choose a set of bleep words

- now that we have the transcript we can choose a set of words to bleep
- this can include both correctly and incorrectly transcribed words
- for example, the product mentioned in our test video "Tritz" is the name of a bespoke local product in Costa Rica, and so won't be in the vocabulary of a non-fine tuned whisper model
- it will be transcribed as the phonetically similar "treats" or "treatz"
- that's fine for our use case - if we want to *bleep* it out we can simply choose the word(s) its been transcribed as by our whisper model

In [16]:
# choose a set of words - ideally from the transcript of course - to bleep out
bleep_words = ["treats", "treetz", "threats", "ice", "cream", "chocolate", "syrup", "cookie", "hooked"]

# 4.  replace each "bleep word" with a *bleep*

- using the timestamped transcript we loop through the audio, and replace each instance of a "bleep word" with a slice of *beep* sound
- we'll first create two helper functions: 
    - a `word_cleaner` that will cleanup words from the timestamped transcript a bit by removing punctuation, lowercasing the word, ec.,  this ensure we properly *bleep* out every instance of a "bleep word", regardless of whether it is capitalized, attached to a punctuation mark, etc.,
    - this will ensure that if want to bleep out the word "cookie", we bleep out all its variations like "Cookie", "cookie,", "cookie?!", etc.,

    - a `query_transcript` that takes in a bleep word, and returns a list of all timestamped instances of that word in the transcript

In [17]:
# simple word cleaner - remove punctuation etc.,
def word_cleaner(word: str) -> str:
    return "".join(e for e in word if e.isalnum()).lower().strip()


# collect all timestamped instances of bleep_word in transcript
def query_transcript(bleep_words: list, timestamped_transcript: list) -> list:
    transcript_words = sum([timestamped_transcript[i]["words"] for i in range(len(timestamped_transcript))], [])
    detected_bleep_words = []
    for bleep_word in bleep_words:
        detected_bleep_words += [v for v in transcript_words if word_cleaner(v["text"]) == word_cleaner(bleep_word)]
    detected_bleep_words = sorted(detected_bleep_words, key=lambda d: d["start"])
    return detected_bleep_words

- we now use these in our `splice_audio_with_bleeps` function below
- first this takes in our chosen `bleep_words`, and finds all timestamped instances of these words in the timestamped transcript of our original audio using `query_transcript`
- it then splices into the original audio *bleep* sounds for each word detected
- a list of contiguous audio segments - with bleeps replacing our chosen bleep word instances - is then returned
- this uses [pydub](https://github.com/jiaaro/pydub) to load in the *bleep* pythonically, and dissects a re-usable one second portion of it

In [18]:
from pydub import AudioSegment

bleep_sound = AudioSegment.from_mp3("bleep_that_sht/bleep.mp3")
bleep_first_sec = bleep_sound[1 * 1000 : 2 * 1000]


def splice_audio_with_bleeps(og_audio_path: str, bleep_words: list) -> list:
    # input original audio file for splicing
    test_sound = AudioSegment.from_mp3(og_audio_path)

    # find bleep_words in timestamped transcript
    bleep_word_instances = query_transcript(bleep_words, timestamped_transcript)

    # start creation of test_sound_bleeped - by splicing in instance 0
    audio_clip = test_sound[:1]
    contiguous_audio_clips = [audio_clip]

    # loop over instances, thread in clips of bleep
    prev_end_time = 1
    for instance in bleep_word_instances:
        # unpack bleep_word start / end times - converted to microseconds
        start_time = int(instance["start"] * 1000) - 50
        end_time = int(instance["end"] * 1000) + 50

        # collect clip of test starting at previous end time, and leading to start_time of next bleep
        audio_clip = test_sound[prev_end_time:start_time]

        # create bleep clip for this instance
        bleep_clip = bleep_first_sec[: (end_time - start_time)]

        # store test and bleep clips
        contiguous_audio_clips.append(audio_clip)
        contiguous_audio_clips.append(bleep_clip)

        # update prev_end_time
        prev_end_time = end_time

    # create final clip from test
    audio_clip = test_sound[prev_end_time:]
    contiguous_audio_clips.append(audio_clip)
    return contiguous_audio_clips

# 5.  replace original audio with the new bleeped version

- finally our `bleep_that_sht` function below ties everything together
- in it we call `splice_audio_with_bleeps` returning a contiguous sequence of audio clips - with bleeps replacing our chosen words
- it then sews these segments together, and replaces the original audio of our video with the updated bleeped version
- this uses [moviepy](https://github.com/Zulko/moviepy) to replace the new beeped audio
- this function takes in our original audio (to splice and replace bleep words), original video (to make the final audio replacement), save paths for bleeped versions of the audio and video, and of course our chosen `bleep_words`

In [19]:
from moviepy.editor import VideoFileClip, AudioFileClip, CompositeAudioClip


def bleep_that_sht(og_video_path: str, og_audio_path: str, final_video_path: str, final_audio_path: str, bleep_words: list) -> None:
    # input og audio file for splicing
    test_sound = AudioSegment.from_mp3(og_audio_path)

    # create list of new audio clips replacing all bleep words
    contiguous_audio_clips = splice_audio_with_bleeps(og_audio_path, bleep_words)

    # merge and save bleeped audio
    bleeped_test_clip = sum(contiguous_audio_clips)
    bleeped_test_clip.export(final_audio_path, format="mp3")

    # load in og video, overlay with bleeped audio
    og_video = VideoFileClip(og_video_path)
    bleep_audio = AudioFileClip(final_audio_path)
    new_audioclip = CompositeAudioClip([bleep_audio])
    og_video.audio = new_audioclip
    og_video.write_videofile(final_video_path, codec="libx264", audio_codec="aac", temp_audiofile="temp-audio.m4a", remove_temp=True)

- lets run everything and watch/listen to the result

In [20]:
# define path to saved bleep audio and video
final_video_path = "data/output/bleep_test_1.mp4"
final_audio_path = "data/output/bleep_test_1.mp3"

# create bleeped audio and video
bleep_that_sht(og_video_path, og_audio_path, final_video_path, final_audio_path, bleep_words)

Moviepy - Building video data/output/bleep_test_1.mp4.
MoviePy - Writing audio in temp-audio.m4a


                                                                    

MoviePy - Done.
Moviepy - Writing video data/output/bleep_test_1.mp4



                                                                

Moviepy - Done !
Moviepy - video ready data/output/bleep_test_1.mp4




- lets watch the bleeped video!

In [21]:
# display bleeped video
display_video(final_video_path)

- want to play around with this more?  fire up the streamlit app (see this repo's `README.md`) locally to start playing immediately