<a href="https://colab.research.google.com/github/kirbah/genai-chaptercraft/blob/main/GenAI_ChapterCraft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenAI ChapterCraft: Automated Video Chapter Generation

**Overview:**

GenAI ChapterCraft is a free tool for automatically generating video chapters using AI. This Colab notebook demonstrates how to transcribe video audio and use a Large Language Model (LLM) to create SEO-friendly chapters with timestamps.

**Key Features:**

*   **Audio Transcription:** Uses `openai/whisper-large-v3-turbo` (or `openai/whisper-tiny` for CPU) for audio-to-text conversion.
*   **Audio Download:** Downloads audio from video URLs (e.g., Vimeo) using `yt_dlp`.
*   **LLM Chapter Creation:** Employs an LLM (`Qwen/Qwen2.5-Coder-32B-Instruct` by default) to generate chapters from the transcript.
*   **SRT Output:** Converts the transcript to SRT format.
*   **GPU optimized:** If GPU is available in the colab, the notebook will take advantage of it.
*   **Free and open-source.**

**Process:**

1.  **Video Input:** Takes a video URL or uses existing YouTube video transcript.
2.  **Audio Extraction:** If needed, downloads audio as an MP3 using `yt_dlp`.
3.  **Transcription:** Transcribes audio to text using `openai/whisper`.
4.  **SRT Conversion:** Converts the transcript into SRT format.
5.  **Chapter Generation:**  LLM analyzes the transcript and creates chapters with timestamps.
6. **Hugging Face usage**: LLM integration via Hugging Face Inference API. HF\_TOKEN is required.

**Benefits:**

*   Saves time on manual chapter creation.
*   Improves video searchability with SEO-friendly chapters.
* Enhances the viewer's experience.

**Requirements:**

*   Google Colab.
*   AI Studio (GEMINI\_API\_KEY) key. Create new key on the [AI Studio](https://aistudio.google.com/apikey).
*   Hugging Face account and API token (HF\_TOKEN). Create new token on [Hugging Face page](https://huggingface.co/settings/tokens)
*   Supadate account token (SUPADATA\_TOKEN) to retrieve Youtube transcription. Create [new account and token](https://supadata.ai/).
*   GPU runtime for faster run (highly recommended).

**How to use it:**

1. Enter the video url.
2. Set SUPADATA\_TOKEN in the secrets (small key on the right side).
3. Set HF\_TOKEN in the secrets.
4. Set GEMINI\_API\_KEY
5. Set SUPADATA\_TOKEN in the secrets.
6. Run all code blocks sequentially (Ctrl + F9 or Runtime -> Run all).
7. Enjoy the generated transcript and chapters.

Specify the video URL.  You can change this to any supported video URL.

**Note:** For YouTube videos, ensure that a transcript *already exists* to enable direct retrieval.  Direct video downloading from YouTube may be unreliable due to ongoing issues with `yt-dlp` (see https://github.com/yt-dlp/yt-dlp/issues/11868).  If no transcript exists, the notebook will attempt to download and transcribe, which is considerably slower.

In [3]:
video_url = "https://www.youtube.com/watch?v=A9WY_HZUK8Q"
#video_url = "https://vimeo.com/1013516281"  # Dutch language
#video_url = "https://vimeo.com/821101511"

Specify the desired number of chapters to be generated (minimum of 3).

In [70]:
num_chapters = 8

Install the required libraries.

In [5]:
!pip install -q yt_dlp
!pip install -q transformers
!pip install -q huggingface_hub
!pip install -q supadata
!pip install -q safetensors

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.9/171.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the necessary libraries.

In [31]:
import gc
import google.generativeai as genai

import re
from typing import Any, Dict

import torch
import yt_dlp
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from google.colab import userdata
from huggingface_hub import login, InferenceClient
from supadata import Supadata

Retrieve your Hugging Face API token. Ensure that `HF_TOKEN` is set in your Colab secrets. You can create a new token at https://huggingface.co/settings/tokens.

In [51]:
# Get the API token from the user data
huggingface_api_token = userdata.get('HF_TOKEN')

Retrieve your Google AI Studio API key.  Ensure that `GEMINI_API_KEY` is set in your Colab Secrets.

In [37]:
gemini_api_key = userdata.get('GEMINI_API_KEY')

Retrieve your Supabase API token (`SUPADATA_TOKEN`) from https://supadata.ai/. This is *only* required to automatically fetch transcripts for YouTube videos.  If the video is from another source (or you're providing a local file), this is not needed.

In [76]:
supadata_api_token = userdata.get('SUPADATA_TOKEN')

# Attempt to retrieve an existing YouTube transcript.

In [None]:
def extract_video_id(video_url):
    """
    Extract video ID from a YouTube video URL.
    """
    match = re.search(r"(?:v=|youtu\.be\/)([^&?]+)", video_url)
    if match:
        return match.group(1)
    return None

# In case of YouTube video download transcipt text for it
youtube_video_id = extract_video_id(video_url)

if youtube_video_id:
    if supadata_api_token is None:
      raise ValueError(
          "SUPADATA_TOKEN user data variable not set. "
          "Please set it in Colab's."
          "Click on the 'KEY' on the left side"
      )
    supadata = Supadata(api_key=supadata_api_token)
    try:
        transcript = supadata.youtube.transcript(video_id=youtube_video_id)
    except Exception as e:
        print(f"Error fetching transcript: {e}")
else:
    print("Invalid YouTube URL provided.")


In [10]:
transcript

Transcript(content=[TranscriptChunk(text='this 400-year old book should have', offset=199, duration=4441, lang='en'), TranscriptChunk(text='changed mathematics Forever This Is The', offset=2159, duration=4680, lang='en'), TranscriptChunk(text="Swiss clockmaker Jos bergy's arithmetic", offset=4640, duration=4760, lang='en'), TranscriptChunk(text='and geometric progression tables the', offset=6839, duration=4401, lang='en'), TranscriptChunk(text='book contains an ingenious mathematical', offset=9400, duration=4800, lang='en'), TranscriptChunk(text='Hack That Bergie called red numbers and', offset=11240, duration=4479, lang='en'), TranscriptChunk(text='the design of a powerful Computing', offset=14200, duration=3360, lang='en'), TranscriptChunk(text='device that uses these red numbers', offset=15719, duration=4521, lang='en'), TranscriptChunk(text="hiding on its title page bergy's hack", offset=17560, duration=4440, lang='en'), TranscriptChunk(text='works by constructing an enormous table

In [11]:
def format_time(milliseconds):
    """Formats the time from milliseconds to HH:MM:SS,mmm format."""

    seconds = milliseconds / 1000
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    milliseconds = int(round((seconds - int(seconds)) * 1000))
    seconds = int(seconds)

    return f"{int(hours):02}:{int(minutes):02}:{int(seconds):02},{milliseconds:03}"

def convert_to_srt(transcript_content):
    """Converts the transcript content to SRT format."""

    srt_text = ""
    for i, segment in enumerate(transcript_content):

        start_time = segment['offset']
        end_time = segment['offset'] + segment['duration']
        text = segment['text']

        # Format the timestamps
        start_time_formatted = format_time(start_time)
        end_time_formatted = format_time(end_time)

        srt_text += f"{i+1}\n"
        srt_text += f"{start_time_formatted} --> {end_time_formatted}\n"
        srt_text += f"{text}\n\n"
    return srt_text

#checking transcript content.
if 'transcript' in locals() and hasattr(transcript, 'content') and transcript.content is not None and len(transcript.content)>0:
  srt_text = convert_to_srt([{"offset":item.offset, "duration":item.duration, "text":item.text} for item in transcript.content])


In [12]:
srt_text

"1\n00:00:00,199 --> 00:00:04,640\nthis 400-year old book should have\n\n2\n00:00:02,159 --> 00:00:06,839\nchanged mathematics Forever This Is The\n\n3\n00:00:04,640 --> 00:00:09,400\nSwiss clockmaker Jos bergy's arithmetic\n\n4\n00:00:06,839 --> 00:00:11,240\nand geometric progression tables the\n\n5\n00:00:09,400 --> 00:00:14,200\nbook contains an ingenious mathematical\n\n6\n00:00:11,240 --> 00:00:15,719\nHack That Bergie called red numbers and\n\n7\n00:00:14,200 --> 00:00:17,560\nthe design of a powerful Computing\n\n8\n00:00:15,719 --> 00:00:20,240\ndevice that uses these red numbers\n\n9\n00:00:17,560 --> 00:00:22,000\nhiding on its title page bergy's hack\n\n10\n00:00:20,240 --> 00:00:24,240\nworks by constructing an enormous table\n\n11\n00:00:22,000 --> 00:00:26,760\nof numbers where each number is simply\n\n12\n00:00:24,240 --> 00:00:29,199\nthe previous number time\n\n13\n00:00:26,760 --> 00:00:31,759\n1.001 starting at one and repeating this\n\n14\n00:00:29,199 --> 00:00:34

# Download the video, extract the audio, and obtain the filename if a YouTube transcript was not found.

In case transcript was not received from Youtube download video, extract audio and get the filename

In [13]:
def download_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '128',
        }],
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        audio_file = ydl.prepare_filename(info)
        if not audio_file.endswith('.mp3'):
             audio_file = audio_file.rsplit('.', 1)[0] + '.mp3'

    return audio_file

# Download and get the filename only if srt_text is not defined.
if 'srt_text' not in locals():
  audio_file = download_audio(video_url)

  if audio_file:
    print("Audio file saved as:", audio_file)
  else:
    print("Failed to get audio file.")

Prepare to perform speech recognition locally using the [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) model.  The notebook will automatically select a smaller model (`openai/whisper-tiny`) if a GPU is not available.

In [14]:
if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
    model_id = "openai/whisper-large-v3-turbo"
    batch_size = 16
else:
    device = "cpu"
    torch_dtype = torch.float32
    model_id = "openai/whisper-tiny"
    batch_size = 2

Execute speech recognition, using batching and chunking to improve the accuracy and efficiency of the transcription process.

In [15]:
# Only if srt_text is not defined.
if 'srt_text' not in locals():
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        torch_dtype=torch_dtype,
        device=device,
    )
    transcribed_text = pipe(audio_file,
                            chunk_length_s=10,
                            batch_size=batch_size,
                            return_timestamps=True)

Display the transcribed text (if generated).

In [16]:
if 'transcribed_text' in locals():
    transcribed_text

In [17]:
# Clear the GPU cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Run garbage collection
gc.collect()

60

Convert the transcribed text to the SRT (SubRip Subtitle) format.

In [18]:
def seconds_to_srt_time(seconds):
    """Convert seconds to SRT time format (HH:MM:SS,mmm)."""
    # Check if seconds is None or not a number
    if seconds is None or not isinstance(seconds, (int, float)):
        return "00:00:00,000"  # Default value for unknown time

    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02}:{minutes:02}:{secs:02},{millis:03}"

def convert_to_srt(transcribed_text):
    """Convert Whisper transcription chunks into SRT format."""
    srt_output = []
    # Check if chunks exist in the result.
    if "chunks" in transcribed_text:
        for i, chunk in enumerate(transcribed_text["chunks"], start=1):
            if chunk.get("timestamp") is not None:
                 start_time = seconds_to_srt_time(chunk["timestamp"][0])
                 end_time = seconds_to_srt_time(chunk["timestamp"][1])
                 srt_output.append(f"{i}\n{start_time} --> {end_time}\n{chunk['text']}\n")
            else:
                 srt_output.append(f"{i}\n{chunk['text']}\n")
        return "\n".join(srt_output)
    else:
        print("No chunks found in transcription. returning plain text")
        return transcribed_text["text"]

# Convert transcript chunks to SRT format
if 'transcribed_text' in locals():
    srt_text = convert_to_srt(transcribed_text)

In [19]:
srt_text

"1\n00:00:00,199 --> 00:00:04,640\nthis 400-year old book should have\n\n2\n00:00:02,159 --> 00:00:06,839\nchanged mathematics Forever This Is The\n\n3\n00:00:04,640 --> 00:00:09,400\nSwiss clockmaker Jos bergy's arithmetic\n\n4\n00:00:06,839 --> 00:00:11,240\nand geometric progression tables the\n\n5\n00:00:09,400 --> 00:00:14,200\nbook contains an ingenious mathematical\n\n6\n00:00:11,240 --> 00:00:15,719\nHack That Bergie called red numbers and\n\n7\n00:00:14,200 --> 00:00:17,560\nthe design of a powerful Computing\n\n8\n00:00:15,719 --> 00:00:20,240\ndevice that uses these red numbers\n\n9\n00:00:17,560 --> 00:00:22,000\nhiding on its title page bergy's hack\n\n10\n00:00:20,240 --> 00:00:24,240\nworks by constructing an enormous table\n\n11\n00:00:22,000 --> 00:00:26,760\nof numbers where each number is simply\n\n12\n00:00:24,240 --> 00:00:29,199\nthe previous number time\n\n13\n00:00:26,760 --> 00:00:31,759\n1.001 starting at one and repeating this\n\n14\n00:00:29,199 --> 00:00:34

Ensure that a transcript (in SRT format) is available before proceeding.

In [20]:
if not srt_text:
    raise ValueError("No transcript available. Please check your video URL or tokens.")

# Generate Chapters using an LLM.

Construct the prompt that will be sent to the LLM for chapter generation. This prompt includes specific instructions and constraints to ensure the desired output format.

In [68]:
# Prepare a refined prompt with explicit instructions and constraints.
prompt = (
    "Based on the following transcript, please perform the following steps internally and then output a single final chapter list:\n"
    "1. Analyze the transcript and identify its key segments with approximate starting timestamps in MM:SS format.\n"
    f"2. Generate exactly {num_chapters} distinct, non-overlapping chapters that cover different aspects of the video.\n"
    "3. The very first chapter must start at 00:00. All subsequent chapters should use the timestamp corresponding to when the segment begins, and the timestamps must be in ascending order with a minimum gap of 10 seconds between chapters.\n"
    "4. Format each chapter on its own line using the format '<timestamp> <chapter title>'. For example, '00:00 Introduction'.\n"
    "5. Do not include any additional commentary, explanations, chain-of-thought, or intermediate reasoning—only the final chapter list.\n"
    "6. Ensure that only one chapter list is generated and that there are no duplicate chapters or timestamps.\n\n"
    "Example:\n"
    "00:00 Introduction\n"
    "01:24 Key Concepts Overview\n"
    "08:56 Comparative Insights\n"
    "09:31 Conclusion and Next Steps\n\n"
    "### Transcript:\n"
    f"{srt_text}\n\n"
    "Chapters:"
)

In [71]:
prompt

"Based on the following transcript, please perform the following steps internally and then output a single final chapter list:\n1. Analyze the transcript and identify its key segments with approximate starting timestamps in MM:SS format.\n2. Generate exactly 8 distinct, non-overlapping chapters that cover different aspects of the video.\n3. The very first chapter must start at 00:00. All subsequent chapters should use the timestamp corresponding to when the segment begins, and the timestamps must be in ascending order with a minimum gap of 10 seconds between chapters.\n4. Format each chapter on its own line using the format '<timestamp> <chapter title>'. For example, '00:00 Introduction'.\n5. Do not include any additional commentary, explanations, chain-of-thought, or intermediate reasoning—only the final chapter list.\n6. Ensure that only one chapter list is generated and that there are no duplicate chapters or timestamps.\n\nExample:\n00:00 Introduction\n01:24 Key Concepts Overview\n

# Generate chapters using the Gemini API.

In [72]:
if gemini_api_key:
    genai.configure(api_key=gemini_api_key)

    # Create the model
    generation_config = {
      "temperature": 1,
      "top_p": 0.95,
      "top_k": 64,
      "max_output_tokens": 500,
      "response_mime_type": "text/plain",
    }

    model = genai.GenerativeModel(
      model_name="gemini-2.0-pro-exp-02-05",
      generation_config=generation_config,
    )

    chat_session = model.start_chat(
      history=[
      ]
    )

    gemini_response = chat_session.send_message(prompt)

In [75]:
if 'gemini_response' in locals():
    print("Generated Chapters:\n")
    print(gemini_response.text)
    print("\nGenerated using free 'GenAI ChapterCraft' tool.")

Generated Chapters:

00:00 Introduction to Bergy's Discovery
00:20 Bergy's Number Table Construction
01:15 Using the Table for Calculation
02:42 Converting Operations with the Table
03:32 The Circular Slide Rule Design
04:57 Slide Rule Mechanics and Logarithms
06:10 Bergy's Secrecy and Napier's Logarithms
07:33 Logarithms in Modern Applications


Generated using free 'GenAI ChapterCraft' tool.


# Generate Chapters using Hugging Face

Login to Hugging Face

In [34]:
# Log in using the API token from the HF_TOKEN
if huggingface_api_token:
  login(token=huggingface_api_token)

Generate text using the Hugging Face API (an alternative to Gemini).

In [35]:
if huggingface_api_token:
    # Replace with your actual model ID
    model_id = "Qwen/Qwen2.5-Coder-32B-Instruct"

    # Create an InferenceClient instance
    client = InferenceClient(model=model_id)

    # Define the parameters for the text generation request
    generation_parameters = {
        "max_new_tokens": 300,  # Maximum length of generated text
        "temperature": 0.2,
        "top_p": 0.8,
        "do_sample": False,
    }

    # Send the request to the Inference API
    hf_response = client.text_generation(prompt, **generation_parameters)

Print the generated chapters.

In [52]:
if 'hf_response' in locals():
    print("Generated Chapters:")
    print(hf_response)
    print("\nGenerated using free 'GenAI ChapterCraft' tool.")

Generated Chapters:
 
00:00 Introduction
00:30 Bergie's Arithmetic and Geometric Progression Tables
01:00 Red Numbers and Computing Device
01:30 Construction of the Table
02:00 Mapping Black and Red Numbers
02:30 Multiplication Using Bergie's Table
03:00 Other Mathematical Operations
03:30 Historical Context and Impact

00:00 Introduction
00:30 Bergie's Tables and Red Numbers
01:00 Construction of the Table
01:30 Mapping Black and Red Numbers
02:00 Multiplication Using Bergie's Table
02:30 Other Mathematical Operations
03:00 Historical Context and Impact
03:30 Conclusion and Next Steps

00:00 Introduction
00:30 Bergie's Tables and Red Numbers
01:00 Construction of the Table
01:30 Mapping Black and Red Numbers
02:00 Multiplication Using Bergie's Table
02:30 Other Mathematical Operations
03:00 Historical Context and Impact
03:30 Conclusion

00:00 Introduction
00:30 Bergie's Tables and Red Numbers
01:00 Construction of the Table
01:30 Mapping Black and Red Numbers
02:00 Multiplication Usi