# Generating YouTube Transcripts

Overview
This notebook extracts transcripts and metadata from YouTube videos, saving them as markdown files. These files can later be added to your vector database pipeline alongside PDFs and other documents.

## Section 1: Setup and Installation


### 1.1 Mount Google Drive

In [None]:
# Import Google Colab's drive module
from google.colab import drive

# Mount Google Drive to access personal files
drive.mount('/content/drive')

### 1.2 Installing Required Libraries

- `youtube_transcript_api`:
    - Downloads video transcripts/subtitles
    - Accesses the same captions you see on YouTube
    - Includes timestamps for each text segment
    - Works with auto-generated and manual captions

- `yt_dlp` to extract video metadata (title, uploader, views, etc.):
    - Extracts video metadata
    - Fork of youtube-dl (more actively maintained)
    - Gets title, uploader, view count, duration, upload date
    - No actual video download needed for metadata

In [None]:
# Install the YouTube Transcript API
!pip install youtube_transcript_api

# Install yt-dlp for downloading video metadata
!pip install yt_dlp

### 1.3 Importing Python Libraries
Here we import Python libraries needed for working with files, JSON, and YouTube content.

- `os`: Create directories, build file paths
- `json`: Structure data in JSON format for easy parsing later
- `YouTubeTranscriptApi`: Core transcript retrieval
- `yt_dlp`: Video information extraction

In [None]:
import os        # For handling file paths and directories
import json      # For writing data to a .json format
from youtube_transcript_api import YouTubeTranscriptApi  # Transcript access
import yt_dlp    # For metadata extraction from YouTube videos

## Section 2: Configuration and Processing

### 2.1 Defining Video IDs and Output Directory

`video_ids`: List of video IDs to process
- Find ID in YouTube URL after watch?v=
- Add as many as you need
- Process happens sequentially

`output_directory`: Where markdown files will be saved
- Same folder as your PDF documents (for vector database pipeline)
- `exist_ok`=True prevents errors if folder already exists

In [None]:
# List of YouTube video IDs to process
video_ids = ["MkETkRv9tG8", "HR7xaHv3Ias"] # ADD YOUR OWN VIDEO IDS IF YOU WANT

# Path to the output directory on Google Drive
output_directory = "/content/drive/Shareddrives/# MDSC-30801-Language-Processing-in-Practice/Fall 2025/Week 12/Chronicles_Articles_PDFs" # CHANGE TO YOUR OWN DIRECTORY

# Create the directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

### 2.2 Looping Over Each Video to Extract Transcript and Metadata

1. Loop through each video ID in the list
2. Construct full YouTube URL
3. Fetch transcript with timestamps using YouTube Transcript API
4. Format transcript with timestamp markers
5. Extract metadata (title, uploader, etc.) using yt-dlp
6. Combine everything into structured dictionary
7. Save as JSON in markdown file (.md extension)
8. Print confirmation for each successful save
9. Continue to next video

In [None]:
# Loop through each video ID in the list
for video_id in video_ids:
    # Construct the full YouTube video URL
    url = f"https://www.youtube.com/watch?v={video_id}"

    # Attempt to retrieve the transcript using the YouTube Transcript API
    try:
        # UPDATED: Create an instance and use .fetch() instead of .get_transcript()
        ytt_api = YouTubeTranscriptApi()
        fetched_transcript = ytt_api.fetch(video_id)

        # Format the transcript with timestamps
        # UPDATED: Access .text, .start, .duration from FetchedTranscriptSnippet objects
        lines = [f"[timestamp {snippet.start}] {snippet.text}" for snippet in fetched_transcript]

        # Combine the lines into a single string
        full_transcript = "\n".join(lines)
    except Exception as e:
        # If transcript retrieval fails, log the error
        full_transcript = f"Error retrieving transcript: {e}"

    # Set up options for yt_dlp (can be customized if needed)
    ydl_opts = {}

    # Attempt to retrieve metadata using yt_dlp
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(url, download=False)

            # Extract desired metadata fields
            title = info.get('title')
            uploader = info.get('uploader')
            upload_date = info.get('upload_date')  # Format: YYYYMMDD
            duration = info.get('duration')         # Duration in seconds
            view_count = info.get('view_count')
    except Exception as e:
        # If metadata retrieval fails, use the error message as placeholder values
        title = uploader = upload_date = duration = view_count = f"Error retrieving metadata: {e}"

    # Create a dictionary to hold all video information
    video_data = {
        "title": title,
        "uploader": uploader,
        "url": url,
        "upload_date": upload_date,
        "duration": duration,
        "view_count": view_count,
        "transcript": full_transcript
    }

    # Define the output path for the .txt file
    output_file_path = os.path.join(output_directory, f"YT_Transcript_{video_id}.md")

    # Write the video data dictionary to the file as JSON
    with open(output_file_path, "w", encoding="utf-8") as f:
        json.dump(video_data, f, ensure_ascii=False, indent=4)

    # Print a confirmation message
    print(f"Data successfully saved to {output_file_path}")