<a href="https://colab.research.google.com/github/jermwatt/youtube_transcript_downloader/blob/main/transcript_downloader_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import os
# if running in collab pull repo and install requirements
if os.getenv("COLAB_RELEASE_TAG"):
    !git clone https://github.com/jermwatt/youtube_transcript_downloader.git
    %cd youtube_transcript_downloader
    !pip install -r requirements.txt

# Youtube transcript downloader

The problem:
- this expert youtuber [has downloaded THOUSANDS of transcripts BY HAND](https://youtu.be/As7abwNhG7Y?si=dSzEf6Hk0glROYhu&t=340) to analyze them
- oy say 10 secs per video copy/paste, for 1000 videos that's 10000 secs / 167 mins minimum of copying and pasting just the transcripts
- let's help anyone else in the future who wants to do this save a bunch of time

Let's build:
- a simple app you can run locally that lets you
- upload a .txt file with youtube urls, extracts their transcripts and
- once complete lets you download a .csv or .xlsx file with the transcripts nicely organized by url / video id

Tool's we'll need:
- [youtube_transcript_api](https://github.com/jdepoix/youtube-transcript-api) - a simple python api using the requests library and the YouTube web-client
- [streamlit](https://github.com/streamlit/streamlit) for the app framework (components, web UI, etc.,)
- custom code to organize data, batch process, create and organize output, etc.,

## 1.  Pulling transcripts

- we'll create simple wrappers on [youtube_transcript_api](https://github.com/jdepoix/youtube-transcript-api) to pull transcripts 
- built in batch mode &#x1f929; - so we can pull multiple video transcripts at once

- simple functionality below 
    - `is_valid_youtube_url` type checks input youtube urls to ensure they are formatted correctly
    - `get_single_transcript` pulls the transcript of a single video
    - `get_batch_transcripts` pulls transcripts of multiple input videos

In [3]:
import re
from typing import List, Dict
from youtube_transcript_api import YouTubeTranscriptApi


def is_valid_youtube_url(potential_url: str) -> bool:
    pattern = r'^https://www\.youtube\.com/watch\?v=[A-Za-z0-9_-]{11}$'  # youtube ids are always 11 chars long
    return re.match(pattern, potential_url) is not None
    
def get_single_transcript(youtube_url: str) -> dict:
    try:
        if is_valid_youtube_url(youtube_url):
            video_id = youtube_url.split("?v=")[-1]
            video_transcript = YouTubeTranscriptApi.get_transcript(video_id)
            entry = {}
            entry["youtube_url"] = youtube_url
            entry["video_id"] = video_id
            entry["transcript"] = video_transcript
            return entry
        else:
            print(f"FAILURE: youtube_url is not valid - {youtube_url}")
            return {}
    except Exception as e:
        print(f"FAILURE: transcript pull for youtube_url - {youtube_url} - failed with exception {e}")
        return {}

def get_batch_transcripts(youtube_urls: List[str]) -> List[Dict]:
    valid_urls = []
    valid_vids = []
    for i, url in enumerate(youtube_urls):
        if is_valid_youtube_url(url):
            vid = url.split("?v=")[-1]
            valid_urls.append(url)
            valid_vids.append(vid)
    try:
        video_transcripts = YouTubeTranscriptApi.get_transcripts(valid_vids, languages=["en"])
        entries = []
        for i in range(len(valid_urls)):
                entry = {}
                entry["youtube_url"] = valid_urls[i]
                entry["video_id"] = valid_vids[i]
                entry["transcript"] = video_transcripts[i]
                entries.append(entry)
        return entries
    except Exception as e:
        print(f"FAILURE: batch transcription fetch failed with exception {e}")
        return []

- test our functionality with two videos

In [4]:
# test transcript pull functionality with two test videos
youtube_url_1 = "https://www.youtube.com/watch?v=As7abwNhG7Y"
youtube_url_2 = "https://www.youtube.com/watch?v=0TolBiTrUg4"
youtube_urls = [youtube_url_1, youtube_url_2]
single_transcript = get_single_transcript(youtube_url_1)
batch_transcripts = get_batch_transcripts(youtube_urls)

In [8]:
# print the first few hundred characters of return object
print(str(single_transcript)[:400])

{'youtube_url': 'https://www.youtube.com/watch?v=As7abwNhG7Y', 'video_id': 'As7abwNhG7Y', 'transcript': [{'text': "I don't ask if it'll go viral I can", 'start': 0.04, 'duration': 3.399}, {'text': 'figure out how to make it viral that is', 'start': 1.599, 'duration': 4.361}, {'text': 'an absurd thing to', 'start': 3.439, 'duration': 5.921}, {'text': "say that's Jenny Hoyos an 18-year-old", 'start'


## 2.  Input and output

- lets design and build two helper functions for I/O
- input will be a text file with one youtube url per line - `parse_input_file` implements this
- output will be a csv file - `save_output` implements this

In [None]:
import pandas as pd


def parse_input_file(input_file_path: str) -> list:
    youtube_urls = []
    with open(input_file_path, 'r') as file:
        for line in file:
            youtube_urls.append(line.strip())
    return youtube_urls


def save_output(data: list,
                output_file_path: str) -> None:
    df = pd.DataFrame(data)
    df.to_csv(output_file_path, index=False)