<a href="https://colab.research.google.com/github/jermwatt/youtube_transcript_downloader/blob/main/transcript_downloader_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

# if running in collab pull repo and install requirements
if os.getenv("COLAB_RELEASE_TAG"):
    !git clone https://github.com/jermwatt/youtube_transcript_downloader.git
    %cd youtube_transcript_downloader
    !pip install -r requirements.txt

# Youtube transcript downloader

The problem:
- this pro youtuber [has downloaded THOUSANDS of youtube shorts transcripts BY HAND](https://youtu.be/As7abwNhG7Y?si=dSzEf6Hk0glROYhu&t=340) to analyze them
- each time required adjusting the shorts url to a standard format (where transcripts live), clicking around buttons to show the transcript, and then copying and pasting that text somewhere
- lets say 60 secs per video to do all this - for 6000 videos that's 100 hours of manual labor there
- let's help anyone else in the future who wants to do this save a bunch of time


Let's build:
- a simple app you can run locally or via a free service that lets you
- upload a .txt file with youtube shorts urls, transform the url to the standard youtube viewer, and extracts their transcripts
- once complete lets you download a .csv or .xlsx file with the transcripts nicely organized by url / video id

We can do build this in 3 steps.  These are:

1.  I/O functionality to upload shorts urls and download associated transcripts
2.  type and formatting checks to ensure user inputs are proper shorts urls
3.  a way to to pull youtube shorts transcripts

# 1.  I/O functionality

- first lets design simple I/O functionality
- we'll allow our users to upload `.txt` files containing shorts urls
- one url per line
- an input `.txt` file can look like this

```text
https://www.youtube.com/shorts/xaRRZKgj5aQ
https://www.youtube.com/shorts/xK9_V9LF4PE
...
```

- our function for loading input will be called `parse_input_file`
- it takes in a file path, reads in the file, and stores each line in an output list
- our function for saving output will be called `save_output`
- it will save the output - which will contain urls, video ids, and video transcripts - as `csv` file using `pandas`

In [2]:
import pandas as pd


def parse_input_file(input_file_path: str) -> list:
    youtube_urls = []
    with open(input_file_path, "r") as file:
        for line in file:
            youtube_urls.append(line.strip())
    return youtube_urls


def save_output(data: list, output_file_path: str) -> None:
    df = pd.DataFrame(data)
    df.to_csv(output_file_path, index=False)

# 2. Type and formatting checks 

- we need to ensure that input contains valid shorts urls
- no point in trying to pull transcripts from an invalid url
- a single function `is_valid_youtube_shorts_url` checks the type of each line, and uses a shorts regular expression pattern to validate that the input matches a shorts url

In [3]:
import re


def is_valid_youtube_shorts_url(url: str) -> bool:
    if not isinstance(url, str):
        return False
    pattern = r"^https://www\.youtube\.com/shorts/[A-Za-z0-9_-]{11}$"  # youtube vido ids are always 11 chars long
    return re.match(pattern, url) is not None

# 3.  Pulling youtube shorts transcripts

- we'll create simple wrappers on the fantastic [youtube_transcript_api](https://github.com/jdepoix/youtube-transcript-api) to pull transcripts, which wraps around the requests library
- it has a built in batch mode &#x1f929; - so we can pull multiple video transcripts at once

- simple functionality below to fetch one or multiple transcripts
    - `get_single_transcript` pulls the transcript of a single video
    - `get_batch_transcripts` pulls transcripts of multiple input videos

- note: we won't have to transform the shorts url as described in [the motivating video](https://youtu.be/As7abwNhG7Y?si=dSzEf6Hk0glROYhu&t=340) - all we need are the video ids to gather the transcripts
- more manual labor saved!

In [4]:
from typing import List, Dict
from youtube_transcript_api import YouTubeTranscriptApi


def get_single_transcript(youtube_url: str) -> dict:
    try:
        if is_valid_youtube_shorts_url(youtube_url):
            video_id = youtube_url.split("/")[-1]
            video_transcript = YouTubeTranscriptApi.get_transcript(video_id)
            entry = {}
            entry["youtube_url"] = youtube_url
            entry["video_id"] = video_id
            entry["transcript"] = video_transcript
            return entry
        else:
            print(f"FAILURE: youtube_url is not valid - {youtube_url}")
            return {}
    except Exception as e:
        print(f"FAILURE: transcript pull for youtube_url - {youtube_url} - failed with exception {e}")
        return {}


def get_batch_transcripts(youtube_urls: List[str]) -> List[Dict]:
    valid_urls = []
    valid_vids = []
    for i, url in enumerate(youtube_urls):
        if is_valid_youtube_shorts_url(url):
            vid = url.split("/")[-1]
            valid_urls.append(url)
            valid_vids.append(vid)
    try:
        video_transcripts = YouTubeTranscriptApi.get_transcripts(valid_vids, languages=["en"])[0]
        entries = []
        for i in range(len(valid_urls)):
            entry = {}
            entry["youtube_url"] = valid_urls[i]
            entry["video_id"] = valid_vids[i]
            entry["transcript"] = video_transcripts[valid_vids[i]]
            entries.append(entry)
        return entries
    except Exception as e:
        print(f"FAILURE: batch transcription fetch failed with exception {e}")
        return []

In [5]:
# test transcript pull functionality with two test videos
youtube_url_1 = "https://www.youtube.com/shorts/xaRRZKgj5aQ"
single_transcript = get_single_transcript(youtube_url_1)

- test our functionality with two videos

In [6]:
# test transcript pull functionality with two test videos
youtube_url_1 = "https://www.youtube.com/shorts/xaRRZKgj5aQ"
youtube_url_2 = "https://www.youtube.com/shorts/xK9_V9LF4PE"
youtube_urls = [youtube_url_1, youtube_url_2]
single_transcript = get_single_transcript(youtube_url_1)
batch_transcripts = get_batch_transcripts(youtube_urls)

In [7]:
# print the first few hundred characters of return object
print(str(single_transcript)[:400])

{'youtube_url': 'https://www.youtube.com/shorts/xaRRZKgj5aQ', 'video_id': 'xaRRZKgj5aQ', 'transcript': [{'text': 'what happens if you pull a shot of', 'start': 0.08, 'duration': 2.839}, {'text': 'espresso through an orange I thought', 'start': 1.24, 'duration': 3.119}, {'text': 'this could potentially taste pretty good', 'start': 2.919, 'duration': 2.801}, {'text': "so let's try it out and see if 


# Putting it all together

- lets put it all together
- we upload a test file of input youtube shorts urls (we'll print it out below)
- then we use `get_batch_transcripts`
- finally we save the result using `save_output`

In [11]:
# print out first few lines of input
with open("data/input/test_input.txt") as myfile:
    first_few_lines = myfile.readlines(1024)[0:3]
print(first_few_lines)

['https://www.youtube.com/shorts/xaRRZKgj5aQ\n', 'https://www.youtube.com/shorts/xK9_V9LF4PE\n', 'https://www.youtube.com/shorts/1RV8RhUvJjo\n']


In [8]:
# test functionality
input_file_path = "data/input/test_input.txt"
output_file_path = "data/output/test_output.csv"
youtube_urls = parse_input_file(input_file_path)
batch_transcripts = get_batch_transcripts(youtube_urls)
save_output(batch_transcripts, output_file_path)

- lets check the output by printing out the first few rows

In [9]:
pd.read_csv(output_file_path).head(2)

Unnamed: 0,youtube_url,video_id,transcript
0,https://www.youtube.com/shorts/xaRRZKgj5aQ,xaRRZKgj5aQ,[{'text': 'what happens if you pull a shot of'...
1,https://www.youtube.com/shorts/xK9_V9LF4PE,xK9_V9LF4PE,[{'text': 'here are snacks I packed for my fli...


# Streamlit app

- this functionality is wrapped up into an easy to use streamlit app
- to run locally just type this at your terminal

```bash
python -m streamlit run youtube_shorts_transcript_downloader/app.py
```

- to run via HF Spaces click this tab [![HuggingFace Space](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/neonwatty/youtube_shorts_transcript_downloader)