Skip to content

lschd/scrapeyt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapeyt

Download YouTube transcripts and clean manually copied ones — one video, an entire playlist, or a batch of files.

Features

  • Single video: download a clean transcript from any YouTube URL
  • Playlist bulk download: fetches all videos in a playlist, one file per video inside a named folder
  • Manual transcript cleaner: strips timestamps and chapter headers from transcripts copied by hand from YouTube — single file or entire folder in one pass
  • Resumes automatically: already-downloaded files are skipped, safe to re-run after an interruption
  • GDPR-aware: handles the YouTube cookie-consent redirect (common in Switzerland / EU) automatically
  • IP block protection: browser cookies + optional proxy support; aborts immediately on block with actionable fix instructions

Requirements

Python 3.10+

pip install -r requirements.txt

Or with uv:

uv pip install -r requirements.txt

Usage

Interactive mode (no arguments)

python scrapeyt.py

Prompts for a URL on each run. Default language: en.

CLI flags

Flag Short Description
--url URL -u YouTube video or playlist URL to download
--input PATH -i File or folder of .txt transcripts to clean
--output PATH -o Custom output file or directory (optional)
--language LANG -l Comma-separated language codes in order of preference (default: en)
--help -h Show usage and exit

Download a single video

python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID"
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it,en -o my_transcript.txt

Download a full playlist

python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID"
python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID" -l it -o ./my_folder

Clean a manually copied transcript

YouTube's transcript copy format concatenates each line's timestamp (0:03), its human-readable equivalent (3 seconds, 3 secondi, …), and the spoken text — all without any separator. Chapter and section headers are interspersed between lines. This mode strips all of that.

The cleaner is language-agnostic: it detects the timestamp structure (MM:SS) rather than matching language-specific time words, so it works for any YouTube UI language.

Single file:

python scrapeyt.py -i "transcript.txt"
python scrapeyt.py -i "transcript.txt" -o clean.txt

Entire folder (batch):

python scrapeyt.py -i ./my_transcripts/
python scrapeyt.py -i ./my_transcripts/ -o ./cleaned/

All .txt files in the folder are processed; each output file is named <original>_cleaned.txt.


Output defaults

All files are written inside an output/ folder that is created automatically next to the script.

Mode Default output path
Single video output/<video_id>_<title>.txt
Playlist output/<playlist title>/<video title>.txt
--input single file output/<original name>_cleaned.txt
--input folder output/<original name>_cleaned.txt for each file

Use -o to override the destination to any file or directory.


Avoiding IP blocks

YouTube rate-limits unauthenticated transcript requests (HTTP 429). When a block is detected the script stops immediately with a clear message — it never hangs retrying dead paths.

Layer 1 — Browser cookies (always active)

rookiepy reads cookies directly from Chrome, Edge, Firefox, or Brave. No setup needed as long as you are logged into YouTube in one of those browsers.

Alternatively, place a cookies.txt file (Netscape format) next to the script:

  1. Install "Get cookies.txt LOCALLY" (Chrome/Edge) or "cookies.txt" (Firefox)
  2. Open youtube.com while logged in
  3. Export cookies → save as cookies.txt next to scrapeyt.py

If you still get blocked

Three options, in order of effort:

  1. Restart your router — most ISPs assign a new IP on reconnect. Turn it off, wait ~30 seconds, turn it back on, then re-run. Already-downloaded files are skipped automatically.

  2. Use a VPN — connect first, then set PROXY_URL at the top of scrapeyt.py:

    PROXY_URL = "socks5://127.0.0.1:1080"
  3. Webshare rotating residential proxy (most reliable) — sign up at webshare.io, buy a "Residential" plan (not "Proxy Server" or "Static Residential"), then set the credentials at the top of scrapeyt.py:

    WEBSHARE_USERNAME = "your-proxy-username"
    WEBSHARE_PASSWORD = "your-proxy-password"

    Webshare rotates the exit IP on every request — a blocked IP is never retried twice.


Notes

  • YouTube only renders the first ~100 videos in a playlist page server-side. Playlists longer than ~100 videos will be partially downloaded.
  • Transcripts are auto-generated captions; availability depends on the video.

About

CLI to download YouTube transcripts (single video or full playlist) and clean manually copied ones. Supports language selection, browser cookie auth, and proxy configuration to avoid IP blocks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages