Download YouTube transcripts and clean manually copied ones — one video, an entire playlist, or a batch of files.
- Single video: download a clean transcript from any YouTube URL
- Playlist bulk download: fetches all videos in a playlist, one file per video inside a named folder
- Manual transcript cleaner: strips timestamps and chapter headers from transcripts copied by hand from YouTube — single file or entire folder in one pass
- Resumes automatically: already-downloaded files are skipped, safe to re-run after an interruption
- GDPR-aware: handles the YouTube cookie-consent redirect (common in Switzerland / EU) automatically
- IP block protection: browser cookies + optional proxy support; aborts immediately on block with actionable fix instructions
Python 3.10+
pip install -r requirements.txt
Or with uv:
uv pip install -r requirements.txt
python scrapeyt.py
Prompts for a URL on each run. Default language: en.
| Flag | Short | Description |
|---|---|---|
--url URL |
-u |
YouTube video or playlist URL to download |
--input PATH |
-i |
File or folder of .txt transcripts to clean |
--output PATH |
-o |
Custom output file or directory (optional) |
--language LANG |
-l |
Comma-separated language codes in order of preference (default: en) |
--help |
-h |
Show usage and exit |
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID"
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it,en -o my_transcript.txt
python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID"
python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID" -l it -o ./my_folder
YouTube's transcript copy format concatenates each line's timestamp (0:03), its human-readable equivalent (3 seconds, 3 secondi, …), and the spoken text — all without any separator. Chapter and section headers are interspersed between lines. This mode strips all of that.
The cleaner is language-agnostic: it detects the timestamp structure (MM:SS) rather than matching language-specific time words, so it works for any YouTube UI language.
Single file:
python scrapeyt.py -i "transcript.txt"
python scrapeyt.py -i "transcript.txt" -o clean.txt
Entire folder (batch):
python scrapeyt.py -i ./my_transcripts/
python scrapeyt.py -i ./my_transcripts/ -o ./cleaned/
All .txt files in the folder are processed; each output file is named <original>_cleaned.txt.
All files are written inside an output/ folder that is created automatically next to the script.
| Mode | Default output path |
|---|---|
| Single video | output/<video_id>_<title>.txt |
| Playlist | output/<playlist title>/<video title>.txt |
--input single file |
output/<original name>_cleaned.txt |
--input folder |
output/<original name>_cleaned.txt for each file |
Use -o to override the destination to any file or directory.
YouTube rate-limits unauthenticated transcript requests (HTTP 429). When a block is detected the script stops immediately with a clear message — it never hangs retrying dead paths.
rookiepy reads cookies directly from Chrome, Edge, Firefox, or Brave. No setup needed as long as you are logged into YouTube in one of those browsers.
Alternatively, place a cookies.txt file (Netscape format) next to the script:
- Install "Get cookies.txt LOCALLY" (Chrome/Edge) or "cookies.txt" (Firefox)
- Open youtube.com while logged in
- Export cookies → save as
cookies.txtnext toscrapeyt.py
Three options, in order of effort:
-
Restart your router — most ISPs assign a new IP on reconnect. Turn it off, wait ~30 seconds, turn it back on, then re-run. Already-downloaded files are skipped automatically.
-
Use a VPN — connect first, then set
PROXY_URLat the top ofscrapeyt.py:PROXY_URL = "socks5://127.0.0.1:1080"
-
Webshare rotating residential proxy (most reliable) — sign up at webshare.io, buy a "Residential" plan (not "Proxy Server" or "Static Residential"), then set the credentials at the top of
scrapeyt.py:WEBSHARE_USERNAME = "your-proxy-username" WEBSHARE_PASSWORD = "your-proxy-password"
Webshare rotates the exit IP on every request — a blocked IP is never retried twice.
- YouTube only renders the first ~100 videos in a playlist page server-side. Playlists longer than ~100 videos will be partially downloaded.
- Transcripts are auto-generated captions; availability depends on the video.