scrapeyt

Download YouTube transcripts and clean manually copied ones — one video, an entire playlist, or a batch of files.

Features

Single video: download a clean transcript from any YouTube URL
Playlist bulk download: fetches all videos in a playlist, one file per video inside a named folder
Manual transcript cleaner: strips timestamps and chapter headers from transcripts copied by hand from YouTube — single file or entire folder in one pass
Resumes automatically: already-downloaded files are skipped, safe to re-run after an interruption
GDPR-aware: handles the YouTube cookie-consent redirect (common in Switzerland / EU) automatically
IP block protection: browser cookies + optional proxy support; aborts immediately on block with actionable fix instructions

Requirements

Python 3.10+

pip install -r requirements.txt

Or with uv:

uv pip install -r requirements.txt

Usage

Interactive mode (no arguments)

python scrapeyt.py

Prompts for a URL on each run. Default language: en.

CLI flags

Flag	Short	Description
`--url URL`	`-u`	YouTube video or playlist URL to download
`--input PATH`	`-i`	File or folder of `.txt` transcripts to clean
`--output PATH`	`-o`	Custom output file or directory (optional)
`--language LANG`	`-l`	Comma-separated language codes in order of preference (default: `en`)
`--help`	`-h`	Show usage and exit

Download a single video

python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID"
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it
python scrapeyt.py -u "https://youtube.com/watch?v=VIDEO_ID" -l it,en -o my_transcript.txt

Download a full playlist

python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID"
python scrapeyt.py -u "https://youtube.com/playlist?list=PLAYLIST_ID" -l it -o ./my_folder

Clean a manually copied transcript

YouTube's transcript copy format concatenates each line's timestamp (0:03), its human-readable equivalent (3 seconds, 3 secondi, …), and the spoken text — all without any separator. Chapter and section headers are interspersed between lines. This mode strips all of that.

The cleaner is language-agnostic: it detects the timestamp structure (MM:SS) rather than matching language-specific time words, so it works for any YouTube UI language.

Single file:

python scrapeyt.py -i "transcript.txt"
python scrapeyt.py -i "transcript.txt" -o clean.txt

Entire folder (batch):

python scrapeyt.py -i ./my_transcripts/
python scrapeyt.py -i ./my_transcripts/ -o ./cleaned/

All .txt files in the folder are processed; each output file is named <original>_cleaned.txt.

Output defaults

All files are written inside an output/ folder that is created automatically next to the script.

Mode	Default output path
Single video	`output/<video_id>_<title>.txt`
Playlist	`output/<playlist title>/<video title>.txt`
`--input` single file	`output/<original name>_cleaned.txt`
`--input` folder	`output/<original name>_cleaned.txt` for each file

Use -o to override the destination to any file or directory.

Avoiding IP blocks

YouTube rate-limits unauthenticated transcript requests (HTTP 429). When a block is detected the script stops immediately with a clear message — it never hangs retrying dead paths.

Layer 1 — Browser cookies (always active)

rookiepy reads cookies directly from Chrome, Edge, Firefox, or Brave. No setup needed as long as you are logged into YouTube in one of those browsers.

Alternatively, place a cookies.txt file (Netscape format) next to the script:

Install "Get cookies.txt LOCALLY" (Chrome/Edge) or "cookies.txt" (Firefox)
Open youtube.com while logged in
Export cookies → save as cookies.txt next to scrapeyt.py

If you still get blocked

Three options, in order of effort:

Restart your router — most ISPs assign a new IP on reconnect. Turn it off, wait ~30 seconds, turn it back on, then re-run. Already-downloaded files are skipped automatically.
Use a VPN — connect first, then set PROXY_URL at the top of scrapeyt.py:
```
PROXY_URL = "socks5://127.0.0.1:1080"
```
Webshare rotating residential proxy (most reliable) — sign up at webshare.io, buy a "Residential" plan (not "Proxy Server" or "Static Residential"), then set the credentials at the top of scrapeyt.py:
```
WEBSHARE_USERNAME = "your-proxy-username"
WEBSHARE_PASSWORD = "your-proxy-password"
```
Webshare rotates the exit IP on every request — a blocked IP is never retried twice.

Notes

YouTube only renders the first ~100 videos in a playlist page server-side. Playlists longer than ~100 videos will be partially downloaded.
Transcripts are auto-generated captions; availability depends on the video.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapeyt.py		scrapeyt.py
ytranscript.py		ytranscript.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapeyt

Features

Requirements

Usage

Interactive mode (no arguments)

CLI flags

Download a single video

Download a full playlist

Clean a manually copied transcript

Output defaults

Avoiding IP blocks

Layer 1 — Browser cookies (always active)

If you still get blocked

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scrapeyt

Features

Requirements

Usage

Interactive mode (no arguments)

CLI flags

Download a single video

Download a full playlist

Clean a manually copied transcript

Output defaults

Avoiding IP blocks

Layer 1 — Browser cookies (always active)

If you still get blocked

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages