# 📹 YouTube Transcript Summarizer with Oxylabs Proxy

This notebook:
1️⃣ Loads Oxylabs credentials from a **.env** file.
2️⃣ Pulls the transcript of each YouTube video via `youtube-transcript-api` **through the Oxylabs proxy**.
3️⃣ Sends the transcripts to GPT-OSS 120B (via OpenRouter / Cerebras) for a structured summary.

In [22]:
%pip install -q youtube-transcript-api python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [23]:
# -------------------------------------------------
# 1️⃣  Imports & global constants (patched)
# -------------------------------------------------
import os, time, json, pathlib, sys
import requests
from urllib.parse import urlparse, parse_qs, quote_plus
from datetime import datetime

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import GenericProxyConfig
from youtube_transcript_api.formatters import TextFormatter
from IPython.display import Markdown, display

from dotenv import load_dotenv

# ---------- Load .env (search upwards) ----------
cwd = pathlib.Path.cwd()
env_path = None
for parent in [cwd] + list(cwd.parents):
    candidate = parent / ".env"
    if candidate.is_file():
        env_path = candidate
        break

if env_path is None:
    print("⚠️  No .env file found – you must set env vars manually.")
else:
    print(f"🔑 Loading environment from {env_path}")
    load_dotenv(env_path)

# ---------- Expected environment variables ----------
SECRET__INCEPTION_LABS__API_KEY = os.getenv("SECRET__INCEPTION_LABS__API_KEY")
OX_USERNAME   = os.getenv("SECRET__OXYLABS__PROXY_USERNAME")                       # base username
OX_PASSWORD   = os.getenv("SECRET__OXYLABS__PROXY_PASSWORD")                       # password
OX_COUNTRY    = os.getenv("OX_COUNTRY", "").strip().upper()    # e.g. "US"
OX_CITY       = os.getenv("OX_CITY", "").strip().lower()       # e.g. "los_angeles"
OX_STATE      = os.getenv("OX_STATE", "").strip().lower()      # optional US state, e.g. "us_california"
OX_SESSID     = os.getenv("OX_SESSID", "").strip()            # optional session id
OX_SESTIME    = os.getenv("OX_SESTIME", "").strip()           # optional session length (minutes)
OX_HOST       = os.getenv("OX_HOST", "pr.oxylabs.io")
OX_PORT       = int(os.getenv("OX_PORT", "7777"))               # 7777 = datacenter, 8080 = residential
SECRET__OPENROUTER__API_KEY = os.getenv('SECRET__OPENROUTER__API_KEY')

# ---------- Validate required keys ----------
if not SECRET__INCEPTION_LABS__API_KEY:
    raise SystemExit("❌  Missing SECRET__INCEPTION_LABS__API_KEY environment variable.")

USE_PROXY = bool(OX_USERNAME and OX_PASSWORD)
if USE_PROXY:
    print("✅ Oxylabs credentials loaded – proxy will be used for YouTube calls.")
else:
    print("⚠️  Oxylabs credentials missing – proceeding **without** proxy (you may hit IP blocks).")

# ---------- Helper: build the http:// proxy URL ----------
def _build_oxylabs_proxy_url():
    """
    Returns a single string suitable for both http and https entries.
    It follows Oxylabs’ “customer‑USERNAME‑cc‑US‑city‑london‑sessid‑ABC‑sesstime‑5”
    pattern, but any of the optional pieces can be omitted.
    """
    if not USE_PROXY:
        return None

    # Base parts list – always starts with the fixed keyword "customer"
    parts = ["customer", OX_USERNAME]

    if OX_COUNTRY:
        parts.append(f"cc-{OX_COUNTRY}")

    if OX_CITY:
        parts.append(f"city-{OX_CITY}")

    if OX_STATE:
        parts.append(f"st-{OX_STATE}")

    if OX_SESSID:
        parts.append(f"sessid-{OX_SESSID}")

    if OX_SESTIME:
        parts.append(f"sesstime-{OX_SESTIME}")

    # Assemble the *username* part Oxylabs expects
    ox_user = "-".join(parts)

    # URL‑encode credentials (password may contain special chars)
    auth = f"{quote_plus(ox_user)}:{quote_plus(OX_PASSWORD)}"
    return f"http://{auth}@{OX_HOST}:{OX_PORT}"

PROXY_URL = _build_oxylabs_proxy_url()

# ---------- Optional: proxy health‑check (soft‑fail) ----------
def _check_proxy():
    """Ping httpbin.org through the proxy. Returns True on success."""
    if not PROXY_URL:
        return True
    proxies = {"http": PROXY_URL, "https": PROXY_URL}
    try:
        r = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=10)
        r.raise_for_status()
        print("🔍 Proxy test successful – external IP:", r.json()["origin"])
        return True
    except Exception as exc:
        print(f"⚠️  Proxy test failed (will continue): {exc}")
        return False

# Run health‑check (won’t abort the notebook)
if PROXY_URL and not _check_proxy():
    print("⚠️  Continuing – the later YouTube calls will still try the proxy.")

# ---------- Helper to build a GenericProxyConfig for youtube_transcript_api ----------
def make_proxy_config():
    """Return a GenericProxyConfig that uses the same http:// URL for both http and https."""
    if not PROXY_URL:
        return None
    return GenericProxyConfig(http_url=PROXY_URL, https_url=PROXY_URL)

# -------------------------------------------------
# 2️⃣ Initialise the YouTube API (proxy‑aware)
# -------------------------------------------------
proxy_cfg = make_proxy_config()
ytt_api = YouTubeTranscriptApi(proxy_config=proxy_cfg) if proxy_cfg else YouTubeTranscriptApi()
formatter = TextFormatter()
print("✅ YouTubeTranscriptApi ready.")

🔑 Loading environment from /home/superdev/projects/OpenMates/.env
✅ Oxylabs credentials loaded – proxy will be used for YouTube calls.
🔍 Proxy test successful – external IP: 37.19.197.185
✅ YouTubeTranscriptApi ready.


In [24]:
# -------------------------------------------------
# 3️⃣ List of video URLs (feel free to edit)
# -------------------------------------------------
youtube_urls = [
    "https://www.youtube.com/watch?v=xlEQ6Y3WNNI",
    "https://www.youtube.com/watch?v=UjboGsztHd8",
    # "https://www.youtube.com/watch?v=67a5yrKH-nI",
    # "https://www.youtube.com/watch?v=Ac4LiuoJT20",
    # "https://www.youtube.com/watch?v=XSZP9GhhuAc",
    # "https://www.youtube.com/watch?v=ysPbXH0LpIE",
    # "https://www.youtube.com/watch?v=j8NlbEWAsmc",
    # "https://www.youtube.com/watch?v=HNzH5Us1Rvg",
    # "https://www.youtube.com/watch?v=gv0WHhKelSE",
    # "https://www.youtube.com/watch?v=dRsjO-88nBs",
]

print(f"📦 {len(youtube_urls)} URLs loaded.")

📦 2 URLs loaded.


In [25]:
# -------------------------------------------------
# 4️⃣ Helper utilities (ID extraction, safe fetch, …)
# -------------------------------------------------
def extract_video_id(url: str) -> str | None:
    """Return the YouTube video ID or ``None`` if the URL is malformed."""
    parsed = urlparse(url)
    if parsed.hostname and "youtube" in parsed.hostname:
        return parse_qs(parsed.query).get("v", [None])[0]
    if parsed.hostname and "youtu.be" in parsed.hostname:
        return parsed.path.lstrip("/")
    return None

def safe_fetch_transcript(video_id: str, url: str, max_retries: int = 3, base_delay: float = 2.0):
    """Fetch a transcript with exponential back‑off.
    Returns a dict with the same shape used later in the notebook.
    """
    for attempt in range(max_retries):
        try:
            if attempt > 0:
                delay = base_delay * (2 ** (attempt - 1))
                print(f"   ⏳ retry {attempt+1}/{max_retries} – sleeping {delay:.1f}s")
                time.sleep(delay)
            # request up to four languages – the API will pick the first that exists
            fetched = ytt_api.fetch(video_id, languages=["en", "de", "es", "fr"])
            text = formatter.format_transcript(fetched)
            return {
                "success": True,
                "video_id": video_id,
                "url": url,
                "transcript": text,
                "word_count": len(text.split()),
                "language": fetched.language,
                "is_generated": fetched.is_generated,
            }
        except Exception as exc:
            # Log the error but keep trying (unless it’s the last attempt)
            print(f"   ❌ attempt {attempt+1} failed: {type(exc).__name__}: {exc}")
            if attempt == max_retries - 1:
                return {
                    "success": False,
                    "video_id": video_id,
                    "url": url,
                    "error": str(exc),
                }

    # Should never reach here
    return {"success": False, "video_id": video_id, "url": url, "error": "Unknown"}


In [26]:
# -------------------------------------------------
# 5️⃣ Fetch all transcripts (progress printed in the notebook)
# -------------------------------------------------
all_transcripts = []
all_transcripts_md = ""
success_count = 0
failed_count = 0
total_word_count = 0
failed_videos = []

print(f"🔎 Starting fetch for {len(youtube_urls)} videos …")

for i, url in enumerate(youtube_urls, start=1):
    video_id = extract_video_id(url)
    print(f"\n[{i}/{len(youtube_urls)}] {url}")
    if not video_id:
        print("   ❌ Could not extract video ID – skipping")
        failed_count += 1
        failed_videos.append({"url": url, "error": "Invalid video ID"})
        continue

    result = safe_fetch_transcript(video_id, url)

    if result["success"]:
        success_count += 1
        total_word_count += result["word_count"]
        all_transcripts.append(result)

        # Build a markdown block that will be fed to the LLM later
        all_transcripts_md += f"## Video {i}: {url}\n"
        all_transcripts_md += f"**Language:** {result['language']} | **Generated:** {result['is_generated']} | **Words:** {result['word_count']:,}\n\n"
        all_transcripts_md += result['transcript'] + "\n\n---\n\n"

        print(f"   ✅ OK – {result['word_count']:,} words, lang={result['language']}, generated={result['is_generated']}")
    else:
        failed_count += 1
        failed_videos.append({"url": url, "video_id": video_id, "error": result.get("error", "unknown")})
        print(f"   ❌ Failed – {result.get('error','unknown')[:120]}")

    # Be nice to YouTube – a short pause between calls
    if i < len(youtube_urls):
        time.sleep(1)

# -------------------------------------------------
# 6️⃣ Summary statistics
# -------------------------------------------------
print("\n" + "="*60)
print("📊 FINAL STATISTICS")
print("="*60)
print(f"✅ Successfully fetched: {success_count}/{len(youtube_urls)} ({success_count/len(youtube_urls)*100:.1f} %) ")
print(f"❌ Failed: {failed_count}")
print(f"🧮 Total word count: {total_word_count:,}")
if success_count:
    print(f"📈 Avg words / transcript: {total_word_count//success_count:,}")
if failed_videos:
    print("\n🔎 Failed videos:")
    for f in failed_videos:
        print(f"  • {f['url']} – {f.get('error','?')[:120]}")

🔎 Starting fetch for 2 videos …

[1/2] https://www.youtube.com/watch?v=xlEQ6Y3WNNI
   ✅ OK – 3,592 words, lang=English, generated=False

[2/2] https://www.youtube.com/watch?v=UjboGsztHd8
   ✅ OK – 4,419 words, lang=English, generated=False

📊 FINAL STATISTICS
✅ Successfully fetched: 2/2 (100.0 %) 
❌ Failed: 0
🧮 Total word count: 8,011
📈 Avg words / transcript: 4,005


In [27]:
import os
import json
import textwrap
import requests
from IPython.display import display, Markdown

# -----------------------------------------------------------------
# Model – just the model slug (provider is chosen via the `provider` block)
# -----------------------------------------------------------------
MODEL = "openai/gpt-oss-120b"

# -----------------------------------------------------------------
# Provider‑routing overrides – Cerebras only, no fallbacks.
# Set to None if you prefer the default price‑based load‑balancing.
# -----------------------------------------------------------------
PROVIDER_OVERRIDES = {
    "order": ["cerebras"],   # try Cerebras first (and only)
    "allow_fallbacks": False # fail fast if Cerebras is down
}

def summarize_all_transcripts(
    md: str,
    *,
    model: str = MODEL,
    max_tokens: int = 3000,
    temperature: float = 0.7,
    provider_overrides: dict | None = PROVIDER_OVERRIDES,
) -> str:
    """
    Sends *md* to OpenRouter (optionally with provider overrides) and
    returns the assistant's answer.
    """
    # ------------------------------------------------------------------
    # Prompt
    # ------------------------------------------------------------------
    prompt = textwrap.dedent(f"""
        You are an expert analyst. Below are transcripts from {success_count}
        YouTube videos (total {total_word_count:,} words).

        ---
        {md}
        ---

        Please produce a concise, structured summary containing:
        1️⃣ Executive summary
        2️⃣ Key themes & recurring topics
        3️⃣ Notable insights / quotes
        4️⃣ Technology & innovation mentions
        5️⃣ Actionable take‑aways for a product‑team audience
    """)

    # ------------------------------------------------------------------
    # Payload – exact fields OpenRouter expects
    # ------------------------------------------------------------------
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature,
    }

    # Only add the provider block if we actually have overrides
    if provider_overrides:
        payload["provider"] = provider_overrides

    # ------------------------------------------------------------------
    # Headers
    # ------------------------------------------------------------------
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {SECRET__OPENROUTER__API_KEY}",
        # optional, nice for analytics on the OpenRouter dashboard:
        # "HTTP-Referer": "https://my-notebook.example.com",
        # "X-Title": "YouTube‑Transcript‑Summariser",
    }

    # ------------------------------------------------------------------
    # Perform request
    # ------------------------------------------------------------------
    try:
        print(f"🧠 Sending {len(md):,} chars → model `{model}`")
        resp = requests.post(
            "https://openrouter.ai/api/v1/chat/completions",
            json=payload,
            headers=headers,
            timeout=180,
        )
        resp.raise_for_status()
        data = resp.json()
        return data["choices"][0]["message"]["content"].strip()
    except requests.exceptions.HTTPError:
        # Show the *exact* JSON error body returned by OpenRouter
        try:
            err_body = resp.json()
        except Exception:
            err_body = resp.text
        return (
            f"❌ Summarisation failed – HTTP {resp.status_code}\\n"
            f"**OpenRouter says:** {json.dumps(err_body, indent=2)}"
        )
    except Exception as exc:
        return f"❌ Summarisation failed – {type(exc).__name__}: {exc}"

# ----------------------------------------------------------------------
# 3️⃣  RUN & DISPLAY
# ----------------------------------------------------------------------
if success_count:
    summary_text = summarize_all_transcripts(all_transcripts_md)
else:
    summary_text = "❌ No transcripts were fetched – nothing to summarize."

display(Markdown("# 🧠 YouTube Video Analysis Summary"))
display(
    Markdown(
        f"""**Videos processed**: {success_count}/{len(youtube_urls)}  
**Total words**: {total_word_count:,}  
**Proxy used**: ❌ No (direct request)"""
    )
)
display(Markdown("---"))
display(Markdown(summary_text))

🧠 Sending 43,593 chars → model `openai/gpt-oss-120b`


# 🧠 YouTube Video Analysis Summary

**Videos processed**: 2/2  
**Total words**: 8,011  
**Proxy used**: ❌ No (direct request)

---

## 1️⃣ Executive Summary  
Both talks illustrate how large‑scale product teams are moving from “AI‑as‑a‑brain” (pure LLM chat) to **AI‑as‑both‑brain‑and‑hands** – i.e., tightly coupling large language models (LLMs) with deterministic tooling and workflow orchestration.  

* **Shopify (Obie Fernandez)** – built **RoAST**, an open‑source Ruby‑centric workflow engine that stitches together *agentic* Claude‑code calls with *structured* scripts, test‑coverage tooling, type‑checking (Sorbet) and other CI steps.  RoAST lets developers replay individual steps, cache function calls and keep entropy low while still benefiting from Claude’s reasoning.  

* **Manasai (Tao “hik”)** – built **Manas**, a cloud‑hosted “hand” for LLMs.  Each “Manas” instance runs a full virtual machine (Linux, soon Windows/Android) with a real browser, filesystem, VS Code, and a library of 27 pre‑wired tools.  The product is deliberately **workflow‑free**: no hard‑coded pipelines, the LLM decides which tool to call (search, browse, PDF‑extract, interior‑design, etc.).  Manas is powered by Anthropic’s “cloud‑sol” models (Claude 3.5 → Claude 4) and relies on a custom “co‑plan” injection step to improve function‑calling reliability.

Both teams stress **scale (hundreds of thousands of requests per day), internal culture of tinkering, and open‑source sharing** as the engines that let AI improve developer productivity and end‑user experience at enterprise scale.

---

## 2️⃣ Key Themes & Recurring Topics  

| Theme | Shopify (RoAST) | Manasai (Manas) |
|-------|----------------|-----------------|
| **Agentic vs. Deterministic** | Distinguishes *agentic* Claude‑code (exploratory, non‑deterministic) from *structured* deterministic workflows (test grading, migrations). | No explicit split; the LLM is given “hands” (tools) and decides autonomously, but the underlying VM provides deterministic execution of those tools. |
| **Workflow Orchestration** | RoAST: Ruby DSL, inline prompts, bash steps, replay, function‑call caching. | Manas: implicit workflow built from tool calls; “co‑plan” injection adds a planning step before each function call. |
| **Tool Integration** | Cloud Code SDK, Sorbet type‑checker, coverage tools, test runners. | Browser (text + screenshot + bounding‑box), VS Code, file‑system, PDF unzip/parse, private APIs (finance, maps). |
| **Scale & Metrics** | 500 DAU, 250 k requests / sec, ~0.5 M PRs / yr. | $1 M spent on Claude 4 usage in 14 days; 20 % of users do “deep‑research” loops of 30‑50 steps. |
| **Open‑Source & Culture** | RoAST released publicly; internal “tinkering” culture encourages shared tooling. | Manas builds on open‑source “browser‑use” protocol; the team openly discusses architecture (co‑plan, VM). |
| **User‑Facing Value** | Faster test‑coverage fixing, automated migrations, repeatable CI pipelines. | Office‑search & accommodation recommendation, interior‑design from a room photo, bulk PDF extraction, autonomous research. |
| **Future Directions** | More SDK hooks, richer Ruby DSL, tighter Cloud Code ↔ RoAST loop. | Windows/Android VMs, deeper model integration, broader private‑API catalogue. |

---

## 3️⃣ Notable Insights / Quotes  

| Quote | Insight |
|-------|----------|
| “**Agentic tools shine when the path to the solution is not known in advance**.” – Obie | Use LLMs for exploratory, ambiguous tasks; don’t force deterministic pipelines on them. |
| “**Interleaving deterministic and nondeterministic processes is the perfect combination**.” – Obie | Hybrid approach reduces error propagation while keeping LLM creativity. |
| “**We don’t train models; we build *hands* for them**.” – Tao | The competitive edge lies in giving LLMs actionable tool access, not in model training. |
| “**Less structure, more intelligence**.” – Tao | Minimal hard‑coded workflows; let the model orchestrate given rich context. |
| “**We spent $1 M on Claude 4 in the first 14 days**.” – Tao | Cloud‑based LLM usage at scale can be financially significant; budgeting is a product‑level decision. |
| “**Replay from step 4 instead of re‑running steps 1‑3 saves massive time**.” – Obie | Step‑level persistence is a high‑ROI feature for developer tooling. |
| “**Coot injection = planner agent reasoning + function call**.” – Tao | Adding a lightweight planning layer before each tool call dramatically improves function‑calling success. |

---

## 4️⃣ Technology & Innovation Mentions  

| Category | Specifics |
|----------|-----------|
| **LLMs** | Claude 3.5, Claude 4 (Anthropic “cloud‑sol”); Claude code (code‑focused model). |
| **Workflow Engine** | **RoAST** – Ruby DSL, inline ERB templating, function‑call caching, step replay. |
| **Cloud‑Native SDK** | **Cloud Code SDK** (Shopify internal), permits invoking Claude via CLI, permissions‑skip flag for prototyping. |
| **Tooling Integration** | Sorbet (type system), coverage tools, test runners, browser‑use protocol, 27 custom VM tools, “co‑plan” injector. |
| **Infrastructure** | Large‑scale VM fleet (Linux, future Windows/Android), real (non‑headless) Chromium, VS Code, filesystem. |
| **Metrics / Observability** | 500 DAU, 250 k RPS, $1 M cloud‑model spend, PR volume (0.5 M / yr). |
| **Open‑Source** | RoAST (GitHub), browser‑use (open source), potential upcoming Manas SDK. |
| **Security / Permissions** | “dangerously skip permissions” flag for rapid prototyping (Shopify). |
| **Data Sources** | Private APIs (finance, maps), pre‑paid data feeds, PDF archives, image‑to‑furniture pipelines. |

---

## 5️⃣ Actionable Take‑aways for a Product‑Team Audience  

| Area | What to Do |
|------|------------|
| **Hybrid Architecture** | Design your AI product as a *two‑layer* system: an **agentic LLM** for reasoning + a **deterministic orchestration layer** (DSL, workflow engine) for repeatable steps. |
| **Expose a Stable SDK** | Provide a lightweight CLI/SDK (like Cloud Code) that lets product engineers call the LLM, skip permissions in dev, and inject custom tooling. |
| **Step‑Level Persistence** | Implement **step replay & caching** (store intermediate outputs, cache function calls). This reduces latency and cost when debugging long pipelines. |
| **Tool‑First Mindset** | Build a **catalog of first‑class tools** (browser, file system, domain‑specific CLI). Let the LLM pick tools via function‑calling rather than hard‑coding pipelines. |
| **Monitoring & Cost Guardrails** | Track usage metrics (DAU, RPS, token spend) and set **budget alerts**; high‑scale LLM usage can quickly become $‑heavy. |
| **Open‑Source & Community** | Release internal tooling (e.g., workflow DSL) as open source to attract contributions and avoid duplicated effort across teams. |
| **Internal Tinkering Culture** | Encourage “skunk‑works” projects; give teams ownership of small AI utilities that can later be unified under a shared framework. |
| **User‑Facing Simplicity** | For non‑technical users, hide the workflow complexity behind a **single “assistant” UI** that internally runs the deterministic pipeline. |
| **Future‑Proofing** | Keep the orchestration layer **model‑agnostic** (plug‑in any new LLM via a thin adapter) so you can switch from Claude 3.5 → Claude 4 → future models without re‑architecting workflows. |
| **Security & Privacy** | When offering cloud‑only VM tools, consider **data residency** and **credential isolation** (e.g., separate VM per user, token‑scoped APIs). |
| **Iterative Research Feature** | Provide a **“deep‑research” mode** that automatically expands a short query into a multi‑step plan; observe emergent capabilities rather than building them manually. |

---  

**Bottom line:**  
Successful AI‑powered products at scale blend the *creativity* of LLMs with the *reliability* of deterministic tooling. Build a lightweight orchestration layer, give the model rich, well‑instrumented “hands,” and expose reusable SDKs—while tracking cost, encouraging internal tool‑sharing, and keeping the architecture flexible enough to swap out the underlying model as it evolves.