# YouTube RAG Pipeline

Pipeline
1. **Loading Data**
2. **Chunking**
3. **Embeddings** (OpenAI-Embeddings)
4. **VectorDB** (Chroma)
5. **Retriever**
6. **LLM** (OpenAI Chat-Model)
7. **Chain** (Conversational Retrieval)
8. **Memory**


## 0. YouTube videos / transcripts

Imports

In [8]:
pip install youtube-transcript-api chromadb sentence-transformers transformers accelerate torch



In [15]:
from urllib.parse import urlparse, parse_qs

from youtube_transcript_api import (
    YouTubeTranscriptApi,
    TranscriptsDisabled,
    NoTranscriptFound,
)

import pandas as pd
import chromadb
from chromadb.utils import embedding_functions

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

YouTube ingestion
- URL -> video_id
- video_id -> transcript (list)
- transcript -> plain text

In [16]:
#Extract the YouTube video ID from URL formats
def extract_video_id(url: str) -> str:
    parsed = urlparse(url)

    # Short youtu.be links
    if parsed.netloc in ("youtu.be", "www.youtu.be"):
        return parsed.path.lstrip("/")

    # Regular youtube.com links
    if parsed.netloc in ("www.youtube.com", "youtube.com", "m.youtube.com"):
        qs = parse_qs(parsed.query)
        vid = qs.get("v", [None])[0]
        if vid:
            return vid

    raise ValueError(f"Could not extract video_id from URL: {url}")

#Convert a transcript (list of {text, start, duration}) to a single text string
def transcript_to_text(transcript, include_timestamps: bool = False) -> str:
    lines = []
    for entry in transcript:
        if include_timestamps:
            start = entry["start"]
            lines.append(f"[{start:.1f}s] {entry['text']}")
        else:
            lines.append(entry["text"])
    return " ".join(lines)


#Fetch transcript for a single video_id and turn it into plain text.
def fetch_transcript_text(video_id: str, languages=None) -> str:
    try:
        ytt_api = YouTubeTranscriptApi()

        # If you don't care about language, you can call ytt_api.fetch(video_id) without languages
        if languages is None:
            fetched = ytt_api.fetch(video_id)
        else:
            fetched = ytt_api.fetch(video_id, languages=languages)

        # `fetched` is a FetchedTranscript object with `.snippets`
        # Convert to the same structure transcript_to_text() expects
        transcript = [
            {"text": s.text, "start": s.start, "duration": s.duration}
            for s in fetched.snippets
        ]

        return transcript_to_text(transcript, include_timestamps=False)

    except TranscriptsDisabled:
        raise RuntimeError(f"Transcripts are disabled for video_id={video_id}")
    except NoTranscriptFound:
        raise RuntimeError(f"No transcript found for video_id={video_id} in languages={languages}")
    except Exception as e:
        raise RuntimeError(f"Error fetching transcript for {video_id}: {e}")


Store transcripts in Chroma

In [17]:
#Create a Chroma collection from a DataFrame with columns 'video_id', 'url', 'transcript'
def build_chroma_collection_from_df(df: pd.DataFrame, collection_name: str = "youtube_videos"):

    # 2.1 Set up Chroma client (in-memory for now; for persistence use PersistentClient)
    client = chromadb.Client()

    # 2.2 Define an embedding function (SentenceTransformer)
    embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )

    # 2.3 Create (or recreate) the collection
    # If collection exists, delete it to start fresh
    existing = [c.name for c in client.list_collections()]
    if collection_name in existing:
        client.delete_collection(collection_name)

    collection = client.create_collection(
        name=collection_name,
        embedding_function=embedding_func,
    )

    # 2.4 Add documents to collection
    # Use video_id as id, transcript as document
    documents = df["transcript"].tolist()
    ids = df["video_id"].tolist()
    metadatas = df[["video_id", "url"]].to_dict(orient="records")

    collection.add(
        documents=documents,
        ids=ids,
        metadatas=metadatas,
    )

    print(f"Added {len(documents)} transcripts to Chroma collection '{collection_name}'.")
    return collection

Transcribe videos

In [18]:
if __name__ == "__main__":
    # ---- 1) Ingest multiple videos ----
    video_urls = [
        # add YouTube URLs here:
        "https://www.youtube.com/watch?v=enD8mK9Zvwo",
        "https://www.youtube.com/watch?v=ZdjJdoEwCY4",

    ]

    df_videos = ingest_youtube_videos(video_urls, languages=["en"])
    print("\nIngested videos DataFrame:")
    print(df_videos.head())

    if df_videos.empty:
        raise SystemExit("No videos ingested successfully – check URLs or transcripts settings.")

    # ---- 2) Build Chroma collection ----
    collection = build_chroma_collection_from_df(df_videos, collection_name="youtube_videos")

    print("\nChroma collection is ready.")

Processing: https://www.youtube.com/watch?v=enD8mK9Zvwo -> video_id=enD8mK9Zvwo
Processing: https://www.youtube.com/watch?v=ZdjJdoEwCY4 -> video_id=ZdjJdoEwCY4

Ingested videos DataFrame:
      video_id                                          url  \
0  enD8mK9Zvwo  https://www.youtube.com/watch?v=enD8mK9Zvwo   
1  ZdjJdoEwCY4  https://www.youtube.com/watch?v=ZdjJdoEwCY4   

                                          transcript  
0  One of the most important parts of a job appli...  
1  so going on a job interview has got to be one ...  


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Added 2 transcripts to Chroma collection 'youtube_videos'.

Chroma collection is ready.


In [20]:
df_videos.head()

Unnamed: 0,video_id,url,transcript
0,enD8mK9Zvwo,https://www.youtube.com/watch?v=enD8mK9Zvwo,One of the most important parts of a job appli...
1,ZdjJdoEwCY4,https://www.youtube.com/watch?v=ZdjJdoEwCY4,so going on a job interview has got to be one ...


In [None]:
!pip install -q --upgrade langchain langchain-openai langchain-community langchain-text-splitters chromadb tiktoken python-dotenv

In [None]:
from pathlib import Path

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


In [None]:
import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

if OPENAI_API_KEY:
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    print("✅ API Key loaded successfully!")


✅ API Key loaded successfully!


## 1. Loading Data

Lege deine Transkript-Dateien als **`.txt`** in einen Ordner, z.B. `data/`.
Jede Datei kann z.B. der Transkripttext eines YouTube-Videos sein.

In [None]:
# Ordner für Transkripte
DATA_DIR = Path("data")  # passe den Pfad an, falls nötig
DATA_DIR.mkdir(exist_ok=True)

def load_transcripts(data_dir: Path):
    docs = []
    for path in data_dir.glob('*.txt'):
        with path.open('r', encoding='utf-8') as f:
            text = f.read()
        if not text.strip():
            continue
        docs.append(
            Document(
                page_content=text,
                metadata={"source": path.name}
            )
        )
    return docs

documents = load_transcripts(DATA_DIR)
print(f"Geladene Dokumente: {len(documents)}")
if documents:
    print("Beispiel-Dokument:")
    print("Quelle:", documents[0].metadata)
    print(textwrap.shorten(documents[0].page_content, width=400, placeholder=" ..."))

## 2. Chunking

Wir zerlegen die langen Transkripte in kleinere Textstücke (Chunks), damit die Vektor-Suche besser funktioniert.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Größe eines Chunks (Tokens/Zeichen-Nähe)
    chunk_overlap=200,    # Überlappung, damit der Kontext nicht abreißt
    separators=["\n\n", "\n", ".", " ", ""]
)

splits = text_splitter.split_documents(documents)
print(f"Anzahl Chunks: {len(splits)}")
if splits:
    print("Beispiel-Chunk:")
    print("Quelle:", splits[0].metadata)
    print(textwrap.shorten(splits[0].page_content, width=300, placeholder=" ..."))

## 3. Embeddings & 4. VectorDB (Chroma)

Wir erzeugen Embeddings mit einem OpenAI-Embedding-Modell und speichern diese in einer Chroma-Datenbank.

In [None]:
# Embedding-Modell
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Chroma-DB erstellen
PERSIST_DIR = "chroma_db"

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=PERSIST_DIR
)

vectordb.persist()
print("✅ Vektor-Datenbank erstellt und gespeichert.")

## 5. Retriever

Der Retriever holt zu einer Frage die **relevantesten Chunks** aus der VectorDB.

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k": 4})
print("✅ Retriever bereit.")

## 6. LLM, 7. Chain & 8. Memory

Wir nutzen ein schnelles OpenAI-Chat-Modell (`gpt-4.1-mini`) und bauen eine
**ConversationalRetrievalChain**, die auch einen Chat-Verlauf (Memory) verwendet.

Der Chat-Verlauf wird in einer Python-Liste `chat_history` gespeichert.

In [None]:
# LLM
llm = ChatOpenAI(
    model="gpt-4.1-mini",  # gutes Preis/Leistungs-Verhältnis
    temperature=0.2,        # eher sachlich
)

# Conversational Retrieval Chain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)

# Einfache Memory-Struktur: Liste aus (user, assistant)-Nachrichten
chat_history = []

def ask(question: str):
    """Stellt eine Frage an die RAG-Chain und aktualisiert den Chat-Verlauf."""
    global chat_history
    result = qa_chain({
        "question": question,
        "chat_history": chat_history,
    })

    answer = result["answer"]
    source_docs = result.get("source_documents", [])

    # Chat-Verlauf updaten
    chat_history.append((question, answer))

    # Antwort + Quellen hübsch ausgeben
    print("Frage:\n", question)
    print("\nAntwort:\n", textwrap.fill(answer, width=100))
    if source_docs:
        print("\nVerwendete Quellen:")
        for i, d in enumerate(source_docs, 1):
            short = textwrap.shorten(d.page_content, width=120, placeholder=" ...")
            print(f"[{i}] {d.metadata.get('source')} :: {short}")

    return answer


## 9. Test: Frage an dein RAG-System stellen

Jetzt kannst du dein System testen. Formuliere eine Frage, die im Inhalt deiner Transkripte beantwortet werden kann.

In [None]:
# Beispiel-Frage (anpassen!)
example_question = "Worum geht es in dem ersten Video grob?"
ask(example_question)