# LectureMitra: Your Conversational YouTube Tutor


Welcome to LectureMitra! This notebook contains a complete, voice-enabled AI tutor that can help you study any YouTube video.

How it Works:
Input a URL: You provide a link to any YouTube video that has English captions.
Build a Brain: The app fetches the transcript and builds a temporary "knowledge base" or "brain" for that video.
Ask Anything: You can ask questions using text or your voice.
Get Answers: The tutor answers your questions using both text and voice, based only on the content of the video you provided.

-----------------------------------------------------------

Instructions for Use:
To start the application, simply run all the cells in order.


Go to the menu and click Runtime -> Run all.

When a box appears asking for an API key, paste in your Sarvam AI API key and press Enter.

Scroll to the bottom of the notebook. An input box will appear asking for a YouTube video URL. Paste the link and press Enter to begin your session!

**For any doubt while running the code, take reference from the demo video. ( https://youtu.be/EXim3NoRhRI )**

# Step 0: Installing Dependencies

This first cell installs all the necessary Python libraries required for the project.

youtube-transcript-api: To fetch captions directly from YouTube.

langchain, faiss-cpu, sentence-transformers: These form the core of our Retrieval-Augmented Generation (RAG) system. They help in splitting the text, creating numerical embeddings, and building a searchable vector database.

requests: A standard library for making API calls to Sarvam AI.

ipywidgets, IPython: Utilities for handling audio recording and playback within this Colab notebook.

In [None]:
print("Installing necessary libraries...")

# For YouTube transcripts
#pip install --upgrade youtube-transcript-api -q
!pip install yt-dlp -q

# For the RAG pipeline (vector store, embeddings, text splitting)
!pip install langchain langchain-community faiss-cpu sentence-transformers -q

# For making API calls to Sarvam
!pip install requests -q

# For audio recording and playback in Colab
!pip install ipywidgets IPython -q

print("Installations complete!")

# Import necessary libraries
import os
import requests
from getpass import getpass
from IPython.display import display, Javascript, Audio
from google.colab.output import eval_js
from base64 import b64decode
import io

# Securely get your Sarvam API Key
# When you run this cell, it will prompt you to enter your key.
# This is much safer than pasting it directly into the code.
SARVAM_API_KEY = getpass('Please enter your Sarvam AI API Key: ')
os.environ['SARVAM_API_KEY'] = SARVAM_API_KEY
print("API Key has been set up successfully!")

# Step 0.5 (Optional) Professional Logging Setup


Before we define our main functions, we'll set up a professional logger. This is better than using print() because it provides structured, timestamped output with different severity levels (e.g., INFO, ERROR). This makes debugging much more efficient. This logger will be used by all subsequent functions to report their status.

In [None]:
import logging

# Configure the logger
logging.basicConfig(
    level=logging.INFO, # Set the minimum level of messages to display
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Create a logger instance
logger = logging.getLogger(__name__)

logger.info("Logger configured successfully.")

# Step 1: Fetching the YouTube Transcript

This is a critical function that fetches the video's transcript. This version uses the industry-standard yt-dlp library, which is highly robust against changes in YouTube's website structure.


It works by:


Using a command-line call to yt-dlp to download only the English subtitle file (in .vtt format).

Reading this file with Python.

Parsing the file's content to extract only the clean transcript text, removing timestamps and other metadata.
Cleaning up the downloaded file after it's been processed.

In [None]:
import subprocess
import os
import re

def get_youtube_transcript(video_url: str) -> str | None:
    """
    Fetches the transcript for a given YouTube video URL using the robust yt-dlp library.
    Returns the transcript as a single string.
    """
    logger.info(f"Attempting to fetch transcript for {video_url} using yt-dlp.")

    # Define the output filename for the transcript
    transcript_file = "transcript.en.vtt"

    # Command to download the English auto-captions, skipping the video download
    command = [
        "yt-dlp",
        "--write-auto-sub",       # Get auto-generated subtitles
        "--sub-lang", "en",       # Specify English
        "--skip-download",        # Don't download the video
        "-o", "transcript",       # Base name for the output file
        video_url
    ]

    try:
        # Run the command. We use text=True and capture_output=True for better error logging.
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        logger.info("yt-dlp command executed successfully.")

        if not os.path.exists(transcript_file):
            logger.error("yt-dlp ran but the transcript file was not created. This video might not have English captions.")
            return None

        # Read the downloaded VTT file
        with open(transcript_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()

        logger.info("Successfully read the downloaded VTT transcript file.")

        # Parse the VTT file to extract only the text
        transcript_text_lines = []
        for line in lines:
            if "-->" not in line and "WEBVTT" not in line and not line.strip().isdigit() and line.strip():
                cleaned_line = re.sub(r'<[^>]+>', '', line).strip()
                transcript_text_lines.append(cleaned_line)

        unique_lines = list(dict.fromkeys(transcript_text_lines))
        full_transcript = " ".join(unique_lines)

        logger.info("Transcript parsed successfully.")
        return full_transcript

    except subprocess.CalledProcessError as e:
        # This error happens if yt-dlp fails (e.g., video not found, no captions)
        logger.error(f"yt-dlp failed. Error message: {e.stderr.strip()}")
        return None
    except Exception as e:
        logger.error(f"An unexpected error occurred with yt-dlp: {e}")
        return None
    finally:
        # Clean up by deleting the downloaded file
        if os.path.exists(transcript_file):
            os.remove(transcript_file)
            logger.info("Cleaned up temporary transcript file.")

# Step 2: Building the Knowledge Base (RAG Core)

This is the heart of the application. The build_rag_pipeline function takes the raw transcript text and turns it into a structured, searchable knowledge base. This function will be called by our main app after a user provides a URL.

This process involves:

Chunking: The long transcript is broken into smaller, overlapping text chunks.
Embedding: Each chunk is converted into a numerical vector (an "embedding") using a sentence-transformer model. This represents the semantic meaning of the text.
Storing: These vectors are stored in a FAISS vector database, which allows for extremely fast similarity searches.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Global variable to hold our vector database
vector_db = None

def build_rag_pipeline(transcript):
    """
    Takes the transcript, chunks it, creates embeddings, and builds a vector store.
    """
    global vector_db
    if not transcript:
        print("Transcript is empty. Cannot build RAG pipeline.")
        return

    print("Building RAG pipeline...")
    # 1. Chunk the text
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        length_function=len
    )
    chunks = text_splitter.split_text(transcript)
    print(f"Transcript split into {len(chunks)} chunks.")

    # 2. Create Embeddings
    # We use a popular open-source model for creating the embeddings (vectors)
    print("Loading embedding model...")
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # 3. Create Vector Store
    # We'll use FAISS, a fast in-memory vector store
    print("Creating vector store...")
    vector_db = FAISS.from_texts(texts=chunks, embedding=embeddings)
    print("RAG pipeline is ready!")



# Step 3: The Answering Engine

The ask_lecturemitra function is responsible for generating an answer. It does not simply ask the AI the question. Instead, it follows the Retrieval-Augmented Generation (RAG) process:

Retrieve: It takes the user's question, converts it to a vector, and searches the FAISS database to find the most relevant text chunks from the transcript.

Augment: It creates a detailed prompt for the Sarvam LLM, including the user's question and the relevant chunks as "context".

Generate: It strictly instructs the AI to answer the question only using the provided context, preventing it from making things up or using outside knowledge.


In [None]:
def ask_lecturemitra(question):
    """
    Answers a question based ONLY on the transcript context.
    This version uses the correct Sarvam Chat API endpoint and the correct model.
    """
    global vector_db
    if not vector_db:
        return "The RAG pipeline has not been built yet. Please provide a video link first."

    print(f"\nSearching for context for the question: '{question}'")

    # 1. Retrieve relevant documents
    retriever = vector_db.as_retriever(search_kwargs={'k': 4})
    docs = retriever.invoke(question)
    context = "\n\n".join([doc.page_content for doc in docs])

    # 2. Define the prompt for the Chat model
    system_prompt = """You are 'LectureMitra', a helpful AI tutor. Your task is to answer the user's question STRICTLY and ONLY based on the provided 'Transcript Context'.
    Do not use any external knowledge.
    If the answer is not found within the context, you MUST say "I cannot answer this question based on the provided transcript." """

    user_prompt_content = f"""
    ---
    Transcript Context:
    {context}
    ---
    User's Question:
    {question}
    """

    # 3. Call the Sarvam LLM API
    print("Asking the LLM...")
    try:
        headers = {
            "Authorization": f"Bearer {os.environ['SARVAM_API_KEY']}",
            "Content-Type": "application/json"
        }

        data = {

            "model": "sarvam-m",
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt_content}
            ],
            "max_tokens": 250,
            "temperature": 0.1
        }

        response = requests.post("https://api.sarvam.ai/v1/chat/completions", headers=headers, json=data)
        response.raise_for_status()

        result = response.json()
        answer = result['choices'][0]['message']['content'].strip()
        return answer

    except requests.exceptions.HTTPError as http_err:
        return f"HTTP error occurred while contacting Sarvam API: {http_err} - {response.text}"
    except Exception as e:
        return f"An error occurred: {e}"

# Step 4: Enabling Voice Input (Speech-to-Text)

To allow users to ask questions with their voice, we define two functions:

record_audio: This uses JavaScript within Colab to access the browser's microphone and record a short audio clip.
transcribe_audio_with_sarvam: This function takes the recorded audio file and sends it to Sarvam's Speech-to-Text (STT) API, which returns the transcribed text.

In [None]:
RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record_audio(filename="recorded_audio.wav", seconds=5):
    """Records audio from the browser, no changes needed here."""
    print(f"Recording for {seconds} seconds... Get ready to speak!")
    display(Javascript(RECORD))
    s = eval_js(f'record({seconds * 1000})')
    b = b64decode(s.split(',')[1])
    with open(filename, 'wb') as f:
        f.write(b)
    print(f"Recording finished. Audio saved as {filename}")
    return filename

def transcribe_audio_with_sarvam(audio_filename):
    """
    This function has been completely rewritten to match the STT documentation you provided.
    """
    print("Transcribing audio with Sarvam AI (using 'saarika' model)...")
    api_key = os.environ.get("SARVAM_API_KEY")
    if not api_key:
        return "Sarvam API key not found."

    url = "https://api.sarvam.ai/speech-to-text"
    headers = {
        "api-subscription-key": api_key
    }

    data = {

        "language_code": "en-IN",
        "model": "saarika:v2"
    }

    with open(audio_filename, "rb") as f:
        files = {
            'file': (audio_filename, f, 'audio/wav')
        }

        try:
            response = requests.post(url, headers=headers, files=files, data=data)
            response.raise_for_status()
            transcribed_text = response.json().get("transcript", "")
            print("Transcription successful!")
            return transcribed_text
        except requests.exceptions.HTTPError as http_err:
            return f"HTTP error during transcription: {http_err} - {response.text}"
        except Exception as e:
            return f"An error occurred during transcription: {e}"

# Step 5: Enabling Voice Output (Text-to-Speech)

This function creates the voice of our tutor. It takes the text answer generated by the ask_lecturemitra function and sends it to Sarvam's Text-to-Speech (TTS) API. The API returns an audio file, which is then played back directly in the notebook.

In [None]:
from base64 import b64decode

def generate_and_play_speech(text, language_code="en-IN"):
    """
    This function has been completely rewritten to match the TTS documentation you provided.
    """
    if not text or "cannot answer" in text.lower():
        print("\nWon't generate speech for a non-answer.")
        return

    print("\nGenerating speech with Sarvam AI (using 'bulbul' model)...")
    api_key = os.environ.get("SARVAM_API_KEY")
    if not api_key:
        print("Sarvam API key not found.")
        return

    url = "https://api.sarvam.ai/text-to-speech"
    headers = {
        "api-subscription-key": api_key,
        "Content-Type": "application/json"
    }
    data = {
        "text": text,
        "target_language_code": language_code,
        "model": "bulbul:v2"
    }

    try:
        response = requests.post(url, headers=headers, json=data)
        response.raise_for_status()
        response_data = response.json()
        base64_audio = response_data['audios'][0]
        audio_data = b64decode(base64_audio)

        print("Speech generated successfully. Playing audio now...")
        display(Audio(audio_data, autoplay=True))

    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error during speech generation: {http_err} - {response.text}")
    except Exception as e:
        print(f"An error occurred during speech generation: {e}")

# Step 6: The Interactive Chat Session

This function creates the main interactive loop of our application. It continuously prompts the user for input and manages the conversation flow. It's smart enough to handle both text input and voice commands (by calling the recording and transcription functions when the user types "voice"). It also includes a small time.sleep(1) delay to ensure the input prompt doesn't break after audio playback.

In [None]:
def start_lecturemitra():
    """
    The main orchestrator for the LectureMitra application.
    It handles the complete user journey from URL input to Q&A.
    """
    # 1. Get the YouTube URL from the user
    video_url = input("Welcome to LectureMitra! Please paste the YouTube video URL you want to study: ")

    if not video_url:
        print("No URL provided. Exiting.")
        return

    # 2. Fetch the transcript for that URL
    # We use our existing 'get_youtube_transcript' function
    transcript = get_youtube_transcript(video_url)

    # 3. Check if transcript was fetched successfully
    if not transcript:
        print("\nCould not process this video. Please try another one with available English captions.")
        return

    # 4. Build the RAG pipeline for this specific transcript
    # We use our existing 'build_rag_pipeline' function
    # This will set the global 'vector_db' variable for this session
    build_rag_pipeline(transcript)

    # 5. Start the interactive Q&A session
    # We call our existing 'lecture_mitra_qa_session' function
    lecture_mitra_qa_session()

# Step 7: The Main Application Orchestrator

The start_lecturemitra function acts as the conductor for the entire orchestra. It defines the complete user journey from start to finish:

It first prompts the user to enter a YouTube URL.
It calls the transcript and RAG pipeline functions to set everything up.
Finally, it launches the interactive chat session.

In [None]:
import time #Import the time library#

def lecture_mitra_qa_session():
    """
    The main interactive loop for the LectureMitra tutor.
    This corrected version includes a small delay to prevent the input prompt from breaking.
    """
    global vector_db
    if vector_db is None:
        print("The RAG pipeline has not been built yet. Please run the setup cells first.")
        return

    print("\n--- Welcome to the LectureMitra QA Session! ---")
    print("You can ask questions about the video. Type 'exit' to end the session.")
    print("To ask a question with your voice, type 'voice' and press Enter.")

    while True:
        user_input = input("\nYou (or type 'voice'): ")

        if user_input.lower() == 'exit':
            print("Session ended. Goodbye!")
            break

        user_question = ""
        if user_input.lower() == 'voice':
            audio_file = record_audio(seconds=5)
            if os.path.exists(audio_file):
                transcribed_question = transcribe_audio_with_sarvam(audio_file)
                if transcribed_question and "error" not in transcribed_question.lower():
                    user_question = transcribed_question
                    print(f"You (via voice): {user_question}")
                else:
                    print(f"Could not understand audio. Result: {transcribed_question}")
                    continue
            else:
                continue
        else:
            user_question = user_input

        # This part remains the same
        answer_text = ask_lecturemitra(user_question)
        print(f"\nLectureMitra: {answer_text}")
        generate_and_play_speech(answer_text)

        # Add a small delay to allow the output stream to recover #
        time.sleep(1)

#  Let's Get Started!

All the components have been defined. Running the final cell below will start the LectureMitra application.

What to Expect:

The application will prompt you to enter a YouTube URL.

Paste a link and press Enter. **Make sure the link you paste here is the shareable youtube video link (not the plain URL from browser)**

After a few moments of processing, the Q&A session will begin. You can ask questions with text or by typing voice and speaking.

In [None]:
start_lecturemitra()