# Environment Setup

This notebook prepares the environment for the RAG-based
Signals & Systems AI Teaching Assistant project.

It performs:
- Google Drive mounting
- System dependency installation
- Python library installation
- Project directory creation
- GPU availability check

NOTE:
Run this notebook ONCE per session.


### Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


#### Base Project Paths

In [2]:
import os

BASE_DIR = "/content/drive/MyDrive/RAG_BAS_PROJECT"

VIDEO_DIR = os.path.join(BASE_DIR, "VIDEOS")
AUDIO_DIR = os.path.join(BASE_DIR, "AUDIOS")
JSON_DIR  = os.path.join(BASE_DIR, "jsons")

os.makedirs(VIDEO_DIR, exist_ok=True)
os.makedirs(AUDIO_DIR, exist_ok=True)
os.makedirs(JSON_DIR, exist_ok=True)

print("Project folders created successfully")


Project folders created successfully


#### Install System Dependencies

#### Install FFmpeg for audio/video processing

In [3]:

!apt-get update -qq
!apt-get install -y ffmpeg


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 110 not upgraded.


#####  Verify FFmpeg installation

In [4]:

!ffmpeg -version


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-l

### Install Python Dependencies

In [5]:
# Core utilities
!pip install -U yt-dlp tqdm

# Whisper for transcription
!pip install -U openai-whisper

# Embeddings & ML
!pip install -U sentence-transformers scikit-learn

# Gemini API
!pip install -U google-generativeai


Collecting yt-dlp
  Downloading yt_dlp-2026.1.29-py3-none-any.whl.metadata (181 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/182.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.0/182.0 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Downloading yt_dlp-2026.1.29-py3-none-any.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2026.1.29
Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai

#### Verify yt-dlp Installation

In [6]:
!yt-dlp --version


2026.01.29


In [7]:
print("Environment setup complete. You may proceed to the next notebook.")


Environment setup complete. You may proceed to the next notebook.


# 02a_video_to_audio.ipynb


# Automatic Video Download (YouTube Playlist)

This notebook:
- Downloads lecture videos from a YouTube playlist
- Automatically skips already-downloaded videos
- Supports partial or full playlist downloads
- Stores videos in a structured format for further processing

NOTE:
- Safe to re-run
- Downloads only new videos
- Does NOT perform audio conversion


#### Imports & Paths

In [8]:
import os
import yt_dlp


In [9]:
BASE_DIR = "/content/drive/MyDrive/RAG_BAS_PROJECT"
VIDEO_DIR = os.path.join(BASE_DIR, "VIDEOS")

os.makedirs(VIDEO_DIR, exist_ok=True)

print("Video directory ready")


Video directory ready


In [10]:
# Playlist URL
PLAYLIST_URL = "https://www.youtube.com/playlist?list=PLBlnK6fEyqRhG6s3jYIU48CqsT5cyiDTO"

# Playlist range
# Examples:
# None       → full playlist
# "1-10"     → first 10 videos
# "16-25"    → specific range
PLAYLIST_ITEMS = "1-25"


In [11]:
def download_playlist_videos():
    ydl_opts = {
        # Output filename: keeps playlist order
        "outtmpl": f"{VIDEO_DIR}/%(playlist_index)s - %(title)s.%(ext)s",

        # Best quality video + audio
        "format": "bv*+ba/b",
        "merge_output_format": "mp4",

        # Playlist selection
        "playlist_items": PLAYLIST_ITEMS,

        # Skip already-downloaded files
        "overwrites": False,
        "continuedl": True,

        # Stability & retries
        "retries": 10,
        "fragment_retries": 10,
        "sleep_interval": 1,
        "max_sleep_interval": 3,

        # Cleaner logs
        "quiet": False,
        "no_warnings": False,
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([PLAYLIST_URL])

    print("Video download completed successfully")


In [12]:
download_playlist_videos()


[youtube:tab] Extracting URL: https://www.youtube.com/playlist?list=PLBlnK6fEyqRhG6s3jYIU48CqsT5cyiDTO
[youtube:tab] PLBlnK6fEyqRhG6s3jYIU48CqsT5cyiDTO: Downloading webpage
[youtube:tab] PLBlnK6fEyqRhG6s3jYIU48CqsT5cyiDTO: Redownloading playlist API JSON with unavailable videos
[download] Downloading playlist: Signals and Systems
[youtube:tab] Playlist Signals and Systems: Downloading 25 items of 313
[download] Downloading item 1 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=s8rsR_TStaA
[youtube] s8rsR_TStaA: Downloading webpage




[youtube] s8rsR_TStaA: Downloading android vr player API JSON
[youtube] s8rsR_TStaA: Downloading ios downgraded player API JSON
[youtube] s8rsR_TStaA: Downloading m3u8 information
[info] s8rsR_TStaA: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/01 - Introduction to Signals and Systems.mp4 has already been downloaded
[download] Downloading item 2 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=H4hk6N5vC1Q
[youtube] H4hk6N5vC1Q: Downloading webpage




[youtube] H4hk6N5vC1Q: Downloading android vr player API JSON
[youtube] H4hk6N5vC1Q: Downloading ios downgraded player API JSON
[youtube] H4hk6N5vC1Q: Downloading m3u8 information
[info] H4hk6N5vC1Q: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/02 - Continuous and Discrete Time Signals.mp4 has already been downloaded
[download] Downloading item 3 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=TmkTwJT79yc
[youtube] TmkTwJT79yc: Downloading webpage




[youtube] TmkTwJT79yc: Downloading android vr player API JSON
[youtube] TmkTwJT79yc: Downloading ios downgraded player API JSON
[youtube] TmkTwJT79yc: Downloading m3u8 information
[info] TmkTwJT79yc: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/03 - Addition of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 4 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=jPCgU4ghB8Q
[youtube] jPCgU4ghB8Q: Downloading webpage




[youtube] jPCgU4ghB8Q: Downloading android vr player API JSON
[youtube] jPCgU4ghB8Q: Downloading ios downgraded player API JSON
[youtube] jPCgU4ghB8Q: Downloading m3u8 information
[info] jPCgU4ghB8Q: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/04 - Multiplication of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 5 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=jnB-U5KBvN4
[youtube] jnB-U5KBvN4: Downloading webpage




[youtube] jnB-U5KBvN4: Downloading android vr player API JSON
[youtube] jnB-U5KBvN4: Downloading ios downgraded player API JSON
[youtube] jnB-U5KBvN4: Downloading m3u8 information
[info] jnB-U5KBvN4: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/05 - Time Scaling of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 6 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=sTHbXeiAB_c
[youtube] sTHbXeiAB_c: Downloading webpage




[youtube] sTHbXeiAB_c: Downloading android vr player API JSON
[youtube] sTHbXeiAB_c: Downloading ios downgraded player API JSON
[youtube] sTHbXeiAB_c: Downloading m3u8 information
[info] sTHbXeiAB_c: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/06 - Amplitude Scaling of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 7 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=9Cd5nVCFfc0
[youtube] 9Cd5nVCFfc0: Downloading webpage




[youtube] 9Cd5nVCFfc0: Downloading android vr player API JSON
[youtube] 9Cd5nVCFfc0: Downloading ios downgraded player API JSON
[youtube] 9Cd5nVCFfc0: Downloading m3u8 information
[info] 9Cd5nVCFfc0: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/07 - Time Shifting of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 8 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=dnrJlCD2MLc
[youtube] dnrJlCD2MLc: Downloading webpage




[youtube] dnrJlCD2MLc: Downloading android vr player API JSON
[youtube] dnrJlCD2MLc: Downloading ios downgraded player API JSON
[youtube] dnrJlCD2MLc: Downloading m3u8 information
[info] dnrJlCD2MLc: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/08 - Amplitude Shifting of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 9 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=BzAbZfT6RxQ
[youtube] BzAbZfT6RxQ: Downloading webpage




[youtube] BzAbZfT6RxQ: Downloading android vr player API JSON
[youtube] BzAbZfT6RxQ: Downloading ios downgraded player API JSON
[youtube] BzAbZfT6RxQ: Downloading m3u8 information
[info] BzAbZfT6RxQ: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/09 - Reversal of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 10 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=3mQr7rU4lnA
[youtube] 3mQr7rU4lnA: Downloading webpage




[youtube] 3mQr7rU4lnA: Downloading android vr player API JSON
[youtube] 3mQr7rU4lnA: Downloading ios downgraded player API JSON
[youtube] 3mQr7rU4lnA: Downloading m3u8 information
[info] 3mQr7rU4lnA: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/10 - Multiple Transformations of Continuous-Time Signals.mp4 has already been downloaded
[download] Downloading item 11 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=6O2oEQ2UwxY
[youtube] 6O2oEQ2UwxY: Downloading webpage




[youtube] 6O2oEQ2UwxY: Downloading android vr player API JSON
[youtube] 6O2oEQ2UwxY: Downloading ios downgraded player API JSON
[youtube] 6O2oEQ2UwxY: Downloading m3u8 information
[info] Testing format 616
[info] 6O2oEQ2UwxY: Downloading 1 format(s): 616+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/11 - Multiple Transformations of CTS (Important Point & Shortcut).mp4 has already been downloaded
[download] Downloading item 12 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=zsJmqRY1uW0
[youtube] zsJmqRY1uW0: Downloading webpage




[youtube] zsJmqRY1uW0: Downloading android vr player API JSON
[youtube] zsJmqRY1uW0: Downloading ios downgraded player API JSON
[youtube] zsJmqRY1uW0: Downloading m3u8 information
[info] zsJmqRY1uW0: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/12 - Multiple Transformations of CTS (Solved Problem 1).mp4 has already been downloaded
[download] Downloading item 13 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=TiC4P9WLzo8
[youtube] TiC4P9WLzo8: Downloading webpage




[youtube] TiC4P9WLzo8: Downloading android vr player API JSON
[youtube] TiC4P9WLzo8: Downloading ios downgraded player API JSON
[youtube] TiC4P9WLzo8: Downloading m3u8 information
[info] TiC4P9WLzo8: Downloading 1 format(s): 248+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/13 - Multiple Transformations of CTS (Solved Problem 2).mp4 has already been downloaded
[download] Downloading item 14 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=c_YLkvk_ZXI
[youtube] c_YLkvk_ZXI: Downloading webpage




[youtube] c_YLkvk_ZXI: Downloading android vr player API JSON
[youtube] c_YLkvk_ZXI: Downloading ios downgraded player API JSON
[youtube] c_YLkvk_ZXI: Downloading m3u8 information
[info] c_YLkvk_ZXI: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/14 - Even and Odd Signals.mp4 has already been downloaded
[download] Downloading item 15 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=d0DbwTJUy5c
[youtube] d0DbwTJUy5c: Downloading webpage




[youtube] d0DbwTJUy5c: Downloading android vr player API JSON
[youtube] d0DbwTJUy5c: Downloading ios downgraded player API JSON
[youtube] d0DbwTJUy5c: Downloading m3u8 information
[info] d0DbwTJUy5c: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/15 - Even and Odd Components of a Signal.mp4 has already been downloaded
[download] Downloading item 16 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=wH-R5ao8Wmg
[youtube] wH-R5ao8Wmg: Downloading webpage




[youtube] wH-R5ao8Wmg: Downloading android vr player API JSON
[youtube] wH-R5ao8Wmg: Downloading ios downgraded player API JSON
[youtube] wH-R5ao8Wmg: Downloading m3u8 information
[info] wH-R5ao8Wmg: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/16 - Properties of Even and Odd Signals.mp4 has already been downloaded
[download] Downloading item 17 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=bb_z42C-1-E
[youtube] bb_z42C-1-E: Downloading webpage




[youtube] bb_z42C-1-E: Downloading android vr player API JSON
[youtube] bb_z42C-1-E: Downloading ios downgraded player API JSON
[youtube] bb_z42C-1-E: Downloading m3u8 information
[info] bb_z42C-1-E: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/17 - Even and Odd Signals (Solved Problem 1).mp4 has already been downloaded
[download] Downloading item 18 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=5fX9RfTTvqI
[youtube] 5fX9RfTTvqI: Downloading webpage




[youtube] 5fX9RfTTvqI: Downloading android vr player API JSON
[youtube] 5fX9RfTTvqI: Downloading ios downgraded player API JSON
[youtube] 5fX9RfTTvqI: Downloading m3u8 information
[info] 5fX9RfTTvqI: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/18 - Even and Odd Signals (Solved Problem 2).mp4 has already been downloaded
[download] Downloading item 19 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=BBwgM_UNkFU
[youtube] BBwgM_UNkFU: Downloading webpage




[youtube] BBwgM_UNkFU: Downloading android vr player API JSON
[youtube] BBwgM_UNkFU: Downloading ios downgraded player API JSON
[youtube] BBwgM_UNkFU: Downloading m3u8 information
[info] BBwgM_UNkFU: Downloading 1 format(s): 399+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/19 - Even and Odd Signals (Solved Problem 3).mp4 has already been downloaded
[download] Downloading item 20 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=_f2vBo-ZDdI
[youtube] _f2vBo-ZDdI: Downloading webpage




[youtube] _f2vBo-ZDdI: Downloading android vr player API JSON
[youtube] _f2vBo-ZDdI: Downloading ios downgraded player API JSON
[youtube] _f2vBo-ZDdI: Downloading m3u8 information
[info] _f2vBo-ZDdI: Downloading 1 format(s): 248+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/20 - Even and Odd Signals (Solved Problem 4).mp4 has already been downloaded
[download] Downloading item 21 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=Gj9pDC_OBQU
[youtube] Gj9pDC_OBQU: Downloading webpage




[youtube] Gj9pDC_OBQU: Downloading android vr player API JSON
[youtube] Gj9pDC_OBQU: Downloading ios downgraded player API JSON
[youtube] Gj9pDC_OBQU: Downloading m3u8 information
[info] Gj9pDC_OBQU: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/21 - Periodic and Non-Periodic Signals.mp4 has already been downloaded
[download] Downloading item 22 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=zkHs90fmVf4
[youtube] zkHs90fmVf4: Downloading webpage




[youtube] zkHs90fmVf4: Downloading android vr player API JSON
[youtube] zkHs90fmVf4: Downloading ios downgraded player API JSON
[youtube] zkHs90fmVf4: Downloading m3u8 information
[info] zkHs90fmVf4: Downloading 1 format(s): 248+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/22 - Periodic and Non-Periodic Signals (Important Point).mp4 has already been downloaded
[download] Downloading item 23 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=4kEZQ9RtItk
[youtube] 4kEZQ9RtItk: Downloading webpage




[youtube] 4kEZQ9RtItk: Downloading android vr player API JSON
[youtube] 4kEZQ9RtItk: Downloading ios downgraded player API JSON
[youtube] 4kEZQ9RtItk: Downloading m3u8 information
[info] 4kEZQ9RtItk: Downloading 1 format(s): 244+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/23 - Calculation of Fundamental Period.mp4 has already been downloaded
[download] Downloading item 24 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=A3Il3OyNoso
[youtube] A3Il3OyNoso: Downloading webpage




[youtube] A3Il3OyNoso: Downloading android vr player API JSON
[youtube] A3Il3OyNoso: Downloading ios downgraded player API JSON
[youtube] A3Il3OyNoso: Downloading m3u8 information
[info] A3Il3OyNoso: Downloading 1 format(s): 397+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/24 - Periodic and Non-Periodic Signals (Solved Problems).mp4 has already been downloaded
[download] Downloading item 25 of 25
[youtube] Extracting URL: https://www.youtube.com/watch?v=SfIQ6fLk6co
[youtube] SfIQ6fLk6co: Downloading webpage




[youtube] SfIQ6fLk6co: Downloading android vr player API JSON
[youtube] SfIQ6fLk6co: Downloading ios downgraded player API JSON
[youtube] SfIQ6fLk6co: Downloading m3u8 information
[info] SfIQ6fLk6co: Downloading 1 format(s): 248+251
[download] /content/drive/MyDrive/RAG_BAS_PROJECT/VIDEOS/25 - Effect of Time-Shifting & Time-Scaling on Fundamental Time Period.mp4 has already been downloaded
[download] Finished downloading playlist: Signals and Systems
Video download completed successfully


In [13]:
print("02a_auto_video_download.ipynb completed.")


02a_auto_video_download.ipynb completed.


# 02_video_to_audio.ipynb

# Video to Audio (Data Acquisition)

This notebook:
- Downloads lecture videos from a YouTube playlist
- Stores videos in MP4 format
- Converts MP4 videos to MP3 audio using FFmpeg

NOTE:
- This notebook should be run ONLY ONCE.
- Re-run ONLY if the playlist changes.


#### Imports & Paths

In [14]:
import os
import subprocess
from yt_dlp import YoutubeDL


# Base project paths

In [15]:

BASE_DIR = "/content/drive/MyDrive/RAG_BAS_PROJECT"
VIDEO_DIR = os.path.join(BASE_DIR, "VIDEOS")
AUDIO_DIR = os.path.join(BASE_DIR, "AUDIOS")

os.makedirs(VIDEO_DIR, exist_ok=True)
os.makedirs(AUDIO_DIR, exist_ok=True)

print("Directories ready")


Directories ready


#### Convert MP4 → MP3 (Clean & Safe)

In [16]:
video_files = sorted([
    f for f in os.listdir(VIDEO_DIR)
    if f.endswith(".mp4")
])

print(f"Found {len(video_files)} video files for conversion")


Found 25 video files for conversion


In [17]:
for video in video_files:
    input_path = os.path.join(VIDEO_DIR, video)

    lecture_number = video.split(" - ")[0]
    lecture_title = video.split(" - ", 1)[1].replace(".mp4", "")
    output_audio = f"{lecture_number}_{lecture_title}.mp3"
    output_path = os.path.join(AUDIO_DIR, output_audio)

    # Skip if MP3 already exists (IMPORTANT)
    if os.path.exists(output_path):
        print(f"Skipping (already exists): {output_audio}")
        continue

    print(f"Converting: {video}")

    subprocess.run(
        [
            "ffmpeg",
            "-nostdin",
            "-y",
            "-i", input_path,
            output_path
        ],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.STDOUT
    )

print("All videos converted to MP3 successfully!")


Skipping (already exists): 01_Introduction to Signals and Systems.mp3
Skipping (already exists): 02_Continuous and Discrete Time Signals.mp3
Skipping (already exists): 03_Addition of Continuous-Time Signals.mp3
Skipping (already exists): 04_Multiplication of Continuous-Time Signals.mp3
Skipping (already exists): 05_Time Scaling of Continuous-Time Signals.mp3
Skipping (already exists): 06_Amplitude Scaling of Continuous-Time Signals.mp3
Skipping (already exists): 07_Time Shifting of Continuous-Time Signals.mp3
Skipping (already exists): 08_Amplitude Shifting of Continuous-Time Signals.mp3
Skipping (already exists): 09_Reversal of Continuous-Time Signals.mp3
Skipping (already exists): 10_Multiple Transformations of Continuous-Time Signals.mp3
Skipping (already exists): 11_Multiple Transformations of CTS (Important Point & Shortcut).mp3
Skipping (already exists): 12_Multiple Transformations of CTS (Solved Problem 1).mp3
Skipping (already exists): 13_Multiple Transformations of CTS (Solved

In [18]:
print("02_video_to_audio.ipynb completed successfully.")


02_video_to_audio.ipynb completed successfully.


# 03_audio_to_json.ipynb

# Audio to JSON (Transcription & Chunking)

This notebook converts lecture audio files into structured JSON documents.

Pipeline:
1. Load MP3 lecture audio files
2. Transcribe audio using OpenAI Whisper (large-v2)
3. Merge Whisper segments into RAG-optimized chunks
4. Store timestamps, lecture metadata, and text
5. Save output as reusable JSON files

Chunking Strategy:
- Time-based chunking (15–25 seconds)
- Small overlap to preserve context
- Designed for embedding and retrieval performance


### Imports & Paths

In [19]:
import os
import json
import whisper
from tqdm import tqdm


In [20]:
BASE_DIR = "/content/drive/MyDrive/RAG_BAS_PROJECT"
AUDIO_DIR = os.path.join(BASE_DIR, "AUDIOS")
JSON_DIR = os.path.join(BASE_DIR, "jsons")

os.makedirs(JSON_DIR, exist_ok=True)

print("Directories ready")


Directories ready


### Load Whisper Model

In [21]:
model = whisper.load_model("large-v2")
print("Whisper large-v2 model loaded")


100%|██████████████████████████████████████| 2.87G/2.87G [00:23<00:00, 130MiB/s]


Whisper large-v2 model loaded


## Chunking Parameters

- MAX_DURATION: Target chunk length (seconds)
- OVERLAP: Context overlap between chunks


In [22]:
MAX_DURATION = 20   # seconds
OVERLAP = 3         # seconds


### Chunk Creation Function

In [23]:
def create_chunks(segments, lecture_number, lecture_title):
    chunks = []
    buffer_text = []
    chunk_start = None

    for seg in segments:
        if chunk_start is None:
            chunk_start = seg["start"]

        buffer_text.append(seg["text"].strip())
        duration = seg["end"] - chunk_start

        if duration >= MAX_DURATION:
            chunks.append({
                "Number": lecture_number,
                "Title": lecture_title,
                "Start": round(chunk_start, 2),
                "End": round(seg["end"], 2),
                "Text": " ".join(buffer_text)
            })

            # Overlap handling
            chunk_start = seg["end"] - OVERLAP
            buffer_text = []

    # Handle leftover text
    if buffer_text:
        chunks.append({
            "Number": lecture_number,
            "Title": lecture_title,
            "Start": round(chunk_start, 2),
            "End": round(segments[-1]["end"], 2),
            "Text": " ".join(buffer_text)
        })

    return chunks


#### List Audio Files

In [24]:
audio_files = sorted([
    f for f in os.listdir(AUDIO_DIR)
    if f.endswith(".mp3")
])

print(f"Found {len(audio_files)} audio files")


Found 25 audio files


#### Transcribe & Save JSON

In [25]:
for audio in tqdm(audio_files):
    lecture_number = audio.split("_")[0]
    lecture_title = audio.split("_", 1)[1].replace(".mp3", "")
    audio_path = os.path.join(AUDIO_DIR, audio)

    output_json_path = os.path.join(
        JSON_DIR,
        f"{lecture_number}_{lecture_title}.json"
    )

    # Skip if already processed
    if os.path.exists(output_json_path):
        print(f"Skipping (already exists): {lecture_number}_{lecture_title}")
        continue

    print(f"\nTranscribing: {audio}")

    result = model.transcribe(
        audio=audio_path,
        language="en",
        task="transcribe",
        word_timestamps=False
    )

    chunks = create_chunks(
        segments=result["segments"],
        lecture_number=lecture_number,
        lecture_title=lecture_title
    )

    output = {
        "lecture_number": lecture_number,
        "lecture_title": lecture_title,
        "full_text": result["text"],
        "chunks": chunks
    }

    with open(output_json_path, "w") as f:
        json.dump(output, f, indent=2)

    print(f"Saved JSON: {output_json_path}")


  4%|▍         | 1/25 [00:00<00:07,  3.09it/s]

Skipping (already exists): 01_Introduction to Signals and Systems
Skipping (already exists): 02_Continuous and Discrete Time Signals
Skipping (already exists): 03_Addition of Continuous-Time Signals
Skipping (already exists): 04_Multiplication of Continuous-Time Signals
Skipping (already exists): 05_Time Scaling of Continuous-Time Signals
Skipping (already exists): 06_Amplitude Scaling of Continuous-Time Signals
Skipping (already exists): 07_Time Shifting of Continuous-Time Signals
Skipping (already exists): 08_Amplitude Shifting of Continuous-Time Signals
Skipping (already exists): 09_Reversal of Continuous-Time Signals
Skipping (already exists): 10_Multiple Transformations of Continuous-Time Signals
Skipping (already exists): 11_Multiple Transformations of CTS (Important Point & Shortcut)
Skipping (already exists): 12_Multiple Transformations of CTS (Solved Problem 1)

Transcribing: 13_Multiple Transformations of CTS (Solved Problem 2).mp3


 52%|█████▏    | 13/25 [01:58<01:51,  9.31s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/13_Multiple Transformations of CTS (Solved Problem 2).json

Transcribing: 14_Even and Odd Signals.mp3


 56%|█████▌    | 14/25 [03:15<02:58, 16.20s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/14_Even and Odd Signals.json

Transcribing: 15_Even and Odd Components of a Signal.mp3


 60%|██████    | 15/25 [03:49<03:03, 18.36s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/15_Even and Odd Components of a Signal.json

Transcribing: 16_Properties of Even and Odd Signals.mp3


 64%|██████▍   | 16/25 [05:27<04:35, 30.62s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/16_Properties of Even and Odd Signals.json

Transcribing: 17_Even and Odd Signals (Solved Problem 1).mp3


 68%|██████▊   | 17/25 [06:22<04:40, 35.01s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/17_Even and Odd Signals (Solved Problem 1).json

Transcribing: 18_Even and Odd Signals (Solved Problem 2).mp3


 72%|███████▏  | 18/25 [07:50<05:19, 45.65s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/18_Even and Odd Signals (Solved Problem 2).json

Transcribing: 19_Even and Odd Signals (Solved Problem 3).mp3


 76%|███████▌  | 19/25 [09:33<05:51, 58.63s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/19_Even and Odd Signals (Solved Problem 3).json

Transcribing: 20_Even and Odd Signals (Solved Problem 4).mp3


 80%|████████  | 20/25 [11:13<05:44, 68.87s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/20_Even and Odd Signals (Solved Problem 4).json

Transcribing: 21_Periodic and Non-Periodic Signals.mp3


 84%|████████▍ | 21/25 [12:48<05:01, 75.40s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/21_Periodic and Non-Periodic Signals.json

Transcribing: 22_Periodic and Non-Periodic Signals (Important Point).mp3


 88%|████████▊ | 22/25 [13:41<03:28, 69.40s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/22_Periodic and Non-Periodic Signals (Important Point).json

Transcribing: 23_Calculation of Fundamental Period.mp3


 92%|█████████▏| 23/25 [14:19<02:01, 60.68s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/23_Calculation of Fundamental Period.json

Transcribing: 24_Periodic and Non-Periodic Signals (Solved Problems).mp3


 96%|█████████▌| 24/25 [15:09<00:57, 57.63s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/24_Periodic and Non-Periodic Signals (Solved Problems).json

Transcribing: 25_Effect of Time-Shifting & Time-Scaling on Fundamental Time Period.mp3


100%|██████████| 25/25 [16:29<00:00, 39.58s/it]

Saved JSON: /content/drive/MyDrive/RAG_BAS_PROJECT/jsons/25_Effect of Time-Shifting & Time-Scaling on Fundamental Time Period.json





In [26]:
print("03_audio_to_json.ipynb completed successfully.")


03_audio_to_json.ipynb completed successfully.


# 04_embeddings.ipynb

# Embeddings Generation

This notebook:
- Loads chunked lecture JSON files
- Converts text chunks into dense vector embeddings
- Uses a state-of-the-art embedding model (BAAI/bge-m3)
- Stores embeddings with metadata in a reusable CSV file

NOTE:
- Run this notebook ONLY when chunks change
- Do NOT recompute embeddings on every query


### Imports & Paths

In [27]:
import os
import json
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer


In [28]:
BASE_DIR = "/content/drive/MyDrive/RAG_BAS_PROJECT"
JSON_DIR = os.path.join(BASE_DIR, "jsons")
EMBEDDING_CSV = os.path.join(BASE_DIR, "embeddings.csv")

print("Paths configured")


Paths configured


#### Load Embedding Model

In [30]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer(
    "BAAI/bge-m3",
    device=device
)

print("Embedding model loaded")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Embedding model loaded


## Embedding Function

- Batch encoding for efficiency
- Normalized embeddings for cosine similarity


In [32]:
def create_embeddings(texts, batch_size=8):
    return model.encode(
        texts,
        batch_size=batch_size,
        normalize_embeddings=True,
        show_progress_bar=False
    )


#### Load JSON Files


In [33]:
json_files = sorted([
    f for f in os.listdir(JSON_DIR)
    if f.endswith(".json")
])

print(f"Found {len(json_files)} JSON files")


Found 25 JSON files


#### Generate Embeddings

In [34]:
records = []
chunk_id = 0

for json_file in json_files:
    json_path = os.path.join(JSON_DIR, json_file)

    with open(json_path, "r") as f:
        content = json.load(f)

    print(f"Creating embeddings for: {json_file}")

    texts = [chunk["Text"] for chunk in content["chunks"]]
    embeddings = create_embeddings(texts)

    for i, chunk in enumerate(content["chunks"]):
        records.append({
            "chunk_id": chunk_id,
            "Number": chunk["Number"],
            "Title": chunk["Title"],
            "Start": chunk["Start"],
            "End": chunk["End"],
            "Text": chunk["Text"],
            "embedding": embeddings[i].tolist()
        })
        chunk_id += 1

    torch.cuda.empty_cache()


Creating embeddings for: 01_Introduction to Signals and Systems.json
Creating embeddings for: 02_Continuous and Discrete Time Signals.json
Creating embeddings for: 03_Addition of Continuous-Time Signals.json
Creating embeddings for: 04_Multiplication of Continuous-Time Signals.json
Creating embeddings for: 05_Time Scaling of Continuous-Time Signals.json
Creating embeddings for: 06_Amplitude Scaling of Continuous-Time Signals.json
Creating embeddings for: 07_Time Shifting of Continuous-Time Signals.json
Creating embeddings for: 08_Amplitude Shifting of Continuous-Time Signals.json
Creating embeddings for: 09_Reversal of Continuous-Time Signals.json
Creating embeddings for: 10_Multiple Transformations of Continuous-Time Signals.json
Creating embeddings for: 11_Multiple Transformations of CTS (Important Point & Shortcut).json
Creating embeddings for: 12_Multiple Transformations of CTS (Solved Problem 1).json
Creating embeddings for: 13_Multiple Transformations of CTS (Solved Problem 2).js

#### Create DataFrame

In [35]:
df = pd.DataFrame.from_records(records)
print(df.head())
print(f"Total chunks embedded: {len(df)}")


   chunk_id Number                                Title  Start     End  \
0         0     01  Introduction to Signals and Systems   0.00   21.08   
1         1     01  Introduction to Signals and Systems  18.08   40.88   
2         2     01  Introduction to Signals and Systems  37.88   62.78   
3         3     01  Introduction to Signals and Systems  59.78   80.56   
4         4     01  Introduction to Signals and Systems  77.56  101.50   

                                                Text  \
0  Welcome to the first lecture in the lecture se...   
1  The first one is the introductional part in wh...   
2  etc. In the third part we need to learn the ba...   
3  Signals. CTS stands for Continuous Time Signal...   
4  The eighth part is the Z-transform or Z-transf...   

                                           embedding  
0  [0.005194071214646101, 0.0008971200441010296, ...  
1  [-0.022420773282647133, 0.01766463927924633, -...  
2  [-0.023778000846505165, -0.0033304006792604923... 

#### Save Embeddings to CSV

In [36]:
# Convert embedding list → JSON string for CSV storage
df["embedding"] = df["embedding"].apply(json.dumps)

df.to_csv(EMBEDDING_CSV, index=False)

print(f"Embeddings saved to: {EMBEDDING_CSV}")


Embeddings saved to: /content/drive/MyDrive/RAG_BAS_PROJECT/embeddings.csv


In [37]:
print("04_embeddings.ipynb completed successfully.")


04_embeddings.ipynb completed successfully.


# 05_retrieval.ipynb

```
This notebook performs semantic retrieval over lecture transcripts.
Given a user query, it finds the most relevant transcript chunks
using vector similarity search. The retrieved chunks are later
used as context for a RAG-based AI teaching assistant.
```

In [38]:
import json
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import torch
from sentence_transformers import SentenceTransformer


#### Load Embeddings CSV

In [39]:
CSV_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/embeddings.csv"

df = pd.read_csv(CSV_PATH)

print("CSV loaded successfully")
print("Total chunks:", len(df))


CSV loaded successfully
Total chunks: 627


#### Convert Embedding Strings → NumPy Matrix

In [40]:
df["embedding"] = df["embedding"].apply(json.loads)

embedding_matrix = np.vstack(df["embedding"].values)

print("Embedding matrix shape:", embedding_matrix.shape)


Embedding matrix shape: (627, 1024)


#### Load Same Embedding Model

In [41]:
# model = SentenceTransformer(
#     "BAAI/bge-m3",
#     device=device
# )

# because the model previously uplodade

### Encode User Query

In [45]:
def encode_query(text,batch_size=8):
     embedding = model.encode(
        [text],
         batch_size=batch_size,
         normalize_embeddings=True,
         show_progress_bar=False
      )
     return embedding



#### Cosine Similarity + Top-K Retrieval

In [46]:
def retrieve_top_k(query_embedding, embedding_matrix, df, k=5):
    similarities = cosine_similarity(
        embedding_matrix,
        query_embedding
    ).flatten()

    top_indices = similarities.argsort()[::-1][:k]

    results = df.iloc[top_indices].copy()
    results["similarity_score"] = similarities[top_indices]

    return results


### Run Retrieval

In [47]:
user_question = input("Ask a question: ")

query_embedding = encode_query(user_question)

top_k = retrieve_top_k(
    query_embedding,
    embedding_matrix,
    df,
    k=3
)

top_k[[
    "Number",
    "Title",
    "Start",
    "End",
    "Text",
    "similarity_score"
]]


Ask a question: what is amplitude of a signal?


Unnamed: 0,Number,Title,Start,End,Text,similarity_score
55,2,Continuous and Discrete Time Signals,527.16,551.02,"correctly. You know the amplitudes 1, 0, 1, 2,...",0.712385
120,6,Amplitude Scaling of Continuous-Time Signals,22.28,47.16,"In case of amplitude scaling, we multiply the ...",0.640954
53,2,Continuous and Discrete Time Signals,485.52,510.12,inside these two curly braces will be the ampl...,0.623273


#### Modify 05_retrieval.ipynb to save top-k chunks

In [48]:
TOP_K_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/top_k.json"
top_k.to_json(TOP_K_PATH, orient="records", indent=2)
print(f"Top-K retrieval results saved to: {TOP_K_PATH}")

# Save the user question for 06_rag_gemini
QUESTION_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/user_question.json"
with open(QUESTION_PATH, "w") as f:
    f.write(json.dumps({"question": user_question}))
print(f"User question saved to: {QUESTION_PATH}")

Top-K retrieval results saved to: /content/drive/MyDrive/RAG_BAS_PROJECT/top_k.json
User question saved to: /content/drive/MyDrive/RAG_BAS_PROJECT/user_question.json


# 06_rag_gemini.ipynb

# RAG-based AI Teaching Assistant (Gemini)

#### Install & Import Libraries

In [49]:
!pip install -q -U google-generativeai


In [51]:
import google.generativeai as genai
import pandas as pd


#### Configure Gemini API Key

In [53]:
from google.colab import userdata

api_key = userdata.get("GOOGLE_API_KEY")
genai.configure(api_key=api_key)

print("Gemini API configured successfully")


Gemini API configured successfully


In [54]:
def build_context(top_k_df):
    context = ""

    for _, row in top_k_df.iterrows():
        context += f"""
Lecture Title: {row['Title']}
Lecture Number: {row['Number']}
Start Time: {row['Start']} seconds
End Time: {row['End']} seconds
Lecture Content:
{row['Text']}
----------------------------------
"""
    return context


In [55]:
def build_prompt(context, user_question):
    prompt = f"""
You are an AI teaching assistant for a Signals and Systems course.

IMPORTANT RULES:
- Use ONLY the lecture content below.
- Do NOT use outside knowledge.
- You may combine multiple lecture chunks.
- If the topic is NOT covered, say:
  "This question is not covered in the provided lectures."

You MUST:
- Explain in simple student-friendly language
- Clearly mention:
  - Lecture Title
  - Lecture Number
  - Start Time
  - End Time

Lecture Content:
{context}

User Question:
{user_question}

Answer Format (STRICT):

Answer:
<simple explanation>

Where it is taught:
- Lecture Title:
- Lecture Number:
- Start Time:
- End Time:
"""
    return prompt


In [56]:
def gemini_rag_answer(prompt):
    model = genai.GenerativeModel("gemini-2.5-flash")
    response = model.generate_content(prompt)
    return response.text


In [57]:
# Load the saved user question from 05_file
QUESTION_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/user_question.json"
with open(QUESTION_PATH, "r") as f:
    user_question = json.load(f)["question"]

print(f"Loaded user question: {user_question}")

Loaded user question: what is amplitude of a signal?


In [58]:
import pandas as pd

TOP_K_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/top_k.json"

# Load the top-k chunks retrieved in 05_retrieval.ipynb
top_k = pd.read_json(TOP_K_PATH)

print("Top-K retrieval data loaded successfully")
print(top_k.head())


Top-K retrieval data loaded successfully
   chunk_id  Number                                         Title   Start  \
0        55       2          Continuous and Discrete Time Signals  527.16   
1       120       6  Amplitude Scaling of Continuous-Time Signals   22.28   
2        53       2          Continuous and Discrete Time Signals  485.52   

      End                                               Text  \
0  551.02  correctly. You know the amplitudes 1, 0, 1, 2,...   
1   47.16  In case of amplitude scaling, we multiply the ...   
2  510.12  inside these two curly braces will be the ampl...   

                                           embedding  similarity_score  
0  [0.0094819423, 0.0447933003, -0.0621751137, -0...          0.712385  
1  [-0.026730001000000003, 0.018699931, -0.056353...          0.640954  
2  [0.0094759092, 0.0575332083, -0.0436175913, 0....          0.623273  


In [59]:

context_text = build_context(top_k)
prompt = build_prompt(context_text, user_question)

answer = gemini_rag_answer(prompt)

print("\nAI Teaching Assistant Answer:\n")
print(answer)



AI Teaching Assistant Answer:

Answer:
The amplitude of a signal refers to its numerical values. For example, values like 1, 0, 1, 2, 2, 1 are described as the amplitudes of a signal. For discrete-time signals, these amplitude values are typically found inside curly braces.

Where it is taught:
- Lecture Title: Continuous and Discrete Time Signals
- Lecture Number: 2
- Start Time: 485.52 seconds
- End Time: 551.02 seconds


In [60]:
SAVE_PATH = "/content/drive/MyDrive/RAG_BAS_PROJECT/response.txt"

with open(SAVE_PATH, "w") as f:
    f.write(answer)

print("Response saved successfully")


Response saved successfully
