# ***Understanding and Preparing Data***

In this section, I developed a comprehensive web-scraping and text-preparation pipeline designed to collect, clean, and structure content from the University of Chicago’s Master of Science in Applied Data Science (MSADS) program pages.

The workflow combines an automated crawler to dynamically discover all relevant subpages, main-text extraction using trafilatura for high-quality content retrieval, and intelligent chunking to prepare data for embedding and retrieval-augmented generation (RAG).

This approach ensures that only meaningful program-related information  is captured, cleaned, and split into context-preserving text segments for downstream analysis.

In [None]:
!pip install requests beautifulsoup4 trafilatura tqdm

Collecting trafilatura
  Downloading trafilatura-2.0.0-py3-none-any.whl.metadata (12 kB)
Collecting courlan>=1.3.2 (from trafilatura)
  Downloading courlan-1.3.2-py3-none-any.whl.metadata (17 kB)
Collecting htmldate>=1.9.2 (from trafilatura)
  Downloading htmldate-1.9.4-py3-none-any.whl.metadata (10 kB)
Collecting justext>=3.0.1 (from trafilatura)
  Downloading justext-3.0.2-py2.py3-none-any.whl.metadata (7.3 kB)
Collecting tld>=0.13 (from courlan>=1.3.2->trafilatura)
  Downloading tld-0.13.1-py2.py3-none-any.whl.metadata (10 kB)
Collecting dateparser>=1.1.2 (from htmldate>=1.9.2->trafilatura)
  Downloading dateparser-1.2.2-py3-none-any.whl.metadata (29 kB)
Collecting lxml_html_clean (from lxml[html_clean]>=4.4.2->justext>=3.0.1->trafilatura)
  Downloading lxml_html_clean-0.4.3-py3-none-any.whl.metadata (2.3 kB)
Downloading trafilatura-2.0.0-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m


**Web Crawler**

In this step, I implemented a focused crawler that begins at the main MSADS program page and explores internal links up to three levels deep.

The crawler follows a breadth-first search (BFS) pattern:

1. Starts from the seed URL

2. Collects and normalizes internal links related to MSADS content

3. Skips non-HTML or irrelevant files (e.g., PDFs, images)

4. Saves discovered pages with metadata such as title, depth, and child links

This design ensures that the scraper remains domain-restricted ([datascience.uchicago.edu](https://datascience.uchicago.edu )) and captures all education-related sections—such as admissions, capstones, and career outcomes—without drifting into unrelated parts of the site.

In [None]:
import requests, json, time
from urllib.parse import urlparse, urljoin
from collections import deque
from random import uniform

import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

# Crawler
SEED_URLS = [
    "https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/"
]

MAX_PAGES = 300      # hard cap so it doesn’t run forever
MAX_DEPTH = 4        # clicks away from seed
DOMAIN = urlparse(SEED_URLS[0]).netloc

# # Keywords to decide if a path is relevant to MSADS
# PATH_KEYWORDS = [
#     "ms-in-applied-data-science",
#     "masters-programs",
#     "tuition-fees-aid",
#     "explore-the-ms-ads-campus",
#     "education",
# ]

# File extensions to ignore
SKIP_EXTS = [".pdf", ".jpg", ".jpeg", ".png", ".gif", ".mp4", ".zip", ".docx", ".pptx"]


# def is_relevant_link(link: str) -> bool:
#     """Keep only links on the same domain and with MSADS-related paths."""
#     parsed = urlparse(link)
#     if parsed.netloc != DOMAIN:
#         return False

#     path = parsed.path.lower()
#     if any(path.endswith(ext) for ext in SKIP_EXTS):
#         return False

#     # keep only URLs that look related to education / MSADS
#     if any(kw in path for kw in PATH_KEYWORDS):
#         return True

#     return False

def is_relevant_link(link: str) -> bool:
    """
    Keep ALL pages related to MS in Applied Data Science program.
    Don't try to predict what questions users will ask!
    """
    parsed = urlparse(link)

    if parsed.netloc != DOMAIN:
        return False

    path = parsed.path.lower()

    # Skip file downloads
    if any(path.endswith(ext) for ext in SKIP_EXTS):
        return False

    # BROAD APPROACH: Keep everything under the program and education sections
    relevant_paths = [
        "/education/masters-programs/ms-in-applied-data-science/",  # Main program
        "/education/masters-programs/",                             # Masters programs
        "/education/",                                              # General education
    ]

    # Keep if path contains any relevant pattern
    if any(relevant_path in path for relevant_path in relevant_paths):
        return True

    # Also keep Data Science Institute pages that might be relevant
    if "/education/" in path or "/programs/" in path:
        return True

    return False


session = requests.Session()
session.headers.update({
    "User-Agent": "MSADS-RAG-Crawler/1.0 (educational project)"
})

seen = set()
queue = deque((u, 0) for u in SEED_URLS)
pages = []

print("Starting crawl...")

while queue and len(pages) < MAX_PAGES:
    url, depth = queue.popleft()
    if url in seen:
        continue
    if depth > MAX_DEPTH:
        continue
    seen.add(url)

    try:
        resp = session.get(url, timeout=20)
        print(f"[HTTP {resp.status_code}] depth={depth} {url}")
        resp.raise_for_status()
    except Exception as e:
        print(f"[SKIP] {url} ({e})")
        continue

    # Only process HTML pages
    ctype = resp.headers.get("Content-Type", "")
    if "html" not in ctype:
        print(f"[SKIP] Non-HTML content at {url} ({ctype})")
        continue

    html = resp.text
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.find("title")
    title = title_tag.get_text(strip=True) if title_tag else ""

    new_links = set()

    # Discover new links
    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        full = urljoin(url, href)
        if is_relevant_link(full) and full not in seen:
            new_links.add(full)
            queue.append((full, depth + 1))

    pages.append({
        "depth": depth,
        "title": title,
        "url_found": url,
        "url_final": resp.url,
        "num_child_links": len(new_links),
        "child_links": list(new_links),
    })

    time.sleep(uniform(0.5, 1.5))

# Build dataframe of crawled pages
crawl_df = pd.DataFrame(pages).drop_duplicates(subset="url_final").reset_index(drop=True)
print("\n Crawler finished.")
print("Pages discovered:", crawl_df.shape[0])

crawl_df.head()


Starting crawl...
[HTTP 200] depth=0 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/undergrad-major/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/phd-in-data-science/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/data-science-clinic/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/summer-research-programs/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20
[HTTP 200] depth=1 https://datascienc

Unnamed: 0,depth,title,url_found,url_final,num_child_links,child_links
0,0,Master's in Applied Data Science - DSI,https://datascience.uchicago.edu/education/mas...,https://datascience.uchicago.edu/education/mas...,13,"[https://datascience.uchicago.edu/education/, ..."
1,1,Master's in Applied Data Science - DSI,https://datascience.uchicago.edu/education/mas...,https://datascience.uchicago.edu/education/mas...,12,"[https://datascience.uchicago.edu/education/, ..."
2,1,Education - DSI,https://datascience.uchicago.edu/education/,https://datascience.uchicago.edu/education/,7,[https://datascience.uchicago.edu/education/ph...
3,1,Undergraduate Data Science Major - DSI,https://datascience.uchicago.edu/education/und...,https://datascience.uchicago.edu/education/und...,5,[https://datascience.uchicago.edu/education/un...
4,1,Master's Programs - DSI,https://datascience.uchicago.edu/education/mas...,https://datascience.uchicago.edu/education/mas...,4,[https://datascience.uchicago.edu/education/ph...


After crawling, the collected data contains duplicate and nested links across multiple levels.

So we need to  flatten all child link lists into a single collection, remove duplicates while maintaining the original order, and store the clean, unique list of URLs into a Pandas DataFrame


In [None]:
# Flatten child_links + include the main url_final itself
all_links = []

# child_links column may contain lists or NaN
for links in crawl_df["child_links"]:
    if isinstance(links, list):
        all_links.extend(links)

all_links.extend(crawl_df["url_final"].tolist())

# Deduplicate while preserving order
unique_links = list(dict.fromkeys(all_links))

links_df = pd.DataFrame({
    "id": range(1, len(unique_links) + 1),
    "url": unique_links
})

print("Total unique URLs:", len(unique_links))
links_df.head()


Total unique URLs: 153


Unnamed: 0,id,url
0,1,https://datascience.uchicago.edu/education/
1,2,https://datascience.uchicago.edu/education/phd...
2,3,https://datascience.uchicago.edu/education/dat...
3,4,https://datascience.uchicago.edu/education/mas...
4,5,https://datascience.uchicago.edu/education/mas...


Here, I used trafilatura to extract the main readable content from each discovered page.

Unlike basic HTML parsing, trafilatura automatically removes navigation bars, menus, and sidebars—preserving only the central, human-readable article text.

In [None]:
import trafilatura

urls = links_df["url"].tolist()
records = []

print("Extracting main content with trafilatura...")

for url in tqdm(urls, desc="Extracting"):
    try:
        downloaded = trafilatura.fetch_url(url)
        if not downloaded:
            print(f"[!] Failed to fetch: {url}")
            continue

        text = trafilatura.extract(
            downloaded,
            include_comments=False,
            include_tables=False
        )

        if not text:
            print(f"[!] No main text extracted: {url}")
            continue

        meta = trafilatura.extract_metadata(downloaded)
        title = meta.title if meta and meta.title else ""

        records.append({
            "url": url,
            "page_title": title,
            "content": text,
            "word_count": len(text.split())
        })
    except Exception as e:
        print(f"[!] Error processing {url}: {e}")
        continue

content_df = pd.DataFrame(records)
print(f"\nExtracted {len(content_df)} pages successfully.")

content_df.head()


Extracting main content with trafilatura...


Extracting:   0%|          | 0/153 [00:00<?, ?it/s]


Extracted 153 pages successfully.


Unnamed: 0,url,page_title,content,word_count
0,https://datascience.uchicago.edu/education/,Education - DSI,"Building the foundations of data science, cons...",223
1,https://datascience.uchicago.edu/education/phd...,PhD in Data Science - DSI,PhD in Data Science\nStudents conduct research...,178
2,https://datascience.uchicago.edu/education/dat...,Data Science Clinic - DSI,Data Science Clinic\nApplied experiential lear...,815
3,https://datascience.uchicago.edu/education/mas...,In-Person Program - DSI,In-Person Program\nTailor Your Data Science Jo...,4484
4,https://datascience.uchicago.edu/education/mas...,Online Program - DSI,Online Program\nAcademic Rigor Meets Work/Life...,4482


**Smart Chunking and Metadata Enrichment**

In this section, I transformed long extracted texts into overlapping chunks optimized for embedding and retrieval-based models.

The logic includes:

Token-safe chunking: large ~500-word segments with 100-word overlaps for context continuity

Minimum length filtering: ignores tiny fragments below 80 words

Noise cleaning: removes boilerplate phrases like “Cookie Policy” or “All rights reserved”

Metadata tagging: each chunk is labeled with page URL, title, inferred page type (e.g., “capstone”, “faq”, “career_outcomes”), and index position

In [None]:
import re, uuid

OUTPUT_JSONL = "/content/msads_chunks_trafilatura.jsonl"

MAX_WORDS = 500          # large chunks
OVERLAP_WORDS = 100      # overlapping window
MIN_WORDS_SECTION = 80   # don't split very short text

NOISY_PHRASES = [
    "Cookie Policy",
    "Privacy Notice",
    "All rights reserved",
]


def infer_page_type(url: str) -> str:
    u = url.lower()
    if "how-to-apply" in u:
        return "how_to_apply"
    if "faqs" in u:
        return "faq"
    if "capstone-projects" in u:
        return "capstone"
    if "course-progressions" in u:
        return "course_progressions"
    if "events-deadlines" in u:
        return "events_deadlines"
    if "tuition-fees-aid" in u:
        return "tuition_fees_aid"
    if "instructors-staff" in u:
        return "instructors_staff"
    if "career-outcomes" in u:
        return "career_outcomes"
    if "in-person-program" in u:
        return "in_person_program"
    if "online-program" in u:
        return "online_program"
    if "explore-the-ms-ads-campus" in u:
        return "explore_campus"
    if "ms-in-applied-data-science" in u:
        return "msads_main"
    if "/education/" in u:
        return "education_general"
    if "/about/" in u:
        return "about"
    if "/research/" in u:
        return "research"
    return "general"


def remove_noise(text: str) -> str:
    for phrase in NOISY_PHRASES:
        text = text.replace(phrase, " ")
    text = re.sub(r"\s+", " ", text).strip()
    return text


def split_chunks(text: str,
                 max_words: int = MAX_WORDS,
                 overlap: int = OVERLAP_WORDS,
                 min_words: int = MIN_WORDS_SECTION):
    words = text.split()
    n = len(words)
    if n == 0:
        return []
    if n <= min_words:
        return [" ".join(words)]

    chunks = []
    start = 0
    while start < n:
        end = min(start + max_words, n)
        chunk_words = words[start:end]
        chunks.append(" ".join(chunk_words))
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks


chunk_records = []

for _, row in content_df.iterrows():
    url = row["url"]
    page_title = row.get("page_title", "") or ""
    full_text = remove_noise(str(row["content"] or ""))

    page_type = infer_page_type(url)
    chunks = split_chunks(full_text)

    for idx, chunk in enumerate(chunks):
        chunk_records.append({
            "id": str(uuid.uuid4()),
            "url": url,
            "page_title": page_title,
            "page_type": page_type,
            "chunk_index": idx,
            "text": chunk
        })

print("Total chunks:", len(chunk_records))

# Save to JSONL
with open(OUTPUT_JSONL, "w", encoding="utf-8") as f:
    for r in chunk_records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print(f" Saved chunks to {OUTPUT_JSONL}")


Total chunks: 1166
 Saved chunks to /content/msads_chunks_trafilatura.jsonl


In [None]:
import pandas as pd

jsonl_path = "/content/msads_chunks_trafilatura.jsonl"
df = pd.read_json(jsonl_path, lines=True)
print("Total chunks:", df.shape[0])
df.head(5)

Total chunks: 1166


Unnamed: 0,id,url,page_title,page_type,chunk_index,text
0,78d93e81-2024-4132-aa65-1804db69524e,https://datascience.uchicago.edu/education/,Education - DSI,education_general,0,"Building the foundations of data science, cons..."
1,c097a882-a3c1-406f-b26b-e9c32fb7801b,https://datascience.uchicago.edu/education/phd...,PhD in Data Science - DSI,education_general,0,PhD in Data Science Students conduct research ...
2,f9d59bc2-446d-48e8-9b60-c42a773658a5,https://datascience.uchicago.edu/education/dat...,Data Science Clinic - DSI,education_general,0,Data Science Clinic Applied experiential learn...
3,f7632cf8-0d78-4236-8c6e-4d41f6aca282,https://datascience.uchicago.edu/education/dat...,Data Science Clinic - DSI,education_general,1,"Data Science Clinic, Data Science Institute Dr..."
4,d28164cc-3183-4a52-9477-b0a370e4be31,https://datascience.uchicago.edu/education/mas...,In-Person Program - DSI,in_person_program,0,In-Person Program Tailor Your Data Science Jou...


In [None]:
import pandas as pd

df_chunks = pd.read_json("/content/msads_chunks_trafilatura.jsonl", lines=True)
print("Chunk rows:", df_chunks.shape[0])
df_chunks.head()


Chunk rows: 1166


Unnamed: 0,id,url,page_title,page_type,chunk_index,text
0,78d93e81-2024-4132-aa65-1804db69524e,https://datascience.uchicago.edu/education/,Education - DSI,education_general,0,"Building the foundations of data science, cons..."
1,c097a882-a3c1-406f-b26b-e9c32fb7801b,https://datascience.uchicago.edu/education/phd...,PhD in Data Science - DSI,education_general,0,PhD in Data Science Students conduct research ...
2,f9d59bc2-446d-48e8-9b60-c42a773658a5,https://datascience.uchicago.edu/education/dat...,Data Science Clinic - DSI,education_general,0,Data Science Clinic Applied experiential learn...
3,f7632cf8-0d78-4236-8c6e-4d41f6aca282,https://datascience.uchicago.edu/education/dat...,Data Science Clinic - DSI,education_general,1,"Data Science Clinic, Data Science Institute Dr..."
4,d28164cc-3183-4a52-9477-b0a370e4be31,https://datascience.uchicago.edu/education/mas...,In-Person Program - DSI,in_person_program,0,In-Person Program Tailor Your Data Science Jou...


In [None]:
def check_coverage(df, phrases):
    for p in phrases:
        hits = df["text"].str.contains(p, case=False, na=False).sum()
        print(f"'{p}' -> {hits} chunks")

check_coverage(
    df_chunks,
    [
        "capstone",
        "career outcomes",
        "tuition",
        "application deadline",
        "online program",
        "in-person program",
        "MS in Applied Data Science",
        "visa",
        "graduation"
    ]
)


'capstone' -> 338 chunks
'career outcomes' -> 2 chunks
'tuition' -> 160 chunks
'application deadline' -> 48 chunks
'online program' -> 256 chunks
'in-person program' -> 177 chunks
'MS in Applied Data Science' -> 240 chunks
'visa' -> 101 chunks
'graduation' -> 82 chunks


# ***Implementing Retrieval-Augmented Generation (RAG)***

In [None]:
!pip install -q sentence-transformers chromadb pandas tqdm

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m125.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m104.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m131.2 MB/s[0m eta [36m0

In [None]:
import pandas as pd
import numpy as np
import json
from typing import List, Dict
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# STEP 1: Load the Chunked Data from Part 1
print("STEP 1: Loading Chunked Data from Web Scraping")

# Load the JSONL file created in Part 1 (web scraping)
CHUNKS_FILE = "/content/msads_chunks_trafilatura.jsonl"

df_chunks = pd.read_json(CHUNKS_FILE, lines=True)

print(f"Loaded {len(df_chunks)} chunks")
print(f"Columns: {df_chunks.columns.tolist()}")
print(f"\nFirst few rows:")
print(df_chunks.head(3)[['page_title', 'page_type', 'text']])

# Data statistics
print(f"\n Statistics:")
print(f"Total chunks: {len(df_chunks)}")
print(f"Unique pages: {df_chunks['url'].nunique()}")
print(f"Page types: {df_chunks['page_type'].nunique()}")
print(f"Page type distribution:")
for page_type, count in df_chunks['page_type'].value_counts().head(10).items():
    print(f"    - {page_type}: {count}")

STEP 1: Loading Chunked Data from Web Scraping
Loaded 1166 chunks
Columns: ['id', 'url', 'page_title', 'page_type', 'chunk_index', 'text']

First few rows:
                  page_title          page_type  \
0            Education - DSI  education_general   
1  PhD in Data Science - DSI  education_general   
2  Data Science Clinic - DSI  education_general   

                                                text  
0  Building the foundations of data science, cons...  
1  PhD in Data Science Students conduct research ...  
2  Data Science Clinic Applied experiential learn...  

 Statistics:
Total chunks: 1166
Unique pages: 153
Page types: 12
Page type distribution:
    - in_person_program: 418
    - online_program: 363
    - faq: 140
    - education_general: 124
    - course_progressions: 52
    - instructors_staff: 36
    - msads_main: 14
    - how_to_apply: 6
    - events_deadlines: 4
    - capstone: 4


In [None]:
# STEP 1.5: DEDUPLICATE CHUNKS

print("STEP 1.5: Deduplicating Chunks")

# Remove URL anchors to deduplicate
df_chunks['url_clean'] = df_chunks['url'].str.split('#').str[0]

# Keep only unique URL+chunk combinations
df_chunks_deduped = df_chunks.drop_duplicates(subset=['url_clean', 'chunk_index'])

print(f"Before deduplication: {len(df_chunks)} chunks")
print(f"After deduplication: {len(df_chunks_deduped)} chunks")
print(f"Removed: {len(df_chunks) - len(df_chunks_deduped)} duplicates")

# Save deduplicated version
df_chunks_deduped.to_json("/content/msads_chunks_trafilatura_deduped.jsonl",
                           orient='records',
                           lines=True)

print(f"Saved deduplicated chunks to /content/msads_chunks_trafilatura_deduped.jsonl")

# Use deduplicated data for the rest of the pipeline
df_chunks = df_chunks_deduped.copy()

STEP 1.5: Deduplicating Chunks
Before deduplication: 1166 chunks
After deduplication: 167 chunks
Removed: 999 duplicates
Saved deduplicated chunks to /content/msads_chunks_trafilatura_deduped.jsonl


In [None]:
# STEP 2: Initialize Embedding Model
print("STEP 2: Initialize Embedding Model")

# Load the embedding model
# Using 'all-MiniLM-L6-v2' - efficient and good quality for semantic search
MODEL_NAME = "all-MiniLM-L6-v2"

print(f"Loading embedding model: {MODEL_NAME}...")
embedding_model = SentenceTransformer(MODEL_NAME)

print(f"Model loaded successfully")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

STEP 2: Initialize Embedding Model
Loading embedding model: all-MiniLM-L6-v2...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded successfully
Embedding dimension: 384


In [None]:
# STEP 2.1: Generate Embeddings for All Chunks

print("STEP 2.1: Generating Embeddings")

# Extract all text chunks
texts = df_chunks['text'].tolist()

# Generate embeddings in batches for efficiency
BATCH_SIZE = 32

def generate_embeddings(texts: List[str], model, batch_size: int = 32):
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
        batch = texts[i:i + batch_size]
        embeddings = model.encode(batch,
                                   show_progress_bar=False,
                                   convert_to_numpy=True,
                                   normalize_embeddings=True)
        all_embeddings.append(embeddings)

    return np.vstack(all_embeddings)

# Generate embeddings
embeddings = generate_embeddings(texts, embedding_model, BATCH_SIZE)

print(f"Generated embeddings for {embeddings.shape[0]} chunks")
print(f"Embedding shape: {embeddings.shape}")

# Add embeddings to dataframe
df_chunks['embedding'] = list(embeddings)

STEP 2.1: Generating Embeddings


Generating embeddings: 100%|██████████| 6/6 [00:01<00:00,  5.50it/s]

Generated embeddings for 167 chunks
Embedding shape: (167, 384)





In [None]:
# STEP 3: Setup Vector Database (ChromaDB)

print("STEP 3: Setting up Vector Database (ChromaDB)")

# Initialize ChromaDB client with persistence
PERSIST_DIR = "./msads_chroma_db"
COLLECTION_NAME = "msads_knowledge_base"

chroma_client = chromadb.Client(Settings(
    anonymized_telemetry=False,
    is_persistent=True,
    persist_directory=PERSIST_DIR
))

# Delete existing collection if it exists (for fresh start)
try:
    chroma_client.delete_collection(name=COLLECTION_NAME)
    print(f"Deleted existing collection")
except:
    pass

# Create new collection
collection = chroma_client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "MS in Applied Data Science program knowledge base"}
)

print(f"Created collection: {COLLECTION_NAME}")

STEP 3: Setting up Vector Database (ChromaDB)
Created collection: msads_knowledge_base


In [None]:
# STEP 3.1: Store Embeddings in ChromaDB

print("STEP 3.1: Storing Embeddings in Vector Database")

# Add documents to ChromaDB in batches
STORE_BATCH_SIZE = 100

for i in tqdm(range(0, len(df_chunks), STORE_BATCH_SIZE), desc="Storing in ChromaDB"):
    batch_df = df_chunks.iloc[i:i + STORE_BATCH_SIZE]

    # Prepare data
    ids = batch_df['id'].tolist()
    documents = batch_df['text'].tolist()
    embeddings_batch = batch_df['embedding'].tolist()

    # Prepare metadata
    metadatas = []
    for _, row in batch_df.iterrows():
        metadatas.append({
            "url": row['url'],
            "page_title": row['page_title'],
            "page_type": row['page_type'],
            "chunk_index": int(row['chunk_index'])
        })

    # Add to collection
    collection.add(
        ids=ids,
        documents=documents,
        embeddings=embeddings_batch,
        metadatas=metadatas
    )

print(f"Stored {collection.count()} documents in vector database")

STEP 3.1: Storing Embeddings in Vector Database


Storing in ChromaDB: 100%|██████████| 2/2 [00:00<00:00,  5.86it/s]

Stored 167 documents in vector database





In [None]:
# STEP 4: Implement RAG Retrieval Function

print("STEP 4: Implementing RAG Retrieval System")

def retrieve_context(query: str, top_k: int = 5):
    # Generate query embedding
    query_embedding = embedding_model.encode(query,
                                              convert_to_numpy=True,
                                              normalize_embeddings=True)

    # Query the vector database
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    return results

def format_rag_response(query: str, top_k: int = 5):
    # Retrieve relevant chunks
    results = retrieve_context(query, top_k)

    # Format context
    context_parts = []
    sources = []

    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        # Convert distance to similarity score (0-1, higher is better)
        similarity = 1 - dist

        context_parts.append(f"[Source {i+1}]: {doc}")
        sources.append({
            "source_number": i + 1,
            "page_title": meta['page_title'],
            "url": meta['url'],
            "page_type": meta['page_type'],
            "similarity_score": round(similarity, 3),
            "text_preview": doc[:200] + "..."
        })

    context = "\n\n".join(context_parts)

    return {
        "query": query,
        "context": context,
        "sources": sources,
        "num_sources": len(sources)
    }

print("RAG retrieval functions implemented")

STEP 4: Implementing RAG Retrieval System
RAG retrieval functions implemented


In [None]:
# STEP 5: Test the RAG System
print("STEP 5: Testing the RAG System")

def test_rag_query(query: str, top_k: int = 3):
    print(f"QUERY: {query}")

    response = format_rag_response(query, top_k)

    print(f"Retrieved {response['num_sources']} relevant sources:\n")

    for source in response['sources']:
        print(f"[{source['source_number']}] {source['page_title']}")
        print(f"Similarity: {source['similarity_score']:.3f}")
        print(f"URL: {source['url']}")
        print(f"Type: {source['page_type']}")
        print(f"Preview: {source['text_preview']}")
        print()

    print(f"\nFULL CONTEXT FOR LLM:")
    print(response['context'][:500] + "...")

    return response

# Test with sample queries
test_queries = [
    "What are the core courses in the MS in Applied Data Science program?",
    "What are the admission requirements for the MS in Applied Data Science program?",
    "Can you provide information about the capstone project?"
]

print("\nTesting with sample queries:\n")

for query in test_queries[:2]:
    test_rag_query(query, top_k=3)

STEP 5: Testing the RAG System

Testing with sample queries:

QUERY: What are the core courses in the MS in Applied Data Science program?
Retrieved 3 relevant sources:

[1] Master's Programs - DSI
Similarity: 0.432
URL: https://datascience.uchicago.edu/education/masters-programs/
Type: education_general
Preview: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciti...

[2] Online Program - DSI
Similarity: 0.373
URL: https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20
Type: online_program
Preview: Science by successfully completing 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar*. Our rigorous curriculum is designed by and for data science innovators and leaders. Cou...

[3] Online Program - DSI
Similarity: 0.373
URL: https://datascience.uchicago.edu/education/m

In [None]:
# STEP 6: Create Simple Q&A Function (Without LLM)

print("STEP 6: Simple Q&A Function (Context Retrieval Only)")

def answer_question(question: str, top_k: int = 5, verbose: bool = True):

    response = format_rag_response(question, top_k)

    if verbose:
        print(f"\nQuestion: {question}\n")
        print(f"Answer based on {response['num_sources']} sources:\n")
        print(response['context'])
        print(f"\n\nSources:")
        for src in response['sources']:
            print(f"{src['page_title']} ({src['similarity_score']:.2f} relevance)")
            print(f"{src['url']}")

    return response

print("Q&A function ready")

# Example usage
print("Example: Answering a Question")

answer_question("What are the core courses in the MS in Applied Data Science program?", top_k=3)

STEP 6: Simple Q&A Function (Context Retrieval Only)
Q&A function ready
Example: Answering a Question

Question: What are the core courses in the MS in Applied Data Science program?

Answer based on 3 sources:

[Source 1]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed fo

{'query': 'What are the core courses in the MS in Applied Data Science program?',
 'context': '[Source 1]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed for students interested in pursuing a research career in data science with courses taught by faculty from the departme

In [None]:
# STEP 7: Save the RAG System Configuration
print("STEP 7: Saving RAG System Configuration")

config = {
    "embedding_model": MODEL_NAME,
    "collection_name": COLLECTION_NAME,
    "persist_directory": PERSIST_DIR,
    "total_chunks": len(df_chunks),
    "embedding_dimension": embedding_model.get_sentence_embedding_dimension(),
    "unique_pages": df_chunks['url'].nunique(),
    "batch_size": BATCH_SIZE
}

# Save configuration
with open("/content/rag_config.json", "w") as f:
    json.dump(config, f, indent=2)

print("Configuration saved to /content/rag_config.json")
print("\nConfiguration:")
for key, value in config.items():
    print(f"  {key}: {value}")

STEP 7: Saving RAG System Configuration
Configuration saved to /content/rag_config.json

Configuration:
  embedding_model: all-MiniLM-L6-v2
  collection_name: msads_knowledge_base
  persist_directory: ./msads_chroma_db
  total_chunks: 167
  embedding_dimension: 384
  unique_pages: 34
  batch_size: 32


In [None]:
# STEP 8: Create Reusable RAG Class
print("STEP 8: Creating Reusable RAG System Class")

class MSADSRagSystem:

    def __init__(self):
        """Initialize the RAG system with existing data."""
        print("Initializing MSADS RAG System...")

        self.embedding_model = embedding_model
        self.collection = collection

        print(f"System ready with {self.collection.count()} documents")

    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Search for relevant information."""
        query_embedding = self.embedding_model.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        )

        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )

        formatted_results = []
        for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            formatted_results.append({
                "text": doc,
                "page_title": meta['page_title'],
                "url": meta['url'],
                "page_type": meta['page_type'],
                "similarity_score": round(1 - dist, 3)
            })

        return formatted_results

    def ask(self, question: str, top_k: int = 5) -> Dict:
        """
        Ask a question and get context with sources.

        Returns:
            Dictionary with question, context, and sources
        """
        results = self.search(question, top_k)

        context = "\n\n".join([
            f"[{r['page_title']}]: {r['text']}"
            for r in results
        ])

        return {
            "question": question,
            "context": context,
            "sources": results
        }

    def display_answer(self, question: str, top_k: int = 5):
        """Ask a question and display formatted answer."""
        print(f"\n{'='*80}")
        print(f"Q: {question}")
        print(f"{'='*80}\n")

        response = self.ask(question, top_k)

        print("RELEVANT CONTEXT:\n")
        print(response['context'][:1000])
        if len(response['context']) > 1000:
            print("...\n(truncated)")

        print(f"\n\nSOURCES ({len(response['sources'])}):")
        for i, src in enumerate(response['sources'], 1):
            print(f"\n{i}. {src['page_title']}")
            print(f"   Relevance: {src['similarity_score']:.2f}")
            print(f"   URL: {src['url']}")

        return response

# Initialize the system
rag = MSADSRagSystem()

print("\nRAG System Class created and initialized")

STEP 8: Creating Reusable RAG System Class
Initializing MSADS RAG System...
System ready with 167 documents

RAG System Class created and initialized


In [None]:
# STEP 9: Demo Usage
print("STEP 9: Demo - Using the RAG System")

# Demo queries
demo_questions = [
    "What are the core courses in the MS in Applied Data Science program?",
    "What are the admission requirements for the MS in Applied Data Science program?",
    "Can you provide information about the capstone project?",
]

print("\nDemo: Answering questions about MSADS program\n")

for question in demo_questions[:1]:  # Show 1 example
    rag.display_answer(question, top_k=3)

STEP 9: Demo - Using the RAG System

Demo: Answering questions about MSADS program


Q: What are the core courses in the MS in Applied Data Science program?

RELEVANT CONTEXT:

[Master's Programs - DSI]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed for students interest

# ***Deploy RAG Chatbot***
- In this section, we firstly add the large language model to the RAG system. The prompt (containing both the context and the original question) is sent to the OpenAI (GPT) large language model. We've instructed the model to only use the provided context to formulate a natural, accurate answer.
- Secondly, to make the chatbot usable, we launched it as a public web application using Gradio. This provides a simple chat interface and a shareable URL for evaluation.
- **The link for chatbot interface is:** [MS in Applied Data Science Chatbot
](https://d9eb8155416df7dc7c.gradio.live)
- Finally, We included a final script to automatically run the project's sample questions against our bot, allowing us to evaluate its accuracy and prepare the results for our presentation.

In [None]:
!pip install -q chromadb-client openai gradio

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/729.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m337.9/729.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m729.2/729.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import openai
from google.colab import userdata
import textwrap

# --- Configure the OpenAI API Key ---
try:
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY is None:
        raise ValueError("Key not found")

    client = openai.OpenAI(api_key=OPENAI_API_KEY)
    print("OpenAI API client configured successfully.")

except Exception as e:
    print(f"Error: Could not find or configure OPENAI_API_KEY.")
    print("Please create a Colab Secret named 'OPENAI_API_KEY'.")
    print("You can get a key from: https://platform.openai.com/api-keys")

# --- Initialize the Generative Model (for consistency in naming) ---
# We just need the 'client' object
llm_model = client
print("Initialized OpenAI client.")

OpenAI API client configured successfully.
Initialized OpenAI client.


### Deploy the Chatbot on a user-friendly interface

In [None]:
# --- COMBINED CELL: Load System & Launch UI ---

print("--- Initializing System & Launching Chatbot ---")
print("This may take a moment...")

import pandas as pd
import numpy as np
import json
import warnings
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import openai
from google.colab import userdata
import textwrap
import gradio as gr
import time

warnings.filterwarnings('ignore')

# --- 1. Load Models & DB ---
try:
    print("Loading embedding model...")
    MODEL_NAME = "all-MiniLM-L6-v2"
    embedding_model = SentenceTransformer(MODEL_NAME)

    print("Loading Vector DB from Disk...")
    PERSIST_DIR = "./msads_chroma_db"
    COLLECTION_NAME = "msads_knowledge_base"
    chroma_client = chromadb.Client(Settings(
        anonymized_telemetry=False,
        is_persistent=True,
        persist_directory=PERSIST_DIR
    ))
    collection = chroma_client.get_collection(name=COLLECTION_NAME)
    print(f"Loaded collection '{COLLECTION_NAME}' with {collection.count()} documents.")

    print("Configuring OpenAI API Key...")
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY is None: raise ValueError("Key not found")
    llm_model = openai.OpenAI(api_key=OPENAI_API_KEY)
    print("OpenAI client configured.")

except Exception as e:
    print(f"CRITICAL ERROR during setup: {e}")
    print("This cell cannot continue. Did you run Cell 2 (Build Data) first?")
    print("Do you have the 'OPENAI_API_KEY' secret set?")


# --- 2. Define RAG and Chatbot Classes ---

class MSADSRagSystem:
    def __init__(self, model, collection):
        self.embedding_model = model
        self.collection = collection

    def search(self, query: str, top_k: int = 5):
        query_embedding = self.embedding_model.encode(
            query, convert_to_numpy=True, normalize_embeddings=True
        )
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )
        formatted_results = []
        for doc, meta, dist in zip(
            results['documents'][0], results['metadatas'][0], results['distances'][0]
        ):
            formatted_results.append({
                "text": doc, "page_title": meta['page_title'], "url": meta['url'],
                "page_type": meta['page_type'], "similarity_score": round(1 - dist, 3)
            })
        return formatted_results

    def ask(self, question: str, top_k: int = 5):
        results = self.search(question, top_k)
        context = "\n\n".join([f"Source: {r['text']}" for r in results])
        return {"question": question, "context": context, "sources": results}

class GenerativeMSADSChatbot:
    def __init__(self, rag_system, llm_client: openai.OpenAI):
        self.rag = rag_system
        self.client = llm_client
        self.system_prompt = textwrap.dedent("""
            You are an expert assistant for the University of Chicago's
            Master of Science in Applied Data Science (MSADS) program.
            Your task is to answer the user's QUESTION based *only* on the
            provided CONTEXT.
            - Do not use any information outside of the CONTEXT.
            - Be concise and directly answer the question.
            - If the CONTEXT does not contain the answer, state:
              "I'm sorry, I don't have enough information from the website
               to answer that question."
            - Do not make up information or add conversational fluff.
        """)

    def _build_user_prompt(self, question: str, context: str) -> str:
        user_prompt_template = "---\nCONTEXT:\n{context}\n---\nQUESTION:\n{question}\n---"
        return textwrap.dedent(user_prompt_template).format(context=context, question=question)

    def answer(self, question: str, top_k: int = 5):
        rag_response = self.rag.ask(question, top_k=top_k)
        context = rag_response['context']
        sources = rag_response['sources']
        if not sources:
            return {"question": question, "answer": "I'm sorry, I don't have enough information... to answer that question.", "sources": []}

        user_prompt = self._build_user_prompt(question, context)
        try:
            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",
                temperature=0.0,
                messages=[
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": user_prompt}
                ]
            )
            generated_answer = response.choices[0].message.content
        except Exception as e:
            generated_answer = f"Error: The generative model could not process this request. {e}"
        return {"question": question, "answer": generated_answer.strip(), "sources": sources}

# --- 3. Initialize Chatbot ---
# We define a global 'chatbot' variable
global chatbot
chatbot = None

try:
    rag = MSADSRagSystem(model=embedding_model, collection=collection)
    chatbot = GenerativeMSADSChatbot(rag_system=rag, llm_client=llm_model)
    print("Chatbot is initialized and ready.")
except Exception as e:
    print(f"Error initializing chatbot: {e}")


# --- 4. Define Gradio Functions ---

def format_sources_for_ui(sources):
    if not sources: return "No sources found."
    output = "Sources:\n"
    for i, src in enumerate(sources, 1):
        output += f"{i}. {src['page_title']} (Relevance: {src['similarity_score']:.2f})\n"
        output += f"   URL: {src['url']}\n\n"
    return output

def chat_interface_fn(message, history):
    # This check is now much more direct
    if chatbot is None:
        return "Error: Chatbot is not initialized. Please re-run the setup cell."

    response = chatbot.answer(message, top_k=5)
    answer = response['answer']
    sources_text = format_sources_for_ui(response['sources'])
    full_response = f"{answer}\n\n---\n{sources_text}"
    return full_response

# --- 5. Launch Gradio ---

if chatbot is not None:
    print("Launching Gradio Chat Interface...")
    gr.ChatInterface(
        fn=chat_interface_fn,
        title="MS in Applied Data Science Chatbot",
        description="Ask me questions about the UChicago MSADS program.",
        examples=[
            "What scholarships are available for the program?",
            "What are the minimum scores for the TOEFL?",
            "How many courses must you complete to graduate?",
        ]
    ).launch(share=True, debug=False)
else:
    print("CANNOT LAUNCH GRADIO: Chatbot failed to initialize. Check errors above.")

--- Initializing System & Launching Chatbot ---
This may take a moment...
Loading embedding model...
Loading Vector DB from Disk...
Loaded collection 'msads_knowledge_base' with 167 documents.
Configuring OpenAI API Key...
OpenAI client configured.
Chatbot is initialized and ready.
Launching Gradio Chat Interface...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d9eb8155416df7dc7c.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


### Evaluate chatbot result

In [None]:
import pandas as pd

# 1. Define your evaluation set from the project description
evaluation_set = [
    {
        "question": "What scholarships are available for the program?",
        "ground_truth": "The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship etc"
    },
    {
        "question": "What are the minimum scores for the TOEFL and IELTS English Language Requirement?",
        "ground_truth": "Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement)."
    },
    {
        "question": "Is there an application fee waiver?",
        "ground_truth": "For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."
    },
    {
        "question": "What are the deadlines for the in-person program?",
        "ground_truth": "Lists various deadlines (Priority, Scholarship, International, etc.)"
    },
    {
        "question": "How long will it take for me to receive a decision on my application?",
        "ground_truth": "In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis"
    },
    {
        "question": "Can I set up an advising appointment with the enrollment management team?",
        "ground_truth": "Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science"
    },
    {
        "question": "Where can I mail my official transcripts?",
        "ground_truth": "The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 6011"
    },
    {
        "question": "Does the Master’s in Applied Data Science Online program provide visa sponsorship?",
        "ground_truth": "Only our In-Person, Full-Time program is Visa eligible"
    },
    {
        "question": "How do I apply to the MBA/MS program?",
        "ground_truth": "Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process... Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest"
    },
    {
        "question": "Is the MS in Applied Data Science program STEM/OPT eligible?",
        "ground_truth": "The MS in Applied Data Science program is STEM/OPT eligible"
    },
    {
        "question": "How many courses must you complete to earn UChicago’s Master’s in Applied Data Science?",
        "ground_truth": "To earn the MS-ADS degree students must successfully complete 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar"
    }
]

print("Running evaluation...")
evaluation_results = []

if 'chatbot' in locals():
    for item in evaluation_set:
        question = item['question']
        ground_truth = item['ground_truth']

        # Get the bot's response
        response = chatbot.answer(question, top_k=5)
        generated_answer = response['answer']

        evaluation_results.append({
            "Question": question,
            "Ground Truth": ground_truth,
            "Generated Answer": generated_answer,
            "Sources": response['sources']
        })

    print("Evaluation complete.")

    # 3. Display results in a clean DataFrame
    df_eval = pd.DataFrame(evaluation_results)

    # Optional: Save to a file to copy into your presentation
    # df_eval.to_csv("evaluation_results.csv")

    # Display for review
    from IPython.display import display, HTML
    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.width', 1000)

    print("--- Evaluation Results ---")
    display(df_eval[['Question', 'Ground Truth', 'Generated Answer']])

else:
    print("Cannot run evaluation: Chatbot not initialized.")

Running evaluation...
Evaluation complete.
--- Evaluation Results ---


Unnamed: 0,Question,Ground Truth,Generated Answer
0,What scholarships are available for the program?,"The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship etc",The MS in Applied Data Science program offers partial tuition scholarships to top applicants. These scholarships do not require a separate application but it is recommended that candidates submit their applications ahead of the early deadline to maximize their chances of securing a scholarship.
1,What are the minimum scores for the TOEFL and IELTS English Language Requirement?,"Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement).","The minimum TOEFL iBT score required for admission is 104, and the minimum IELTS score required is 7."
2,Is there an application fee waiver?,"For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy.","For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."
3,What are the deadlines for the in-person program?,"Lists various deadlines (Priority, Scholarship, International, etc.)","The deadlines for the in-person program are:\n- December 4, 2025 - Scholarship Priority Deadline for the 1-year (12-15 months; 12 courses)\n- Final Application Deadline for the 2-year Thesis Track (21 months; 18 courses)\n- January 26, 2026 – International Application Deadline\n- March 4, 2026 – Second Priority Application Deadline\n- May 6, 2026 – Third Priority Application Deadline\n- June 23, 2026 – Final Application Deadline"
4,How long will it take for me to receive a decision on my application?,In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis,Admissions decisions for the Master's in Applied Data Science program are typically released 1-2 months after each application deadline.
5,Can I set up an advising appointment with the enrollment management team?,"Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science","Yes, you can schedule an advising appointment with Patrick Vonesh and/or Jose Alvarado from the enrollment management team for the Master of Science in Applied Data Science program."
6,Where can I mail my official transcripts?,"The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 6011","You can have your official transcripts sent to the following mailing address:\nThe University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 2800\nChicago, Illinois 60611"
7,Does the Master’s in Applied Data Science Online program provide visa sponsorship?,"Only our In-Person, Full-Time program is Visa eligible","I'm sorry, I don't have enough information from the website to answer that question."
8,How do I apply to the MBA/MS program?,"Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process... Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest","I'm sorry, I don't have enough information from the website\nto answer that question."
9,Is the MS in Applied Data Science program STEM/OPT eligible?,The MS in Applied Data Science program is STEM/OPT eligible,"Yes, the full-time, in-person MS in Applied Data Science program is STEM/OPT eligible."
