# ***Understanding and Preparing Data***

In this section, I developed a comprehensive web-scraping and text-preparation pipeline designed to collect, clean, and structure content from the University of Chicago’s Master of Science in Applied Data Science (MSADS) program pages.

The workflow combines an automated crawler to dynamically discover all relevant subpages, main-text extraction using trafilatura for high-quality content retrieval, and intelligent chunking to prepare data for embedding and retrieval-augmented generation (RAG).

This approach ensures that only meaningful program-related information  is captured, cleaned, and split into context-preserving text segments for downstream analysis.

In [26]:
!pip install requests beautifulsoup4 trafilatura tqdm



**Web Crawler**

In this step, I implemented a focused crawler that begins at the main MSADS program page and explores internal links up to three levels deep.

The crawler follows a breadth-first search (BFS) pattern:

1. Starts from the seed URL

2. Collects and normalizes internal links related to MSADS content

3. Skips non-HTML or irrelevant files (e.g., PDFs, images)

4. Saves discovered pages with metadata such as title, depth, and child links

This design ensures that the scraper remains domain-restricted ([datascience.uchicago.edu](https://datascience.uchicago.edu )) and captures all education-related sections—such as admissions, capstones, and career outcomes—without drifting into unrelated parts of the site.

In [27]:
import requests, json, time
from urllib.parse import urlparse, urljoin
from collections import deque
from random import uniform

import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

# Crawler
SEED_URLS = [
    "https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/"
]

MAX_PAGES = 300      # hard cap so it doesn’t run forever
MAX_DEPTH = 4        # clicks away from seed
DOMAIN = urlparse(SEED_URLS[0]).netloc

# File extensions to ignore
SKIP_EXTS = [".pdf", ".jpg", ".jpeg", ".png", ".gif", ".mp4", ".zip", ".docx", ".pptx"]


def is_relevant_link(link: str) -> bool:
    """
    Keep ALL pages related to MS in Applied Data Science program.
    Don't try to predict what questions users will ask!
    """
    parsed = urlparse(link)

    if parsed.netloc != DOMAIN:
        return False

    path = parsed.path.lower()

    # Skip file downloads
    if any(path.endswith(ext) for ext in SKIP_EXTS):
        return False

    # BROAD APPROACH: Keep everything under the program and education sections
    relevant_paths = [
        "/education/masters-programs/ms-in-applied-data-science/",  # Main program
        "/education/masters-programs/",                             # Masters programs
        "/education/",                                              # General education
    ]

    # Keep if path contains any relevant pattern
    if any(relevant_path in path for relevant_path in relevant_paths):
        return True

    # Also keep Data Science Institute pages that might be relevant
    if "/education/" in path or "/programs/" in path:
        return True

    return False


session = requests.Session()
session.headers.update({
    "User-Agent": "MSADS-RAG-Crawler/1.0 (educational project)"
})

seen = set()
queue = deque((u, 0) for u in SEED_URLS)
pages = []

print("Starting crawl...")

while queue and len(pages) < MAX_PAGES:
    url, depth = queue.popleft()
    if url in seen:
        continue
    if depth > MAX_DEPTH:
        continue
    seen.add(url)

    try:
        resp = session.get(url, timeout=20)
        print(f"[HTTP {resp.status_code}] depth={depth} {url}")
        resp.raise_for_status()
    except Exception as e:
        print(f"[SKIP] {url} ({e})")
        continue

    # Only process HTML pages
    ctype = resp.headers.get("Content-Type", "")
    if "html" not in ctype:
        print(f"[SKIP] Non-HTML content at {url} ({ctype})")
        continue

    html = resp.text
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.find("title")
    title = title_tag.get_text(strip=True) if title_tag else ""

    new_links = set()

    # Discover new links
    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        full = urljoin(url, href)
        if is_relevant_link(full) and full not in seen:
            new_links.add(full)
            queue.append((full, depth + 1))

    pages.append({
        "depth": depth,
        "title": title,
        "url_found": url,
        "url_final": resp.url,
        "num_child_links": len(new_links),
        "child_links": list(new_links),
    })

    time.sleep(uniform(0.5, 1.5))

# Build dataframe of crawled pages
crawl_df = pd.DataFrame(pages).drop_duplicates(subset="url_final").reset_index(drop=True)
print("\n Crawler finished.")
print("Pages discovered:", crawl_df.shape[0])

crawl_df.head()


Starting crawl...
[HTTP 200] depth=0 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/undergrad-major/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/phd-in-data-science/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/data-science-clinic/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/summer-research-programs/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/
[HTTP 200] depth=1 https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20
[HTTP 200] depth=1 https://datascienc

Unnamed: 0,depth,title,url_found,url_final,num_child_links,child_links
0,0,Master's in Applied Data Science - DSI,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/,13,"[https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/, https://datascience.uchicago.edu/education/undergrad-major/, https://datascience.uchicago.edu/education/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/course-progressions/, https://datascience.uchicago.edu/education/data-science-clinic/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20, https://datascience.uchicago.edu/education/masters-programs/in-person-program/, https://datascience.uchicago.edu/education/masters-programs/, https://datascience.uchicago.edu/education/summer-research-programs/, https://datascience.uchicago.edu/education/phd-in-data-science/, https://datascience.uchicago.edu/education/masters-programs/online-program/]"
1,1,Master's in Applied Data Science - DSI,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main,12,"[https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/, https://datascience.uchicago.edu/education/undergrad-major/, https://datascience.uchicago.edu/education/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/course-progressions/, https://datascience.uchicago.edu/education/data-science-clinic/, https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20, https://datascience.uchicago.edu/education/masters-programs/in-person-program/, https://datascience.uchicago.edu/education/masters-programs/, https://datascience.uchicago.edu/education/summer-research-programs/, https://datascience.uchicago.edu/education/phd-in-data-science/, https://datascience.uchicago.edu/education/masters-programs/online-program/]"
2,1,Education - DSI,https://datascience.uchicago.edu/education/,https://datascience.uchicago.edu/education/,7,"[https://datascience.uchicago.edu/education/#main, https://datascience.uchicago.edu/education/undergrad-major/, https://datascience.uchicago.edu/education/summerlab/, https://datascience.uchicago.edu/education/data-science-clinic/, https://datascience.uchicago.edu/education/masters-programs/, https://datascience.uchicago.edu/education/summer-research-programs/, https://datascience.uchicago.edu/education/phd-in-data-science/]"
3,1,Undergraduate Data Science Major - DSI,https://datascience.uchicago.edu/education/undergrad-major/,https://datascience.uchicago.edu/education/undergrad-major/,5,"[https://datascience.uchicago.edu/education/undergrad-major/#main, https://datascience.uchicago.edu/education/data-science-clinic/, https://datascience.uchicago.edu/education/summer-research-programs/, https://datascience.uchicago.edu/education/masters-programs/, https://datascience.uchicago.edu/education/phd-in-data-science/]"
4,1,Master's Programs - DSI,https://datascience.uchicago.edu/education/masters-programs/,https://datascience.uchicago.edu/education/masters-programs/,4,"[https://datascience.uchicago.edu/education/summer-research-programs/, https://datascience.uchicago.edu/education/phd-in-data-science/, https://datascience.uchicago.edu/education/masters-programs/#main, https://datascience.uchicago.edu/education/data-science-clinic/]"


After crawling, the collected data contains duplicate and nested links across multiple levels.

So we need to  flatten all child link lists into a single collection, remove duplicates while maintaining the original order, and store the clean, unique list of URLs into a Pandas DataFrame


In [28]:
# Flatten child_links + include the main url_final itself
all_links = []

# child_links column may contain lists or NaN
for links in crawl_df["child_links"]:
    if isinstance(links, list):
        all_links.extend(links)

all_links.extend(crawl_df["url_final"].tolist())

# Deduplicate while preserving order
unique_links = list(dict.fromkeys(all_links))

links_df = pd.DataFrame({
    "id": range(1, len(unique_links) + 1),
    "url": unique_links
})

print("Total unique URLs:", len(unique_links))
links_df.head()


Total unique URLs: 153


Unnamed: 0,id,url
0,1,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/
1,2,https://datascience.uchicago.edu/education/undergrad-major/
2,3,https://datascience.uchicago.edu/education/
3,4,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main
4,5,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/


Here, I used trafilatura to extract the main readable content from each discovered page.

Unlike basic HTML parsing, trafilatura automatically removes navigation bars, menus, and sidebars—preserving only the central, human-readable article text.

In [29]:
import trafilatura

urls = links_df["url"].tolist()
records = []

print("Extracting main content with trafilatura...")

for url in tqdm(urls, desc="Extracting"):
    try:
        downloaded = trafilatura.fetch_url(url)
        if not downloaded:
            print(f"[!] Failed to fetch: {url}")
            continue

        text = trafilatura.extract(
            downloaded,
            include_comments=False,
            include_tables=False
        )

        if not text:
            print(f"[!] No main text extracted: {url}")
            continue

        meta = trafilatura.extract_metadata(downloaded)
        title = meta.title if meta and meta.title else ""

        records.append({
            "url": url,
            "page_title": title,
            "content": text,
            "word_count": len(text.split())
        })
    except Exception as e:
        print(f"[!] Error processing {url}: {e}")
        continue

content_df = pd.DataFrame(records)
print(f"\nExtracted {len(content_df)} pages successfully.")

content_df.head()


Extracting main content with trafilatura...


Extracting:   0%|          | 0/153 [00:00<?, ?it/s]


Extracted 153 pages successfully.


Unnamed: 0,url,page_title,content,word_count
0,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,"FAQs\nMaster’s in Applied Data Science FAQs\nLearn more about what makes our program unique.\nDavid Uminsky, PhD – UChicago Data Science Institute, Executive Director\n-\nApplication Process\n-\nWhen will I receive my Master's in Applied Data Science admission decision?\nAdmissions decisions are typically released 1-2 months after each application deadline. Only completed applications are reviewed. Please refer to the How to Apply page for guidelines.\n-\nIf I finish my Master's in Applied Data Science application before the deadline, will I receive my decision early?\nNo, admissions decisions for the in-person program are typically released 1-2 months after each application deadline. Your application must be complete to be considered for review.\n-\nHow do I submit the materials that will accompany my Master's in Applied Data Science application?\nPlease review the How to Apply page.\n-\nDoes the admissions office allow recommenders to email their letter directly as an attachment to be included in an applicant’s file?\nUnfortunately, no. Recommenders must upload their letter of support by using the URL that is sent to them electronically by our online application system.\n-\nDo I need to provide my recommenders with instructions?\nNo. Recommendation forms and instructions are sent electronically to recommenders once their names are entered within the online application.\n-\nMy recommender did not receive notification, can I resend it?\nYes. If a recommender does not receive a URL, the applicant can resend the link through the online application or ask the recommender to check their spam folder.\n-\nWhat materials do I need to submit to accompany my application for admission to the Masters in Applied Data Science program?\nPlease review the How to Apply page.\n-\nOnce I upload my unofficial transcripts to my application, do I still need to provide an official transcript?\nYou must upload one unofficial transcript from each university you attended within your application. An unofficial undergraduate transcript is required, even if you hold advanced degrees. Do not mail transcripts with your application; only uploads are needed for evaluation. If admitted, you will need to submit official transcripts from each university before matriculation.\n-\nIs the GRE or GMAT required for the Master's in Applied Data Science program?\nNo, the GRE/GMAT is not required for admissions.\n-\nI took the GRE and/or GMAT and want to include my score(s) with my Master's in Applied Data Science application.\nWhile the GRE/GMAT is not required, applicants can still submit their scores. The GRE school code is 1832; the GMAT school code is H9X-WG-70.\n-\nWho is exempt from providing proof of English proficiency?\nPlease refer to the University of Chicago’s English Language Proficiency requirements.\n-\nHow will I be notified that I am admitted to the Master's in Applied Data Science program?\nApplicants will be notified to check their application portal via the email they used to submit their application.\n-\nIf I am admitted to the Master's in Applied Data Science program, what do I do next?\nHave official e-transcripts sent to applieddatascience-admissions@uchicago.edu.\nIf your institution cannot send your documents electronically, please have them send your transcripts to the following mailing address:\nThe University of Chicago\nAttention: MS in Applied Data Science Admissions455 N Cityfront Plaza Dr., Suite 2800Chicago, Illinois 60611 -\nWhat test scores does UChicago accept as proof of English proficiency?\nPlease refer to the Proof of English Proficiency guidelines.\n-\nWhat are the minimum scores required?\nPlease refer to the Required Minimum Score guidelines.\n-\nWhere do I send my test scores?\nPlease send TOEFL scores to the University of Chicago using these instructions at the bottom of the page.\n-\nI took the TOEFL over two years ago. Can I still use those TOEFL results?\nPlease refer to the Validity guidelines.\n-\nWhat’s the difference between the MS in Applied Data Science and the MS in Data Science programs at UChicago?\nThe two programs share a strong foundation in data science but differ in structure, location, and focus.\n- The MS in Data Science is a 10-course program based on UChicago’s Hyde Park campus. It includes a comprehensive research project and is designed for students who want to dive deeper into the theoretical and research side of data science—often as preparation for PhD study or research-focused careers.\n- The MS in Applied Data Science program offers both a 12-course track and an 18-course thesis track. It is offered in online and in-person formats, with in-person classes held at the NBC Tower in downtown Chicago. The program focuses on applying data science and machine learning methods to real-world problems through hands-on coursework, industry collaborations, and a two-quarter capstone or research project.\nFor a more detailed breakdown of the differences between the two programs, check out this article here.\n-\nWhen will I receive my Master's in Applied Data Science admission decision?\n-\nInternational Students\n-\nWhich Master's in Applied Data Science provides students with visas?\nThe full-time, In-Person 1-Year 12-Course Program & 2-Year Thesis Track 18-Course Program are visa eligible.\n-\nWhat is the total cost of tuition for the Master's in Applied Data Science program?\nPlease refer to the Tuition, Fees, and Aid webpage.\n-\nIs the Master's in Applied Data Science an approved OPT/STEM program?\nYes, the full-time, In-Person Master’s in Applied Data Science program is listed as a STEM-designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension, allowing eligible students to apply. However, approval of STEM OPT is at the discretion of U.S. Citizenship & Immigration Services.\n-\nDoes the Master's in Applied Data Science program offer Curricular Practical Training (CPT)?\nPlease refer to the Curriculum Practical Training webpage.\n-\nI have worked in the U.S. for more than two years. Does that mean that I am exempt from the TOEFL/IELTS requirement?\nPlease refer to the English Language Proficiency guidelines.\n-\nWhich Master's in Applied Data Science provides students with visas?\n-\nOnline Program\n-\nIf I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program?\nCurrently, students may only take Master’s in Applied Data Science courses in the modality in which they are officially enrolled.\n-\nDo I need to be a US citizen or permanent resident to apply to Master's in Applied Data Science Online Program?\nNo, students do not have to be US citizen or resident to partake in the Online Program. Please note that the Online Program is not eligible for visa sponsorship.\n-\nHow will enrolling in Master's in Applied Data Science Online Program impact my schedule? Are classes held synchronously, asynchronously, or both?\nClasses generally take place on evenings and weekends in order to allow our students and instructors to maintain their professional schedules. The Master’s in Applied Data Science Online Program is both synchronous and asynchronous. The same as our In-Person program, students are required to participate in weekly, live meetings with their instructors and peers, complete readings and coursework, and engage in discussion.\n-\nWill enrolling in Master's in Applied Data Science Online Program give me the opportunity to network with on-campus students, faculty/instructors, and advisors?\nYes. All Master’s in Applied Data Science Online Program students are invited to an annual ‘Immersion Weekend’ where attendees have opportunities to network and participate in other activities. On a rolling basis, our Career Services team will advertise additional opportunities to connect with employers and peers (e.g., virtual career fairs, virtual career advising/coaching appointments, and more).\n-\nWhat value do employers place on the Master's in Applied Data Science Online degree?\nThe value employers place on the Master’s in Applied Data Science degree is significant. As they hire Data Scientist, Data Engineers, and Data Analysts from the University of Chicago the expectations for technical competence, communication and influence skills, and exposure to advanced Data Science evolving technologies is high. The skills learned in the program translate directly into practice due to the program’s balance between theory and rigorous application experience developed in coursework and the Capstone project work delivered across the curriculum.\n-\nIs the Master's in Applied Data Science Online program equally academically rigorous as the In-Person program?\nYes. The Online Program curriculum is overseen by the same faculty curriculum committee as the In-Person program. Both programs are jointly reviewed and are held to the same high standards. Additionally, both programs are granted by the University of Chicago Physical Sciences Division.\n-\nWill my diploma indicate I completed the Master's in Applied Data Science Program Online?\nNo, your diploma will not include ‘Online’ in the name of your degree.\n-\nIf I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program?\n-\nMBA/MS\n-\nHow do I apply to the MBA/MS joint degree program?\nApplicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process. Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest. An MBA/MS program supplement will be available for completion within your Booth application. The supplement contains Applied Data Science specific questions that will be reviewed by the Applied Data Science admissions team along with your full Booth application. For complete consideration, applicants should complete the MBA application and the joint degree program supplement in the same application round prior to submitting the application.\n-\nWhat courses will I take in the MBA/MS program?\nAs a student in the joint-degree MBA and Applied Data Science program, you’ll take the equivalent of 23 100-unit courses:\n- 14 MBA classes\n- 9 data science courses\n- Leadership Effectiveness and Development (LEAD)\n- Qualified Work Experience, a noncredit professional internship experience\nYour Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA.\n-\nWill MBA/MS courses be in-person or online?\nYour Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. A combination of online and in-person courses gives you flexibility in course scheduling, and you’ll earn two degrees in the time it would take to complete the MBA alone.\n-\nAre standardized tests required for admission?\nAs part of the online application, candidates will be required to submit a GMAT or GRE score for the joint program. International applicants may be required to submit proof of English language proficiency by submitting a TOEFL iBT or IELTS test score. The minimum TOEFL iBT score required for admission is 104; the minimum IELTS score required is 7. Proof of English proficiency may be waived under certain criteria noted by UChicago GRAD Admissions.\n-\nWhat are the main differences in programs and outcomes between the MBA/MS in Applied Data Science compared to Computer Science?\nThe fields of Statistics, Mathematics, and Computer Science intersect with industry domains in different ways. The MPCS program focuses on the center of Computer Science, including Software Engineering, High Performance Computing, Data Analytics, and Application Development. The MS-ADS Program focuses at the intersection of multiple fields, such as Computer Science, Mathematics, and Statistics (including Statistical Inference, Linear/Non-Linear Models, Machine Learning, Natural Language Processing, and Deep Learning). The outcomes for MPCS students include Software Engineer (Developer), Senior Software Engineering Management, Software/Hardware Architect, and Senior Cyber Security Engineer. The outcomes for students in MS-ADS include roles as Data Scientist (most common), Senior Data Science Consultant, Business Intelligence (BI) Director, Data Visualization Manager, Data Analytics Engineer, and AI Solution Architect.\n-\nHow do I apply to the MBA/MS joint degree program?\n-\n2-Year Thesis Track Program\n-\nWhat is the new 2-year thesis track program (18 courses), and how is it different from the 1-year program (12 courses)?\nBeginning in academic 2026-27, a limited number of In-Person, Full-Time students will have the opportunity to complete a 2-year version of the MS in Applied Data Science program. The 2-year program is completed over 21 months (2 academic years). Students in the 2-year program will complete 18 instead of 12 courses. The additional 6 courses consist of 4 additional elective courses (100 units each); and 2 required thesis courses (100 units each) that culminate in the completion of a required written thesis or thesis project. The thesis will be an extension of students’ previously completed Capstone Project. 2-year program students are highly encouraged to complete a research Capstone (as compared to a traditional, industry Capstone) for various reasons.\nThe longer degree timeline offers more time to engage in academic and professional development while maintaining the same rigorous core curriculum and access to UChicago’s faculty and data science network. The first 12 of 18 courses that students complete in the 2-year program follow exactly the same course progression as those in the 1-year, 12-course program.\nThe 2-year program is not currently available to those in the Online program and/or those pursuing the MS in Applied Data Science on a part-time basis.\n-\nWhen can I apply for the 2-year thesis track program?\nThe application portal for the 2-year full-time option will open in September 2025. Applicants apply through the same portal as the standard MS in Applied Data Science program. The deadline to be considered for this track is December 4, 2025. There are a limited number of spots available in this new 2-year thesis track (18 courses) program. Those who are not admitted to the 2-year program will be automatically considered for the 1-year 12 course program unless they choose to opt-out of consideration.\n-\nMay I switch between the 1- and 2-year programs after officially enrolling in the MS-ADS program?\nNo. Once admitted to the 1-year program, there is no opportunity to enroll in the 2-year program. Those admitted to the 2-year program might, under extenuating case by case circumstances, graduate early after completing the initial 12 courses of the MS degree (4-5 quarters). Any such student must meet with program faculty and staff by the required deadlines in order to maintain good academic standing and/or visa compliance (if applicable).\n-\nIf I am admitted to the In-Person, Full-Time 1-year program, may I take any of the 2-year program electives?\nNo, due to the limited number of spots available for the new 2-year program, the electives are reserved for only those students admitted to the 2-year program specifically. Should anything change, students will be notified.\n-\nWhat is the 2-year program's thesis requirement?\nThose admitted to the full-time, in-person thesis track pathway (18 courses, 6 quarters of coursework) are required to complete and submit a written thesis. The thesis will be an individually authored text based on an area of interest from students’ Research Capstone Project. The thesis will be managed through the required thesis courses (2, 100 units each). A lead faculty member with knowledge of the student’s area of interest will supervise the thesis project and provide resources and support as needed.\nAs an alternative to a traditional written thesis, students may opt to complete a thesis project that results in a ‘minimum viable product’ project. Through this option, students will complete a rigorous, multi-step ‘build’ and ‘marketing’ phase throughout the required 2 thesis courses. The thesis project will also be supervised by a program faculty member.\n-\nHow many spots are available in the new 2-year program?\nWhile subject to change, there are a projected ~50-55 spots available for the 2-year program. For application year 2025-26 (for entrance in autumn 2026), applicants must apply by the round two application deadline (December 4, 2025) in order to be considered for the 2-year program. Students admitted to the 2-year program will take the same 12 courses as those in the 1-year program, but they will complete 6 more courses over 2 additional quarters in their 2nd year of study for a total of 18 courses.\n-\nWhat is the new 2-year thesis track program (18 courses), and how is it different from the 1-year program (12 courses)?",2707
1,https://datascience.uchicago.edu/education/undergrad-major/,Undergraduate Data Science Major - DSI,"Undergraduate Data Science Major\nUndergraduate Data Science Staff\n-\nDavid Biron\nDirector of Undergraduate Data Science; Assistant Senior Instructional Professor -\nMaria Lema (she/her)\nAssistant Director of Undergraduate Data Science, Data Science Institute\nMaria Lema is the Assistant Director of Undergraduate Data Science Studies. Previously, Maria served as Academic Adviser in the College Academic Advising Office on campus. She holds a MA in Higher and Postsecondary Education from Teachers College, Columbia University and a BA in Mathematics and Philosophy from SUNY Buffalo State.",83
2,https://datascience.uchicago.edu/education/,Education - DSI,"Building the foundations of data science, considering its ethical and societal implications, and communicating its discoveries to make the most powerful and positive real-world impact.\nUndergraduate Data Science Major\nA curriculum combining computational and analytical skills, domain knowledge, communication skills, and ethics. See the Committee for Data Science for more.\nThe PhD curriculum combines training in mathematical foundations of data science, responsible data use and communication, and advanced computational methods.\nDesigned for professionals with backgrounds in technical fields who want to become data scientists, a program with rigorous classes, expert instructors, leading-edge technology, and an unparalleled network of industry professionals.\nEarn a high-powered joint degree at the intersection of business and technology. An MBA and Applied Data Science degree equips you to bridge the gap between tech and management and provide effective leadership in data-centric environments.\nThe Master’s in Data Science (MSDS) has been developed for students interested in pursuing a research career in data science with courses taught by faculty from the departments of statistics, computer science, and many other departments across the university.\nAn experiential project-based course where students work in teams as data scientists with real-world clients from industry, academia, and social impact organizations.\nA summer research opportunity for undergraduate students (and Chicago-area high school students) focusing on rigorous, applied, interdisciplinary data science research and rooted in a cohort community.",223
3,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/#main,Master's in Applied Data Science - DSI,"Elevate Your Expertise in Data Science\nApply Today!\nHow to Apply\nIndustry Leading Faculty\nData in Action - Capstone Projects\nInside the MS-ADS Program\nStart Your Application\nRelated News, Insights, and Past Events\nDSI NewsNov 10, 2025\nInaugural Margot and Tom Pritzker Prize for AI in Science Research Excellence Announces Winners\nPaperNov 07, 2025\nNew Research Charts Changes in Global Scientific Leadership\nPaperNov 06, 2025\nNew AI Model Explores Massive Chemical Space with Minimal Data\nDSI NewsOct 31, 2025\nTwo Postdoctoral Scholars Awarded Fellowships\nPaperOct 28, 2025\nNew Research Explores What Makes Emotional Memories Stick\nPaperOct 24, 2025\nWhy Can’t Powerful LLMs Learn Multiplication?\nDSI NewsOct 22, 2025\nA Data Science Solution to Amplify Nonprofit Impact at Land Together\nDSI NewsOct 16, 2025\n2025-26 Distinguished Speaker Series\nOct\n24\nPast EventOct 24, 2025\nQuarterly Fireside AI | The Practical Impact of GenAI\nOct\n28\nPast EventOct 28, 2025",148
4,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/in-person-program/,In-Person Program - DSI,"In-Person Program\nTailor Your Data Science Journey\nComplete your master’s degree full- or part-time with UChicago’s In-Person MS in Applied Data Science program. While many In-Person students are early in their careers, the most competitive applicants will have at least 1 year of full-time work experience and 1 or more relevant, sustained internships.\nFor those seeking part-time study and those with more years of full-time work experience (3+), our Online program is optimized for you.\n1-Year 12 Course Program (Full- and Part-Time Options)\nThe signature 1-year 12 course program is ideal for those who wish to fast-track their graduate studies. Full-time students typically graduate in 4–5 quarters; part-time students may take up to 6 quarters. All students must graduate within 4 years maximum.\nAll students complete:\n-\n12 courses (6 Core, 4 Electives, 2 Capstone).\n-\nA required Career Seminar throughout the program. Those with 3+ years of full-time work experience may petition for exemption.\n-\nA 2-quarter Capstone Project with a real industry partner; some students opt to complete a research Capstone.\nCourses are primarily offered in the evenings and on weekends to support working professionals. The program is STEM/OPT eligible.\n2-Year Thesis Track, 18 Course Program (Full-Time Only)\nBeginning in academic year 2026-27, UChicago will offer an additional 2-year (21 month) pathway for In-Person, Full-Time applicants.\nThe 2-year Thesis Track (18 courses) pathway is optimized for those who want a longer master’s program, are interested in taking more elective courses, and have the capacity to complete a master’s thesis. The 2-year (21 month) pathway is completed over 6 academic quarters. The summer in between years 1 and 2 serves as a vacation quarter.\nWithin the application portal, applicants will indicate their selection of this pathway. Read more about applying to the 2-Year program.\nAll students complete:\n-\n18 total courses (6 Core, 8 Electives, 2 Capstone, 2 Independent Study).\n-\nA required Career Seminar throughout the program. Those with 3+ years of full-time work experience may petition for exemption.\n- A 2-quarter Capstone Project.\n-\nA Master’s Thesis (traditional written thesis or project-based thesis).\nThis option is ideal for students seeking a longer academic experience with more time to complete more elective courses. Students must also have the capacity to complete a traditional written thesis or thesis project.\nComparison Chart\nThis program is listed as a STEM designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension allowing eligible students to apply. However, approval of STEM OPT is at the discretion of U.S. Citizenship & Immigration Services.\nYour Career Success\nTake the next step to advance your career with UChicago’s MS in Applied Data Science.\nThe In-Person program admits full- and part-time students for entrance in Autumn quarter annually. The full-time, in-person MS in Applied Data Science program is STEM/OPT eligible. Please visit the Online program page if you are interested in those full- and part-time options.\nYour Engagement\nIf you learn best in an in-person classroom environment and prefer to live in or near to Chicago, IL, the Master’s in Applied Data Science In-Person program is ideal for you. Your high-tech classrooms are located in downtown Chicago (NBC Tower, Gleacher Center), and you will have access to tailored, in-person student services and program amenities. Most courses are from 6-9pm Monday through Thursday with some offered on Fridays and Saturdays. This allows you to work in an internship and/or job during the program. Select courses are offered during the day. Learn more about Tuition, Fees, & Aid.\nYour Student Experience\nAs an In-Person program student, you will have access to expert faculty and instructors with industry expertise, a full-service student affairs team, and an unparalleled network of global alumni. Our team is passionate about supporting a Signature Student Experience tailored to your needs.\nProgram Director, Greg Green, PhD\nYour Outcomes\nYour success is our success. Graduates of UChicago’s Master’s in Applied Data Science program consistently demonstrate competitive outcomes. You will have full access to our tailored career services and external partnerships team to help you advance your career in data science–whether you are launching your career, interested in pivoting, or want to move up within your current company. You can take advantage of in-house career services advising and coaching, tailored networking events, career fairs to connect directly with employers, internship placement support, and more.\nBy and For Data Science Innovators\nYou will earn UChicago’s Master’s in Applied Data Science by successfully completing the required curriculum and our tailored, multi-quarter Career Seminar.\nTo keep up with the rapidly evolving field and job market, you will be challenged by our rigorous curriculum that is designed by and for data science innovators and leaders. Courses are reviewed annually to ensure the content keeps pace with the rapidly evolving landscape of data science.\nYou have the flexibility to pursue the Master’s in Applied Data Science degree on a part- or full-time schedule. Part-time students enroll in two courses each quarter and take their courses in the evenings or on Saturdays. Full-time students take three courses per quarter. Some of their courses may be offered during the day. All courses are taught at the NBC Tower or Gleacher Center in downtown Chicago.\nGet in Touch\n-\nNoncredit Courses\n-\nCareer Seminar (Seminar, required)\nThe Pass/Fail Career Seminar supports the development of industry professional skills, job and/or internship searches, and other in-demand areas of competency among today’s employers. Students enroll in the Career Seminar each quarter in order to engage in unique content throughout their degree program. Students with significant full-time work experience may be eligible to waive this course. 0 units, no cost.\n-\nIntroduction to Statistical Concepts (Foundational, optional)\nThis course is held in the five weeks leading up to the start of your first quarter and provides general exposure to basic statistical concepts necessary for success in advanced courses in the program. 0 units, no cost.\n-\nR for Data Science (Foundational, optional)\nThis course is held the five weeks leading up to the start of your first quarter and is an introduction to the essential concepts and techniques for the statistical computing language R. 0 units, no cost.\n-\nPython for Data Science (Foundational, optional)\nThis course is held concurrently with the first five weeks of your first quarter in the program and starts with an introduction to the Python programming language basic syntax and environment. 0 units, no cost.\n-\nAdvanced Linear Algebra for Machine Learning (Foundational, optional)\nThis course is held concurrently with the second five weeks of your first quarter in the program and is focused on the theoretical concepts and real-life applications of linear algebra for machine learning. 0 units, no cost.\n-\nBrush up on the Basics (Optional resource)\nIf you would like to gauge your preparation in Foundational course topics, we recommend specific Coursera courses that cover very similar topics.\nIf you would like to gauge your preparation in Foundational course topics, we recommend specific Coursera courses that cover very similar topics.\nFour Coursera courses cover very similar topics. You can review the Coursera curricula to see if you are already well-prepared, or if you like, study their materials to brush up on some or all of these topics.\nMathematics for Machine Learning: Linear Algebra (offered by University College London)\n-\nCareer Seminar (Seminar, required)\n-\nCore Courses\n-\nTime Series Analysis and Forecasting\nTime Series Analysis is a science as well as the art of making rational predictions based on previous records. It is widely used in various fields in today’s business settings.\n-\nStatistical Models for Data Science\nIn a traditional linear model, the observed response follows a normal distribution, and the expected response value is a linear combination of the predictors. Since Carl Friedrich Gauss (1777-1855) and Adrien-Marie Legendre (1752-1833) created this linear model framework in the early 1800s, the “Linear Normal” assumption has been the norm in statistics/data science for almost two centuries. New methods based on probability distributions other than Gaussian appeared only in the second half of the twentieth century. These methods allowed working with variables that span a broader variety of domains and probability distributions. Besides, methods for the analysis of general associations were developed that are different from the Pearson correlation.\n-\nMachine Learning I\nThis course is aimed at providing students an introduction to machine learning with data mining techniques and algorithms. It gives a rigorous methodological foundation in analytical and software tools to successfully undertake projects in Data Science. Students are exposed to concepts of exploratory analyses for uncovering and detecting patterns in multivariate data, hypothesizing and detecting relationships among variables, conducting confirmatory analyses, and building models for predictive and descriptive purposes. It will present predictive modeling in the context of balancing predictive and descriptive accuracies.\n-\nMachine Learning II\nThe objective of this course is three-folds–first, to extend student understanding of predictive modeling with machine learning concepts and methodologies from Machine Learning 1 into the realm of Deep Learning and Generative AI. Second, to develop the ability to apply those concepts and methodologies to diverse practical applications, evaluate the results and recommend the next best action. Third, to discuss and understand state-of-the machine learning and deep learning research and development and their applications.\n-\nData Engineering Platforms for Analytics or Big Data and Cloud Computing\nData Engineering Platforms teaches effective data engineering—an essential first step in building an analytics-driven competitive advantage in the market.\nBig Data and Cloud Computing teaches students how to approach big data and large-scale machine learning applications. There is no single definition of big data and multiple emerging software packages exist to work with it, and we will cover the most popular approaches.\n-\nLeadership and Consulting for Data Science\nThe Leadership and Consulting for Data Scientist course is focused on:\n• Learning techniques and proven methods to effectively grasp the business domain including organizational dynamics of consultancies and client organizations\n• Developing relevant solutions to enterprise problems using the sampling methods, traditional statistical techniques and modern machine learning models that deliver value to the organization\n• Practicing successful project delivery through effective data discovery, influential team membership and leadership, project management, and communication at every stageThis course will not only make you a better data scientist; it will make you and your analyses more approachable, more persuasive, and ultimately more successful.\n-\nData Science Capstone Project\nThe required Capstone Project is completed over two quarters and covers research design, implementation, and writing. Full-time students start their capstone project in their third quarter. Part-time students generally begin the capstone project in their fifth quarter.\n-\nTime Series Analysis and Forecasting\n-\nSample Elective Courses\n-\nAdvanced Computer Vision with Deep Learning\nComputer vision is the field of computer science that focuses on creating digital systems that can process, analyze, and make sense of visual data in the same way that humans do. Deep learning is a subset of machine learning and a branch of Artificial Intelligence (AI). It involves the training, deployment, and application of large complex neural network architectures to solve cutting-edge problems. Deep Learning has become the primary approach for solving cognitive problems such as Computer Vision and Natural Language Processing (NLP) and has had a massive impact on various industries such as healthcare, retail, automotive, industrial automation, and agriculture. This course will enable students to build Deep Learning models and apply them to computer vision tasks such as object recognition, detection, and segmentation. Students will gain an in-depth understanding of the Deep Learning model development process, tools, and frameworks. Although the focus of the course will primarily be computer vision, students will work on both image and nonimage datasets during class exercises and assignments. Students will gain hands-on experience in popular libraries such as Tensorflow, Keras, and PyTorch. Students will also learn to apply state of the art models such as ResNet, EfficientNet, RCNNs, YOLO, Vision Transformers, etc. for computer vision and work on datasets such as CIFAR, ImageNet, MS COCO, and MPII Human Poses.\n-\nAdvanced Machine Learning and Artificial Intelligence\nSince the era of big data started, challenges associated with data analysis have grown significantly in different directions: First, the technological infrastructure had to be developed that can hold and process large amounts of data from different sources and of multiple not always well formalized formats. Second, data analysis methods had to be reviewed, selected and modified to work in distributed computational environments like combinations of in-house clusters of servers and cloud. But the biggest challenge of all is learning to think differently in order to ask new types of questions that could not be answered by analyses of less complex data streams with less complex technological infrastructure. In recent years significant progress has been achieved in creating technological ecosystems for big data analysis. Innovative technologies such as open source projects MapReduce, Hadoop, Spark, Storm, Kafka, TensorFlow, H2O, etc. allowed us to look at depths of data unseen before. We now have a growing number of sources and educational courses introducing these new tools. However, developing new data analysis methods appropriate to these new data ecosystems is more difficult than it appears.\n-\nApplied Generative AI: Agents and Multimodal Intelligence\nThis course explores Advanced Generative AI with a focus on multimodal modeling, a transformative AI paradigm integrating diverse data types—text, images, audio, video, time-series, and point clouds. Multimodal AI is reshaping industries, from autonomous systems and e-commerce to healthcare and intelligent media applications. Students will gain a deep understanding of generative AI, including image generation, transformers in vision, knowledge distillation, vision-language modeling, multimodal fusion, video generation, audio synthesis, and time-series analysis. A key focus is integrating Large Language Models (LLMs) with other modalities to develop next-generation multimodal conversational AI.The course blends theoretical depth with hands-on experience, covering cross-modal alignment, data fusion, and multimodal reasoning using cutting-edge tools. Industry-driven labs connect concepts to real-world applications, equipping students to design innovative AI solutions in autonomous navigation, robotics, healthcare, and finance. Additionally, the course introduces Agentic Systems and Vertical AI Agents, highlighting specialized AI frameworks for intelligent decision-making and adaptive, industry-specific AI agents. Ethical considerations and deployment strategies for autonomous agents are also explored, preparing students to lead AI-driven transformation across industries.\n-\nBayesian Machine Learning with Generative AI Applications\nThis course provides a strong theoretical and practical skillset for probabilistic machine learning applications. Bayesian inference and modeling methods are important for several areas including prediction, decision making, and risk assessment where modeling the uncertainty is needed. The course begins with an introduction to Bayesian statistical analysis, covering the foundations of Bayesian inference and the application of Bayes’ theorem for statistical inference. We then introduce Bayesian networks, which offer a powerful graphical tool for modeling complex systems and making probabilistic inferences. The course then advances to cover more sophisticated topics such as Markov Chain Monte Carlo (MCMC) methods for sampling from complex probability distributions, hierarchical models, and model selection techniques. The final three weeks are dedicated to cutting-edge methodologies like Generative Deep Learning, Variational Autoencoders, and Bayesian Neural Networks, all rooted in Bayesian Machine Learning. Upon completion, students will be equipped to apply Bayesian methods to a wide range of real-world problems in fields such as engineering, business, finance, and public policy, addressing challenges like missing data or training AI models that are able to say ‘I don’t know’.\n-\nCausal Models for Data Science\nThis course is designed to equip students with the knowledge and skills to perform causal inference with machine learning. Students learn practical skills for designing and analyzing experiments. The course begins with a quick overview of the basics of correlational and cross-sectional analytical techniques. It then introduces the importance of randomization in explainability and causal inference. The issues of bias in observational studies are examined. Students use AI/ML models to quantify randomization errors and correct violations of non-randomization. Finally, counterfactuals for individual predictions are examined.\n-\nData Science for Algorithmic Marketing\nThis course focuses on marketing science methods and algorithms for undertaking competitive analysis in the digital landscape: market segmentation, mining databases for effective digital marketing, design of new digital and traditional products, forecasting sales and product diffusion, real time product positioning, intra omni-channel optimization and inter omni-channel resource allocation, and pricing across both omni-channel marketing effectiveness and ROI. The course will use a combination of lecture, in-class discussions, group assignments, and a final group project. The course lays special emphasis on algorithms. Hence it draws heavily from the fields of optimization, machine-learning based recommendation systems, association rules, consumer choice models, Bayesian estimation, experimentation and analysis of covariance, advanced visualization techniques for mapping brand perceptions, and analysis of social media data using advanced NLP techniques.\n-\nData Science for Healthcare\nGiven the breadth of the field of health analytics, this course will provide an overview of the development and rapid expansion of analytics in healthcare, major and emerging topical areas, and current issues related to research methods to improve human health. We will cover such topics as security concerns unique to the field, research design strategies, and the integration of epidemiologic and quality improvement methodologies to operationalize data for continuous improvement. Students will be introduced to the application of predictive analytics to healthcare. Students will understand factors impacting the delivery of quality and safe patient care and the application of data-driven methods to improve care at the healthcare system level, design approaches to answering a research question at the population level, become familiar with the application of data analytics to impacting care at the provider level through Clinical Decision Systems, and understand the process of a Clinical Trial.\n-\nData Visualization Techniques\nIn today’s data driven enterprise, data storytelling using effective visualization strategies is an essential skill for analytics practitioners in almost every field to explore and present data. This course focuses on modern data visualization technologies, tools, and techniques to convert raw data into actionable information. Modern data visualization tools are at the forefront of the “self-service analytics” architectures which are decentralizing analytics and breaking down IT bottlenecks for business experts. Moreover, with its foundations rooted in statistics, psychology, and computer science, data visualization shows you how to better understand the data, present clear evidence of your findings to your intended audience and tell engaging data stories through charts and graphics. This course is designed to introduce data visualization as a medium of effective communication using strategic storytelling, and the basis for interactive information dashboards.\n-\nDigital Marketing Analytics in Theory and Practice\nSuccessfully marketing brands today requires a well-balanced blend of art and science. This course introduces students to the science of web analytics while casting a keen eye toward the artful use of numbers found in the digital space. The goal is to provide marketers with the foundation needed to apply data analytics to real-world challenges they confront daily in their professional lives. Students will learn to identify the web analytic tool right for their specific needs; understand valid and reliable ways to collect, analyze, and visualize data from the web; and utilize data in decision making for their agencies, organizations or clients. By completing this course, students will gain an understanding of the motivations behind data collection and analysis methods used by marketing professionals; learn to evaluate and choose appropriate web analytics tools and techniques; understand frameworks and approaches to measuring consumers’ digital actions; earn familiarity with the unique measurement opportunities and challenges presented by New Media; gain hands-on, working knowledge of a step-by-step approach to planning, collecting, analyzing, and reporting data; utilize tools to collect data using today’s most important online techniques: performing bulk downloads, tapping APIs, and scraping webpages; and understand approaches to visualizing data effectively.\n-\nDeep Reinforcement Learning\nThis course is an introduction to reinforcement learning, also known as neuro-dynamic programming. It discusses basic and advanced concepts in reinforcement learning and provides several practical applications. Reinforcement learning refers to a system or agent interacting with an environment and learning how to behave optimally in such an environment. An environment typically includes time, actions, states, uncertainty and rewards. Reinforcement learning combines neural networks and dynamic programming to find an optimal behavior or policy of the system or agent in a complex environment setting. Neural network approximations are used to circumvent the well-known ‘curse of dimensionality’ which has been a barrier to solving many practical applications. Dynamic programming is the key learning mechanism that the system or the agent uses to interact with the environment and improve its performance. Students will master key learning techniques and will become proficient in applying these techniques to complex stochastic decision processes and intelligent control.\n-\nGenerative AI: Principles and Applications\nThis course dives into the realm of Generative AI, offering a comprehensive look into the world of Large Language Models (LLMs), image generation techniques, and the fusion of vision and text through multimodal models. Drawing from core concepts in neural networks, transformers, and advanced techniques such as prompt engineering, vision prompting, and multimodality representation, students will explore the capabilities, applications, and ethical considerations of generative models. This course culminates in hands-on projects, allowing participants to apply theory to practical scenarios.\n-\nMachine Learning Operations\nThe objective of this course is two-fold: first, to understand what Machine Learning Operations (MLOps) is and why it is a key component in enterprise production deployment of machine learning projects, and second, to expose students to software engineering, model engineering and state-of-the-art deployment engineering with hands-on platform and tools experience. This course crosses the chasm that separates machine learning projects/experiments and enterprise production deployment. It covers three pillars in MLOps: software engineering such as software architecture, Continuous Integration/Continuous Delivery and data versioning; model engineering such as AutoML and A/B experimentation; and deployment engineering such as docker containers and model monitoring. The course focuses on best practices in the industry that are critical to enterprise production deployment of machine learning projects. Having completed this course, a student understands the machine learning lifecycle and what it takes to go from ideation to operationalization in an enterprise environment. Furthermore, students get exposure to state-of-the-art MLOps platforms such as allegro, xpresso, Dataiku, LityxIQ, DataRobot, AWS Sagemaker, and technologies such as gitHub, Jenkins, slack, docker, and kubernetes.\n-\nNext-Gen NLP: LLM and Agentic AI in Practice\nExtracting actionable insights from unstructured text and designing cognitive applications have become significant areas of application for analytics. Students in this course will learn foundations of natural language processing, including: concept extraction; text summarization and topic modeling; part of speech tagging; named entity recognition; semantic roles and sentiment analysis. For advanced NLP applications, we will focus on feature extraction from unstructured text, including word and paragraph embedding and representing words and paragraphs as vectors. For cognitive analytics section of the course, students will practice designing question answering systems with intent classification, semantic knowledge extraction and reasoning under uncertainty. Students will gain hands-on expertise applying Python for text analysis tasks, as well as practice with multiple IBM Watson services, including: Watson Discovery, Watson Conversation, Watson Natural Language Classification and Watson Natural Language Understanding.\n-\nOptimization and Simulation Methods for Data Science\nThis course introduces students to how optimization and simulation techniques can be used to solve many real-life problems. It will cover two classes of optimization methods. First class has been developed to optimize real, non- simulated systems or to find the optimal solution of a mathematical model. The methods that belong to this class include liner programming, quadratic programming and mixed-integer programming. Second class of methods has been developed to optimize a simulation model. The difference with the classical mathematical programming methods is that the objective function (which is the function to be minimized or maximized) is not known explicitly and is defined by the simulation model (computer code). The course will demonstrate multiple approaches to build simulation models, such as discrete event simulations and agent-based simulations. Then, it will show how stochastic optimization and heuristic approaches can be used to analyze the simulated system and design a sequence of computational experiments that allow to develop a basic understanding of a particular simulation model or system through exploration of the parameter space, to find robust plausible behaviors and conditions and robust near-optimal solutions that are not prone to being unstable under small perturbations.\n-\nQuantitative Finance: Methods and Applications\nThis course concentrates on the following topics: review of financial markets and assets traded on them; main characteristics of financial analytics: returns, yields, volatility; review of stochastic models of market price and their statistical representations; concept of arbitrage, elements of arbitrage pricing approach; principles of volatility analyses, implied vs. realized volatility; correlation, cointegration and other relationships between various financial assets; market risk analytics and management of portfolios of financial assets. The course puts special emphasis on covering main steps of building analytics from visualizing data and building intuition about their structure and patterns to selecting appropriate statistical method to interpretation of the results and building analytical models. Topics are illustrated by data analysis projects using R. Basic familiarity with R is a requirement.\n-\nReal Time Intelligent Systems\nDeveloping end-to-end automation and intelligent systems is now the most advanced area of application for analytics. Building such systems requires proficiency in programming, understanding of computer systems, as well as knowledge of related analytical methodologies, which are the skills that this course aims to teach to students. The course focuses on python and is tailored for students with basic programming knowledge in python. The course is partially project based. During the first three sessions, we will review basic python concepts and then learn more advanced python and the ways to use python to handle large data flows. The later sessions are project based and will focus on developing end-to-end analytical solutions in the following areas: Finance and trading, blockchains and crypto-currencies, image recognition, and video surveillance systems.\n-\nSupply Chain Optimization\n“Big Data” continues to grow exponentially in our large-scale transactional world where 100,000s of SKUs and millions of customers are interacting with 1:1 offers that include differential pricing, shipping timing/costs and even made to order “custom” product configurations. These consumer behaviors are quickly advancing the availability of new data and techniques within the discipline of Data Science. This elective course will give students the opportunity to apply their skills in data visualization, data mining tools, predictive modeling, and advanced optimization techniques to address Supply Chain challenges. The course focuses on the use of Advanced Predictive Modeling, Machine Learning, AI and other Data Science insight and activation tools to automate and optimize the performance of the Supply Chain. Students will also learn how to optimize the performance of the Supply Chain from the lens of multiple related disciplines including: Sales Forecasting, Warehousing/Inventory Management, Promotion, Pricing, Logistics Network Optimization, Freight Cost Management, Manufacturing, Retail POS Information, Ecommerce, Consumer Data, and Product Design/Packaging. After completing this course, you will be prepared to work in any of the numerous specialty areas possible in the world of Supply Chain Management.\n-\nAdvanced Computer Vision with Deep Learning",4484


**Smart Chunking and Metadata Enrichment**

In this section, I transformed long extracted texts into overlapping chunks optimized for embedding and retrieval-based models.

The logic includes:

Token-safe chunking: large ~500-word segments with 100-word overlaps for context continuity

Minimum length filtering: ignores tiny fragments below 80 words

Noise cleaning: removes boilerplate phrases like “Cookie Policy” or “All rights reserved”

Metadata tagging: each chunk is labeled with page URL, title, inferred page type (e.g., “capstone”, “faq”, “career_outcomes”), and index position

In [30]:
import re, uuid

OUTPUT_JSONL = "/content/msads_chunks_trafilatura.jsonl"

MAX_WORDS = 500          # large chunks
OVERLAP_WORDS = 100      # overlapping window
MIN_WORDS_SECTION = 80   # don't split very short text

NOISY_PHRASES = [
    "Cookie Policy",
    "Privacy Notice",
    "All rights reserved",
]


def infer_page_type(url: str) -> str:
    u = url.lower()
    if "how-to-apply" in u:
        return "how_to_apply"
    if "faqs" in u:
        return "faq"
    if "capstone-projects" in u:
        return "capstone"
    if "course-progressions" in u:
        return "course_progressions"
    if "events-deadlines" in u:
        return "events_deadlines"
    if "tuition-fees-aid" in u:
        return "tuition_fees_aid"
    if "instructors-staff" in u:
        return "instructors_staff"
    if "career-outcomes" in u:
        return "career_outcomes"
    if "in-person-program" in u:
        return "in_person_program"
    if "online-program" in u:
        return "online_program"
    if "explore-the-ms-ads-campus" in u:
        return "explore_campus"
    if "ms-in-applied-data-science" in u:
        return "msads_main"
    if "/education/" in u:
        return "education_general"
    if "/about/" in u:
        return "about"
    if "/research/" in u:
        return "research"
    return "general"


def remove_noise(text: str) -> str:
    for phrase in NOISY_PHRASES:
        text = text.replace(phrase, " ")
    text = re.sub(r"\s+", " ", text).strip()
    return text


def split_chunks(text: str,
                 max_words: int = MAX_WORDS,
                 overlap: int = OVERLAP_WORDS,
                 min_words: int = MIN_WORDS_SECTION):
    words = text.split()
    n = len(words)
    if n == 0:
        return []
    if n <= min_words:
        return [" ".join(words)]

    chunks = []
    start = 0
    while start < n:
        end = min(start + max_words, n)
        chunk_words = words[start:end]
        chunks.append(" ".join(chunk_words))
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks


chunk_records = []

for _, row in content_df.iterrows():
    url = row["url"]
    page_title = row.get("page_title", "") or ""
    full_text = remove_noise(str(row["content"] or ""))

    page_type = infer_page_type(url)
    chunks = split_chunks(full_text)

    for idx, chunk in enumerate(chunks):
        chunk_records.append({
            "id": str(uuid.uuid4()),
            "url": url,
            "page_title": page_title,
            "page_type": page_type,
            "chunk_index": idx,
            "text": chunk
        })

print("Total chunks:", len(chunk_records))

# Save to JSONL
with open(OUTPUT_JSONL, "w", encoding="utf-8") as f:
    for r in chunk_records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print(f" Saved chunks to {OUTPUT_JSONL}")


Total chunks: 1164
 Saved chunks to /content/msads_chunks_trafilatura.jsonl


In [31]:
import pandas as pd

jsonl_path = "/content/msads_chunks_trafilatura.jsonl"
df = pd.read_json(jsonl_path, lines=True)
print("Total chunks:", df.shape[0])
df.head(5)

Total chunks: 1164


Unnamed: 0,id,url,page_title,page_type,chunk_index,text
0,af09e1b6-a74f-4aa7-9c2c-57beaff5b025,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,0,"FAQs Master’s in Applied Data Science FAQs Learn more about what makes our program unique. David Uminsky, PhD – UChicago Data Science Institute, Executive Director - Application Process - When will I receive my Master's in Applied Data Science admission decision? Admissions decisions are typically released 1-2 months after each application deadline. Only completed applications are reviewed. Please refer to the How to Apply page for guidelines. - If I finish my Master's in Applied Data Science application before the deadline, will I receive my decision early? No, admissions decisions for the in-person program are typically released 1-2 months after each application deadline. Your application must be complete to be considered for review. - How do I submit the materials that will accompany my Master's in Applied Data Science application? Please review the How to Apply page. - Does the admissions office allow recommenders to email their letter directly as an attachment to be included in an applicant’s file? Unfortunately, no. Recommenders must upload their letter of support by using the URL that is sent to them electronically by our online application system. - Do I need to provide my recommenders with instructions? No. Recommendation forms and instructions are sent electronically to recommenders once their names are entered within the online application. - My recommender did not receive notification, can I resend it? Yes. If a recommender does not receive a URL, the applicant can resend the link through the online application or ask the recommender to check their spam folder. - What materials do I need to submit to accompany my application for admission to the Masters in Applied Data Science program? Please review the How to Apply page. - Once I upload my unofficial transcripts to my application, do I still need to provide an official transcript? You must upload one unofficial transcript from each university you attended within your application. An unofficial undergraduate transcript is required, even if you hold advanced degrees. Do not mail transcripts with your application; only uploads are needed for evaluation. If admitted, you will need to submit official transcripts from each university before matriculation. - Is the GRE or GMAT required for the Master's in Applied Data Science program? No, the GRE/GMAT is not required for admissions. - I took the GRE and/or GMAT and want to include my score(s) with my Master's in Applied Data Science application. While the GRE/GMAT is not required, applicants can still submit their scores. The GRE school code is 1832; the GMAT school code is H9X-WG-70. - Who is exempt from providing proof of English proficiency? Please refer to the University of Chicago’s English Language Proficiency requirements. - How will I be notified that I am admitted to the Master's in Applied Data Science program? Applicants will be notified to check their application portal via the email they used to submit their application. - If I am admitted to the Master's in Applied Data Science program, what do I do next? Have official e-transcripts sent to"
1,71aed5b5-c0b2-4173-ac54-1467efaaec4e,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,1,"not required, applicants can still submit their scores. The GRE school code is 1832; the GMAT school code is H9X-WG-70. - Who is exempt from providing proof of English proficiency? Please refer to the University of Chicago’s English Language Proficiency requirements. - How will I be notified that I am admitted to the Master's in Applied Data Science program? Applicants will be notified to check their application portal via the email they used to submit their application. - If I am admitted to the Master's in Applied Data Science program, what do I do next? Have official e-transcripts sent to applieddatascience-admissions@uchicago.edu. If your institution cannot send your documents electronically, please have them send your transcripts to the following mailing address: The University of Chicago Attention: MS in Applied Data Science Admissions455 N Cityfront Plaza Dr., Suite 2800Chicago, Illinois 60611 - What test scores does UChicago accept as proof of English proficiency? Please refer to the Proof of English Proficiency guidelines. - What are the minimum scores required? Please refer to the Required Minimum Score guidelines. - Where do I send my test scores? Please send TOEFL scores to the University of Chicago using these instructions at the bottom of the page. - I took the TOEFL over two years ago. Can I still use those TOEFL results? Please refer to the Validity guidelines. - What’s the difference between the MS in Applied Data Science and the MS in Data Science programs at UChicago? The two programs share a strong foundation in data science but differ in structure, location, and focus. - The MS in Data Science is a 10-course program based on UChicago’s Hyde Park campus. It includes a comprehensive research project and is designed for students who want to dive deeper into the theoretical and research side of data science—often as preparation for PhD study or research-focused careers. - The MS in Applied Data Science program offers both a 12-course track and an 18-course thesis track. It is offered in online and in-person formats, with in-person classes held at the NBC Tower in downtown Chicago. The program focuses on applying data science and machine learning methods to real-world problems through hands-on coursework, industry collaborations, and a two-quarter capstone or research project. For a more detailed breakdown of the differences between the two programs, check out this article here. - When will I receive my Master's in Applied Data Science admission decision? - International Students - Which Master's in Applied Data Science provides students with visas? The full-time, In-Person 1-Year 12-Course Program & 2-Year Thesis Track 18-Course Program are visa eligible. - What is the total cost of tuition for the Master's in Applied Data Science program? Please refer to the Tuition, Fees, and Aid webpage. - Is the Master's in Applied Data Science an approved OPT/STEM program? Yes, the full-time, In-Person Master’s in Applied Data Science program is listed as a STEM-designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension, allowing"
2,f94e1b97-acf1-4d5d-9041-56f274cdcfec,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,2,"admission decision? - International Students - Which Master's in Applied Data Science provides students with visas? The full-time, In-Person 1-Year 12-Course Program & 2-Year Thesis Track 18-Course Program are visa eligible. - What is the total cost of tuition for the Master's in Applied Data Science program? Please refer to the Tuition, Fees, and Aid webpage. - Is the Master's in Applied Data Science an approved OPT/STEM program? Yes, the full-time, In-Person Master’s in Applied Data Science program is listed as a STEM-designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension, allowing eligible students to apply. However, approval of STEM OPT is at the discretion of U.S. Citizenship & Immigration Services. - Does the Master's in Applied Data Science program offer Curricular Practical Training (CPT)? Please refer to the Curriculum Practical Training webpage. - I have worked in the U.S. for more than two years. Does that mean that I am exempt from the TOEFL/IELTS requirement? Please refer to the English Language Proficiency guidelines. - Which Master's in Applied Data Science provides students with visas? - Online Program - If I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program? Currently, students may only take Master’s in Applied Data Science courses in the modality in which they are officially enrolled. - Do I need to be a US citizen or permanent resident to apply to Master's in Applied Data Science Online Program? No, students do not have to be US citizen or resident to partake in the Online Program. Please note that the Online Program is not eligible for visa sponsorship. - How will enrolling in Master's in Applied Data Science Online Program impact my schedule? Are classes held synchronously, asynchronously, or both? Classes generally take place on evenings and weekends in order to allow our students and instructors to maintain their professional schedules. The Master’s in Applied Data Science Online Program is both synchronous and asynchronous. The same as our In-Person program, students are required to participate in weekly, live meetings with their instructors and peers, complete readings and coursework, and engage in discussion. - Will enrolling in Master's in Applied Data Science Online Program give me the opportunity to network with on-campus students, faculty/instructors, and advisors? Yes. All Master’s in Applied Data Science Online Program students are invited to an annual ‘Immersion Weekend’ where attendees have opportunities to network and participate in other activities. On a rolling basis, our Career Services team will advertise additional opportunities to connect with employers and peers (e.g., virtual career fairs, virtual career advising/coaching appointments, and more). - What value do employers place on the Master's in Applied Data Science Online degree? The value employers place on the Master’s in Applied Data Science degree is significant. As they hire Data Scientist, Data Engineers, and Data Analysts"
3,692def8b-60a7-47e7-adaa-b84ca6d06248,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,3,"on-campus students, faculty/instructors, and advisors? Yes. All Master’s in Applied Data Science Online Program students are invited to an annual ‘Immersion Weekend’ where attendees have opportunities to network and participate in other activities. On a rolling basis, our Career Services team will advertise additional opportunities to connect with employers and peers (e.g., virtual career fairs, virtual career advising/coaching appointments, and more). - What value do employers place on the Master's in Applied Data Science Online degree? The value employers place on the Master’s in Applied Data Science degree is significant. As they hire Data Scientist, Data Engineers, and Data Analysts from the University of Chicago the expectations for technical competence, communication and influence skills, and exposure to advanced Data Science evolving technologies is high. The skills learned in the program translate directly into practice due to the program’s balance between theory and rigorous application experience developed in coursework and the Capstone project work delivered across the curriculum. - Is the Master's in Applied Data Science Online program equally academically rigorous as the In-Person program? Yes. The Online Program curriculum is overseen by the same faculty curriculum committee as the In-Person program. Both programs are jointly reviewed and are held to the same high standards. Additionally, both programs are granted by the University of Chicago Physical Sciences Division. - Will my diploma indicate I completed the Master's in Applied Data Science Program Online? No, your diploma will not include ‘Online’ in the name of your degree. - If I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program? - MBA/MS - How do I apply to the MBA/MS joint degree program? Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process. Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest. An MBA/MS program supplement will be available for completion within your Booth application. The supplement contains Applied Data Science specific questions that will be reviewed by the Applied Data Science admissions team along with your full Booth application. For complete consideration, applicants should complete the MBA application and the joint degree program supplement in the same application round prior to submitting the application. - What courses will I take in the MBA/MS program? As a student in the joint-degree MBA and Applied Data Science program, you’ll take the equivalent of 23 100-unit courses: - 14 MBA classes - 9 data science courses - Leadership Effectiveness and Development (LEAD) - Qualified Work Experience, a noncredit professional internship experience Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. - Will MBA/MS courses be in-person or online? Your Booth courses will be in person, while your"
4,e1a5e210-dd24-4016-adef-19141a3dadc9,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,4,"courses will I take in the MBA/MS program? As a student in the joint-degree MBA and Applied Data Science program, you’ll take the equivalent of 23 100-unit courses: - 14 MBA classes - 9 data science courses - Leadership Effectiveness and Development (LEAD) - Qualified Work Experience, a noncredit professional internship experience Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. - Will MBA/MS courses be in-person or online? Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. A combination of online and in-person courses gives you flexibility in course scheduling, and you’ll earn two degrees in the time it would take to complete the MBA alone. - Are standardized tests required for admission? As part of the online application, candidates will be required to submit a GMAT or GRE score for the joint program. International applicants may be required to submit proof of English language proficiency by submitting a TOEFL iBT or IELTS test score. The minimum TOEFL iBT score required for admission is 104; the minimum IELTS score required is 7. Proof of English proficiency may be waived under certain criteria noted by UChicago GRAD Admissions. - What are the main differences in programs and outcomes between the MBA/MS in Applied Data Science compared to Computer Science? The fields of Statistics, Mathematics, and Computer Science intersect with industry domains in different ways. The MPCS program focuses on the center of Computer Science, including Software Engineering, High Performance Computing, Data Analytics, and Application Development. The MS-ADS Program focuses at the intersection of multiple fields, such as Computer Science, Mathematics, and Statistics (including Statistical Inference, Linear/Non-Linear Models, Machine Learning, Natural Language Processing, and Deep Learning). The outcomes for MPCS students include Software Engineer (Developer), Senior Software Engineering Management, Software/Hardware Architect, and Senior Cyber Security Engineer. The outcomes for students in MS-ADS include roles as Data Scientist (most common), Senior Data Science Consultant, Business Intelligence (BI) Director, Data Visualization Manager, Data Analytics Engineer, and AI Solution Architect. - How do I apply to the MBA/MS joint degree program? - 2-Year Thesis Track Program - What is the new 2-year thesis track program (18 courses), and how is it different from the 1-year program (12 courses)? Beginning in academic 2026-27, a limited number of In-Person, Full-Time students will have the opportunity to complete a 2-year version of the MS in Applied Data Science program. The 2-year program is completed over 21 months (2 academic years). Students in the 2-year program will complete 18 instead of 12 courses. The additional 6 courses consist of 4 additional elective courses (100 units each); and 2 required thesis courses (100 units each) that culminate in the completion of a required written thesis or thesis project. The thesis will be an extension of"


In [32]:
import pandas as pd

df_chunks = pd.read_json("/content/msads_chunks_trafilatura.jsonl", lines=True)
print("Chunk rows:", df_chunks.shape[0])
df_chunks.head()


Chunk rows: 1164


Unnamed: 0,id,url,page_title,page_type,chunk_index,text
0,af09e1b6-a74f-4aa7-9c2c-57beaff5b025,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,0,"FAQs Master’s in Applied Data Science FAQs Learn more about what makes our program unique. David Uminsky, PhD – UChicago Data Science Institute, Executive Director - Application Process - When will I receive my Master's in Applied Data Science admission decision? Admissions decisions are typically released 1-2 months after each application deadline. Only completed applications are reviewed. Please refer to the How to Apply page for guidelines. - If I finish my Master's in Applied Data Science application before the deadline, will I receive my decision early? No, admissions decisions for the in-person program are typically released 1-2 months after each application deadline. Your application must be complete to be considered for review. - How do I submit the materials that will accompany my Master's in Applied Data Science application? Please review the How to Apply page. - Does the admissions office allow recommenders to email their letter directly as an attachment to be included in an applicant’s file? Unfortunately, no. Recommenders must upload their letter of support by using the URL that is sent to them electronically by our online application system. - Do I need to provide my recommenders with instructions? No. Recommendation forms and instructions are sent electronically to recommenders once their names are entered within the online application. - My recommender did not receive notification, can I resend it? Yes. If a recommender does not receive a URL, the applicant can resend the link through the online application or ask the recommender to check their spam folder. - What materials do I need to submit to accompany my application for admission to the Masters in Applied Data Science program? Please review the How to Apply page. - Once I upload my unofficial transcripts to my application, do I still need to provide an official transcript? You must upload one unofficial transcript from each university you attended within your application. An unofficial undergraduate transcript is required, even if you hold advanced degrees. Do not mail transcripts with your application; only uploads are needed for evaluation. If admitted, you will need to submit official transcripts from each university before matriculation. - Is the GRE or GMAT required for the Master's in Applied Data Science program? No, the GRE/GMAT is not required for admissions. - I took the GRE and/or GMAT and want to include my score(s) with my Master's in Applied Data Science application. While the GRE/GMAT is not required, applicants can still submit their scores. The GRE school code is 1832; the GMAT school code is H9X-WG-70. - Who is exempt from providing proof of English proficiency? Please refer to the University of Chicago’s English Language Proficiency requirements. - How will I be notified that I am admitted to the Master's in Applied Data Science program? Applicants will be notified to check their application portal via the email they used to submit their application. - If I am admitted to the Master's in Applied Data Science program, what do I do next? Have official e-transcripts sent to"
1,71aed5b5-c0b2-4173-ac54-1467efaaec4e,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,1,"not required, applicants can still submit their scores. The GRE school code is 1832; the GMAT school code is H9X-WG-70. - Who is exempt from providing proof of English proficiency? Please refer to the University of Chicago’s English Language Proficiency requirements. - How will I be notified that I am admitted to the Master's in Applied Data Science program? Applicants will be notified to check their application portal via the email they used to submit their application. - If I am admitted to the Master's in Applied Data Science program, what do I do next? Have official e-transcripts sent to applieddatascience-admissions@uchicago.edu. If your institution cannot send your documents electronically, please have them send your transcripts to the following mailing address: The University of Chicago Attention: MS in Applied Data Science Admissions455 N Cityfront Plaza Dr., Suite 2800Chicago, Illinois 60611 - What test scores does UChicago accept as proof of English proficiency? Please refer to the Proof of English Proficiency guidelines. - What are the minimum scores required? Please refer to the Required Minimum Score guidelines. - Where do I send my test scores? Please send TOEFL scores to the University of Chicago using these instructions at the bottom of the page. - I took the TOEFL over two years ago. Can I still use those TOEFL results? Please refer to the Validity guidelines. - What’s the difference between the MS in Applied Data Science and the MS in Data Science programs at UChicago? The two programs share a strong foundation in data science but differ in structure, location, and focus. - The MS in Data Science is a 10-course program based on UChicago’s Hyde Park campus. It includes a comprehensive research project and is designed for students who want to dive deeper into the theoretical and research side of data science—often as preparation for PhD study or research-focused careers. - The MS in Applied Data Science program offers both a 12-course track and an 18-course thesis track. It is offered in online and in-person formats, with in-person classes held at the NBC Tower in downtown Chicago. The program focuses on applying data science and machine learning methods to real-world problems through hands-on coursework, industry collaborations, and a two-quarter capstone or research project. For a more detailed breakdown of the differences between the two programs, check out this article here. - When will I receive my Master's in Applied Data Science admission decision? - International Students - Which Master's in Applied Data Science provides students with visas? The full-time, In-Person 1-Year 12-Course Program & 2-Year Thesis Track 18-Course Program are visa eligible. - What is the total cost of tuition for the Master's in Applied Data Science program? Please refer to the Tuition, Fees, and Aid webpage. - Is the Master's in Applied Data Science an approved OPT/STEM program? Yes, the full-time, In-Person Master’s in Applied Data Science program is listed as a STEM-designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension, allowing"
2,f94e1b97-acf1-4d5d-9041-56f274cdcfec,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,2,"admission decision? - International Students - Which Master's in Applied Data Science provides students with visas? The full-time, In-Person 1-Year 12-Course Program & 2-Year Thesis Track 18-Course Program are visa eligible. - What is the total cost of tuition for the Master's in Applied Data Science program? Please refer to the Tuition, Fees, and Aid webpage. - Is the Master's in Applied Data Science an approved OPT/STEM program? Yes, the full-time, In-Person Master’s in Applied Data Science program is listed as a STEM-designated degree by the U.S. Department of Homeland Security for the purposes of the STEM OPT extension, allowing eligible students to apply. However, approval of STEM OPT is at the discretion of U.S. Citizenship & Immigration Services. - Does the Master's in Applied Data Science program offer Curricular Practical Training (CPT)? Please refer to the Curriculum Practical Training webpage. - I have worked in the U.S. for more than two years. Does that mean that I am exempt from the TOEFL/IELTS requirement? Please refer to the English Language Proficiency guidelines. - Which Master's in Applied Data Science provides students with visas? - Online Program - If I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program? Currently, students may only take Master’s in Applied Data Science courses in the modality in which they are officially enrolled. - Do I need to be a US citizen or permanent resident to apply to Master's in Applied Data Science Online Program? No, students do not have to be US citizen or resident to partake in the Online Program. Please note that the Online Program is not eligible for visa sponsorship. - How will enrolling in Master's in Applied Data Science Online Program impact my schedule? Are classes held synchronously, asynchronously, or both? Classes generally take place on evenings and weekends in order to allow our students and instructors to maintain their professional schedules. The Master’s in Applied Data Science Online Program is both synchronous and asynchronous. The same as our In-Person program, students are required to participate in weekly, live meetings with their instructors and peers, complete readings and coursework, and engage in discussion. - Will enrolling in Master's in Applied Data Science Online Program give me the opportunity to network with on-campus students, faculty/instructors, and advisors? Yes. All Master’s in Applied Data Science Online Program students are invited to an annual ‘Immersion Weekend’ where attendees have opportunities to network and participate in other activities. On a rolling basis, our Career Services team will advertise additional opportunities to connect with employers and peers (e.g., virtual career fairs, virtual career advising/coaching appointments, and more). - What value do employers place on the Master's in Applied Data Science Online degree? The value employers place on the Master’s in Applied Data Science degree is significant. As they hire Data Scientist, Data Engineers, and Data Analysts"
3,692def8b-60a7-47e7-adaa-b84ca6d06248,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,3,"on-campus students, faculty/instructors, and advisors? Yes. All Master’s in Applied Data Science Online Program students are invited to an annual ‘Immersion Weekend’ where attendees have opportunities to network and participate in other activities. On a rolling basis, our Career Services team will advertise additional opportunities to connect with employers and peers (e.g., virtual career fairs, virtual career advising/coaching appointments, and more). - What value do employers place on the Master's in Applied Data Science Online degree? The value employers place on the Master’s in Applied Data Science degree is significant. As they hire Data Scientist, Data Engineers, and Data Analysts from the University of Chicago the expectations for technical competence, communication and influence skills, and exposure to advanced Data Science evolving technologies is high. The skills learned in the program translate directly into practice due to the program’s balance between theory and rigorous application experience developed in coursework and the Capstone project work delivered across the curriculum. - Is the Master's in Applied Data Science Online program equally academically rigorous as the In-Person program? Yes. The Online Program curriculum is overseen by the same faculty curriculum committee as the In-Person program. Both programs are jointly reviewed and are held to the same high standards. Additionally, both programs are granted by the University of Chicago Physical Sciences Division. - Will my diploma indicate I completed the Master's in Applied Data Science Program Online? No, your diploma will not include ‘Online’ in the name of your degree. - If I am a student in the In-Person Master's in Applied Data Science Program, may I take courses in the Online Program? Conversely, if I am a student in the Online Program, may I take courses in the In-Person program? - MBA/MS - How do I apply to the MBA/MS joint degree program? Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process. Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest. An MBA/MS program supplement will be available for completion within your Booth application. The supplement contains Applied Data Science specific questions that will be reviewed by the Applied Data Science admissions team along with your full Booth application. For complete consideration, applicants should complete the MBA application and the joint degree program supplement in the same application round prior to submitting the application. - What courses will I take in the MBA/MS program? As a student in the joint-degree MBA and Applied Data Science program, you’ll take the equivalent of 23 100-unit courses: - 14 MBA classes - 9 data science courses - Leadership Effectiveness and Development (LEAD) - Qualified Work Experience, a noncredit professional internship experience Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. - Will MBA/MS courses be in-person or online? Your Booth courses will be in person, while your"
4,e1a5e210-dd24-4016-adef-19141a3dadc9,https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/faqs/,FAQs - DSI,faq,4,"courses will I take in the MBA/MS program? As a student in the joint-degree MBA and Applied Data Science program, you’ll take the equivalent of 23 100-unit courses: - 14 MBA classes - 9 data science courses - Leadership Effectiveness and Development (LEAD) - Qualified Work Experience, a noncredit professional internship experience Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. - Will MBA/MS courses be in-person or online? Your Booth courses will be in person, while your MS courses will be online. Most students will earn both degrees in seven quarters—the same time it takes to earn the MBA. A combination of online and in-person courses gives you flexibility in course scheduling, and you’ll earn two degrees in the time it would take to complete the MBA alone. - Are standardized tests required for admission? As part of the online application, candidates will be required to submit a GMAT or GRE score for the joint program. International applicants may be required to submit proof of English language proficiency by submitting a TOEFL iBT or IELTS test score. The minimum TOEFL iBT score required for admission is 104; the minimum IELTS score required is 7. Proof of English proficiency may be waived under certain criteria noted by UChicago GRAD Admissions. - What are the main differences in programs and outcomes between the MBA/MS in Applied Data Science compared to Computer Science? The fields of Statistics, Mathematics, and Computer Science intersect with industry domains in different ways. The MPCS program focuses on the center of Computer Science, including Software Engineering, High Performance Computing, Data Analytics, and Application Development. The MS-ADS Program focuses at the intersection of multiple fields, such as Computer Science, Mathematics, and Statistics (including Statistical Inference, Linear/Non-Linear Models, Machine Learning, Natural Language Processing, and Deep Learning). The outcomes for MPCS students include Software Engineer (Developer), Senior Software Engineering Management, Software/Hardware Architect, and Senior Cyber Security Engineer. The outcomes for students in MS-ADS include roles as Data Scientist (most common), Senior Data Science Consultant, Business Intelligence (BI) Director, Data Visualization Manager, Data Analytics Engineer, and AI Solution Architect. - How do I apply to the MBA/MS joint degree program? - 2-Year Thesis Track Program - What is the new 2-year thesis track program (18 courses), and how is it different from the 1-year program (12 courses)? Beginning in academic 2026-27, a limited number of In-Person, Full-Time students will have the opportunity to complete a 2-year version of the MS in Applied Data Science program. The 2-year program is completed over 21 months (2 academic years). Students in the 2-year program will complete 18 instead of 12 courses. The additional 6 courses consist of 4 additional elective courses (100 units each); and 2 required thesis courses (100 units each) that culminate in the completion of a required written thesis or thesis project. The thesis will be an extension of"


In [33]:
def check_coverage(df, phrases):
    for p in phrases:
        hits = df["text"].str.contains(p, case=False, na=False).sum()
        print(f"'{p}' -> {hits} chunks")

check_coverage(
    df_chunks,
    [
        "capstone",
        "career outcomes",
        "tuition",
        "application deadline",
        "online program",
        "in-person program",
        "MS in Applied Data Science",
        "visa",
        "graduation"
    ]
)


'capstone' -> 338 chunks
'career outcomes' -> 2 chunks
'tuition' -> 160 chunks
'application deadline' -> 46 chunks
'online program' -> 256 chunks
'in-person program' -> 177 chunks
'MS in Applied Data Science' -> 238 chunks
'visa' -> 99 chunks
'graduation' -> 82 chunks


# ***Implementing Retrieval-Augmented Generation (RAG)***

In [34]:
!pip install -q sentence-transformers chromadb pandas tqdm

In [35]:
import pandas as pd
import numpy as np
import json
from typing import List, Dict
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# STEP 1: Load the Chunked Data from Part 1
print("STEP 1: Loading Chunked Data from Web Scraping")

# Load the JSONL file created in Part 1 (web scraping)
CHUNKS_FILE = "/content/msads_chunks_trafilatura.jsonl"

df_chunks = pd.read_json(CHUNKS_FILE, lines=True)

print(f"Loaded {len(df_chunks)} chunks")
print(f"Columns: {df_chunks.columns.tolist()}")
print(f"\nFirst few rows:")
print(df_chunks.head(3)[['page_title', 'page_type', 'text']])

# Data statistics
print(f"\n Statistics:")
print(f"Total chunks: {len(df_chunks)}")
print(f"Unique pages: {df_chunks['url'].nunique()}")
print(f"Page types: {df_chunks['page_type'].nunique()}")
print(f"Page type distribution:")
for page_type, count in df_chunks['page_type'].value_counts().head(10).items():
    print(f"    - {page_type}: {count}")

STEP 1: Loading Chunked Data from Web Scraping
Loaded 1164 chunks
Columns: ['id', 'url', 'page_title', 'page_type', 'chunk_index', 'text']

First few rows:
   page_title page_type  \
0  FAQs - DSI       faq   
1  FAQs - DSI       faq   
2  FAQs - DSI       faq   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [36]:
# STEP 1.5: DEDUPLICATE CHUNKS

print("STEP 1.5: Deduplicating Chunks")

# Remove URL anchors to deduplicate
df_chunks['url_clean'] = df_chunks['url'].str.split('#').str[0]

# Keep only unique URL+chunk combinations
df_chunks_deduped = df_chunks.drop_duplicates(subset=['url_clean', 'chunk_index'])

print(f"Before deduplication: {len(df_chunks)} chunks")
print(f"After deduplication: {len(df_chunks_deduped)} chunks")
print(f"Removed: {len(df_chunks) - len(df_chunks_deduped)} duplicates")

# Save deduplicated version
df_chunks_deduped.to_json("/content/msads_chunks_trafilatura_deduped.jsonl",
                           orient='records',
                           lines=True)

print(f"Saved deduplicated chunks to /content/msads_chunks_trafilatura_deduped.jsonl")

# Use deduplicated data for the rest of the pipeline
df_chunks = df_chunks_deduped.copy()

STEP 1.5: Deduplicating Chunks
Before deduplication: 1164 chunks
After deduplication: 166 chunks
Removed: 998 duplicates
Saved deduplicated chunks to /content/msads_chunks_trafilatura_deduped.jsonl


In [37]:
# STEP 2: Initialize Embedding Model
print("STEP 2: Initialize Embedding Model")

# Load the embedding model
# Using 'all-MiniLM-L6-v2' - efficient and good quality for semantic search
MODEL_NAME = "all-MiniLM-L6-v2"

print(f"Loading embedding model: {MODEL_NAME}...")
embedding_model = SentenceTransformer(MODEL_NAME)

print(f"Model loaded successfully")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

STEP 2: Initialize Embedding Model
Loading embedding model: all-MiniLM-L6-v2...
Model loaded successfully
Embedding dimension: 384


In [38]:
# STEP 2.1: Generate Embeddings for All Chunks

print("STEP 2.1: Generating Embeddings")

# Extract all text chunks
texts = df_chunks['text'].tolist()

# Generate embeddings in batches for efficiency
BATCH_SIZE = 32

def generate_embeddings(texts: List[str], model, batch_size: int = 32):
    all_embeddings = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
        batch = texts[i:i + batch_size]
        embeddings = model.encode(batch,
                                   show_progress_bar=False,
                                   convert_to_numpy=True,
                                   normalize_embeddings=True)
        all_embeddings.append(embeddings)

    return np.vstack(all_embeddings)

# Generate embeddings
embeddings = generate_embeddings(texts, embedding_model, BATCH_SIZE)

print(f"Generated embeddings for {embeddings.shape[0]} chunks")
print(f"Embedding shape: {embeddings.shape}")

# Add embeddings to dataframe
df_chunks['embedding'] = list(embeddings)

STEP 2.1: Generating Embeddings


Generating embeddings: 100%|██████████| 6/6 [00:00<00:00, 21.24it/s]

Generated embeddings for 166 chunks
Embedding shape: (166, 384)





In [39]:
# STEP 3: Setup Vector Database (ChromaDB)

print("STEP 3: Setting up Vector Database (ChromaDB)")

# Initialize ChromaDB client with persistence
PERSIST_DIR = "./msads_chroma_db"
COLLECTION_NAME = "msads_knowledge_base"

chroma_client = chromadb.Client(Settings(
    anonymized_telemetry=False,
    is_persistent=True,
    persist_directory=PERSIST_DIR
))

# Delete existing collection if it exists (for fresh start)
try:
    chroma_client.delete_collection(name=COLLECTION_NAME)
    print(f"Deleted existing collection")
except:
    pass

# Create new collection
collection = chroma_client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "MS in Applied Data Science program knowledge base"}
)

print(f"Created collection: {COLLECTION_NAME}")

STEP 3: Setting up Vector Database (ChromaDB)
Deleted existing collection
Created collection: msads_knowledge_base


In [40]:
# STEP 3.1: Store Embeddings in ChromaDB

print("STEP 3.1: Storing Embeddings in Vector Database")

# Add documents to ChromaDB in batches
STORE_BATCH_SIZE = 100

for i in tqdm(range(0, len(df_chunks), STORE_BATCH_SIZE), desc="Storing in ChromaDB"):
    batch_df = df_chunks.iloc[i:i + STORE_BATCH_SIZE]

    # Prepare data
    ids = batch_df['id'].tolist()
    documents = batch_df['text'].tolist()
    embeddings_batch = batch_df['embedding'].tolist()

    # Prepare metadata
    metadatas = []
    for _, row in batch_df.iterrows():
        metadatas.append({
            "url": row['url'],
            "page_title": row['page_title'],
            "page_type": row['page_type'],
            "chunk_index": int(row['chunk_index'])
        })

    # Add to collection
    collection.add(
        ids=ids,
        documents=documents,
        embeddings=embeddings_batch,
        metadatas=metadatas
    )

print(f"Stored {collection.count()} documents in vector database")

STEP 3.1: Storing Embeddings in Vector Database


Storing in ChromaDB: 100%|██████████| 2/2 [00:00<00:00,  5.92it/s]

Stored 166 documents in vector database





In [41]:
# STEP 4: Implement RAG Retrieval Function

print("STEP 4: Implementing RAG Retrieval System")

def retrieve_context(query: str, top_k: int = 5):
    # Generate query embedding
    query_embedding = embedding_model.encode(query,
                                              convert_to_numpy=True,
                                              normalize_embeddings=True)

    # Query the vector database
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    return results

def format_rag_response(query: str, top_k: int = 5):
    # Retrieve relevant chunks
    results = retrieve_context(query, top_k)

    # Format context
    context_parts = []
    sources = []

    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        # Convert distance to similarity score (0-1, higher is better)
        similarity = 1 - dist

        context_parts.append(f"[Source {i+1}]: {doc}")
        sources.append({
            "source_number": i + 1,
            "page_title": meta['page_title'],
            "url": meta['url'],
            "page_type": meta['page_type'],
            "similarity_score": round(similarity, 3),
            "text_preview": doc[:200] + "..."
        })

    context = "\n\n".join(context_parts)

    return {
        "query": query,
        "context": context,
        "sources": sources,
        "num_sources": len(sources)
    }

print("RAG retrieval functions implemented")

STEP 4: Implementing RAG Retrieval System
RAG retrieval functions implemented


In [42]:
# STEP 5: Test the RAG System
print("STEP 5: Testing the RAG System")

def test_rag_query(query: str, top_k: int = 3):
    print(f"QUERY: {query}")

    response = format_rag_response(query, top_k)

    print(f"Retrieved {response['num_sources']} relevant sources:\n")

    for source in response['sources']:
        print(f"[{source['source_number']}] {source['page_title']}")
        print(f"Similarity: {source['similarity_score']:.3f}")
        print(f"URL: {source['url']}")
        print(f"Type: {source['page_type']}")
        print(f"Preview: {source['text_preview']}")
        print()

    print(f"\nFULL CONTEXT FOR LLM:")
    print(response['context'][:500] + "...")

    return response

# Test with sample queries
test_queries = [
    "What are the core courses in the MS in Applied Data Science program?",
    "What are the admission requirements for the MS in Applied Data Science program?",
    "Can you provide information about the capstone project?"
]

print("\nTesting with sample queries:\n")

for query in test_queries[:2]:
    test_rag_query(query, top_k=3)

STEP 5: Testing the RAG System

Testing with sample queries:

QUERY: What are the core courses in the MS in Applied Data Science program?
Retrieved 3 relevant sources:

[1] Master's Programs - DSI
Similarity: 0.432
URL: https://datascience.uchicago.edu/education/masters-programs/
Type: education_general
Preview: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciti...

[2] Online Program - DSI
Similarity: 0.373
URL: https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/online-program/%20
Type: online_program
Preview: Science by successfully completing 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar*. Our rigorous curriculum is designed by and for data science innovators and leaders. Cou...

[3] Online Program - DSI
Similarity: 0.373
URL: https://datascience.uchicago.edu/education/m

In [43]:
# STEP 6: Create Simple Q&A Function (Without LLM)

print("STEP 6: Simple Q&A Function (Context Retrieval Only)")

def answer_question(question: str, top_k: int = 5, verbose: bool = True):

    response = format_rag_response(question, top_k)

    if verbose:
        print(f"\nQuestion: {question}\n")
        print(f"Answer based on {response['num_sources']} sources:\n")
        print(response['context'])
        print(f"\n\nSources:")
        for src in response['sources']:
            print(f"{src['page_title']} ({src['similarity_score']:.2f} relevance)")
            print(f"{src['url']}")

    return response

print("Q&A function ready")

# Example usage
print("Example: Answering a Question")

answer_question("What are the core courses in the MS in Applied Data Science program?", top_k=3)

STEP 6: Simple Q&A Function (Context Retrieval Only)
Q&A function ready
Example: Answering a Question

Question: What are the core courses in the MS in Applied Data Science program?

Answer based on 3 sources:

[Source 1]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed fo

{'query': 'What are the core courses in the MS in Applied Data Science program?',
 'context': '[Source 1]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed for students interested in pursuing a research career in data science with courses taught by faculty from the departme

In [44]:
# STEP 7: Save the RAG System Configuration
print("STEP 7: Saving RAG System Configuration")

config = {
    "embedding_model": MODEL_NAME,
    "collection_name": COLLECTION_NAME,
    "persist_directory": PERSIST_DIR,
    "total_chunks": len(df_chunks),
    "embedding_dimension": embedding_model.get_sentence_embedding_dimension(),
    "unique_pages": df_chunks['url'].nunique(),
    "batch_size": BATCH_SIZE
}

# Save configuration
with open("/content/rag_config.json", "w") as f:
    json.dump(config, f, indent=2)

print("Configuration saved to /content/rag_config.json")
print("\nConfiguration:")
for key, value in config.items():
    print(f"  {key}: {value}")

STEP 7: Saving RAG System Configuration
Configuration saved to /content/rag_config.json

Configuration:
  embedding_model: all-MiniLM-L6-v2
  collection_name: msads_knowledge_base
  persist_directory: ./msads_chroma_db
  total_chunks: 166
  embedding_dimension: 384
  unique_pages: 34
  batch_size: 32


In [45]:
# STEP 8: Create Reusable RAG Class
print("STEP 8: Creating Reusable RAG System Class")

class MSADSRagSystem:

    def __init__(self):
        """Initialize the RAG system with existing data."""
        print("Initializing MSADS RAG System...")

        self.embedding_model = embedding_model
        self.collection = collection

        print(f"System ready with {self.collection.count()} documents")

    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Search for relevant information."""
        query_embedding = self.embedding_model.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        )

        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )

        formatted_results = []
        for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            formatted_results.append({
                "text": doc,
                "page_title": meta['page_title'],
                "url": meta['url'],
                "page_type": meta['page_type'],
                "similarity_score": round(1 - dist, 3)
            })

        return formatted_results

    def ask(self, question: str, top_k: int = 5) -> Dict:
        """
        Ask a question and get context with sources.

        Returns:
            Dictionary with question, context, and sources
        """
        results = self.search(question, top_k)

        context = "\n\n".join([
            f"[{r['page_title']}]: {r['text']}"
            for r in results
        ])

        return {
            "question": question,
            "context": context,
            "sources": results
        }

    def display_answer(self, question: str, top_k: int = 5):
        """Ask a question and display formatted answer."""
        print(f"\n{'='*80}")
        print(f"Q: {question}")
        print(f"{'='*80}\n")

        response = self.ask(question, top_k)

        print("RELEVANT CONTEXT:\n")
        print(response['context'][:1000])
        if len(response['context']) > 1000:
            print("...\n(truncated)")

        print(f"\n\nSOURCES ({len(response['sources'])}):")
        for i, src in enumerate(response['sources'], 1):
            print(f"\n{i}. {src['page_title']}")
            print(f"   Relevance: {src['similarity_score']:.2f}")
            print(f"   URL: {src['url']}")

        return response

# Initialize the system
rag = MSADSRagSystem()

print("\nRAG System Class created and initialized")

STEP 8: Creating Reusable RAG System Class
Initializing MSADS RAG System...
System ready with 166 documents

RAG System Class created and initialized


In [46]:
# STEP 9: Demo Usage
print("STEP 9: Demo - Using the RAG System")

# Demo queries
demo_questions = [
    "What are the core courses in the MS in Applied Data Science program?",
    "What are the admission requirements for the MS in Applied Data Science program?",
    "Can you provide information about the capstone project?",
]

print("\nDemo: Answering questions about MSADS program\n")

for question in demo_questions[:1]:  # Show 1 example
    rag.display_answer(question, top_k=3)

STEP 9: Demo - Using the RAG System

Demo: Answering questions about MSADS program


Q: What are the core courses in the MS in Applied Data Science program?

RELEVANT CONTEXT:

[Master's Programs - DSI]: Master’s Programs The Data Science Institute supports master’s-level education through three programs: MS in Applied Data Science Our Online and In-Person degree will advance your career in the exciting field of data science. Rigorous classes, expert instructors, leading-edge technology, and an unparalleled network support your student experience as a full- or part-time learner. MS in Computational Analysis and Public Policy (MSCAPP) A rigorous, two-year program offered jointly by the Harris School of Public Policy and the UChicago Department of Computer Science. MSCAPP students work with external social impact organizations through our Data Science Clinic and Community Data Fellows program. MS in Data Science The Master’s in Data Science (MSDS) has been developed for students interest

# ***Deploy RAG Chatbot***
- In this section, we firstly add the large language model to the RAG system. The prompt (containing both the context and the original question) is sent to the OpenAI (GPT) large language model. We've instructed the model to only use the provided context to formulate a natural, accurate answer.
- Secondly, to make the chatbot usable, we launched it as a public web application using Gradio. This provides a simple chat interface and a shareable URL for evaluation.
- **The link for chatbot interface is:** [MS in Applied Data Science Chatbot
](https://d9eb8155416df7dc7c.gradio.live)
- Finally, We included a final script to automatically run the project's sample questions against our bot, allowing us to evaluate its accuracy and prepare the results for our presentation.

In [47]:
!pip install -q chromadb-client openai gradio

In [48]:
import openai
from google.colab import userdata
import textwrap

# --- Configure the OpenAI API Key ---
try:
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY is None:
        raise ValueError("Key not found")

    client = openai.OpenAI(api_key=OPENAI_API_KEY)
    print("OpenAI API client configured successfully.")

except Exception as e:
    print(f"Error: Could not find or configure OPENAI_API_KEY.")
    print("Please create a Colab Secret named 'OPENAI_API_KEY'.")
    print("You can get a key from: https://platform.openai.com/api-keys")

# --- Initialize the Generative Model (for consistency in naming) ---
# We just need the 'client' object
llm_model = client
print("Initialized OpenAI client.")

OpenAI API client configured successfully.
Initialized OpenAI client.


### Deploy the Chatbot on a user-friendly interface

In [49]:
# --- COMBINED CELL: Load System & Launch UI ---

print("--- Initializing System & Launching Chatbot ---")
print("This may take a moment...")

import pandas as pd
import numpy as np
import json
import warnings
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import openai
from google.colab import userdata
import textwrap
import gradio as gr
import time

warnings.filterwarnings('ignore')

# --- 1. Load Models & DB ---
try:
    print("Loading embedding model...")
    MODEL_NAME = "all-MiniLM-L6-v2"
    embedding_model = SentenceTransformer(MODEL_NAME)

    print("Loading Vector DB from Disk...")
    PERSIST_DIR = "./msads_chroma_db"
    COLLECTION_NAME = "msads_knowledge_base"
    chroma_client = chromadb.Client(Settings(
        anonymized_telemetry=False,
        is_persistent=True,
        persist_directory=PERSIST_DIR
    ))
    collection = chroma_client.get_collection(name=COLLECTION_NAME)
    print(f"Loaded collection '{COLLECTION_NAME}' with {collection.count()} documents.")

    print("Configuring OpenAI API Key...")
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY is None: raise ValueError("Key not found")
    llm_model = openai.OpenAI(api_key=OPENAI_API_KEY)
    print("OpenAI client configured.")

except Exception as e:
    print(f"CRITICAL ERROR during setup: {e}")
    print("This cell cannot continue. Did you run Cell 2 (Build Data) first?")
    print("Do you have the 'OPENAI_API_KEY' secret set?")


# --- 2. Define RAG and Chatbot Classes ---

class MSADSRagSystem:
    def __init__(self, model, collection):
        self.embedding_model = model
        self.collection = collection

    def search(self, query: str, top_k: int = 5):
        query_embedding = self.embedding_model.encode(
            query, convert_to_numpy=True, normalize_embeddings=True
        )
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )
        formatted_results = []
        for doc, meta, dist in zip(
            results['documents'][0], results['metadatas'][0], results['distances'][0]
        ):
            formatted_results.append({
                "text": doc, "page_title": meta['page_title'], "url": meta['url'],
                "page_type": meta['page_type'], "similarity_score": round(1 - dist, 3)
            })
        return formatted_results

    def ask(self, question: str, top_k: int = 5):
        results = self.search(question, top_k)
        context = "\n\n".join([f"Source: {r['text']}" for r in results])
        return {"question": question, "context": context, "sources": results}

class GenerativeMSADSChatbot:
    def __init__(self, rag_system, llm_client: openai.OpenAI):
        self.rag = rag_system
        self.client = llm_client
        self.system_prompt = textwrap.dedent("""
            You are an expert assistant for the University of Chicago's
            Master of Science in Applied Data Science (MSADS) program.
            Your task is to answer the user's QUESTION based *only* on the
            provided CONTEXT.
            - Do not use any information outside of the CONTEXT.
            - Be concise and directly answer the question.
            - If the CONTEXT does not contain the answer, state:
              "I'm sorry, I don't have enough information from the website
               to answer that question."
            - Do not make up information or add conversational fluff.
        """)

    def _build_user_prompt(self, question: str, context: str) -> str:
        user_prompt_template = "---\nCONTEXT:\n{context}\n---\nQUESTION:\n{question}\n---"
        return textwrap.dedent(user_prompt_template).format(context=context, question=question)

    def answer(self, question: str, top_k: int = 5):
        rag_response = self.rag.ask(question, top_k=top_k)
        context = rag_response['context']
        sources = rag_response['sources']
        if not sources:
            return {"question": question, "answer": "I'm sorry, I don't have enough information... to answer that question.", "sources": []}

        user_prompt = self._build_user_prompt(question, context)
        try:
            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",
                temperature=0.0,
                messages=[
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": user_prompt}
                ]
            )
            generated_answer = response.choices[0].message.content
        except Exception as e:
            generated_answer = f"Error: The generative model could not process this request. {e}"
        return {"question": question, "answer": generated_answer.strip(), "sources": sources}

# --- 3. Initialize Chatbot ---
# We define a global 'chatbot' variable
global chatbot
chatbot = None

try:
    rag = MSADSRagSystem(model=embedding_model, collection=collection)
    chatbot = GenerativeMSADSChatbot(rag_system=rag, llm_client=llm_model)
    print("Chatbot is initialized and ready.")
except Exception as e:
    print(f"Error initializing chatbot: {e}")


# --- 4. Define Gradio Functions ---

def format_sources_for_ui(sources):
    if not sources: return "No sources found."
    output = "Sources:\n"
    for i, src in enumerate(sources, 1):
        output += f"{i}. {src['page_title']} (Relevance: {src['similarity_score']:.2f})\n"
        output += f"   URL: {src['url']}\n\n"
    return output

def chat_interface_fn(message, history):
    # This check is now much more direct
    if chatbot is None:
        return "Error: Chatbot is not initialized. Please re-run the setup cell."

    response = chatbot.answer(message, top_k=5)
    answer = response['answer']
    sources_text = format_sources_for_ui(response['sources'])
    full_response = f"{answer}\n\n---\n{sources_text}"
    return full_response

# --- 5. Launch Gradio ---

if chatbot is not None:
    print("Launching Gradio Chat Interface...")
    gr.ChatInterface(
        fn=chat_interface_fn,
        title="MS in Applied Data Science Chatbot",
        description="Ask me questions about the UChicago MSADS program.",
        examples=[
            "What scholarships are available for the program?",
            "What are the minimum scores for the TOEFL?",
            "How many courses must you complete to graduate?",
        ]
    ).launch(share=True, debug=False)
else:
    print("CANNOT LAUNCH GRADIO: Chatbot failed to initialize. Check errors above.")

--- Initializing System & Launching Chatbot ---
This may take a moment...
Loading embedding model...
Loading Vector DB from Disk...
Loaded collection 'msads_knowledge_base' with 166 documents.
Configuring OpenAI API Key...
OpenAI client configured.
Chatbot is initialized and ready.
Launching Gradio Chat Interface...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://761f7ec4e63166aee3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


### Evaluate chatbot result

In [50]:
import pandas as pd

# 1. Define your evaluation set from the project description
evaluation_set = [
    {
        "question": "What scholarships are available for the program?",
        "ground_truth": "The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship etc"
    },
    {
        "question": "What are the minimum scores for the TOEFL and IELTS English Language Requirement?",
        "ground_truth": "Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement)."
    },
    {
        "question": "Is there an application fee waiver?",
        "ground_truth": "For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."
    },
    {
        "question": "What are the deadlines for the in-person program?",
        "ground_truth": "Lists various deadlines (Priority, Scholarship, International, etc.)"
    },
    {
        "question": "How long will it take for me to receive a decision on my application?",
        "ground_truth": "In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis"
    },
    {
        "question": "Can I set up an advising appointment with the enrollment management team?",
        "ground_truth": "Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science"
    },
    {
        "question": "Where can I mail my official transcripts?",
        "ground_truth": "The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 6011"
    },
    {
        "question": "Does the Master’s in Applied Data Science Online program provide visa sponsorship?",
        "ground_truth": "Only our In-Person, Full-Time program is Visa eligible"
    },
    {
        "question": "How do I apply to the MBA/MS program?",
        "ground_truth": "Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process... Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest"
    },
    {
        "question": "Is the MS in Applied Data Science program STEM/OPT eligible?",
        "ground_truth": "The MS in Applied Data Science program is STEM/OPT eligible"
    },
    {
        "question": "How many courses must you complete to earn UChicago’s Master’s in Applied Data Science?",
        "ground_truth": "To earn the MS-ADS degree students must successfully complete 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar"
    }
]

print("Running evaluation...")
evaluation_results = []

if 'chatbot' in locals():
    for item in evaluation_set:
        question = item['question']
        ground_truth = item['ground_truth']

        # Get the bot's response
        response = chatbot.answer(question, top_k=5)
        generated_answer = response['answer']

        evaluation_results.append({
            "Question": question,
            "Ground Truth": ground_truth,
            "Generated Answer": generated_answer,
            "Sources": response['sources']
        })

    print("Evaluation complete.")

    # 3. Display results in a clean DataFrame
    df_eval = pd.DataFrame(evaluation_results)

    # Optional: Save to a file to copy into your presentation
    # df_eval.to_csv("evaluation_results.csv")

    # Display for review
    from IPython.display import display, HTML
    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.width', 1000)

    print("--- Evaluation Results ---")
    display(df_eval[['Question', 'Ground Truth', 'Generated Answer']])

else:
    print("Cannot run evaluation: Chatbot not initialized.")

Running evaluation...
Evaluation complete.
--- Evaluation Results ---


Unnamed: 0,Question,Ground Truth,Generated Answer
0,What scholarships are available for the program?,"The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship etc","The program offers partial tuition scholarships to top applicants, including the Data Science Institute Scholarship and the MS in Applied Data Science Alumni Scholarship."
1,What are the minimum scores for the TOEFL and IELTS English Language Requirement?,"Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement).","The minimum TOEFL iBT score required for admission is 104, and the minimum IELTS score required is 7."
2,Is there an application fee waiver?,"For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy.","For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."
3,What are the deadlines for the in-person program?,"Lists various deadlines (Priority, Scholarship, International, etc.)","The deadlines for the in-person program are as follows:\n- December 4, 2025 - Scholarship Priority Deadline for the 1-year (12-15 months; 12 courses)\n- Final Application Deadline for the 2-year Thesis Track (21 months; 18 courses)\n- January 26, 2026 – International Application Deadline\n- March 4, 2026 – Second Priority Application Deadline\n- May 6, 2026 – Third Priority Application Deadline\n- June 23, 2026 – Final Application Deadline"
4,How long will it take for me to receive a decision on my application?,In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis,Admissions decisions for the Master's in Applied Data Science program are typically released 1-2 months after each application deadline.
5,Can I set up an advising appointment with the enrollment management team?,"Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science","Yes, you can set up an advising appointment with the enrollment management team for the Master of Science in Applied Data Science program."
6,Where can I mail my official transcripts?,"The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 6011","I'm sorry, I don't have enough information from the website to answer that question."
7,Does the Master’s in Applied Data Science Online program provide visa sponsorship?,"Only our In-Person, Full-Time program is Visa eligible","I'm sorry, I don't have enough information from the website to answer that question."
8,How do I apply to the MBA/MS program?,"Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process... Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest","I'm sorry, I don't have enough information from the website\nto answer that question."
9,Is the MS in Applied Data Science program STEM/OPT eligible?,The MS in Applied Data Science program is STEM/OPT eligible,"Yes, the full-time, in-person MS in Applied Data Science program is STEM/OPT eligible."
