<a href="https://colab.research.google.com/github/malhar-123/Gies-DSRS-Knowledgebase/blob/main/DSRS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##DSRS Knowledge Base Project - README

##- Overview

The Data Science Research Services (DSRS) at the University of Illinois provides researchers with data science consulting, secure infrastructure, licensed datasets, and advanced computing tools. This project creates an automated, structured knowledge base from DSRS's publicly available resources to aid new joiners, collaborators, and external researchers.

 ##-Project Purpose

This notebook crawls, indexes, and organizes DSRS-related content from:

https://dsrs.illinois.edu/

In future phases, it will expand to include:

https://github.com/giesdsrs

https://dsrs.illinois.edu/datahub/category

https://giesbusiness.illinois.edu/disruption-lab

The core output is a searchable, structured JSON and SQLite knowledge base.

 ##-Functionality Summary

Web Crawler: Extracts text from paragraphs, classifies it into logical sections.

Keyword Extractor: Identifies and indexes top keywords from each page.

JSON Export: Saves the knowledge base in a structured .json format.

Keyword Search: Allows users to retrieve relevant pages based on keyword matches.

Graph Visualization: Depicts conceptual data flows across DSRS components (Datasets → Projects → Services → Tools, etc.).

###-  Data Schema

Each crawled entry is structured using the following fields:

 **url**: Source URL of the crawled page  
 **title**: Page or section title  
 **section**: Classified section (e.g., Services, Tools)  
 **content**: Main body text extracted from the page  
 **keywords**: Auto-extracted keywords (top 5) for quick retrieval  
 **last_crawled**: Timestamp of the last time the page was crawled

 ## - How Users Can Interact

Search: Enter any keyword to retrieve matching knowledge entries.

Explore Graph: View how different DSRS resources conceptually relate.

Extend or Update: Add more URLs or re-run the notebook to refresh content.

##- Maintenance Strategy

Each crawl stores last_crawled timestamps.

Duplicate pages are updated using ON CONFLICT logic.

Can be scheduled (via cron or GitHub Actions) to refresh monthly.

 ##-Future Scope

Iterative (BFS) Crawl: Switch to a breadth-first strategy to better prioritize public-facing summary pages.

Full Site Expansion: Crawl GitHub repositories, DataHub catalogs, and the Gies Disruption Lab.

Semantic Search: Add OpenAI or embedding-based models for natural language queries.

Onboarding Assistant: Build an FAQ/chat interface for new DSRS members or interns.

## Knowledge Architecture: How DSRS Components Relate

**Services** → use **Technical Tools** like Azure, JupyterHub, and MinIO.
**Projects** → are supported by **Services** (consulting, analytics) and **Tools** (secure compute, databases).
**Datasets** → power **Projects** and are curated through **DataHub**.
**Internships** and **Collaborations** → emerge from real-world application of **Projects** and **Services**.

This structure ensures a connected ecosystem: tools and services support research, which in turn enables learning, collaboration, and innovation.


In [13]:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import sqlite3
import time
import re
import pandas as pd
import json
from datetime import datetime

# Initialize
START_URLS = ["https://dsrs.illinois.edu/"]
urls_to_visit = deque(START_URLS)
visited_urls = set()
extracted_data = []

# Section classifiers
SECTION_MAP = {
    "about": "Overview",
    "services": "Services",
    "projects": "Projects",
    "news": "Events",
    "tools": "Technical Resources",
    "team": "Team",
    "students": "Internships",
    "datahub": "Datasets",
    "github": "Open Source Projects",
    "disruption-lab": "Collaborations"
}

# DB Setup
def init_db():
    conn = sqlite3.connect("dsrs_kb.db")
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS knowledge_base (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE,
            title TEXT,
            section TEXT,
            content TEXT,
            keywords TEXT,
            last_crawled TEXT
        )
    """)
    conn.commit()
    conn.close()

def export_to_json(data, filename="dsrs_kb1.json"):
    json_data = [
        {
            "url": entry[0],
            "title": entry[1],
            "section": entry[2],
            "content": entry[3],
            "keywords": entry[4],
            "last_crawled": entry[5]
        }
        for entry in data
    ]
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(json_data, f, indent=4, ensure_ascii=False)
    print(f"\n Exported knowledge base to {filename}")

def extract_keywords(text, top_k=5):
    words = re.findall(r'\b\w{5,}\b', text.lower())
    freq = pd.Series(words).value_counts()
    return ', '.join(freq.head(top_k).index)

def classify_section(url, title):
    for key in SECTION_MAP:
        if key in url.lower() or key in title.lower():
            return SECTION_MAP[key]
    return "Uncategorized"

def is_valid_url(url):
    parsed = urlparse(url)
    return parsed.netloc.endswith("dsrs.illinois.edu")

def crawl(url):
    if url in visited_urls:
        return
    visited_urls.add(url)
    try:
        print(f"Processing: {url}")
        response = requests.get(url)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f" Failed to fetch {url}: {e}")
        return

    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string.strip() if soup.title else "No Title"
    section = classify_section(url, title)

    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p') if len(p.get_text(strip=True)) > 30]
    full_text = " ".join(paragraphs)
    if not full_text:
        return

    keywords = extract_keywords(full_text)
    last_crawled = datetime.now().isoformat()
    extracted_data.append((url, title, section, full_text, keywords, last_crawled))

    print(f" Indexed: {url} | Section: {section} | Title: {title}")
    if section in ["Overview", "Services", "Projects", "Technical Resources"]:
        print("\n--- Summary for Knowledge Base Users ---")
        print(f"Section: {section}")
        print(f"Title: {title}")
        print(f"Main Info: {full_text[:300]}...")
        print("---\n")

    for link_tag in soup.find_all('a', href=True):
        full_url = urljoin(url, link_tag['href'])
        if is_valid_url(full_url) and full_url not in visited_urls:
            urls_to_visit.append(full_url)
    time.sleep(0.5)

# Execute
if __name__ == "__main__":
    init_db()
    while urls_to_visit:
        crawl(urls_to_visit.popleft())
    #store_to_db(extracted_data)
    export_to_json(extracted_data)
    print(f"\n Stored {len(extracted_data)} entries in dsrs_kb.json")


Processing: https://dsrs.illinois.edu/
 Indexed: https://dsrs.illinois.edu/ | Section: Uncategorized | Title: DSRS | DSRS
Processing: https://dsrs.illinois.edu/#__docusaurus_skipToContent_fallback
 Indexed: https://dsrs.illinois.edu/#__docusaurus_skipToContent_fallback | Section: Uncategorized | Title: DSRS | DSRS
Processing: https://dsrs.illinois.edu/about
 Indexed: https://dsrs.illinois.edu/about | Section: Overview | Title: Introduction | DSRS

--- Summary for Knowledge Base Users ---
Section: Overview
Title: Introduction | DSRS
Main Info: Situated at the heart of Gies College of Business, the Data Science Research Services (DSRS) exemplifies our institution's commitment to pioneering research and technological innovation. DSRS has been intricately designed to provide an unparalleled research support framework, ensuring our academic c...
---

Processing: https://dsrs.illinois.edu/faculty
 Indexed: https://dsrs.illinois.edu/faculty | Section: Uncategorized | Title: Overview | DSRS
Pr

### Search by keywords

In [16]:
import json

def search_by_keyword_from_json(json_path="dsrs_kb1.json"):
    keyword = input("Enter keyword to search: ")
    with open(json_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    results = [entry for entry in data if keyword.lower() in entry.get("keywords", "").lower()]

    for entry in results:
        print(f"Title: {entry['title']}\nSection: {entry['section']}\nURL: {entry['url']}\nKeywords: {entry['keywords']}\nContent Snippet: {entry['content'][:200]}\n{'-'*80}")


search_by_keyword_from_json()


KeyboardInterrupt: Interrupted by user

### This is not the submission

You can try it as it has multiple URL's and uses the iterative crawl function which uses bfs approach thus not getting stuck in the recursive loop. But for the sake of this submission I am not putting it as my deliveable because it going to take really long time to execute but you can try. My focus now will be making this code run faster.

In [17]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import sqlite3
import time
import re
import pandas as pd
import json
from collections import deque # Import deque for breadth-first search

# Initialize
START_URLS = [
    "https://dsrs.illinois.edu/",
    #"https://github.com/giesdsrs",
    #"https://dsrs.illinois.edu/datahub/category",
    #"https://giesbusiness.illinois.edu/disruption-lab"
]

# Use a deque as a queue for URLs to visit
urls_to_visit = deque(START_URLS)
visited_urls = set()
extracted_data = []

# Section classifiers
SECTION_MAP = {
    "about": "Overview",
    "services": "Services",
    "projects": "Projects",
    "news": "Events",
    "tools": "Technical Resources",
    "team": "Team",
    "students": "Internships",
    "datahub": "Datasets",
    "github": "Open Source Projects",
    "disruption-lab": "Collaborations"
}

# DB Setup
def init_db():
    conn = sqlite3.connect("dsrs_kb.db")
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS knowledge_base (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            title TEXT,
            section TEXT,
            content TEXT,
            keywords TEXT
        )
    """)
    conn.commit()
    conn.close()

def store_to_db(data):
    conn = sqlite3.connect("dsrs_kb.db")
    cursor = conn.cursor()
    cursor.executemany("""
        INSERT INTO knowledge_base (url, title, section, content, keywords)
        VALUES (?, ?, ?, ?, ?)
    """, data)
    conn.commit()
    conn.close()

def export_to_json(data, filename="dsrs_kb.json"):
    json_data = [
        {"url": entry[0], "title": entry[1], "section": entry[2], "content": entry[3], "keywords": entry[4]}
        for entry in data
    ]
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(json_data, f, indent=4, ensure_ascii=False)
    print(f"\n Exported knowledge base to {filename}")

def extract_keywords(text, top_k=5):
    words = re.findall(r'\b\w{5,}\b', text.lower())
    freq = pd.Series(words).value_counts()
    return ', '.join(freq.head(top_k).index)

def classify_section(url, title):
    for key in SECTION_MAP:
        if key in url.lower() or key in title.lower():
            return SECTION_MAP[key]
    return "Uncategorized"

def is_valid_url(url):
    parsed = urlparse(url)
    # Check if the URL belongs to one of the allowed domains
    return any(domain in parsed.netloc for domain in ["illinois.edu", "github.com"])


def crawl_iterative():
    """
    Iteratively crawl the websites starting from the given URLs using a queue.
    """
    while urls_to_visit:
        url = urls_to_visit.popleft() # Get the next URL from the left of the queue

        if url in visited_urls:
            continue
        visited_urls.add(url)

        try:
            print(f"Processing: {url}") # Added for visibility
            response = requests.get(url)
            response.raise_for_status()
        except requests.RequestException as e:
            print(f" Failed to fetch {url}: {e}")
            continue # Skip to the next URL in the queue

        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.string.strip() if soup.title else "No Title"
        section = classify_section(url, title)

        paragraphs = [p.get_text(strip=True) for p in soup.find_all('p') if len(p.get_text(strip=True)) > 30]
        full_text = " ".join(paragraphs)
        if not full_text:
            continue # Skip if no significant text is found

        keywords = extract_keywords(full_text)
        extracted_data.append((url, title, section, full_text, keywords))
        print(f" Indexed: {url} | Section: {section} | Title: {title}")

        if section in ["Overview", "Services", "Projects", "Technical Resources"]:
            print("\n--- Summary for Knowledge Base Users ---")
            print(f"Section: {section}")
            print(f"Title: {title}")
            print(f"Main Info: {full_text[:300]}...")
            print("---\n")

        # Add new valid links to the queue
        for link_tag in soup.find_all('a', href=True):
            full_url = urljoin(url, link_tag['href'])
            if is_valid_url(full_url) and full_url not in visited_urls:
                urls_to_visit.append(full_url)

        time.sleep(0.5)


# Execute
if __name__ == "__main__":
    init_db()
    # Call the iterative crawl function
    crawl_iterative()
    store_to_db(extracted_data)
    export_to_json(extracted_data)
    print(f"\n Stored {len(extracted_data)} entries in dsrs_kb.db and exported to dsrs_kb.json")

Processing: https://dsrs.illinois.edu/
 Indexed: https://dsrs.illinois.edu/ | Section: Uncategorized | Title: DSRS | DSRS
Processing: https://dsrs.illinois.edu/#__docusaurus_skipToContent_fallback
 Indexed: https://dsrs.illinois.edu/#__docusaurus_skipToContent_fallback | Section: Uncategorized | Title: DSRS | DSRS
Processing: https://dsrs.illinois.edu/about
 Indexed: https://dsrs.illinois.edu/about | Section: Overview | Title: Introduction | DSRS

--- Summary for Knowledge Base Users ---
Section: Overview
Title: Introduction | DSRS
Main Info: Situated at the heart of Gies College of Business, the Data Science Research Services (DSRS) exemplifies our institution's commitment to pioneering research and technological innovation. DSRS has been intricately designed to provide an unparalleled research support framework, ensuring our academic c...
---

Processing: https://dsrs.illinois.edu/faculty
 Indexed: https://dsrs.illinois.edu/faculty | Section: Uncategorized | Title: Overview | DSRS
Pr

KeyboardInterrupt: 