# Web Scraping Project

This notebook contains the web scraping process for extracting content from specified Northeastern University catalog pages, including their subpages. The content will be converted from HTML to Markdown format.

## Task 1: Scrape Content from Specified URLs and Subpages

This task involves:
- Downloading the main pages at the provided URLs.
- Identifying and downloading the subpages for each main URL.
- Converting each page's HTML content to Markdown format.

Each page is represented as a dictionary with two keys:
- `url`: URL of the page.
- `content`: Content of the page in Markdown format as a string.

In [7]:

import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import re
import time


In [8]:

def get_page_content(url):
    """Fetch HTML content of the given URL and return as text."""
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Failed to retrieve {url}: {e}")
        return None

# Function to find subpage URLs within the page content
def find_subpages(main_url, html_content):
    """Extract subpage URLs that start with the main URL from HTML content."""
    soup = BeautifulSoup(html_content, 'html.parser')
    subpage_links = set()
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.endswith('/'):
            href = 'https://catalog.northeastern.edu' + href
        # Only add links that start with the main URL
        if href.startswith(main_url) and not href.endswith('.pdf') and href not in subpage_links:
            subpage_links.add(href)
    return list(subpage_links)

def convert_to_markdown(html_content):
    """Convert HTML content to Markdown format."""
    return md(html_content)


# Recursive function for scraping main page and subpages
def scrape_recursive(url, visited_pages, scraped_pages):
    """Recursively scrape a page and its subpages."""
    if url in visited_pages:
        return  # Skip already visited pages

    print(f"Scraping: {url}")
    content = get_page_content(url)
    if content:
        page_data = {
            "url": url,
            "content": convert_to_markdown(content)
        }
        scraped_pages.append(page_data)  # Add scraped page to list
        visited_pages.add(url)  # Mark the page as visited

        # Find subpages and recurse into each one
        subpages = find_subpages(url, content)
        print(f"Found {len(subpages)} subpages for {url}")

        for subpage_url in subpages:
            scrape_recursive(subpage_url, visited_pages, scraped_pages)  # Recurse for each subpage
            time.sleep(1)  # Delay to avoid server overload

# Initial URLs and starting the recursive scraping
main_urls = [
    "https://catalog.northeastern.edu/undergraduate/computer-information-science/",
    "https://catalog.northeastern.edu/graduate/computer-information-science/"
]

# Sample URLs
main_urls = [
    "https://catalog.northeastern.edu/undergraduate/computer-information-science/",
    "https://catalog.northeastern.edu/graduate/computer-information-science/"
]


In [9]:

# Task 1: Download main pages, find subpages, and convert content to markdown

scraped_pages = []
visited_pages = set()

# Start recursive scraping for each main URL
for url in main_urls:
    scrape_recursive(url, visited_pages, scraped_pages)

print("Recursive scraping completed.")


Scraping: https://catalog.northeastern.edu/undergraduate/computer-information-science/
Found 5 subpages for https://catalog.northeastern.edu/undergraduate/computer-information-science/
Scraping: https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-information-science-combined-majors/
Found 48 subpages for https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-information-science-combined-majors/
Scraping: https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-information-science-combined-majors/computer-science-theatre-bs/
Found 0 subpages for https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-information-science-combined-majors/computer-science-theatre-bs/
Scraping: https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-information-science-combined-majors/computer-science-political-science-bs/
Found 0 subpages for https://catalog.n

In [10]:

# Display first few results to verify
# scraped_pages  # Display first two items as a sample


## Task 2

In [11]:
minor = [entry['content'] for entry in scraped_pages if entry['url']=='https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-science/minor/'][0]
print(minor)






Computer Science, Minor \< Northeastern University Academic Catalog

























* [Skip to Content](#contentarea)
* [AZ Index](/azindex/)
* [Catalog Home](/)
* [Institution Home](https://northeastern.edu)







[![Northeastern University](/images/logo.png)](https://www.northeastern.edu/)



![](/images/search.svg)
Toggle Search Visibility



Close Search
![](/images/close.svg)


Search catalog

Go




---








Academic Catalog 2024\-2025
---------------------------






* [Home](/)›
* [Undergraduate](/undergraduate/)›
* [Khoury College of Computer Sciences](/undergraduate/computer-information-science/)›
* [Computer Science](/undergraduate/computer-information-science/computer-science/)›
* Computer Science, Minor






Computer Science, Minor










2024\-2025 Edition



[2024\-2025 Edition](/)
-----------------------



* [Delivery of Services](/delivery-services/)
* [General Information](/general-information/)
* [Undergraduate](/undergraduate/)
	+ [Admission]

## Task 3

In [12]:
def chunker(text):
    splitter = text.split('\n\n\n')
    clean_chunk = ['\n'.join([line.strip() for line in m.splitlines() if line.strip()]) for m in splitter if m != '' and m != '\n']
    
    return clean_chunk

# print(chunker(minor))

In [13]:
# data = scraped_pages
# cleaned_articles = []

# for article in data:
#     clean_content = '\n'.join([line.strip() for line in article['content'].splitlines() if line.strip()])
#     cleaned_articles.append({'url':article['url'], 'content':clean_content})

# print(cleaned_articles)

In [14]:
# [entry['content'] for entry in cleaned_articles if entry['url']=='https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-science/minor/'][0]

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

pages = []

for sp in scraped_pages:
    content = sp['content']
    sectionized = chunker(content)
    pages.append(sectionized)

len_pages = [len(p) for p in pages]
flat_pages = [x for xs in pages for x in xs]
vectorizer = TfidfVectorizer()
vector_pages = vectorizer.fit_transform(flat_pages)

kmeans = KMeans(n_clusters=100, random_state=0, n_init="auto").fit(vector_pages)

In [24]:
from collections import defaultdict

counter = defaultdict(int)
for l in kmeans.labels_:
    counter[l] += 1

{k: v for k, v in sorted(counter.items(), key=lambda item: item[1], reverse=True)}

{0: 861,
 30: 318,
 5: 190,
 3: 95,
 17: 95,
 4: 95,
 13: 95,
 15: 95,
 2: 95,
 1: 95,
 10: 95,
 25: 95,
 9: 95,
 19: 95,
 7: 95,
 11: 95,
 6: 95,
 12: 95,
 27: 75,
 8: 74,
 14: 71,
 21: 71,
 29: 68,
 45: 52,
 18: 42,
 78: 40,
 42: 32,
 94: 32,
 72: 28,
 31: 25,
 54: 21,
 77: 20,
 33: 19,
 98: 18,
 23: 18,
 16: 16,
 20: 14,
 37: 14,
 80: 14,
 66: 13,
 85: 13,
 39: 12,
 32: 12,
 22: 11,
 96: 10,
 59: 10,
 40: 10,
 99: 10,
 93: 9,
 49: 8,
 68: 8,
 47: 8,
 63: 7,
 41: 7,
 50: 7,
 88: 6,
 70: 6,
 60: 6,
 90: 6,
 83: 6,
 74: 6,
 73: 6,
 62: 6,
 92: 6,
 71: 6,
 34: 5,
 51: 5,
 48: 5,
 82: 5,
 76: 5,
 44: 4,
 58: 4,
 35: 4,
 26: 4,
 69: 4,
 53: 4,
 65: 4,
 36: 4,
 57: 4,
 55: 4,
 79: 4,
 46: 4,
 43: 4,
 52: 3,
 86: 3,
 87: 3,
 56: 3,
 28: 3,
 67: 3,
 75: 3,
 84: 3,
 24: 3,
 89: 3,
 38: 3,
 61: 3,
 64: 2,
 91: 2,
 81: 2,
 95: 2,
 97: 2}

In [25]:
for i, idx in enumerate(list({k: v for k, v in sorted(counter.items(), key=lambda item: item[1], reverse=True)}.keys())[:20], start=1):
    print(f"{i}: {flat_pages[list(kmeans.labels_).index(idx)]}")

1: ---
2: Computing has transformed the way people work and live, and its applications are limitless. Today, an understanding of computing is critical in business, healthcare, science, digital art, and other areas of our information\-driven society. Computing knowledge and computing technology also contribute to resolving major issues in an increasingly complex world.
3: 2024\-2025 Edition
4: * [Skip to Content](#contentarea)
* [AZ Index](/azindex/)
* [Catalog Home](/)
* [Institution Home](https://northeastern.edu)
5: [![Northeastern University](/images/logo.png)](https://www.northeastern.edu/)
6: ![](/images/search.svg)
Toggle Search Visibility
7: Close Search
![](/images/close.svg)
8: Search catalog
Go
9: Academic Catalog 2024\-2025
---------------------------
10: Print Options
11: ### Campus Locations
12: * [Arlington
VAOpens New Window](https://arlington.northeastern.edu/)
* [Boston
MAOpens New Window](https://www.northeastern.edu/campuses/boston/)
* [Burlington
MAOpens New Window]

In [26]:
import cluster_calculation as cc

similarity_scores = cc.calculate_cluster_similarity(kmeans.labels_, kmeans.cluster_centers_, vector_pages)

ccs = dict()
for idx, score in enumerate(similarity_scores):
    # print(f"Cluster {idx}: {score:.4f}")
    ccs[idx] = score

# print(len([f for f in similarity_scores if f > 0.99]))

common = [k for k, v in counter.items() if v > 40]
ccs_lst = list({k: v for k, v in sorted(ccs.items(), key=lambda item: item[1], reverse=True)}.keys())
new_lst = [c for c in ccs_lst if c in common]

for i, idx in enumerate(new_lst[:20], start=1):
    print(f"{i}: Cluster {idx}: Count {counter[idx]}: {flat_pages[list(kmeans.labels_).index(idx)]}")

1: Cluster 9: Count 95: * [Facebook. Opens New Window](https://www.facebook.com/northeastern/)
* [X. Opens New Window](https://x.com/Northeastern)
* [YouTube. Opens New Window](https://www.youtube.com/user/Northeastern)
* [Linkedin. Opens New Window](https://www.linkedin.com/school/northeastern-university/)
* [Instagram. Opens New Window](https://www.instagram.com/northeastern/)
* [TikTok. Opens New Window](https://www.tiktok.com/@northeasternu)
2: Cluster 1: Count 95: Print Options
3: Cluster 4: Count 95: ![](/images/search.svg)
Toggle Search Visibility
4: Cluster 6: Count 95: * [Send Page to Printer](#)
5: Cluster 11: Count 95: Close this window
Print Options
-------------
6: Cluster 13: Count 95: Close Search
![](/images/close.svg)
7: Cluster 14: Count 71: ### Quick Links
8: Cluster 15: Count 95: Search catalog
Go
9: Cluster 17: Count 95: [![Northeastern University](/images/logo.png)](https://www.northeastern.edu/)
10: Cluster 19: Count 95: Copyright 2024\-2025 Northeastern Universi

In [19]:
len_pages

[52,
 40,
 30,
 30,
 30,
 30,
 30,
 29,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 29,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 30,
 29,
 30,
 30,
 29,
 30,
 30,
 30,
 30,
 30,
 30,
 387,
 30,
 30,
 27,
 29,
 31,
 25,
 113,
 30,
 124,
 30,
 27,
 39,
 133,
 29,
 29,
 34,
 29,
 25,
 26,
 37,
 26,
 28,
 26,
 30,
 37,
 27,
 27,
 432,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 34,
 30,
 28,
 29,
 30,
 30]

In [27]:
cluster_rank_rm = [1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,19,20]
cluster_rank_rm = [x-1 for x in cluster_rank_rm]
cluster_rm = [new_lst[i] for i in cluster_rank_rm]

for i, k in enumerate(kmeans.labels_):
    if k in cluster_rm:
        flat_pages[i] = None

combined_pages = []
print(flat_pages[0:len_pages[1]])
len_pages_0 = [0] + len_pages
for i in range(1, len(len_pages_0)):
    combined_pages.append(flat_pages[len_pages_0[i-1]:len_pages_0[i-1]+len_pages_0[i]])

# combined_pages = ['\n'.join(page) for pages in combined_pages for page in pages if page != None]
combined_pages = ['\n'.join(page for page in pages if page is not None) for pages in combined_pages]



In [28]:
combined_pages

 '#### First\\-Year Students\n* Not maintaining an overall cumulative GPA of at least 1\\.800 at the end of each full\\-term semester (fall, spring) of the first\\-year curriculum and a GPA of at least 2\\.000 in the major at the end of the second academic full\\-term semester of the curriculum (spring)\n#### Upperclass and Transfer Students\n* Not maintaining an overall cumulative GPA of at least 2\\.000 and a GPA of at least 2\\.000 in the major at the end of the second academic full\\-term semester of the curriculum completed on campus (fall or spring) and at the end of each full\\-term academic semester thereafter (fall, spring)\n### Academic Dismissal from Major\nNot maintaining a GPA of at least 2\\.000 in the major at the end of the third academic full\\-term semester and at the end of each full\\-term academic semester (fall, spring) thereafter will result in dismissal from Khoury College of Computer Sciences.\nStudents not following a program of study approved by the student’s

## Task 4

In [29]:
from datasets import Dataset
dataset_dict = {'url': [entry['url'] for entry in scraped_pages], 'content': combined_pages}
dataset = Dataset.from_dict(dataset_dict)
dataset.to_json('cleaned_dataset.jsonl')

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1060833