## Python code to find all the urls in a web page using breadth first search 

In [2]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

Requests helps to obtain the data.
Beautiful Soup helps to extract the data.
Time is used to slow down requests (rate limiting), avoid overloading the server and follow ethical and polite crawling practices. Legally/Ethically rate limiting is strongly advised.
Urljoin allows your crawler to follow both relative and absolute links.

Disallowed paths from https://web.mit.edu/robots.txt

In [3]:
DISALLOWED_PATHS = [
    "/afs/", "/cgi-bin/", "/user/", "/org/",
    "/activity/", "/contrib/", "/dept/",
    "/software/", "/bin/"
]

Check if the URL is allowed according to robots.txt

In [4]:
def is_allowed(url):
    for path in DISALLOWED_PATHS:
        if path in url:
            return False
    return True

In [5]:
def bfs_crawl(start_url, max_pages=20):
    visited = []
    queue = [start_url]

    while queue and len(visited) < max_pages:
        url = queue.pop(0)

        if url in visited or not is_allowed(url):
            continue

        try:
            r = requests.get(url, timeout=5)
            if r.status_code != 200:
                continue

            visited.append(url)
            print("Visited:", url)

            soup = BeautifulSoup(r.text, "html.parser")
            for link in soup.find_all("a", href=True):
                href = link["href"]
                full_url = urljoin(start_url, href)  # convert relative to absolute
                if full_url.startswith("https://web.mit.edu") and full_url not in visited and full_url not in queue:
                    if is_allowed(full_url):
                        queue.append(full_url)

            time.sleep(1)  # polite crawling

        except requests.exceptions.RequestException:
            continue

    return visited

# Start crawling
start_url = "https://web.mit.edu/"
urls = bfs_crawl(start_url)

print("\nCrawled URLs:")
for url in urls:
    print(url)

Visited: https://web.mit.edu/
Visited: https://web.mit.edu/#main
Visited: https://web.mit.edu/education
Visited: https://web.mit.edu/research
Visited: https://web.mit.edu/innovation
Visited: https://web.mit.edu/admissions-aid
Visited: https://web.mit.edu/campus-life
Visited: https://web.mit.edu/alumni
Visited: https://web.mit.edu/about
Visited: https://web.mit.edu/building-a-better-world
Visited: https://web.mit.edu/search
Visited: https://web.mit.edu/#
Visited: https://web.mit.edu/feedback
Visited: https://web.mit.edu/visitmit
Visited: https://web.mit.edu/search/?redirect-origin=legacy&tab=directory
Visited: https://web.mit.edu/contact
Visited: https://web.mit.edu/privacy
Visited: https://web.mit.edu/accessibility
Visited: https://web.mit.edu/education/schools-and-departments/
Visited: https://web.mit.edu/about/mission-statement/

Crawled URLs:
https://web.mit.edu/
https://web.mit.edu/#main
https://web.mit.edu/education
https://web.mit.edu/research
https://web.mit.edu/innovation
https

pop is a list operation that removes and returns the first element of the list and ensures that URLs are visited in the order they were added, which is how BFS (Breadth-First Search) works.