<a href="https://colab.research.google.com/github/kamangirkhan/Data/blob/main/ArashNateghian_HW_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import urllib.request, urllib.error
from bs4 import BeautifulSoup
import csv

def get_soup(url):

    req = urllib.request.Request(
        url,
        headers={"User-Agent": "Mozilla/5.0"}
    )
    try:
        with urllib.request.urlopen(req) as resp:
            html = resp.read()
        return BeautifulSoup(html, "html.parser")
    except Exception as e:
        print("Error fetching page:", e)
        return None


# Part 1 – Tutorial-style scraping on AI Wikipedia page
url_ai = "https://en.wikipedia.org/wiki/Artificial_intelligence"
soup_ai = get_soup(url_ai)

if not soup_ai:
    raise SystemExit("Could not load AI page.")

# --- Title ---
title = soup_ai.title.get_text(strip=True) if soup_ai.title else "NO TITLE"
print("Page title:", title)

# --- "Content": collect all section headings as our 'content list' ---
headings = []
for tag in soup_ai.find_all(["h2", "h3"]):
    span = tag.find("span", class_="mw-headline")
    text = span.get_text(strip=True) if span else tag.get_text(strip=True)
    text = text.replace("[edit]", "").strip()
    if text:
        headings.append(text)

print("\n=== Content section (first 10 headings) ===")
for h in headings[:10]:
    print(h)

with open("content.txt", "w", encoding="utf-8") as f:
    for h in headings:
        f.write(h + "\n")

print(f"\nSaved {len(headings)} headings to content.txt")



see_also = []

# 1) Collect all hatnotes that start with "See also:"
for note in soup_ai.find_all("div", class_="hatnote"):
    text = note.get_text(" ", strip=True)
    if "see also" in text.lower():
        for a_tag in note.find_all("a", href=True):
            href = a_tag["href"]
            label = a_tag.get_text(strip=True)
            if label:
                see_also.append([href, label])

# 2) collect a global "See also" section if it exists
see_header = None
for tag in soup_ai.find_all(["h2", "h3"]):
    heading_text = tag.get_text(" ", strip=True).lower()
    if "see also" in heading_text:
        see_header = tag
        break

if see_header:
    for sib in see_header.find_next_siblings():
        if sib.name in ["h2", "h3"]:
            break  # next section, stop
        if sib.name == "ul":
            for li in sib.find_all("li"):
                a_tag = li.find("a", href=True)
                if not a_tag:
                    continue
                href = a_tag["href"]
                label = a_tag.get_text(strip=True)
                if label:
                    see_also.append([href, label])

print("\n=== See also (first 10 links) ===")
for href, label in see_also[:10]:
    print(href, "->", label)

# Save to CSV
with open("see_also.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["href", "text"])
    writer.writerows(see_also)

print(f"\nSaved {len(see_also)} 'See also' links to see_also.csv")


Page title: Artificial intelligence - Wikipedia

=== Content section (first 10 headings) ===
Contents
Goals
Reasoning and problem-solving
Knowledge representation
Planning and decision-making
Learning
Natural language processing
Perception
Social intelligence
General intelligence

Saved 53 headings to content.txt

=== See also (first 10 links) ===
/wiki/Environmental_impacts_of_artificial_intelligence -> Environmental impacts of artificial intelligence
/wiki/Content_moderation -> Content moderation
/wiki/Explainable_AI -> Explainable AI
/wiki/Algorithmic_transparency -> Algorithmic transparency
/wiki/Right_to_explanation -> Right to explanation
/wiki/Human-AI_interaction -> Human-AI interaction
/wiki/Lists_of_open-source_artificial_intelligence_software -> Lists of open-source artificial intelligence software
/wiki/Synthetic_intelligence -> Synthetic intelligence
/wiki/Intelligent_agent -> Intelligent agent
/wiki/Artificial_mind_(disambiguation) -> Artificial mind

Saved 12 'See also' 

**Web Scraping Learning Experience**

In this project, I explored the fundamentals of web scraping using Python, urllib.request, and BeautifulSoup. My goal was to follow a tutorial that scraped the “Content” and “See also” sections from a Wikipedia page, and then apply and enhance this method on my own. Although the tutorial appeared straightforward, my actual experience differed significantly because the HTML layout of Wikipedia pages has changed over time. This forced me to move beyond copy-and-paste scripting and actually understand how web scraping works under dynamic, real-world conditions.

**What I Learned**

The most important thing I learned is that web scraping is not about memorizing code, it is about understanding structure. HTML changes, websites update their layout, and hard-coded patterns from tutorials often fail. I learned how to:

Inspect live HTML and search for patterns (h2, h3, span.id, class="hatnote", etc.).

Use req = urllib.request.Request(...) with a User-Agent header to avoid 403 Forbidden errors.

Parse webpages using BeautifulSoup and extract tags, text, and attributes.

Handle unexpected or missing elements defensively (None checking, conditional extraction).

Use multiple fallback patterns when a single selector doesn’t work.

This project forced me to think like a scraper developer rather than a code copier.

**Challenges I Encountered**

I encountered several meaningful challenges:

Wikipedia changed its HTML structure, so the tutorial’s code for "See also" no longer worked at all.
I kept getting:
“Saved 0 ‘See also’ links to see_also.csv.”

The tutorial relied on div class="div-col", which no longer exists on the page I scraped.

The page had many hatnotes with small “See also:” hints inside sections, but did not have a properly structured bottom “See also” section like older pages.

I had to debug why soup.find('div', class='div-col') returned None and why see_header.find_next_siblings() didn’t locate any lists.

I also had to deal with relative URLs (/wiki/...) and later learned how to convert them into full URLs using urljoin.

These failures were actually the best part of the learning process, because they pushed me to analyze HTML more carefully and redesign the extraction logic.

**Enhancements I Made**

I implemented several enhancements beyond the tutorial, all of which strengthened the scraper:

 Added a Realistic User-Agent Header

Without this, Wikipedia returned a 403 Forbidden error.
This makes the scraper more stable and more “browser-like.”


**Overall Learning Experience**

This assignment transformed my understanding of web scraping. I began by expecting to simply replicate the tutorial, but the failures forced me to analyze HTML deeply, think critically, and improve the scraper’s logic. I learned that web scraping is fundamentally a problem-solving activity, where the developer must adapt to imperfect or changing HTML structures.

This experience also clarified the difference between:

Theoretical examples that work in controlled tutorials

Real-world pages that evolve, break older selectors, and require flexible strategies

In the end, I created a significantly improved version of the code with modern, reliable extraction of both headings and “See also” links. The debugging process improved my HTML literacy, my BeautifulSoup skills, and my ability to design fallback logic.

Overall, this was an extremely valuable and realistic introduction to web scraping.