<a href="https://colab.research.google.com/github/jad-r-s/Jad_DTSC3020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [6]:
# 1) Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")

Dependencies installed.


### 2) Common Imports & Polite Headers

In [7]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [9]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [12]:
# Q1 — Write your answer here

# Step 1: Fetch HTML content from the given URL
url_q1 = "https://www.iban.com/country-codes"
html_q1 = fetch_html(url_q1)

# Step 2: Implement q1_read_table
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML."""
    tables = pd.read_html(html)
    # Pick the table that has at least 3 columns (the correct one has 4)
    for table in tables:
        if table.shape[1] >= 3:
            df = table
            break
    df = flatten_headers(df)
    return df

# Step 3: Implement q1_clean
def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: trim spaces, uppercase Alpha codes, cast Numeric to int (nullable)."""
    # Strip whitespace from all string cells
    df = df.map(lambda x: x.strip() if isinstance(x, str) else x)

    # Rename columns for consistency
    df.columns = [c.strip() for c in df.columns]

    # Make sure the expected columns exist
    expected_cols = ['Country', 'Alpha-2 code', 'Alpha-3 code', 'Numeric']
    df = df[[col for col in expected_cols if col in df.columns]]

    # Standardize Alpha-2 and Alpha-3 columns
    for col in ['Alpha-2 code', 'Alpha-3 code']:
        if col in df.columns:
            df[col] = df[col].str.upper()

    # Convert Numeric column to integer (nullable)
    if 'Numeric' in df.columns:
        df['Numeric'] = pd.to_numeric(df['Numeric'], errors='coerce').astype('Int64')

    # Drop rows missing essential fields
    df = df.dropna(subset=['Country'])
    return df

# Step 4: Implement q1_sort_top
def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N."""
    df_sorted = df.sort_values(by='Numeric', ascending=False, na_position='last')
    return df_sorted.head(top)

# Step 5: Run pipeline
df_q1 = q1_read_table(html_q1)
df_q1_clean = q1_clean(df_q1)
df_q1_top15 = q1_sort_top(df_q1_clean)

# Step 6: Output results
print("Top 15 Countries by Numeric Code (Descending):")
print(df_q1_top15)

# Step 7: Save CSV
df_q1_clean.to_csv("data_q1.csv", index=False)
print("\nFile saved as data_q1.csv")


Top 15 Countries by Numeric Code (Descending):
                                               Country Alpha-2 code  \
247                                             Zambia           ZM   
246                                              Yemen           YE   
192                                              Samoa           WS   
244                                  Wallis and Futuna           WF   
240                 Venezuela (Bolivarian Republic of)           VE   
238                                         Uzbekistan           UZ   
237                                            Uruguay           UY   
35                                        Burkina Faso           BF   
243                              Virgin Islands (U.S.)           VI   
236                     United States of America (the)           US   
219                       Tanzania, United Republic of           TZ   
108                                        Isle of Man           IM   
113                           

  tables = pd.read_html(html)


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [13]:
# Q2 — Write your answer here

# Step 1: Fetch HTML
url_q2 = "https://news.ycombinator.com/"
html_q2 = fetch_html(url_q2)

# Step 2: Implement q2_parse_items
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    """
    soup = BeautifulSoup(html, "lxml")
    items = soup.select(".athing")

    data = []
    for item in items:
        rank_tag = item.select_one(".rank")
        title_tag = item.select_one(".titleline a")

        rank = rank_tag.text.replace(".", "").strip() if rank_tag else ""
        title = title_tag.text.strip() if title_tag else ""
        link = title_tag["href"] if title_tag and title_tag.has_attr("href") else ""

        # subtext row (next sibling)
        subtext = item.find_next_sibling("tr").select_one(".subtext")
        if subtext:
            points_tag = subtext.select_one(".score")
            user_tag = subtext.select_one(".hnuser")
            comments_tag = subtext.find_all("a")[-1]  # last <a> is usually comments

            points = points_tag.text if points_tag else ""
            user = user_tag.text if user_tag else ""
            comments = comments_tag.text if "comment" in comments_tag.text else "0"
        else:
            points = ""
            user = ""
            comments = "0"

        data.append({
            "rank": rank,
            "title": title,
            "link": link,
            "points": points,
            "comments": comments,
            "user": user
        })

    return pd.DataFrame(data)


# Step 3: Implement q2_clean
def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values."""
    # Fill missing text fields
    df = df.fillna("")

    # Extract numeric values safely; replace non-digits with 0
    def extract_int(x):
        nums = re.findall(r"\d+", str(x))
        return int(nums[0]) if nums else 0

    df["rank"] = df["rank"].apply(extract_int)
    df["points"] = df["points"].apply(extract_int)
    df["comments"] = df["comments"].apply(extract_int)

    # Fill text fields where blank
    for col in ["title", "link", "user"]:
        df[col] = df[col].replace("", "N/A")

    return df


# Step 4: Implement q2_sort_top
def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N."""
    df_sorted = df.sort_values(by="points", ascending=False)
    return df_sorted.head(top)


# Step 5: Run pipeline
df_q2 = q2_parse_items(html_q2)
df_q2_clean = q2_clean(df_q2)
df_q2_top15 = q2_sort_top(df_q2_clean)

# Step 6: Output results
print("Top 15 Hacker News Stories (sorted by points):")
print(df_q2_top15[["rank", "title", "points", "comments", "user"]])

# Step 7: Save CSV
df_q2_clean.to_csv("data_q2.csv", index=False)
print("\nFile saved as data_q2.csv")

Top 15 Hacker News Stories (sorted by points):
    rank                                              title  points  comments  \
10    11  YouTube Removes Windows 11 Bypass Tutorials, C...     469       176   
8      9                            Why I love OCaml (2023)     307       205   
24    25  VLC's Jean-Baptiste Kempf Receives the Europea...     272        44   
25    26                              James Watson has died     266       147   
5      6  Myna: Monospace typeface designed for symbol-h...     224        84   
0      1                                Why is Zig so cool?     211        92   
7      8                             Ruby Solved My Problem     194        73   
6      7                                How did I get here?     162        33   
3      4                       Becoming a Compiler Engineer     156        63   
1      2  Snapchat open-sources Valdi a cross-platform U...     132        32   
20    21                     Angel Investors, a Field Guide   