<a href="https://colab.research.google.com/github/huiseung02/huiseung_DTSC3020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [3]:
#1) Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


### 2) Common Imports & Polite Headers

In [4]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [10]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
  tables = pd.read_html(html)
  for table in tables:
    if table.shape[1] >= 3:
      return table
  raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    rename = {
        df.columns[0]: "Country",
        df.columns[1]: "Alpha-2",
        df.columns[2]: "Alpha-3",
        df.columns[3]: "Numeric"
    }
    df = df.rename(columns=rename)
    df["Country"] = df["Country"].str.strip()
    df["Numeric"] = pd.to_numeric(df["Numeric"], errors="coerce").astype("Int64")
    df["Alpha-2"] = df["Alpha-2"].str.upper()
    df["Alpha-3"] = df["Alpha-3"].str.upper()
    df = df.dropna(subset = ["Country","Alpha-2","Alpha-3","Numeric"])
    return df

    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    sorted = df.sort_values("Numeric", ascending=False)
    return sorted.head(top)
    raise NotImplementedError("TODO: implement q1_sort_top")


In [11]:
# Q1 — Write your answer here

url = "https://www.iban.com/country-codes"
html = fetch_html(url)
df_raw = q1_read_table(html)
df_clean = q1_clean(df_raw)
df_top15 = q1_sort_top(df_clean, 15)
df_clean.to_csv("data_q1.csv", index=False)
print(df_top15)




                                               Country Alpha-2 Alpha-3  \
247                                             Zambia      ZM     ZMB   
246                                              Yemen      YE     YEM   
192                                              Samoa      WS     WSM   
244                                  Wallis and Futuna      WF     WLF   
240                 Venezuela (Bolivarian Republic of)      VE     VEN   
238                                         Uzbekistan      UZ     UZB   
237                                            Uruguay      UY     URY   
35                                        Burkina Faso      BF     BFA   
243                              Virgin Islands (U.S.)      VI     VIR   
236                     United States of America (the)      US     USA   
219                       Tanzania, United Republic of      TZ     TZA   
108                                        Isle of Man      IM     IMN   
113                                   

  tables = pd.read_html(html)


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [12]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
  soup = BeautifulSoup(html, "html.parser")
  rows = soup.select("tr.athing")
  items = []
  for row in rows:
      rank = row.select_one(".rank")
      title_link = row.select_one(".titleline > a")
      subtext_row = row.find_next_sibling("tr")
      sub_points = subtext_row.select_one(".score")
      sub_comments = subtext_row.select_one("a[href*='item?id=']")

      item = {
            "rank": rank.text.replace(".", "").strip() if rank else "",
            "title": title_link.text.strip() if title_link else "",
            "link": title_link['href'].strip() if title_link else "",
            "points": (sub_points.text.replace(" points", "")
                                      .replace(" point", "").strip() if sub_points else ""),
            "comments": (sub_comments.text.replace("comments", "")
                                          .replace("comment", "").strip() if sub_comments and "comment" in sub_comments.text else "")
        }
      items.append(item)
  return pd.DataFrame(items)

  raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
  df = df.copy()
  for col in ["rank", "points", "comments"]:
      df[col] = df[col].apply(lambda x: "".join([c for c in str(x) if c.isdigit()]))
      df[col] = df[col].replace("", "0").astype(int)
  df["title"] = df["title"].fillna("")
  df["link"] = df["link"].fillna("")
  return df

  raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    return df.sort_values("points", ascending=False).head(top)
    raise NotImplementedError("TODO: implement q2_sort_top")


In [13]:
# Q2 — Write your answer here

url = "https://news.ycombinator.com/"
html = fetch_html(url)
df_raw = q2_parse_items(html)
df_clean = q2_clean(df_raw)
df_top15 = q2_sort_top(df_clean, 15)
df_clean.to_csv("data_q2.csv", index=False)
print(df_top15)



    rank                                              title  \
9     10                   Solarpunk is happening in Africa   
17    18                          End of Japanese community   
4      5                             Ratatui – App Showcase   
29    30  New gel restores dental enamel and could revol...   
23    24                   Why aren't smart people happier?   
15    16      Dillo, a multi-platform graphical web browser   
18    19  ChatGPT terms disallow its use in providing le...   
16    17  Firefox profiles: Private, focused spaces for ...   
0      1  Open Source Implementation of Apple's Private ...   
28    29                  Ruby and Its Neighbors: Smalltalk   
21    22   The trust collapse: Infinite AI content is awful   
7      8  Cloudflare Tells U.S. Govt That Foreign Site B...   
20    21  IKEA launches new smart home range with 21 Mat...   
5      6    Mathematical exploration and discovery at scale   
13    14                  How I am deeply integrating E