<a href="https://colab.research.google.com/github/lt33tx/Landon_Tinch_DTSC3020_Fall2025-/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.



## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [9]:
1 #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


##Common Imports & Polite Headers




In [10]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [11]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [16]:
# Q1 — Write your answer here

URL_Q1 = "https://www.iban.com/country-codes"  # link to pull country code table from

def q1_read_table(html: str) -> pd.DataFrame:  # function to read table from HTML
    """Snags the main table, tries to use the first row as headers."""

    tables = pd.read_html(html, header=0)  # read all HTML tables

    df = next(df for df in tables if df.shape[1] >= 3)  # grab first table with enough columns

    df = flatten_headers(df)  # flatten messy multi-row headers

    df.columns = ['Country', 'Alpha-2', 'Alpha-3', 'Numeric']  # rename columns cleanly

    return df  # return cleaned raw table

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:  # function to clean table
    """Cleaning time! Stripping spaces, upper-casing, and making 'Numeric' an actual number."""

    for col in df.select_dtypes(include='object').columns:  # loop over string columns
         df[col] = df[col].str.strip()  # remove extra spaces

    for col in ['Alpha-2', 'Alpha-3']:  # enforce uppercase for codes
        df[col] = df[col].str.upper()  # convert to uppercase

    df['Numeric'] = df['Numeric'].astype(str).str.replace(r'[^\d]', '', regex=True)  # strip non-digits
    df['Numeric'] = pd.to_numeric(df['Numeric'], errors='coerce').astype('Int64')  # convert to numeric

    df.dropna(subset=['Numeric'], inplace=True)  # remove rows missing Numeric code

    return df  # return cleaned dataframe

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:  # function to sort by numeric code
    """Sort descending by Numeric and return the top N rows, ez."""

    df_sorted = df.sort_values(by='Numeric', ascending=False)  # sort biggest numeric codes first

    return df_sorted.head(top)  # return top N

try:
    html_q1 = fetch_html(URL_Q1)  # fetch HTML text from website
except requests.exceptions.HTTPError as e:  # catch fetch errors
    print(f"Seriously, a 404? Error fetching URL: {e}")  # print error message
    sys.exit(1)  # exit program

df_q1 = q1_read_table(html_q1)  # read raw table from HTML

df_q1_clean = q1_clean(df_q1.copy())  # clean a copy of the table

output_filename = 'data_q1.csv'  # choose filename for saving

df_q1_clean.to_csv(output_filename, index=False)  # save cleaned data to CSV
print(f"Data done! Full data saved to {output_filename}")  # confirm save

df_q1_top15 = q1_sort_top(df_q1_clean)  # get top 15 countries by numeric code
print(df_q1_top15.to_string())  # print formatted output

Data done! Full data saved to data_q1.csv
                                                        Country Alpha-2 Alpha-3  Numeric
247                                                      Zambia      ZM     ZMB      894
246                                                       Yemen      YE     YEM      887
192                                                       Samoa      WS     WSM      882
244                                           Wallis and Futuna      WF     WLF      876
240                          Venezuela (Bolivarian Republic of)      VE     VEN      862
238                                                  Uzbekistan      UZ     UZB      860
237                                                     Uruguay      UY     URY      858
35                                                 Burkina Faso      BF     BFA      854
243                                       Virgin Islands (U.S.)      VI     VIR      850
236                              United States of America (the)     

  tables = pd.read_html(html, header=0)  # read all HTML tables


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [13]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [17]:
# Q2 — Write your answer here

URL_Q2 = "https://news.ycombinator.com/"  # URL for Hacker News front page


def q2_parse_items(html: str) -> pd.DataFrame:  # function to parse story info from HTML
    """Parse front page items into DataFrame columns: rank, title, link, points, comments."""

    soup = BeautifulSoup(html, 'lxml')  # load HTML into BeautifulSoup
    items = []  # list to hold parsed story dictionaries

    title_rows = soup.select('tr.athing')  # find all rows that represent a story

    for row in title_rows:  # loop through each story row
        rank_tag = row.select_one('.rank')  # grab the rank element
        rank = rank_tag.text.strip().replace('.', '') if rank_tag else ''  # clean the rank

        title_tag = row.select_one('.titleline a')  # grab the title link
        title = title_tag.text.strip() if title_tag else ''  # extract text
        link = title_tag['href'] if title_tag and title_tag.has_attr('href') else ''  # extract URL

        subtext_row = row.find_next_sibling('tr')  # get row containing points/comments

        points = ''  # default points
        comments = ''  # default comment count

        if subtext_row:  # ensure subtext row exists
            score_tag = subtext_row.select_one('.score')  # points element
            if score_tag:
                points = score_tag.text.strip().split(' ')[0]  # extract number part

            comment_tag = subtext_row.find_all('a')[-1]  # the comments link
            comment_text = comment_tag.text.lower()  # lowercase text

            if 'comment' in comment_text or 'discuss' in comment_text:  # check if comment-like
                if 'discuss' in comment_text or comment_text == 'hide':
                    comments = '0'  # treat discuss as zero comments
                else:
                    comments = comment_text.split(' ')[0]  # grab number

        items.append({  # build row dictionary
            'rank': rank,
            'title': title,
            'link': link,
            'points': points,
            'comments': comments,
        })

    return pd.DataFrame(items)  # return DataFrame of results


def q2_clean(df: pd.DataFrame) -> pd.DataFrame:  # function to clean parsed data
    """Clean numeric fields and fill missing values."""

    df['title'].fillna('No Title', inplace=True)  # fill missing titles
    df['link'].fillna('No Link', inplace=True)  # fill missing links

    for col in ['rank', 'points', 'comments']:  # numeric-like columns
        df[col] = df[col].astype(str).str.replace(r'[^\d]', '', regex=True)  # strip non-digits
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)  # convert to int

    return df  # return cleaned DataFrame


def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:  # function to sort top stories
    """Sort by points desc and return Top-N."""

    df_sorted = df.sort_values(by='points', ascending=False)  # sort by points

    return df_sorted.head(top)  # return top N rows


try:
    html_q2 = fetch_html(URL_Q2)  # fetch the Hacker News HTML
except requests.exceptions.HTTPError as e:  # catch fetch errors
    print(f"Server is down or something? Error fetching URL: {e}")  # error message
    sys.exit(1)  # exit if URL fails

df_q2 = q2_parse_items(html_q2)  # parse the HTML into a table

df_q2_clean = q2_clean(df_q2.copy())  # clean up numeric and text fields

output_filename = 'data_q2.csv'  # set output filename

df_q2_clean.to_csv(output_filename, index=False)  # save cleaned data to CSV
print(f"Q2 data saved to {output_filename}")  # confirm save

df_q2_top15 = q2_sort_top(df_q2_clean)  # get top 15 stories
print(df_q2_top15[['rank', 'title', 'points', 'comments']].to_string())  # print final table

Q2 data saved to data_q2.csv
    rank                                                                        title  points  comments
13    14  YouTube Removes Windows 11 Bypass Tutorials, Claims 'Risk of Physical Harm'     411       156
9     10                                                      Why I love OCaml (2023)     297       203
22    23               VLC's Jean-Baptiste Kempf Receives the European SFS Award 2025     249        41
24    25                                                        James Watson has died     241       134
5      6     Myna: Monospace typeface designed for symbol-heavy programming languages     206        81
7      8                                                       Ruby Solved My Problem     173        68
0      1                                                          Why is Zig so cool?     161        56
6      7                                                          How did I get here?     127        33
2      3                           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['title'].fillna('No Title', inplace=True)  # fill missing titles
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['link'].fillna('No Link', inplace=True)  # fill missing links
