%%html

<style>
table {display: block;}
td {
  font-size: 20px
}
.rendered_html { font-size: 20px; }
*{ line-height: 200%; }

.task {
    color: green;
    font-weight: bold;
    background-color: #f0f0f0;
    padding: 2px;
}

.question {
    color: red;
    font-size: 20px;
    font-weight: bold;
}
    span.task {
    color: green !important;
    font-weight: bold;
    background-color: #f0f0f0;
    padding: 2px;
}
</style>

# <span style="color:blue">Natural Language Processing and the Web WS25/26</span>  
# Assignment 01
### Deadline: Friday October 24
## <span class='question'>Submission Instructions</span>
- Make only one submission per group
- Provide Full Names of all group members as displayed in Moodle
- Submit the notebook, including both the solution and its output. 
- Do not submit any output files.

### <span style="color:blue">Group Members:</span>

Gaurika Chopra

Nasrul Huda

In [13]:
### Importing Libraries

import requests
from bs4 import BeautifulSoup

import csv
import json
import re
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import random

# Download required NLTK data files (only once)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

## <span class='task'>TASK-01</span> - 10%
Update the `scrape_hacker_news` function from Example 02, Practice Class 01. The current function only scrapes article titles from the first page of Hacker News

1. Modify the function to scrape articles and their URLs from multiple pages, add a parameter to allow the user to specify the maximum number of pages to scrape, set its default value to 1
2. Add  a parameter to allow the user to specify the file name to save the extracted results in a csv format, make sure to strip any trailing and leading whitespaces characters before saving the results.


In [5]:
# Solution

# Base URL
BASE_URL = "https://news.ycombinator.com/"
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}


def scrape_hacker_news(start_url=BASE_URL, max_pages=1, output_file="hacker_news.csv"):
    """
    Scrapes article titles and URLs from Hacker News across multiple pages
    and saves the results to a CSV file.
    
    Parameters:
        start_url (str): Starting URL of Hacker News.
        max_pages (int): Maximum number of pages to scrape. Default is 1.
        output_file (str): File name to save the extracted results in CSV format.
    """
    all_articles = []
    current_url = start_url

    try:
        for page in range(1, max_pages + 1):
            print(f"\n--- Scraping page {page}: {current_url} ---")

            # Fetch the page
            response = requests.get(current_url, headers=HEADERS, timeout=10)
            response.raise_for_status()

            # Parse HTML
            soup = BeautifulSoup(response.content, "html.parser")

            # Extract all article rows
            articles = soup.find_all(class_="athing")
            if not articles:
                print("No articles found on this page. Stopping.")
                break

            # Process articles
            for article in articles:
                title_tag = article.find(class_="titleline").find("a")
                title = title_tag.text.strip() if title_tag else "N/A"
                url = title_tag["href"].strip() if title_tag else "N/A"
                all_articles.append({"title": title, "url": url})

            # Find the "More" link to get next page
            more_link = soup.find("a", string="More")
            if more_link and more_link.get("href"):
                current_url = BASE_URL + more_link["href"]
            else:
                print("No further pages found.")
                break

        # Save to CSV
        output_file = output_file.strip()  # Remove any leading/trailing spaces
        with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=["title", "url"])
            writer.writeheader()
            writer.writerows(all_articles)

        print(f"\n--- Scraping complete! Extracted {len(all_articles)} articles ---")
        print(f"Results saved to: {output_file}\n")

        return all_articles

    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching {current_url}: {e}")
        return None


# Example usage
if __name__ == "__main__":
    scrape_hacker_news(BASE_URL, max_pages=7, output_file="hn_articles.csv")



--- Scraping page 1: https://news.ycombinator.com/ ---

--- Scraping page 2: https://news.ycombinator.com/?p=2 ---

--- Scraping page 3: https://news.ycombinator.com/?p=3 ---

--- Scraping page 4: https://news.ycombinator.com/?p=4 ---

--- Scraping page 5: https://news.ycombinator.com/?p=5 ---

--- Scraping page 6: https://news.ycombinator.com/?p=6 ---

--- Scraping page 7: https://news.ycombinator.com/?p=7 ---

--- Scraping complete! Extracted 210 articles ---
Results saved to: hn_articles.csv



## <span class='task'>TASK-02</span> - 50%
Use `scrape_hacker_news` to scrape 15 pages for this task and save them to `hackernews-articles.csv`, assume the file is structured as follows:
```
Title, URL
title1, url1
title2, url2
...
```

Perform the following tasks on the article titles:

#### 1. Build a lookup index using words as keys
- Tokenize the original titles into individual words/tokens.
- Collect a list of all unique words that appear across all articles
- Create a dictionary where:
     * Key = token/word
     * Value = list of article IDs (or indices) containing that word
     
       For Example
         ```
         {
            "AI": [0, 2, 5], # ids/indices of articles containing 'AI'
            "Python": [0, 2, 9], # ids/indices of articles containing 'Python'
         }
        ```
       > Here [0, 2, 5, 9] represents index of articles from `hackernews-articles.csv`. 

-  Save the resulting dictionary as `word-index.json` 

#### 2. Repeat the same steps to build two additional lookup indices

- Stem Index: Apply a stemmer (e.g., PorterStemmer) to each token. Use the stemmed form of the token as the dictionary key. Save the resulting dictionary as `stem-index.json`.
- Lemma Index: Apply a lemmatizer (e.g., WordNetLemmatizer) using appropriate POS tags. Use the lemmatized form of each token as the dictionary key. Save the resulting dictionary as `lemma-index.json`.

#### 3. Print vocab size of all three indices
After building all three indices (word-index, stem-index, lemma-index), print the vocabulary size (number of unique keys) for each. 

In [7]:
# Solution

# ------------------------------------------------------
# STEP 1: Scrape 15 pages and save to hackernews-articles.csv
# ------------------------------------------------------

scrape_hacker_news(max_pages=15, output_file="hackernews-articles.csv")

# ------------------------------------------------------
# STEP 2: Read the CSV data
# ------------------------------------------------------

titles = []
with open("hackernews-articles.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        titles.append(row["title"])

print(f"Loaded {len(titles)} article titles.")

# ------------------------------------------------------
# STEP 3: Helper functions
# ------------------------------------------------------

def preprocess_text(text):
    """Lowercase, remove punctuation, and tokenize."""
    text = text.lower()
    tokens = word_tokenize(re.sub(r"[^a-zA-Z0-9]+", " ", text))
    return [t for t in tokens if t.isalnum()]

def get_wordnet_pos(tag):
    """Map POS tag to WordNet POS for better lemmatization."""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# ------------------------------------------------------
# STEP 4: Build indices
# ------------------------------------------------------

word_index = {}
stem_index = {}
lemma_index = {}

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

for idx, title in enumerate(titles):
    tokens = preprocess_text(title)

    # --- Word Index ---
    for token in tokens:
        word_index.setdefault(token, []).append(idx)

    # --- Stem Index ---
    for token in tokens:
        stem = stemmer.stem(token)
        stem_index.setdefault(stem, []).append(idx)

    # --- Lemma Index ---
    pos_tags = nltk.pos_tag(tokens)
    for token, tag in pos_tags:
        lemma = lemmatizer.lemmatize(token, get_wordnet_pos(tag))
        lemma_index.setdefault(lemma, []).append(idx)

# ------------------------------------------------------
# STEP 5: Save indices as JSON files
# ------------------------------------------------------

with open("word-index.json", "w", encoding="utf-8") as f:
    json.dump(word_index, f, indent=4, ensure_ascii=False)

with open("stem-index.json", "w", encoding="utf-8") as f:
    json.dump(stem_index, f, indent=4, ensure_ascii=False)

with open("lemma-index.json", "w", encoding="utf-8") as f:
    json.dump(lemma_index, f, indent=4, ensure_ascii=False)

# ------------------------------------------------------
# STEP 6: Print vocabulary sizes
# ------------------------------------------------------

print("\n--- Vocabulary Sizes ---")
print(f"Word Index  : {len(word_index)} unique words")
print(f"Stem Index  : {len(stem_index)} unique stems")
print(f"Lemma Index : {len(lemma_index)} unique lemmas")



--- Scraping page 1: https://news.ycombinator.com/ ---

--- Scraping page 2: https://news.ycombinator.com/?p=2 ---

--- Scraping page 3: https://news.ycombinator.com/?p=3 ---

--- Scraping page 4: https://news.ycombinator.com/?p=4 ---

--- Scraping page 5: https://news.ycombinator.com/?p=5 ---

--- Scraping page 6: https://news.ycombinator.com/?p=6 ---

--- Scraping page 7: https://news.ycombinator.com/?p=7 ---

--- Scraping page 8: https://news.ycombinator.com/?p=8 ---

--- Scraping page 9: https://news.ycombinator.com/?p=9 ---

--- Scraping page 10: https://news.ycombinator.com/?p=10 ---

--- Scraping page 11: https://news.ycombinator.com/?p=11 ---

--- Scraping page 12: https://news.ycombinator.com/?p=12 ---

--- Scraping page 13: https://news.ycombinator.com/?p=13 ---

--- Scraping page 14: https://news.ycombinator.com/?p=14 ---

--- Scraping page 15: https://news.ycombinator.com/?p=15 ---

--- Scraping complete! Extracted 450 articles ---
Results saved to: hackernews-articles.csv

## <span class='task'>TASK-03</span> - 25%
Assuming you have successfully built the lookup indices:

#### Define a search function
The function should:
- Take a keyword as an argument.
- Search for matching articles in all three indices — original, stemmed, and lemmatized.
- For each index: 
  - Return the titles of those matching articles from `hackernews-articles.csv`.

In [9]:
# Solution

# ------------------------------------------------------------
# Helper function for WordNet POS tagging
# ------------------------------------------------------------
def get_wordnet_pos(tag):
    """Map POS tag to WordNet POS for better lemmatization."""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


# ------------------------------------------------------------
# Load Data
# ------------------------------------------------------------
# Load the article titles from CSV
titles = []
with open("hackernews-articles.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        titles.append(row["title"])

# Load indices
with open("word-index.json", "r", encoding="utf-8") as f:
    word_index = json.load(f)

with open("stem-index.json", "r", encoding="utf-8") as f:
    stem_index = json.load(f)

with open("lemma-index.json", "r", encoding="utf-8") as f:
    lemma_index = json.load(f)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


# ------------------------------------------------------------
# Define the Search Function
# ------------------------------------------------------------
def search_keyword(keyword):
    """
    Search for a keyword in all three indices and return matching article titles.
    """
    keyword = keyword.lower().strip()

    # Get stem and lemma forms
    stem = stemmer.stem(keyword)
    pos_tag = nltk.pos_tag([keyword])[0][1]
    lemma = lemmatizer.lemmatize(keyword, get_wordnet_pos(pos_tag))

    print(f"\n Searching for: '{keyword}'")
    print(f"Stemmed form : {stem}")
    print(f"Lemmatized form : {lemma}")

    results = {}

    # Search in Word Index
    if keyword in word_index:
        indices = word_index[keyword]
        results["word_index"] = [titles[i] for i in indices]
    else:
        results["word_index"] = []

    # Search in Stem Index
    if stem in stem_index:
        indices = stem_index[stem]
        results["stem_index"] = [titles[i] for i in indices]
    else:
        results["stem_index"] = []

    # Search in Lemma Index
    if lemma in lemma_index:
        indices = lemma_index[lemma]
        results["lemma_index"] = [titles[i] for i in indices]
    else:
        results["lemma_index"] = []

    # Print results neatly
    for idx_name, titles_list in results.items():
        print(f"\n--- Matches in {idx_name} ({len(titles_list)} results) ---")
        for t in titles_list[:10]:  # show first 10
            print(f"• {t}")
        if len(titles_list) > 10:
            print(f"... and {len(titles_list) - 10} more")

    return results


In [12]:
search_keyword("cloud")


 Searching for: 'cloud'
Stemmed form : cloud
Lemmatized form : cloud

--- Matches in word_index (2 results) ---
• Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system
• Hetzner: The Simple Cloud just got more flexible and more affordable

--- Matches in stem_index (2 results) ---
• Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system
• Hetzner: The Simple Cloud just got more flexible and more affordable

--- Matches in lemma_index (2 results) ---
• Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system
• Hetzner: The Simple Cloud just got more flexible and more affordable


{'word_index': ['Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system',
  'Hetzner: The Simple Cloud just got more flexible and more affordable'],
 'stem_index': ['Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system',
  'Hetzner: The Simple Cloud just got more flexible and more affordable'],
 'lemma_index': ['Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system',
  'Hetzner: The Simple Cloud just got more flexible and more affordable']}

## <span class='task'>TASK-04</span> - 15%
Randomly select 5 keywords from your list of article titles and use your search function to compare results across all three indices (original, stemmed, and lemmatized):
1. Print the number of matches and the titles of the articles for each keyword. You may display them as a table or simply print them as text.

2. Did stemming or lemmatization help retrieve more relevant articles?
3. Were there any false matches (irrelevant results)?
4. When might stemming be better than lemmatization (or vice versa)?

In [14]:
# Solution

# ------------------------------------------------------------
# STEP 1: Extract keywords (unique words) from article titles
# ------------------------------------------------------------
keywords = set()

for title in titles:
    words = re.findall(r"\b[a-zA-Z]{3,}\b", title.lower())  # 3+ letter words
    keywords.update(words)

# Randomly select 5 unique keywords
sample_keywords = random.sample(list(keywords), 5)
print(f"\n Randomly selected keywords: {sample_keywords}\n")

# ------------------------------------------------------------
# STEP 2: Search each keyword and print comparison
# ------------------------------------------------------------
comparison_results = {}

for word in sample_keywords:
    print("=" * 80)
    print(f"\n Keyword: '{word}'")
    print("=" * 80)
    results = search_keyword(word)

    # Store counts for summary
    comparison_results[word] = {
        "Word Index": len(results["word_index"]),
        "Stem Index": len(results["stem_index"]),
        "Lemma Index": len(results["lemma_index"]),
    }

# ------------------------------------------------------------
# STEP 3: Summary Table
# ------------------------------------------------------------
print("\n" + "=" * 80)
print(" Summary of Match Counts")
print("=" * 80)
print(f"{'Keyword':<15}{'Word Index':<15}{'Stem Index':<15}{'Lemma Index':<15}")
print("-" * 80)

for kw, counts in comparison_results.items():
    print(f"{kw:<15}{counts['Word Index']:<15}{counts['Stem Index']:<15}{counts['Lemma Index']:<15}")



 Randomly selected keywords: ['level', 'passes', 'threaded', 'multithreading', 'confirms']


 Keyword: 'level'

 Searching for: 'level'
Stemmed form : level
Lemmatized form : level

--- Matches in word_index (1 results) ---
• LLMs are getting better at character-level text manipulation

--- Matches in stem_index (2 results) ---
• LLMs are getting better at character-level text manipulation
• Medical student ate 700 eggs in a month and his cholesterol levels dropped

--- Matches in lemma_index (2 results) ---
• LLMs are getting better at character-level text manipulation
• Medical student ate 700 eggs in a month and his cholesterol levels dropped

 Keyword: 'passes'

 Searching for: 'passes'
Stemmed form : pass
Lemmatized form : pass

--- Matches in word_index (1 results) ---
• Grandmaster, Popular Commentator Daniel Naroditsky Tragically Passes Away at 29

--- Matches in stem_index (1 results) ---
• Grandmaster, Popular Commentator Daniel Naroditsky Tragically Passes Away at 29

--- M