# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Matthew Block
[Web Scraping Repository](https://github.com/matthewpblock/620-mod6-web-scraping/)  

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [None]:
# Imports
import requests
from bs4 import BeautifulSoup
import pickle
import spacy
from collections import Counter
import re
import sys
import os
from urllib.parse import urljoin




1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [15]:
import requests
from bs4 import BeautifulSoup
import sys
from urllib.parse import urljoin

def extract_and_save_article(url: str, output_filename: str):
    """
    Fetches an article from a URL, extracts the main article content using BeautifulSoup,
    and saves it to an HTML file.

    Args:
        url (str): The URL of the web page containing the article.
        output_filename (str): The name of the file to save the extracted article to.
    """
    print(f"Fetching article from: {url}")
    try:
        # Using a user-agent can help avoid being blocked by some sites
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=15)
        # Raise an exception for bad status codes (4xx or 5xx)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}", file=sys.stderr)
        sys.exit(1)

    print("Parsing HTML content...")
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the main article tag. Based on inspection of Hackaday pages,
    # the main content is within an <article> tag.
    article_tag = soup.find('article')

    if not article_tag:
        print("Could not find the <article> tag on the page.", file=sys.stderr)
        sys.exit(1)

    print("Article content found. Processing content...")

    # Convert relative URLs to absolute URLs for links, images, and scripts
    for tag in article_tag.find_all(['a', 'link'], href=True):
        tag['href'] = urljoin(url, tag['href'])

    for tag in article_tag.find_all(['img', 'script', 'video', 'audio', 'source'], src=True):
        tag['src'] = urljoin(url, tag['src'])

    # Extract the title from the article header for the new HTML's <title> tag
    title_text = "Extracted Article"
    title_tag = article_tag.find('h1', class_='entry-title')
    if title_tag:
        title_text = title_tag.get_text(strip=True)

    # Use prettify() to get a nicely formatted string of the article tag
    article_html = article_tag.prettify()

    # Create a complete HTML document for the output
    # I've added some basic CSS for better readability
    output_html = f"""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{title_text}</title>
    <style>
        body {{
            font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", sans-serif;
            line-height: 1.6;
            color: #333;
            background-color: #fdfdfd;
            max-width: 800px;
            margin: 20px auto;
            padding: 0 20px;
        }}
        h1, h2, h3, h4, h5, h6 {{
            line-height: 1.2;
        }}
        a {{
            color: #0073aa;
        }}
        img, video {{
            max-width: 100%;
            height: auto;
            display: block;
            margin: 1em 0;
        }}
        pre {{
            background-color: #f0f0f0;
            padding: 1em;
            overflow-x: auto;
            border-radius: 4px;
        }}
        code {{
            font-family: "Courier New", Courier, monospace;
        }}
    </style>
</head>
<body>
    {article_html}
</body>
</html>
"""

    try:
        print(f"Saving extracted article to {output_filename}...")
        with open(output_filename, 'w', encoding='utf-8') as f:
            f.write(output_html)
        print(f"Successfully saved the article to '{output_filename}'.")
    except IOError as e:
        print(f"Error writing to file {output_filename}: {e}", file=sys.stderr)
        sys.exit(1)

# The URL from the request
TARGET_URL = "https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/"
OUTPUT_FILE = "laser_headlights_article.html"


extract_and_save_article(TARGET_URL, OUTPUT_FILE)


Fetching article from: https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/
Parsing HTML content...
Article content found. Processing content...
Saving extracted article to laser_headlights_article.html...
Successfully saved the article to 'laser_headlights_article.html'.


2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [17]:
parser = 'html5lib'
html_file = 'laser_headlights_article.html'

try:
    # Open and read the file
    with open(html_file, 'r', encoding='utf-8') as f:
        html_content = f.read()

    # Create a BeautifulSoup object
    soup = BeautifulSoup(html_content, 'html.parser')

    # First, find the <article> tag
    article_tag = soup.find('article')

    if article_tag:
        # Now, get the text only from within that article tag
        article_text = article_tag.get_text(strip=True, separator=' ')
        
        # Print the first 500 characters of the article text
        print(article_text[:500])
    else:
        print("Could not find an <article> tag in the HTML.")

except FileNotFoundError:
    print(f"Error: The file was not found at {html_file}")
except Exception as e:
    print(f"An error occurred: {e}")

How Laser Headlights Work 130 Comments by: Lewin Day March 22, 2021 When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind. Engines, fuel efficiency, and the switch to electric power are all more front of mind. However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out. Sealed beam headlights gave way to more modern designs once r


3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [18]:
def analyze_article_text():
    """
    Loads an HTML article, extracts the text, and uses spaCy to find the
    most frequent tokens.

    Args:
        file_path (str): The path to the HTML file.
    """
    # --- 1. HTML file was loaded in previous code block ---

    # --- 2. Load spaCy Model ---
    print("Loading spaCy model 'en_core_web_sm'...")
    try:
        nlp = spacy.load("en_core_web_sm")
    except OSError:
        print(
            "spaCy model 'en_core_web_sm' not found. Please run:\n"
            "python -m spacy download en_core_web_sm",
            file=sys.stderr
        )
        sys.exit(1)

    # --- 3. Process Text and Filter Tokens ---
    print("Processing text and counting token frequencies...")
    doc = nlp(article_text)

    # Create a list of tokens, converted to lower case,
    # but only if they are not stopwords, punctuation, or whitespace.
    filtered_tokens = [
        token.text.lower()
        for token in doc
        if not token.is_stop and not token.is_punct and not token.is_space
    ]

    # --- 4. Count Frequencies and Find Most Common ---
    word_freq = Counter(filtered_tokens)
    most_common_tokens = word_freq.most_common(5)

    # --- 5. Print the Results ---
    print("\n--- Analysis Results ---")
    
    # Print the list of the 5 most common tokens
    common_token_list = [token for token, freq in most_common_tokens]
    print("\nThe 5 most frequent tokens are:")
    print(common_token_list)

    # Print the tokens and their frequencies
    print("\nFrequency of the 5 most common tokens:")
    for token, freq in most_common_tokens:
        print(f"- '{token}': {freq}")
    print("------------------------")



analyze_article_text()


Loading spaCy model 'en_core_web_sm'...
Processing text and counting token frequencies...

--- Analysis Results ---

The 5 most frequent tokens are:
['laser', 'headlights', 'headlight', 'technology', 'led']

Frequency of the 5 most common tokens:
- 'laser': 35
- 'headlights': 19
- 'headlight': 11
- 'technology': 10
- 'led': 10
------------------------


4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).