# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name:  Nicole Hansen
GitHub repo:  https://github.com/nhansen23/mod6-web-scraping

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [5]:
#run the follwing once
#%pip install requests
#%pip install beautifulsoup4
%pip install spacy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Question 1

1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [None]:
import requests
import pickle
from bs4 import BeautifulSoup

#url of website
url="https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/"

#Fetch the page
response = requests.get(url)

#Store file name
filename = 'article-html.pkl'

#Check if the response was successful
if response.status_code==200:
    html_content = response.content
    #Parse the content
    data = BeautifulSoup(html_content, 'html.parser')
    #Write to a file
    with open(filename,'wb') as file:
        pickle.dump(html_content, file)
        print(f"The contents have been written to {filename}")
else:
    print('Failed to retrieve the web page.')



The contents have been written to article-html.pkl


In [19]:
#Check for valid HTML content  
def is_valid_html(content):

    if not content:
        return False
    soup = BeautifulSoup(content, 'html.parser')
    return bool(soup.find())

def read_and_validate_pickle(file_path): 
    try: 
        with open(file_path, 'rb') as file: 
            data = pickle.load(file) 
            if is_valid_html(data): 
                print("Valid HTML content") 
                print(data) 
            else: 
                print("Invalid HTML content. Using default template.") 
                data = "<html><body>This is a default template.</body></html>" 
                print(data) 
    except Exception as e: 
        print(f"Error reading pickle file: {e}") 

# Example usage
read_and_validate_pickle('article-html.pkl')

Valid HTML content
b'<!DOCTYPE html>\n<html itemscope="itemscope" itemtype="http://schema.org/Article" lang="en-US">\n<head><script type="text/javascript" src="/_static/js/bundle-playback.js?v=HxkREWBo" charset="utf-8"></script>\n<script type="text/javascript" src="/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script>\n<script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden"};</script>\n<script type="text/javascript" src="/_static/js/ruffle/ruffle.js"></script>\n<script type="text/javascript">\n    __wm.init("https://web.archive.org/web");\n  __wm.wombat("https://hackaday.com/2021/03/22/how-laser-headlights-work/","20210327165005","https://web.archive.org/","web","/_static/",\n\t      "1616863805");\n</script>\n<link rel="stylesheet" type="text/css" href="/_static/css/banner-styles.css?v=S1zqJCYt" />\n<link rel="stylesheet" type="text/css" href="/_static/css/iconochive.css?v=3PDvdIFv" />\n<!-- End Wayback Rewrite JS 

## Question 2

2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [4]:
import pickle
from bs4 import BeautifulSoup

filename = 'article-html.pkl'

#Open the file in read mode
with open(filename,'rb') as file:
    article_content = pickle.load(file)

soup = BeautifulSoup(article_content, 'html.parser')

#Print the content
article_text = soup.get_text()
print(article_text)

















How Laser Headlights Work | Hackaday






































































Skip to content






Hackaday


Primary Menu

Home
Blog
Hackaday.io
Tindie
Hackaday Prize
Submit
About


Search for:



 March 27, 2021 






How Laser Headlights Work


                130 Comments            

by:
Lewin Day



March 22, 2021








When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind. Engines, fuel efficiency, and the switch to electric power are all more front of mind. However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out.
Sealed beam headlights gave way to more modern designs once regulations loosened up, while bulbs moved from simple halogens to xenon HIDs and, more recently, LEDs. Now, a new technology is on the scene, with lasers!

Laser Headlights?!
BWM’s prototype 

## Question 3

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [11]:
from bs4 import BeautifulSoup
import spacy
from collections import Counter

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Process the text content
doc = nlp(article_text)

# Filter out punctuation and stop words
tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_stop and not token.is_space]

# Count the frequency of each token 
token_freq = Counter(tokens)

# Get the 5 most frequent tokens
most_common_tokens = token_freq.most_common(5)

# Print the results
print("The 5 most frequent tokens are:")
for token, freq in most_common_tokens:
    print(f"{token}: {freq}")


The 5 most frequent tokens are:
comment: 136
march: 133
2021: 133
says: 132
report: 130


## Question 4

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [8]:
from bs4 import BeautifulSoup
import spacy
from collections import Counter

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Process the text content
doc = nlp(article_text)

# Extract lemmas, filtering out punctuation, stopwords, and whitespace
lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space]

# Count the frequency of each lemma
lemma_freq = Counter(lemmas)

# Get the 5 most frequent lemmas
most_common_lemmas = lemma_freq.most_common(5)

# Print the most common lemmas with their frequencies
print("The 5 most frequent lemmas are:")
for lemma, freq in most_common_lemmas:
    print(f"{lemma}: {freq}")


The 5 most frequent lemmas are:
comment: 157
say: 134
march: 133
2021: 133
report: 130


## Question 5

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

In [24]:
import requests
from bs4 import BeautifulSoup

# Define scoring for tokens
def score_sentence_by_token(sentence, interesting_tokens):
    # Process the sentence with spaCy
    doc = nlp(sentence)
      
    # Filter out punctuation and spaces
    tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_space and not token.is_stop]
    
    # Count the number of interesting tokens
    interesting_count = sum(1 for token in tokens if token in interesting_tokens)
    
    # Calculate the score
    score = interesting_count / len(tokens) if tokens else 0
    return score

# Define scoring for lemmas
def score_sentence_by_lemma(sentence, interesting_lemmas):
    # Process the sentence with spaCy
    doc = nlp(sentence)
    
    # Extract lemmas, filtering out punctuation, stopwords, and whitespace
    lemmas = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_space] 
    
    # Count the number of interesting lemmas
    interesting_count = sum(1 for token in doc if token.lemma_ in interesting_lemmas)
    
    # Calculate the score
    score = interesting_count / len(doc) if doc else 0
    return score

#Extract the sentence
sentence = [sent.text for sent in doc.sents]
first_sentence = sentence[0]

#Conver the most common tokens and lemmas to lists of words
interesting_tokens = [token for token, _ in most_common_tokens] 
interesting_lemmas = [lemma for lemma, _ in most_common_lemmas]

#Calculate the scores
token_score = score_sentence_by_token(first_sentence, most_common_tokens)
lemma_score = score_sentence_by_lemma(first_sentence, most_common_lemmas)

#Print the results
print("\nFirst sentence token score:")
print(token_score)
print("\nFirst sentence lemma score:")
print(lemma_score)




First sentence token score:
0.0

First sentence lemma score:
0.0


## Question 6

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [25]:
import matplotlib.pyplot as plt

nlp = spacy.load("en_core_web_sm")

article_text = "(sentence[3])"


def score_sentence_by_token(sentence, interesting_tokens):
    # Process the sentence with spaCy
    doc = nlp(sentence)
      
    # Filter out punctuation and spaces
    tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_space and not token.is_stop]
    
    # Count the number of interesting tokens
    interesting_count = sum(1 for token in tokens if token in interesting_tokens)
    
    # Calculate the score
    score = interesting_count / len(tokens) if tokens else 0
    return score

scores = [score_sentence_by_token(sentence, interesting_tokens) for sentence in sentence]

#Plot a Histogram of the Scores 
plt.hist(scores, bins=10, edgecolor='black') 
plt.title('Histogram of Sentence Scores Based on Frequent Tokens') 
plt.xlabel('Score') 
plt.ylabel('Frequency') 
plt.show()


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



ImportError: numpy.core.multiarray failed to import

## Question 7

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [2]:
import matplotlib.pyplot as plt

scores = [score_sentency_by_token(sentence, interesting_lemmas) for sentence in sentence]

#Plot a Histogram of the Scores 
plt.hist(scores, bins=10, edgecolor='black') 
plt.title('Histogram of Sentence Scores Based on Frequent Lemmas') 
plt.xlabel('Score') 
plt.ylabel('Frequency') 
plt.show()


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\Nicole\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



ImportError: numpy.core.multiarray failed to import

## Question 8

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).