# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name:

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [1]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import pickle and Counter (included in the Python Standard Library).
#Erin Swan-Siegel

from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

Package                       Version
----------------------------- --------
anyio                         3.5.0
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
astroid                       2.11.5
asttokens                     2.0.5
attrs                         21.4.0
Babel                         2.10.1
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
beautifulsoup4                4.11.1
bleach                        5.0.0
blis                          0.7.9
brotlipy                      0.7.0
catalogue                     2.0.7
certifi                       2023.5.7
cffi                          1.15.0
charset-normalizer            2.0.4
click                         8.0.4
colorama                      0.4.4
conda                         4.12.0
conda-package-handling        1.8.1
confection                    0.0.4
cryptography                  37.0.1
cycler                        0.11.0
cymem                         2.0.6
debugpy   

# Question 1 
Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [2]:
import requests
import json
from bs4 import BeautifulSoup

In [24]:
article_page = requests.get('https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/')
article_html = article_page.text

# pickle works similar to json, but stores information in a binary format
# json files are readable by humans, pickle files, not so much

# BeautifulSoup objects don't pickle well, so it's appropriate and polite to web developers to cache the text of the web page, or just dump it to an html file you can read in later as a regular file
import pickle
with open('python-match.pkl', 'wb') as f:
    pickle.dump(article_page.text, f)

with open('python-match.pkl', 'rb') as f:
    article_html = pickle.load(f)

for header in soup.findAll('h1'):
    print('h1 header:', header)
    print('h1 text:', header.text)

h1 header: <h1 class="site-title">
<a href="https://web.archive.org/web/20210327165005/https://hackaday.com/" rel="home">Hackaday</a>
</h1>
h1 text: 
Hackaday

h1 header: <h1 class="entry-title" itemprop="name">How Laser Headlights Work</h1>
h1 text: How Laser Headlights Work
h1 header: <h1 class="screen-reader-text">Post navigation</h1>
h1 text: Post navigation
h1 header: <h1 class="widget-title">Search</h1>
h1 text: Search
h1 header: <h1 class="widget-title">Never miss a hack</h1>
h1 text: Never miss a hack
h1 header: <h1 class="widget-title">Subscribe</h1>
h1 text: Subscribe
h1 header: <h1 class="widget-title">If you missed it</h1>
h1 text: If you missed it
h1 header: <h1 class="widget-title">Our Columns</h1>
h1 text: Our Columns
h1 header: <h1 class="widget-title">Search</h1>
h1 text: Search
h1 header: <h1 class="widget-title">Never miss a hack</h1>
h1 text: Never miss a hack
h1 header: <h1 class="widget-title">Subscribe</h1>
h1 text: Subscribe
h1 header: <h1 class="widget-title">I

# Question 2 
Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [5]:
soup = BeautifulSoup(article_html, parser)

article_element = soup.find('article')
# Uncomment to see the entire article element html; again, it's long
# print(article_element)

print(article_element.get_text())



How Laser Headlights Work


                130 Comments            

by:
Lewin Day



March 22, 2021








When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind. Engines, fuel efficiency, and the switch to electric power are all more front of mind. However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out.
Sealed beam headlights gave way to more modern designs once regulations loosened up, while bulbs moved from simple halogens to xenon HIDs and, more recently, LEDs. Now, a new technology is on the scene, with lasers!

Laser Headlights?!
BWM’s prototype laser headlight assemblies undergoing testing.
The first image brought to mind by the phrase “laser headlights” is that of laser beams firing out the front of an automobile. Obviously, coherent beams of monochromatic light would make for poor illumination outside o

# Question 3
Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [6]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [11]:
nlp = spacy.load('en_core_web_sm')
# why not, let's add some fun sentiment analysis, because we can
nlp.add_pipe('spacytextblob')
doc = nlp(article_element.get_text())

non_ws_tokens = []
for token in doc:
    if not token.is_space:
        non_ws_tokens.append(token)
        
def we_care_about(token):
    return not (token.is_space or token.is_punct)

interesting_tokens = [token for token in doc if we_care_about(token)]
from collections import Counter
word_freq = Counter(map(str,interesting_tokens))
print(word_freq.most_common(5))

[('the', 68), ('to', 37), ('of', 36), ('laser', 29), ('in', 24)]


# Question 4
Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [14]:
interesting_lemmas = [token.lemma_.lower() for token in doc if we_care_about(token)]
lemma_freq = Counter(interesting_lemmas)
print(lemma_freq.most_common(5))

[('the', 75), ('laser', 40), ('to', 38), ('of', 36), ('be', 35)]


# Question 5
Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

In [51]:
from collections import Counter

def score_sentence_by_token(sentence, interesting_token):
    Interesting_word_freq = Counter(map(str,interesting_token))
    #sentence_word_count = Counter(map(str, sentence))
    print(Interesting_word_freq)
    #print(sentence_word_count)

def score_sentence_by_lemma(sentence, interesting_lemmas):
    Interesting_lemma_freq = Counter(map(str,interesting_lemmas))
    print(Interesting_lemma_freq)

# Question 6
Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

# Question 7
Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [53]:
sentences = list(doc.sents) # Thanks spaCy for just giving us our sentences
for sentence in sentences:
    count = 0
    for token in sentence:
        if token.lemma_.lower() in cool_words:
            count += 1
    # because there's a bunch of junk newlines, we'll replace those with nothing, as well as a little bit of whitespace
    sent_str = str(sentence).replace('\n','').replace('  ',' ')
    print(count,':', sent_str)

0 : How Laser Headlights Work        130 Comments      by:Lewin DayMarch 22, 2021
0 : When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind.
0 : Engines, fuel efficiency, and the switch to electric power are all more front of mind.
0 : However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out.
0 : Sealed beam headlights gave way to more modern designs once regulations loosened up, while bulbs moved from simple halogens to xenon HIDs and, more recently, LEDs.
0 : Now, a new technology is on the scene, with lasers!Laser Headlights?!
0 : BWM’s prototype laser headlight assemblies undergoing testing.
0 : The first image brought to mind by the phrase “laser headlights” is that of laser beams firing out the front of an automobile.
0 : Obviously, coherent beams of monochromatic light would make for poor illumination outside o

# Question 8
Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).