# Fetching Content from the Web

## Melissa Stone Rogers, [GitHub](https://github.com/meldstonerogers/620-mod6-web-scraping)

## When you have no API... That's annoying!

Sometimes we need to try to access data on a webpage that does not have a helpful API that bundles and returns our data in a usable manner.  When that happens, we are left with no choice but to resort to web scraping, that is downloading the actual webpage (or parts of the webpage) and manipulating the text or elements of the HTML to obtain the information we want.

## Tools to install:

Use either `conda` or `pip` to install the following packages (e.g. `pip install beautifulsoup4`) depending on your environment:

* `beautifulsoup4`
* `html5lib` (optional parser - can use the built-in 'html.parser' instead)

We will deal with each of these tools in turn.  BS4 is a package that allows you to fully read and do horrible horrible things to the DOM of a webpage (it's an XML parser that builds a parse tree for you to play with).  Boilerpipe is a python wrapper to a popular java library that can streamline some of the process of extracting parts of a webpage.  Feedparser lets you get RSS and Atom feeds (remember those?)

Many examples are motivated from the content at https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition/blob/master/notebooks/Chapter%206%20-%20Mining%20Web%20Pages.ipynb

## Getting page text with `requests` and `BeautifulSoup4`

A majority of the web does not follow a specific standard (RSS/ATOM Feeds, etc), or the data we want is in a particularly odd format.  For this we need more flexible tools to allow us to obtain a full web page and extract meaning from it.  For this we will use the `requests` module and `beautifulsoup` to pull our information.

 ## Question 1

In [2]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import pickle and Counter (included in the Python Standard Library).

from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

import requests

response = requests.get('https://en.wikipedia.org/wiki/Data_mining')

print(response.status_code)
print(response.headers['content-type'])
# Uncomment next line to print the full HTML text;  it's long so when done, recomment
# print(response.text)


Package             Version
------------------- -----------
annotated-types     0.7.0
appnope             0.1.4
asttokens           2.4.1
backcall            0.2.0
beautifulsoup4      4.12.3
blis                0.7.11
catalogue           2.0.10
certifi             2024.8.30
charset-normalizer  3.4.0
click               8.1.7
cloudpathlib        0.20.0
comm                0.2.2
confection          0.1.5
contourpy           1.1.1
cycler              0.12.1
cymem               2.0.10
debugpy             1.8.9
decorator           5.1.1
executing           2.1.0
fonttools           4.55.0
html5lib            1.1
idna                3.10
importlib_metadata  8.5.0
importlib_resources 6.4.5
ipykernel           6.29.5
ipython             8.12.3
jedi                0.19.2
Jinja2              3.1.4
joblib              1.4.2
jupyter_client      8.6.3
jupyter_core        5.7.2
kiwisolver          1.4.7
langcodes           3.4.1
language_data       1.3.0
marisa-trie         1.2.1
markdown-it-py     

Hooray... raw HTML.  This will clearly be a good day.  Let's... find a way to **not** deal with raw html encoded text:

## Question 2

In [None]:
from bs4 import BeautifulSoup

# parser = 'html5lib'
parser = 'html.parser'

soup = BeautifulSoup(response.text, parser)
# Uncomment next lines to explore full page contents; it's long so when done, recomment
#print(soup)
#print(soup.prettify())


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Data mining - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-client

From our soup we can extract information by finding tags, searching for ids, and in general treating the parse tree of the HTML as a parse tree.  `BeautifulSoup4` allows us to search for elements of our text

## Question 3

In [20]:
for header in soup.findAll('h1'):
    print('h1 header:', header)
    print('h1 text:', header.text)

h1 header: <h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Data mining</span></h1>
h1 text: Data mining


## Getting some workable material

Let's start with an article and find learn some information about it.  Our first step should be to obtain our web page.

In [21]:
article_page = requests.get('http://web.archive.org/web/20210415020310/https://hackaday.com/2021/04/02/python-will-soon-support-switch-statements/')
article_html = article_page.text

# pickle works similar to json, but stores information in a binary format
# json files are readable by humans, pickle files, not so much

# BeautifulSoup objects don't pickle well, so it's appropriate and polite to web developers to cache the text of the web page, or just dump it to an html file you can read in later as a regular file
import pickle
with open('python-match.pkl', 'wb') as f:
    pickle.dump(article_page.text, f)

In [22]:
with open('python-match.pkl', 'rb') as f:
    article_html = pickle.load(f)

Our next step is to make our HTML page searchable/usable.  Parsing HTML by hand would be a waste of time, so let's ask BeautifulSoup to make us some delicious soup.

Use either parser - how can you find the pros and cons of 'html5lib' or the built-in 'html.parser'? 

Where did we set the value of the parser variable used below?

In [23]:
soup = BeautifulSoup(article_html, parser)

At this point, we want to extract the article text from the page.  Luckily, we have the ability to investigate and inspect individual elements.  By using the inspector we can find that the article is contained in an `article` element; how convenient!  This might not be the case for every page.  One problem with web scraping is that you need to specialize your code for whatever 

In [24]:
article_element = soup.find('article')
# Uncomment to see the entire article element html; again, it's long
# print(article_element)

If you print the article_content, you see that we get the html contained in the article element; while this is the content we want there's a lot of HTML cruft in there that won't help us (we are interested in text, probably).  Luckily for us, BeautifulSoup allows us to essentially display the text (roughly) as a web browser would display it using the `get_text()` method.

In [25]:
print(article_element.get_text())



Python Will Soon Support Switch Statements


                112 Comments            

by:
Adam Zeloof



April 2, 2021








Rejoice! Gone are the long chains of if…else statements, because switch statements will soon be here — sort of. What the Python gods are actually giving us are match statements. match statements are awfully similar to switch statements, but have a few really cool and unique features, which I’ll attempt to illustrate below.

Flip The Switch
A switch statement is often used in place of an if…else ladder. Here’s a quick example of the same logic in C, first executed with an if statement, and then with a switch statement:

Essentially, a switch statement takes a variable and tests it for equality against a number of different cases. If none of the cases match, then the default case is invoked. Notice in the example that each case is terminated by a break. This protects against more than one case matching (or allows for cascading), as the cases are checked in the

We have raw article text! Let the NLP begin! (also introducing/reintroducing [f strings](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings)!

In [26]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
# why not, let's add some fun sentiment analysis, because we can
nlp.add_pipe('spacytextblob')
doc = nlp(article_element.get_text())
print(f'Polarity: {doc._.polarity}')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

So does the article have an overall positive overtone? What polarity scores indicate positive tone? What scores indicate negative? Is -0.5 more negative than -0.2?

I'm actually pretty excited about match statements in Python, but I'm a nerd, so... you know.  What else can we do?

## NLP: Finding important terms and introducing "stopwords"

Our `doc` object contains information about every single lexeme in the text we had it parse.  Let's look at what these are:

In [None]:
for lexeme in doc[:10]: # just the first 10 for now
    print('---',lexeme)

--- 


--- Python
--- Will
--- Soon
--- Support
--- Switch
--- Statements
--- 


                
--- 112
--- Comments


We can see that the large blocks of whitespace are considered tokens; that's not something we particularly care about.  Let's start by going through our document and only getting tokens that aren't whitespace (I'll do this one as a loop constructing the list).  Notice; each lexeme in the document has a flag for whether it is whitespace or not!

In [None]:
non_ws_tokens = []
for token in doc:
    if not token.is_space:
        non_ws_tokens.append(token)
print(non_ws_tokens)

[Python, Will, Soon, Support, Switch, Statements, 112, Comments, by, :, Adam, Zeloof, April, 2, ,, 2021, Rejoice, !, Gone, are, the, long, chains, of, if, …, else, statements, ,, because, switch, statements, will, soon, be, here, —, sort, of, ., What, the, Python, gods, are, actually, giving, us, are, match, statements, ., match, statements, are, awfully, similar, to, switch, statements, ,, but, have, a, few, really, cool, and, unique, features, ,, which, I, ’ll, attempt, to, illustrate, below, ., Flip, The, Switch, A, switch, statement, is, often, used, in, place, of, an, if, …, else, ladder, ., Here, ’s, a, quick, example, of, the, same, logic, in, C, ,, first, executed, with, an, if, statement, ,, and, then, with, a, switch, statement, :, Essentially, ,, a, switch, statement, takes, a, variable, and, tests, it, for, equality, against, a, number, of, different, cases, ., If, none, of, the, cases, match, ,, then, the, default, case, is, invoked, ., Notice, in, the, example, that, each

It appears that punctuation is also its own token, so let's get rid of it.  At this point it starts to make sense to define a function to determine if we care about a particular token.  Let's build our list with a comprehension this time.

In [None]:
def we_care_about(token):
    return not (token.is_space or token.is_punct)

interesting_tokens = [token for token in doc if we_care_about(token)]
print(interesting_tokens)

[Python, Will, Soon, Support, Switch, Statements, 112, Comments, by, Adam, Zeloof, April, 2, 2021, Rejoice, Gone, are, the, long, chains, of, if, else, statements, because, switch, statements, will, soon, be, here, sort, of, What, the, Python, gods, are, actually, giving, us, are, match, statements, match, statements, are, awfully, similar, to, switch, statements, but, have, a, few, really, cool, and, unique, features, which, I, ’ll, attempt, to, illustrate, below, Flip, The, Switch, A, switch, statement, is, often, used, in, place, of, an, if, else, ladder, Here, ’s, a, quick, example, of, the, same, logic, in, C, first, executed, with, an, if, statement, and, then, with, a, switch, statement, Essentially, a, switch, statement, takes, a, variable, and, tests, it, for, equality, against, a, number, of, different, cases, If, none, of, the, cases, match, then, the, default, case, is, invoked, Notice, in, the, example, that, each, case, is, terminated, by, a, break, This, protects, agains

Let's determine the most frequent terms; that's an interesting way to determine the most important parts of a body of text.  Note the use of the Counter collection data structure; it's essentially a superpowered dictionary.  Note that because tokens have context, they won't be the same objects (everything will have a frequency of 1) so we turn all of them into strings using the `map` function.

In [None]:
from collections import Counter
word_freq = Counter(map(str,interesting_tokens))
print(word_freq.most_common(10))

[('the', 29), ('a', 24), ('in', 16), ('is', 14), ('statement', 13), ('switch', 12), ('match', 12), ('of', 11), ('and', 11), ('to', 10)]


Hrm; most of the top 10 most common words are not particularly interesting.  The words "the", "a", etc. do little to add to the meaning of a sentence but they do help with human comprehension.  We call such words "stopwords".  Let's modify our code so we exclude stopwords in our count.

In [None]:
def we_care_about(token):
    return not (token.is_space or token.is_punct or token.is_stop)

interesting_tokens = [token for token in doc if we_care_about(token)]
word_freq = Counter(map(str,interesting_tokens))
print(word_freq.most_common(10))

[('statement', 13), ('switch', 12), ('match', 12), ('statements', 9), ('Python', 8), ('case', 8), ('example', 5), ('break', 4), ('matching', 4), ('Switch', 3)]


Now, we suddenly see that the most common terms in an article about Python supporting switch statements now contains the terms "statement", "statements", "Python", "Switch", and "match".  You might notice that there are a few issues with this.

* "Switch" and "switch" are both included; we could fix this by converting every string to lower
* "statement" and "statements", as well as "match" and "matching" are counted as different terms, even though they have the same base.

The second problem is harder to solve, or it would be if we weren't using such a powerful library. As you learned in your initial exploration of spaCy, part of the pipeline is Lemmatization.  This is the process of finding the base form of each word.  Let's see what happens when we use the base form of each word instead of the token itself:

In [None]:
interesting_lemmas = [token.lemma_ for token in doc if we_care_about(token)]
lemma_freq = Counter(interesting_lemmas)
print(lemma_freq.most_common(10))

[('statement', 22), ('match', 14), ('switch', 13), ('case', 11), ('Python', 8), ('example', 5), ('break', 5), ('go', 3), ('feature', 3), ('execute', 3)]


We still run into case sensitivity, but we can fix that by converting the string to lower case:

In [None]:
interesting_lemmas = [token.lemma_.lower() for token in doc if we_care_about(token)]
lemma_freq = Counter(interesting_lemmas)
print(lemma_freq.most_common(10))

[('statement', 22), ('switch', 15), ('match', 14), ('case', 11), ('python', 9), ('example', 5), ('break', 5), ('matching', 5), ('pattern', 5), ('go', 3)]


Cool!  Let's store the 5 most common words in a set and try some experiments

In [None]:
cool_words = set()
for lemma, freq in lemma_freq.most_common(5):
    cool_words.add(lemma)
print(cool_words)

{'switch', 'python', 'statement', 'case', 'match'}


As an experiment, let's see how many words in each sentence are important (or "cool" as I've named the set)

In [None]:
sentences = list(doc.sents) # Thanks spaCy for just giving us our sentences
for sentence in sentences:
    count = 0
    for token in sentence:
        if token.lemma_.lower() in cool_words:
            count += 1
    # because there's a bunch of junk newlines, we'll replace those with nothing, as well as a little bit of whitespace
    sent_str = str(sentence).replace('\n','').replace('  ',' ')
    print(count,':', sent_str)

2 : Python Will Soon Support Switch Statements        112 Comments      by:Adam ZeloofApril 2, 2021Rejoice!
3 : Gone are the long chains of if…else statements, because switch statements will soon be here — sort of.
3 : What the Python gods are actually giving us are match statements.
4 : match statements are awfully similar to switch statements, but have a few really cool and unique features, which I’ll attempt to illustrate below.
3 : Flip The SwitchA switch statement is often used in place of an if…else ladder.
6 : Here’s a quick example of the same logic in C, first executed with an if statement, and then with a switch statement:Essentially, a switch statement takes a variable and tests it for equality against a number of different cases.
3 : If none of the cases match, then the default case is invoked.
1 : Notice in the example that each case is terminated by a break.
2 : This protects against more than one case matching (or allows for cascading), as the cases are checked in the or

### Bonus: How many words in a sentence?

When counting words, it's probably best to avoid whitespace or punctuation

In [None]:
def sentence_length (sent):
    count = 0
    for token in sent:
        if not(token.is_space or token.is_punct):
            count += 1
    return count
print(sentence_length(sentences[0]), sentences[0])

15 

Python Will Soon Support Switch Statements


                112 Comments            

by:
Adam Zeloof



April 2, 2021








Rejoice!
