# Introduction to web-scraping: Solutions


## Outline

* [URL collection with automated Google search](#URLs)
    * [Scraping school URLs](#school_URLs)
    * [Scraping URLs using an exclusion list](#exclusionlist)
* [Making `Requests`](#request)
* [Parsing HTML](#parsing)
    * [Pretty parsing with `BeautifulSoup`](#BS)
    * [Getting human-readable text](#readable)

**__________________________________**


# URL collection with automated Google search<a id='URLs'></a>

## Scraping school URLs<a id='school_URLs'></a>

In [3]:
# Import automated Google search package
from googlesearch import search

# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.ccpcs.org/
https://www.ccpcs.org/about/our-staff
https://www.ccpcs.org/about/our-staff/join-our-team
https://www.ccpcs.org/current-families/calendar
https://www.ccpcs.org/about/our-staff/high-school
https://www.facebook.com/CapitalCityPCS/
https://www.myschooldc.org/schools/profile/143
https://www.niche.com/k12/capital-city-public-charter-school-washington-dc/
https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&State=11&details=5&ID2=1100035
https://www.usnews.com/education/k12/district-of-columbia/capital-city-pcs-lower-school-226373


### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [4]:
# Solution

# Define metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Automated search
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
https://www.homefacts.com/schools/Texas/Bexar-County/San-Antonio/Dr-David-C-Walker-El.html


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Why would this be the case? 

## Scraping URLs using an exclusion list<a id='exclusionlist'></a>

In [5]:
# Initial version of code

# Define excluded domains to filter out: third-party domains/false positives that we DON'T want to scrape 
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid excluded domains
print("Successfully collected Google search results.")

# Initialize exclusions match counter: How many excluded domains has this search encountered?
excluded_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in exclusions):
        print(f'Bad site detected: {url}') 
        excluded_num += 1 # Add one to exclusions match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(excluded_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Success! URL obtained by Google search with 1 bad URLs avoided.
Quality URL: https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037


What do you think of [the "quality" URL we landed on](https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa)? What does this mean about our exclusion list?

### Challenge

Improve our automated searching to try to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the exclusions list OR (B) try a simple search but for more URLs.

In [6]:
# Solution

# Option A: Define expanded exclusions
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
              'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com', 
              'castro.tea.state.tx.us', 'yellow.place', 'trueschools.com', 'mapquest.com', 'schoolsnearme.net', 
              'homefacts.com']

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid exclusionsed domains
print("Successfully collected Google search results.")

# Initialize exclusions match counter
excluded_num = 0 

# Get first good search result:
for url in urls:
    if any(domain in url for domain in exclusions):
        print(f'Bad site detected: {url}') 
        excluded_num += 1 # Add one to exclusions match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(excluded_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
Bad site detected: https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
Bad site dete

In [7]:
# Option B: Expanded simple search
for url in search(school_name + ' ' + school_address, \
                  stop=20, pause=5.0, num=20): # Get first 20 results: stop at 20, and get 20 in first page of results
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
https://www.homefacts.com/schools/Texas/Bexar-County/San-Antonio/Dr-David-C-Walker-El.html
https://www.donorschoose.org/schools/texas/school-of-excellence-in-education/dr-david-walker-elementary-school/95612
https://texas.hometown

As it turns out, David C. Walker Intermediate School closed recently; that's why Google doesn't find what used to be its [official website](https://excellence-sa.org/walker/) (though you could probably find this on the Internet Archive's Wayback Machine). How could we avoid such blocks in the future? You could use your expanded exclusion list and if no quality URL appears in the first 10 results, consider the school closed. How else could you do this?

## Making `Requests` <a id='request'></a>

### Challenge

Get the HTML for [this claim review by fact checking site PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/). 
Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your browser.

In [8]:
# solution
import requests 

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
response = requests.get(url)
html = response.text

html[:1000]

'\n<!DOCTYPE html>\n<html lang="en-US" dir="ltr">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>PolitiFact | Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.</title>\n<meta name="description" content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " />\n<meta property="og:url" content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" />\n<meta property="og:image" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:image:secure_url" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:title" content="PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t." />\

# Parsing HTML <a id='parsing'></a>

## Pretty parsing with `BeautifulSoup` <a id='BS'></a>

### Challenge

Find all the links in the above claim review page using the `<a>` tags and their `href` elements. Print every 10th link. What do you notice about where these links point?

In [None]:
# Import BeautifulSoup for parsing
from bs4 import BeautifulSoup
import requests # for making web requests

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# solution
for link in soup.find_all('a')[::10]: # every 10th element
    print(link.get('href'))

We see lots of relative links (e.g., `/pennsylvania/`), places where the `href` seems to point nowhere (e.g., `#`), and communication shortcuts (e.g., `https://twitter.com/share?text=PolitiFact - Citizens United calls...`). This could be cleaned up by appending relative links to the domain name (`https://www.politifact.com/`) and keeping only URLs (and nothing after).

## Getting human-readable text <a id='readable'></a>

Not all websites use the `<p>` tag to indicate the important, human-readable text. Sometimes we need to approach HTML parsing from the other end: By finding and removing all non-informative tags. Let's use `BeautifulSoup` to build such a method. 

### Challenge

Use `decompose()` to remove from the soup all tags showing anything other than human-readable text. Below is a list of such junk tags to use as an exclusion list.

```
"b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
"samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
"title", "[document]", "script", "style", "meta", "noscript"
```

_Hint:_ Iterate over these tags to identify each one in the soup and remove it.

In [None]:
# solution

# Define inline tags for cleaning out HTML
tags_exclusions = ["b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
                  "samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
                  "title", "[document]", "script", "style", "meta", "noscript"]

# Get HTML and then soup
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Remove non-visible tags from soup with two for-loops:
for tag in tags_exclusions:
    for elem in soup.find_all(tag):
        elem.decompose()
        
# Show result
visible = soup.get_text(strip=True)
print(visible[1000:3000])

You might have noticed that word boundaries get clobbered when you call `get_text()`. This is because the default setting for this method is `strip=True`, which tells `BeautifulSoup` to strip whitespaces (of any kind) from the beginning and end of each bit of text. Using `strip=False` leads to lots of extra whitespaces--usually, newlines--which requires some regular expressions to clean up.

### Challenge

Using the above tags exclusion list and `decompose()` as before, this time use the `strip=False` parameter when calling `get_text()` to avoid combining words across whitespace boundaries. Instead, use regular expressions to clean up extra whitespaces.

In [None]:
# solution

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Faster way to remove non-visible tags from soup:
[s.decompose() for s in soup(tags_exclusions)]

# Don't strip spaces in-between elements, to avoid clobbering word boundaries
visible = soup.get_text(strip=False)

import re
#visible = re.sub(r"\n+", "\n", visible) # This works, but less extensible than below

import regex # better unicode support than Python's built-in re package

# Use regex to replace all consecutive spaces (including in unicode), tabs, or "|"s with a single space
visible = regex.sub(r"[ \t\h\|]+", " ", visible)
# Replace any consecutive linebreaks with a single space
visible = regex.sub(r"[\n\r\f\v]+", "\n", visible)

print(visible)

### Extra Challenge

You might have noticed that when we scraped HTML above from [this claim review by PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/), we got headers and tags like this:
```html
<p>Misinformation isn't going away just because it's a new year. Support trusted, factual information with a tax deductible contribution to PolitiFact.</p>
<p>
<a class="m-disruptor-content__link" href="/membership/">More Info</a>
</p>
<p class="c-image__caption-inner copy-xs">
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
</p>
```
Use what you now know about identifying HTML, removing tags, and cleaning spacing to scrape a clean explanation from the body of this article. 

_Hint:_ Use your browser to inspect this website's HTML and identify any unique types and/or classes that enclose the explanation (and nothing else).

In [None]:
# solution

# Set URL to scrape
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'

# Scrape HTML with requests and beautifulsoup
html = requests.get(url) 
soup = BeautifulSoup(html.text)

explanation = soup.find('article', class_='m-textblock').get_text() # identify this class from looking at HTML

import re
explanation = re.sub(r"\n+", "\n", explanation)

print(explanation)

Compare the output from this focused, site-specific scraping approach with that from the exclusion list method above. <br/>
**Which method gives the cleaner output? Which method is more extensible?**