# Introduction to web-scraping: Solutions


## Outline

* [Structured queries with APIs](#apis)
    * [Google Fact Check API](#factapi)
* [URL collection with automated Google search](#URLs)
    * [Scraping school URLs](#school_URLs)
    * [Scraping URLs using an exclusion list](#exclusionlist)
* [Making `Requests`](#request)

**__________________________________**


# Structured queries with APIs<a id='apis'></a>

## Google Fact Check API<a id='factapi'></a>

In [1]:
# ONLY IF YOU HAVE API KEY: Get key from file
api_key_fp = '../extra/api_key.txt'
with open(api_key_fp) as keyfile:
    key = keyfile.read().strip()
    
# Import package for making web requests
import requests

In [2]:
# Define what to search for in API
query = 'infrastructure'

# Set backend URL for requesting data from API
search_url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"

# Make data request (first page of results only)
response = requests.get(
    url=search_url, 
    params=dict(
        key=key, 
        languageCode='en-US', 
        query=query)).json()

# Show the result
print(response)

{'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode':

Notice that the claim reviews in the query response sometimes include `claimDate` and `claimant`, but they _always_ include these keys: 
```
text, claimReview[publisher[name, site], url, title, textualRating]
```
Where `text` is the claim (often wrong), `publisher` is the fact checking site (e.g., Snopes), and `textualRating` is the fact checker's evaluation of the claim (e.g., 'Mostly True', 'Pants on Fire', 'Two Pinocchios').

## Challenge

Show the `claimant`, `text`, `claimDate`, and `textualRating` features--_when available_--for first 10 claims in the API response. I've copied the raw response output below for you to play with.<br/>
_Hint:_ `claimReview` is a list data type. How do you need to call the list to access the dictionary within it?

In [1]:
response = {'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode': 'en'}]}, {'text': '“These figures are what you would consider regular appropriations-plus. So it’s baseline-plus.”', 'claimant': 'Shelley Moore Capito', 'claimDate': '2021-04-22T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/26/apples-apples-senate-gop-infrastructure-proposal-is-smaller-than-it-appears/', 'title': 'Analysis | Apples to apples, the Senate GOP infrastructure proposal ...', 'textualRating': 'Correct', 'languageCode': 'en'}]}, {'text': 'Says Joe Biden’s infrastructure plan “is the Green New Deal.”', 'claimant': 'Citizens United', 'claimDate': '2021-03-31T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/', 'title': "Citizens United calls Biden's infrastructure plan the Green New Deal ...", 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': 'There was no public infrastructure built during the Benigno “Noynoy” Aquino III administration.', 'claimReview': [{'publisher': {'name': 'Rappler', 'site': 'rappler.com'}, 'url': 'https://www.rappler.com/newsbreak/fact-check/no-infrastructure-built-under-noynoy-aquino', 'title': 'FALSE: No infrastructure built under Noynoy Aquino', 'reviewDate': '2021-03-30T06:19:28Z', 'textualRating': 'False', 'languageCode': 'en'}]}, {'text': '“Only about 6% of the president’s proposal actually goes" to infrastructure, meaning "water, wastewater ... highways, roads, bridges, perhaps broadband.”', 'claimant': 'John Thune', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'FactCheck.org', 'site': 'factcheck.org'}, 'url': 'https://www.factcheck.org/2021/04/underselling-the-infrastructure-in-infrastructure-plan/', 'title': 'Underselling the Infrastructure in Infrastructure Plan', 'textualRating': '6% is too low', 'languageCode': 'en'}]}, {'text': '“The proposed tax increases in the Biden administration’s infrastructure plan could lead to 1 million fewer jobs in the first two years.”', 'claimant': 'Roy Blunt', 'claimDate': '2021-04-13T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': '“This is a massive social welfare spending program combined with a massive tax increase on small-business job creators.”', 'claimant': 'Roger Wicker', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': '“Something less than 6% ... of this proposal that President Biden has put forward is actually focused on infrastructure.”', 'claimant': 'Liz Cheney', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/13/liz-cheney/liz-cheneys-dubious-claim-just-6-biden-plan-infras/', 'title': "Liz Cheney's dubious claim that just 6% of Biden plan is ...", 'textualRating': 'Pants on Fire', 'languageCode': 'en'}]}, {'text': 'President Joe Biden’s infrastructure proposal “is fully paid for. Across 15 years, it would raise all of the revenue needed for these once-in-a-lifetime investments."', 'claimant': 'Pete Buttigieg', 'claimDate': '2021-04-04T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/07/pete-buttigieg/joe-bidens-infrastructure-bill-fully-paid/', 'title': "Is Joe Biden's infrastructure proposal fully paid for?", 'textualRating': 'Mostly True', 'languageCode': 'en'}]}], 'nextPageToken': 'CAo'}

In [2]:
# Solution

for claim in response['claims'][:10]:
    if 'claimant' in claim and 'claimDate' in claim:
        print(f"Claim by {claim['claimant']} on {claim['claimDate']}, rated {claim['claimReview'][0]['textualRating']}: \n{claim['text']}")
    else:
        print(f"Claim rated {claim['claimReview'][0]['textualRating']}: \n{claim['text']}")
    print()

Claim by Republican National Committee on 2021-04-20T00:00:00Z, rated Three Pinocchios: 
“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”

Claim by Leader McConnell on 2021-04-03T02:33:00Z, rated Fake: 
Infrastructure only comprises of roads and bridges.

Claim by Shelley Moore Capito on 2021-04-22T00:00:00Z, rated Correct: 
“These figures are what you would consider regular appropriations-plus. So it’s baseline-plus.”

Claim by Citizens United on 2021-03-31T00:00:00Z, rated Mostly False: 
Says Joe Biden’s infrastructure plan “is the Green New Deal.”

Claim rated False: 
There was no public infrastructure built during the Benigno “Noynoy” Aquino III administration.

Claim by John Thune on 2021-04-11T00:00:00Z, rated 6% is too low: 
“Only about 6% of the president’s proposal actually goes" to infrastructure, meaning "water, wastewater ... highways, roads, bridges, perhaps broadband.”

Claim by Roy Blunt on 2021-04-13T00:00:00Z, rated Mostly Fal

# URL collection with automated Google search<a id='URLs'></a>

## Scraping school URLs<a id='school_URLs'></a>

In [3]:
# Import automated Google search package
from googlesearch import search

# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.ccpcs.org/
https://www.ccpcs.org/about/our-staff
https://www.ccpcs.org/about/our-staff/join-our-team
https://www.ccpcs.org/current-families/calendar
https://www.ccpcs.org/about/our-staff/high-school
https://www.facebook.com/CapitalCityPCS/
https://www.myschooldc.org/schools/profile/143
https://www.niche.com/k12/capital-city-public-charter-school-washington-dc/
https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&State=11&details=5&ID2=1100035
https://www.usnews.com/education/k12/district-of-columbia/capital-city-pcs-lower-school-226373


### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [4]:
# Solution

# Define metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Automated search
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
https://www.homefacts.com/schools/Texas/Bexar-County/San-Antonio/Dr-David-C-Walker-El.html


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Why would this be the case? 

## Scraping URLs using an exclusion list<a id='exclusionlist'></a>

In [5]:
# Initial version of code

# Define excluded domains to filter out: third-party domains/false positives that we DON'T want to scrape 
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid excluded domains
print("Successfully collected Google search results.")

# Initialize exclusions match counter: How many excluded domains has this search encountered?
excluded_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in exclusions):
        print(f'Bad site detected: {url}') 
        excluded_num += 1 # Add one to exclusions match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(excluded_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Success! URL obtained by Google search with 1 bad URLs avoided.
Quality URL: https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037


What do you think of [the "quality" URL we landed on](https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa)? What does this mean about our exclusion list?

### Challenge

Improve our automated searching to try to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the exclusions list OR (B) try a simple search but for more URLs.

In [6]:
# Solution

# Option A: Define expanded exclusions
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
              'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com', 
              'castro.tea.state.tx.us', 'yellow.place', 'trueschools.com', 'mapquest.com', 'schoolsnearme.net', 
              'homefacts.com']

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid exclusionsed domains
print("Successfully collected Google search results.")

# Initialize exclusions match counter
excluded_num = 0 

# Get first good search result:
for url in urls:
    if any(domain in url for domain in exclusions):
        print(f'Bad site detected: {url}') 
        excluded_num += 1 # Add one to exclusions match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(excluded_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
Bad site detected: https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
Bad site dete

In [7]:
# Option B: Expanded simple search
for url in search(school_name + ' ' + school_address, \
                  stop=20, pause=5.0, num=20): # Get first 20 results: stop at 20, and get 20 in first page of results
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://www.schoolsnearme.net/en/public/dr-david-c-walker-elementary/79869
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
https://www.homefacts.com/schools/Texas/Bexar-County/San-Antonio/Dr-David-C-Walker-El.html
https://www.donorschoose.org/schools/texas/school-of-excellence-in-education/dr-david-walker-elementary-school/95612
https://texas.hometown

As it turns out, David C. Walker Intermediate School closed recently; that's why Google doesn't find what used to be its [official website](https://excellence-sa.org/walker/) (though you could probably find this on the Internet Archive's Wayback Machine). How could we avoid such blocks in the future? You could use your expanded exclusion list and if no quality URL appears in the first 10 results, consider the school closed. How else could you do this?

## Making `Requests` <a id='request'></a>

### Challenge

Get the HTML for [this claim review by fact checking site PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/). 
Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your browser.

In [8]:
# solution
import requests 

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
response = requests.get(url)
html = response.text

html[:1000]

'\n<!DOCTYPE html>\n<html lang="en-US" dir="ltr">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>PolitiFact | Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.</title>\n<meta name="description" content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " />\n<meta property="og:url" content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" />\n<meta property="og:image" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:image:secure_url" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:title" content="PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t." />\