# Introduction to web-scraping

It's 2021. The web is everywhere.

* If you want to buy a house, real estate agents have [websites](https://www.wendytlouie.com/) where they list the houses they're currently selling. 
* If you want to know whether to where a rain jacket or shorts, you check the weather on a [website](https://weather.com/weather/tenday/l/Berkeley+CA+USCA0087:1:US). 
* If you want to know what's happening in the world, you read the news [online](https://www.sfchronicle.com/). 
* If you've forgotten which city is the capital of Australia, you check [Wikipedia](https://en.wikipedia.org/wiki/Australia).

**The point is this: there is an enormous amount of information (also known as data) on the web.**

If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, **we can answer all sorts of interesting questions or solve important problems**.

* Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from [Rate My Professors](https://www.ratemyprofessors.com/) (provided you follow their [terms of service](https://www.ratemyprofessors.com/TermsOfUse_us.jsp#use))
* Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
* [Geoff Boeing](https://geoffboeing.com/) and [Paul Waddell](https://ced.berkeley.edu/ced/faculty-staff/paul-waddell) recently published [a great study](https://arxiv.org/pdf/1605.05397.pdf) of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

This first day's workshop is a one-hour beginner's introduction to web scraping. 


## Learning Goals
*   

## Outline

* [Structured queries with APIs](#apis)
* [URL collection with automated Google search](#URLs)
* [Mirroring websites with `wget`](#wget)
* [Template code: See Google Places API in action](#Places)

## Background

We will do some review, but this notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the intro to Python notebook at `extra/intro-to-python.ipynb` and/or [this one](https://github.com/lknelson/text-analysis-course/blob/master/scripts/01.25.02_PythonBasics.ipynb) *before* the workshop. 

## Vocabulary

* *Uniform Resource Locator (URL)*: 
    * The address of information on the web and directions to get there. A URL points to resources--usually the files needed to show a website, but it can also point to files and such.
* *Domain name*:
    * A website identifier that begins a URL: for instance, in https://www.example.com/ this is everything from `https` to `.com/`.
* *web-scraping* (i.e., *screen-scraping*):
    * Extracting structured information from the files that make up websites (i.e., what's shown in web browsers), relying on their HTML, CSS, and sometimes JS files. 
* *Hyper-Text Markup Language (HTML)*: 
    * The standard markup language for websites, the "nuts and bolts" of WHAT a website will display, including text.
* *Cascading Style Sheets (CSS)*: 
    * A technology used to format the layout of a webpage, i.e. HOW to make it pretty. Not usually relevant for web-scraping.
* *web-crawling*:
    * Finding web pages through links, automated search, etc. Once discovered, pages can be checked (is this website still up?), downloaded, or scraped. 
* *website mirroring*:
    * Creating a complete local copy of the files needed to display and host a website. 
* *Application Programming Interface (API)*:
    * A tool used to access structured data provided by an organization. Examples include Twitter, Reddit, Wikipedia, and the New York Times. When an API is available (not always the case), this is usually the preferred way to access data (over web-scraping).

**__________________________________**


# Structured queries with APIs<a id='apis'></a>

As an example, let's try out the [Google Fact Check API](https://developers.google.com/fact-check/tools/api/), which can be easily explored [in a browser](https://toolbox.google.com/factcheck/explorer). 

In [None]:
# Import libraries

import csv
from tqdm import tqdm
import requests # for downloading
from bs4 import BeautifulSoup, NavigableString, Tag # for html scraping
import regex as re # Regex module with Unicode support
import html5lib # slower but more accurate bs4 parser for messy HTML # lxml faster
import urllib
import json

# Import functions to scrape fact check web pages
from scrape_helpers import load_api_key, clean_text, scrape_politifact, scrape_factcheck, scrape_snopes

In [None]:
######################################################
# Call API
######################################################

# Elements in query response: text, claimDate, claimReview[publisher[name], url, textualRating]
# Columns in output CSV: (date (DD-MM-YYYY), claim, truth rating, url, source (publisher), fact, explanation

if __name__ == '__main__':
    page_token = 0
    domains = ['covid', 'blm', 'election']
    query_sets = [
        ["masks", "Chinese bioweapon", "China virus"],
        ["George Floyd", "Antifa", "Black Lives Matter"],
        ["Hunter Biden", "rigged election", "mail-in", "election ballots"],
    ]

    api_key_fp = "api_key.txt"
    key = load_api_key(api_key_fp)
    endpoint = 'https://factchecktools.googleapis.com'
    search = '/v1alpha1/claims:search'

    sites = ['politifact.com', 'factcheck.org', 'snopes.com']
    site_scrapers = [scrape_politifact, scrape_factcheck, scrape_snopes]
    site_switches = ['politifact', 'factcheck.org', 'snopes']


    for i in range(0, len(domains)):
        domain = domains[i]
        queries = query_sets[i]
        claims = [] # initialize list of claims

        for query in queries:
            urls = set() # initialize set of fact check URLs already seen for this query

            for site in tqdm(sites, desc='Collecting data for {} via API'.format(query)):
                params = {
                    'pageToken': page_token,
                    'query': query,
                    'reviewPublisherSiteFilter': site,
                    'key': key
                }

                nextToken = True
                while nextToken:
                    url = endpoint + search + '?' + urllib.parse.urlencode(params)
                    response = requests.get(url)
                    data = response.json()

                    if 'claims' in data:
                        for claim in data['claims']:
                            if not site == 'snopes.com':
                                claims.append([claim['claimDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']])
                            else:
                                claims.append([claim['claimReview'][0]['reviewDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']
                                              .replace('.com', '')])

                    if 'nextPageToken' in data:
                        params['pageToken'] = data['nextPageToken']
                    else:
                        nextToken = False

            for j in tqdm(range(0, len(claims)), desc='Scraping websites'.format(query)):
                claim = claims[j]
                switch = site_switches.index(claim[4].lower()) # use fact to get publisher site then index (site name is 5th element)
                scraper = site_scrapers[switch] # get scraper using index
                claim.extend(scraper(claim[3])) # scrape URL using scraper (URL is 4th element), add to existing claim info
                claims[j] = claim # record fact in list

            claims # remove duplicates

            # Save output for this query
            query_string = query.replace(' ', '-')
            with open('data/fact_checker_data_{}.csv'.format(query_string), 'w') as f:
                csv_writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(['date', 'claim', 'truth_rating', 'url', 'source', 'fact', 'explanation'])
                for claim in claims: # Each claim gets its own column
                    if claim[3] not in urls: # don't add if fact check URL already seen
                        csv_writer.writerow(claim) # save row
                        urls.add(claim[3]) # add to set of urls already saved

            print('Saved {} claims for {} query.'.format(str(len(urls)), query))
            print()

# URL collection with automated Google search<a id='URLs'></a>

If you want to crawl and/or scrape an online community of websites, there's a good chance may find yourself needing to collect their URLs. If you're lucky, you have comprehensive metadata describing these entities, something like their name and physical address. Your next step in this scenario would be to automate a Google search to collect the best URL matching each entity. 

How can you scrape URLs from Google? There are two fairly easy ways.

First, the **Google Places API**, which is the best option to do this at scale. You would need to apply for an API key from Google: go to the [Google cloud console](https://console.cloud.google.com/), create a project, and request an API key for each service you want to use. Approval may take a few days, but once done there is a [handy Python wrapper](https://github.com/slimkrazy/python-google-places) to make this easy to use in Python. See [Google Web Services](https://developers.google.com/places/web-service/) for general documentation and [Google Developers](https://developers.google.com/places/web-service/details) for details on Place Details requests.

Be aware that Google APIs are not a free service, and they may not work at all unless you sign up for billing. However, if you apply for access under an education account or for research purposes, Google offers you credit to start with (200 dollars last I checked). Nonetheless, to avoid excessive charges (I have experience with this!), check what exact requests you're making and set up account alerts before making API calls at scale.

The second option is **automated Google search**, which is not nearly as reliable and may get you blocked if used repeatedly. This method tends to get lots of false positives and third-party website aggregators (e.g., yellowpages.com, trulia.com), so using a blacklist to manually filter results is a good idea. Check out [the source code](https://github.com/MarioVilas/googlesearch) and [documentation](https://python-googlesearch.readthedocs.io/en/latest/). _Thanks Mario Vilas for this package!_

Because this second option is free and has no waiting period to use, we will practice using this in a nice way. You can see template code for running the Google Places API at the bottom of this notebook.

## Scraping school URLs

To see how this works, let's start by searching for the best URL for a charter school in Washington, D.C. Assume we have the name and address of the school.

To prevent overwhelming Google search with rapid requests--and likely getting our IP address blocked by Google as a result--let's search only for the first 10 results and include a five-second pause in between each request.

In [50]:
# Import automated Google search package
from googlesearch import search

# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.ccpcs.org/
https://www.ccpcs.org/current-families/calendar
https://www.ccpcs.org/about/mission-and-history
https://www.ccpcs.org/admissions/applying-capital-city
https://www.ccpcs.org/program/el-education
https://www.ccpcs.org/about/our-staff
https://www.facebook.com/CapitalCityPCS/
https://www.greatschools.org/washington-dc/washington/282-Capital-City-PCS---Lower-School/
https://www.greatschools.org/washington-dc/washington/591-Capital-City-High-School-PCS/
https://www.greatschools.org/washington-dc/washington/591-Capital-City-High-School-PCS/#College_readiness


This is a pretty strong result: the first six matches share the domain of https://www.ccpcs.org/, so this is probably the best match. We identified a URL without even visiting any websites!

Notice that results 7-10 are about the right school, but they don't point to it's genuine website--with all its descriptive language, images, and subpages. Even in this case with a strong topline result, we can already get a feel for what websites will pollute our automated searches: Facebook and greatschools.org are a good start to making a blacklist to filter the results. 

Now let's try something harder to find.

### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [53]:
# Solution

# Define metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Automated search
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Interestingly, the [first result](https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/) (with domain of https://www.niche.com) does point to the [official website](https://excellence-sa.org/walker/), but extracting this information systematically would mean web-scraping--which we will get to tomorrow! 

## Scraping URLs using a blacklist

To provide cleaner search results, let's filter out the third-party websites from the previous two examples. 

Many of these websites can show up with either 'http' or 'https', often with or without a 'www', but usually have a consistent top-level domain (e.g., 'com'). Exact string matchin would fail to capture matches across these variations. Regular expressions could do this, but for now let's just filter out those search results that contain the core of any blacklisted domain name (e.g., niche.com). 

Let's get the first result for the previous school (Dr. David C. Walker Intermediate School) that doesn't match any blacklisted domains. 

In [59]:
# Define blacklisted domains to filter out: third-party domains/false positives that we DON'T want to scrape 
blacklist = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid blacklisted domains
print("Successfully collected Google search results.")

# Initialize blacklist match counter: How many blacklisted domains has this search encountered?
blacklisted_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
Bad site detected: https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
Bad site detected: https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected:

What do you think of [the "quality" URL we landed on](http://castro.tea.state.tx.us/charter_apps/content/downloads/Renewals/015806_2.pdf)? Looks like we need to expand our blacklist!

### Challenge

Improve our automated searching to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the blacklist OR (B) try a simple search but for more URLs.

In [60]:
# Your solution here

# Option A: Define expanded blacklist
blacklist = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com', 
             'castro.tea.state.tx.us']

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid blacklisted domains
print("Successfully collected Google search results.")

# Initialize blacklist match counter
blacklisted_num = 0 

# Get first good search result:
for url in urls:
    if any(domain in url for domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
Bad site detected: https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
Bad site detected: https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected:

In [61]:
# Option B: Expanded simple search
for url in search(school_name + ' ' + school_address, \
                  stop=20, pause=5.0, num=20): # Get first 20 results: stop at 20, and get 20 in first page of results
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
http://castro.tea.state.tx.us/charter_apps/content/downloads/Renewals/015806_2.pdf
https://excellence-sa.org/walker/
https

# Mirroring websites with `wget`<a id='wget'></a>

`wget` is classic (circa 1996, but still updated) [free software](https://www.gnu.org/philosophy/free-sw) in shell for non-interactively downloading web content. It's often used for basic one-time downloads, like `curl` also does for shell or `urllib.urlretrieve` does in-house for Python. But where `wget` really shines is in its extensive customization, including retrying failed connections, following links, and duplicating a remote website's files and structure to the point of having an identical local copy (website mirroring). 

Let's try using the nice Python wrapper for `wget` to download the MDI News page nested in the McCourt School for Public Policy site:

In [33]:
import wget 
wget.download(url='https://mccourt.georgetown.edu/research/mdi-news/')

'download.wget'

We can check out the contents of this (rather poorly named) file using the Jupyter interface in the previous tab. 

We got some HTML--cool! But what if we want something clickable and interactive? This is easiest to do with `wget` run via its native shell, rather than this simple Python wrapper--which also doesn't allow for `get`'s more advanced functionality. We can use the helpful `!` prefix to run shell commands straight from this notebook. 

Let's make a new `wget` request to download a version of the same page that's easier to see in your browser. 

In [41]:
!wget https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-23 14:43:11--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157499 (154K) [text/html]
Saving to: ‘index.html.1’


2021-04-23 14:43:12 (8.80 MB/s) - ‘index.html.1’ saved [157499/157499]



Use your Jupyter browser to check out the results: just click on `index.html` in your current folder (probably this is `day-1/`) to view the page. What do you notice? How does it compare to viewing https://mccourt.georgetown.edu/research/mdi-news/ in your browser? Try clicking the links. Where can you go on the actual page that your local copy can't show you? Do you have local copies of the images?

You might have noticed that we only ended up with some HTML--we didn't download any of the files associated with the webpage. So, this isn't a true copy; we couldn't host the page ourselves, analyze its images, or easily use its content for purposes other than viewing. How do we mirror the full site?

To do this, we need only the `page-requisites` option, which makes sure to download all the resources needed to render the page in a browser: that means CSS, javascript, image files, etc. To keep from overloading the server, let's pause for a few seconds in between downloads using the `--wait` option. 

Let's use some other features as well for politeness and subtlety (i.e. to avoid getting blocked). Here is explanation for all of them:

```shell
--page-requisites             Grabs all of the linked resources necessary to render the page (images, CSS, javascript, etc.)
--wait                        Pauses between downloads (in seconds)
--tries=3                     Retries failed downloads 3 times
--user-agent=Mozilla          Makes wget look like a Mozilla browser by masking its user agent
--header="Accept:text/html"   Sends header with each HTML request, looks more browser-ish
--no-check-certificate        Doesnt check authenticity of website server (use only with trusted websites!)
```

In [38]:
!wget --page-requisites --wait=2 --tries=3 --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-23 14:16:52--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157499 (154K) [text/html]
Saving to: ‘index.html’


2021-04-23 14:16:52 (9.30 MB/s) - ‘index.html’ saved [157499/157499]



Check out the results--what's similar and whats different? See `/research/mdi-news/` for the `index.html` (sometimes this is `default.html`) page we saw earlier. 

`wget` has a rich array of options. Here are some of the most useful ones in addition to those above:

```shell
--mirror                      Downloads a full website and makes available for local viewing
--recursive                   Recursively downloads files and follows links
--no-parent 		          Does not follow links above hierarchical level of input URL
--convert-links 	          Turns links into local links as appropriate
--accept                      Download only file suffixes in this list (e.g., .html)
--execute robots=off          Turns off automatic robots.txt checking, preventing server privacy exclusions
--random-wait                 Randomizes the defined wait period to between .5 and 1.5x that value
--background		          For a huge download, put the download in background
--spider                      Determines whether the remote file exist at the destination (mimics web spiders)
--domains   		          Downloads only only PDF files from specific domains
--user --password   		  Downloads files from password protected sites
```

### Challenge

Download only `.html` files from https://mccourt.georgetown.edu/research/ and links below that.

In [42]:
# Solution
!wget --accept .html --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/

--2021-04-23 15:03:17--  https://mccourt.georgetown.edu/research/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8000::1, 2620:12a:8001::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157616 (154K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/index.html’


2021-04-23 15:03:18 (9.33 MB/s) - ‘mccourt.georgetown.edu/research/index.html’ saved [157616/157616]

Loading robots.txt; please ignore errors.
--2021-04-23 15:03:20--  https://mccourt.georgetown.edu/robots.txt
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 116 [text/plain]
Saving to: ‘mccourt.georgetown.edu/robots.txt.tmp’


2021-04-23 15:03:20 (3.37 MB/s) - ‘mccourt.georgetown.edu/robots.txt.tmp’ saved [116/116]

--2021-04-23 15:03:22--  https://mccourt.georgetown.edu/research/featured-publications/
Reusing existing conn

HTTP request sent, awaiting response... 404 Not Found
2021-04-23 15:03:55 ERROR 404: Not Found.

--2021-04-23 15:03:57--  https://mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143331 (140K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’


2021-04-23 15:03:58 (8.29 MB/s) - ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’ saved [143331/143331]

--2021-04-23 15:04:00--  https://mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 142753 (139K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/index.html’


2021-0

Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/resources/dp-resources/index.html... 23-8
Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/resources/index.html... 34-8
Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/mdi-conferences-and-panels/index.html... 32-8
Converting links in mccourt.georgetown.edu/research/mccourt-centers/index.html... 24-8
Converted links in 24 files in 0.06 seconds.


### Challenge

Use advanced options for `wget` (listed above) to mirror a website you use often. Be sure to use a polite `--wait` and avoid downloading anything with massive numbers of links, files, or pages (e.g., don't try YouTube.com or Wikipedia.com). If you want to download a segment or specific page within a website (e.g., a single YouTube channel or Wikipedia page), use the `--recursive` option with `--no-parent` (to follow only links within the input URL).

While you let `wget` run, read more about it on its [manual](https://www.gnu.org/software/wget/manual/wget.html) and see other examples of `wget` usage [here](https://gist.github.com/bueckl/bd0a1e7a30bc8e2eeefd) and [here](https://phoenixnap.com/kb/wget-command-with-examples). 

In [40]:
# Solution
!wget --mirror --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://www.jarenhaber.com/

--2021-04-23 14:39:08--  https://www.gnu.org/software/wget/
Resolving www.gnu.org (www.gnu.org)... 209.51.188.148, 2001:470:142:3::a
Connecting to www.gnu.org (www.gnu.org)|209.51.188.148|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.gnu.org/software/wget/index.html’

www.gnu.org/softwar     [ <=>                ]  10.46K  --.-KB/s    in 0.03s   

Last-modified header missing -- time-stamps turned off.
2021-04-23 14:39:08 (363 KB/s) - ‘www.gnu.org/software/wget/index.html’ saved [10708]

Loading robots.txt; please ignore errors.
--2021-04-23 14:39:10--  https://www.gnu.org/robots.txt
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 1135 (1.1K) [text/plain]
Saving to: ‘www.gnu.org/robots.txt’


2021-04-23 14:39:10 (29.2 MB/s) - ‘www.gnu.org/robots.txt’ saved [1135/1135]

--2021-04-23 14:39:12--  https://www.gnu.org/mini.css
Reusing existing connection to www.gnu.org:

--2021-04-23 14:39:46--  https://www.gnu.org/software/wget/manual/wget.texi.tar.gz
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 68053 (66K) [application/x-gzip]
Saving to: ‘www.gnu.org/software/wget/manual/wget.texi.tar.gz’


2021-04-23 14:39:46 (1.13 MB/s) - ‘www.gnu.org/software/wget/manual/wget.texi.tar.gz’ saved [68053/68053]

--2021-04-23 14:39:48--  https://www.gnu.org/software/wget/manual/dir.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 404 Not Found
2021-04-23 14:39:48 ERROR 404: Not Found.

--2021-04-23 14:39:50--  https://www.gnu.org/software/gnulib/manual.css
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2083 (2.0K) [text/css]
Saving to: ‘www.gnu.org/software/gnulib/manual.css’


2021-04-23 14:39:50 (47.9 MB/s) - ‘www.gnu.org/software/gnulib/manual.css’ saved [2083/2083]

--2021-04-23 14:39:52--  https://www.gnu.or

--2021-04-23 14:40:21--  https://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 15938 (16K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:21 (566 KB/s) - ‘www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html’ saved [15938/15938]

--2021-04-23 14:40:23--  https://www.gnu.org/software/wget/manual/html_node/Recursive-Accept_002fReject-Options.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 9306 (9.1K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Recursive-Accept_002fReject-Options.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:23 (64.5 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Recursiv

--2021-04-23 14:40:51--  https://www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3677 (3.6K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:51 (88.9 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html’ saved [3677/3677]

--2021-04-23 14:40:53--  https://www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3731 (3.6K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:53 (86.0 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html’ saved [3731/3731]

--2021-04-23 14:40:55--  https://www.gnu.org/softw

--2021-04-23 14:41:22--  https://www.gnu.org/software/wget/manual/html_node/Portability.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 4458 (4.4K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Portability.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:41:22 (114 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Portability.html’ saved [4458/4458]

--2021-04-23 14:41:24--  https://www.gnu.org/software/wget/manual/html_node/Signals.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3420 (3.3K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Signals.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:41:24 (73.4 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Signals.html’ saved [3420/3420]

--2021-04-23 14:41:26--  https://www.gnu.org/software/wget/manual/html_node/Appen

# Template code: See Google Places API in action<a id='Places'></a>

For your reference, this is the code you would use to do URL scraping with the Google Places API.

In [None]:
# Import packages
from googleplaces import GooglePlaces, types  # Google Places API: 'types' lets us define what kind of entity to look for (e.g., schools)
import re

# Initialize Google Places API key
api_fp = 'define_me.txt' # Replace with API key filepath
places_api_key = re.sub("\n", "", open(api_fp).read())
google_places = GooglePlaces(places_api_key)

In [None]:
# See Google Places API in action
school_name = "River City Scholars Charter Academy"
school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

query_result = google_places.nearby_search(
        location=school_address, name=school_name,
        radius=15000, types=[types.TYPE_SCHOOL], rankby='distance') # Search for schools within 15000 km of input location

for place in query_result.places:
    print(place.name)
    place.get_details()  # makes further API call
    print(place.details) # A dict matching the JSON response from Google.
    print(place.website)
    print(place.formatted_address)

# Are there any additional pages of results?
if query_result.has_next_page_token:
    query_result_next_page = google_places.nearby_search(
        pagetoken=query_result.next_page_token)

The results look like this:
```python
River City Scholars Charter Academy
http://rivercityscholars.org/
944 Evergreen St SE, Grand Rapids, MI 49507, USA
```

In [None]:
# More robust code with a blacklist
query_result = google_places.nearby_search(
    location=address, name=school_name,
    radius=15000, types=[types.TYPE_SCHOOL], rankby='distance') # search within radius of Google Places API search (in km)
        
for place in query_result.places:
    place.get_details()  # Make further API call to get detailed info on this place
    
    found_name = place.name  # Compare this name in Places API to school's name on file
    found_address = place.formatted_address  # Compare this address in Places API to address on file

    url = place.website  # Grab school URL from Google Places API, if it's there
    
    # Initialize blacklist match counter
    blacklisted_num = 0 

    if any(domain in url for domain in blacklist):
        blacklisted_num += 1    # If this url is in bad_sites_list, add 1 to counter and move on
        print("URL in Google Places API is a third-party domain. Moving on.")

    else:
        good_url = url
        print("Success! URL obtained from Google Places API with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for-loop after finding first good result
        
print(f'Quality URL: {good_url}') # Show valid URL of the Place discovered in Google Places API