# Introduction to web-scraping: Solutions


## Outline

* [Structured queries with APIs](#apis)
    * [Google Fact Check API](#factapi)
* [URL collection with automated Google search](#URLs)
   * [Scraping school URLs](#school_URLs)
   * [Scraping URLs using a blacklist](blacklist)
* [Mirroring websites with `wget`](#wget)
   * [Features of `wget`](#wget_features)
* [Template code: See Google Places API in action](#Places)

**__________________________________**


# Structured queries with APIs<a id='apis'></a>

## Google Fact Check API<a id='factapi'></a>

In [1]:
# ONLY IF YOU HAVE API KEY: Get key from file
api_key_fp = '../extra/api_key.txt'
with open(api_key_fp) as keyfile:
    key = keyfile.read().strip()
    
# Import package for making web requests
import requests

In [2]:
# Define what to search for in API
query = 'infrastructure'

# Set backend URL for requesting data from API
search_url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"

# Make data request (first page of results only)
response = requests.get(
    url=search_url, 
    params=dict(
        key=key, 
        languageCode='en-US', 
        query=query)).json()

# Show the result
print(response)

{'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode':

Notice that the claim reviews in the query response sometimes include `claimDate` and `claimant`, but they _always_ include these keys: 
```
text, claimReview[publisher[name, site], url, title, textualRating]
```
Where `text` is the claim (often wrong), `publisher` is the fact checking site (e.g., Snopes), and `textualRating` is the fact checker's evaluation of the claim (e.g., 'Mostly True', 'Pants on Fire', 'Two Pinocchios').

## Challenge

Show the `claimant`, `text`, `claimDate`, and `textualRating` features--_when available_--for first 10 claims in the API response. I've copied the raw response output below for you to play with.<br/>
_Hint:_ `claimReview` is a list data type. How do you need to call the list to access the dictionary within it?

In [3]:
response = {'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode': 'en'}]}, {'text': '“These figures are what you would consider regular appropriations-plus. So it’s baseline-plus.”', 'claimant': 'Shelley Moore Capito', 'claimDate': '2021-04-22T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/26/apples-apples-senate-gop-infrastructure-proposal-is-smaller-than-it-appears/', 'title': 'Analysis | Apples to apples, the Senate GOP infrastructure proposal ...', 'textualRating': 'Correct', 'languageCode': 'en'}]}, {'text': 'Says Joe Biden’s infrastructure plan “is the Green New Deal.”', 'claimant': 'Citizens United', 'claimDate': '2021-03-31T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/', 'title': "Citizens United calls Biden's infrastructure plan the Green New Deal ...", 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': 'There was no public infrastructure built during the Benigno “Noynoy” Aquino III administration.', 'claimReview': [{'publisher': {'name': 'Rappler', 'site': 'rappler.com'}, 'url': 'https://www.rappler.com/newsbreak/fact-check/no-infrastructure-built-under-noynoy-aquino', 'title': 'FALSE: No infrastructure built under Noynoy Aquino', 'reviewDate': '2021-03-30T06:19:28Z', 'textualRating': 'False', 'languageCode': 'en'}]}, {'text': '“Only about 6% of the president’s proposal actually goes" to infrastructure, meaning "water, wastewater ... highways, roads, bridges, perhaps broadband.”', 'claimant': 'John Thune', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'FactCheck.org', 'site': 'factcheck.org'}, 'url': 'https://www.factcheck.org/2021/04/underselling-the-infrastructure-in-infrastructure-plan/', 'title': 'Underselling the Infrastructure in Infrastructure Plan', 'textualRating': '6% is too low', 'languageCode': 'en'}]}, {'text': '“The proposed tax increases in the Biden administration’s infrastructure plan could lead to 1 million fewer jobs in the first two years.”', 'claimant': 'Roy Blunt', 'claimDate': '2021-04-13T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': '“This is a massive social welfare spending program combined with a massive tax increase on small-business job creators.”', 'claimant': 'Roger Wicker', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': '“Something less than 6% ... of this proposal that President Biden has put forward is actually focused on infrastructure.”', 'claimant': 'Liz Cheney', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/13/liz-cheney/liz-cheneys-dubious-claim-just-6-biden-plan-infras/', 'title': "Liz Cheney's dubious claim that just 6% of Biden plan is ...", 'textualRating': 'Pants on Fire', 'languageCode': 'en'}]}, {'text': 'President Joe Biden’s infrastructure proposal “is fully paid for. Across 15 years, it would raise all of the revenue needed for these once-in-a-lifetime investments."', 'claimant': 'Pete Buttigieg', 'claimDate': '2021-04-04T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/07/pete-buttigieg/joe-bidens-infrastructure-bill-fully-paid/', 'title': "Is Joe Biden's infrastructure proposal fully paid for?", 'textualRating': 'Mostly True', 'languageCode': 'en'}]}], 'nextPageToken': 'CAo'}

In [4]:
# Solution

for claim in response['claims'][:10]:
    if 'claimant' in claim and 'claimDate' in claim:
        print(f"Claim by {claim['claimant']} on {claim['claimDate']}, rated {claim['claimReview'][0]['textualRating']}: \n{claim['text']}")
    else:
        print(f"Claim rated {claim['claimReview'][0]['textualRating']}: \n{claim['text']}")
    print()

Claim by Republican National Committee on 2021-04-20T00:00:00Z, rated Three Pinocchios: 
“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”

Claim by Leader McConnell on 2021-04-03T02:33:00Z, rated Fake: 
Infrastructure only comprises of roads and bridges.

Claim by Shelley Moore Capito on 2021-04-22T00:00:00Z, rated Correct: 
“These figures are what you would consider regular appropriations-plus. So it’s baseline-plus.”

Claim by Citizens United on 2021-03-31T00:00:00Z, rated Mostly False: 
Says Joe Biden’s infrastructure plan “is the Green New Deal.”

Claim rated False: 
There was no public infrastructure built during the Benigno “Noynoy” Aquino III administration.

Claim by John Thune on 2021-04-11T00:00:00Z, rated 6% is too low: 
“Only about 6% of the president’s proposal actually goes" to infrastructure, meaning "water, wastewater ... highways, roads, bridges, perhaps broadband.”

Claim by Roy Blunt on 2021-04-13T00:00:00Z, rated Mostly Fal

# URL collection with automated Google search<a id='URLs'></a>

## Scraping school URLs<a id='school_URLs'></a>

In [5]:
# Import automated Google search package
from googlesearch import search

# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.ccpcs.org/
https://www.ccpcs.org/current-families/calendar
https://www.ccpcs.org/about/mission-and-history
https://www.ccpcs.org/admissions/applying-capital-city
https://www.ccpcs.org/program/el-education
https://www.ccpcs.org/about/our-staff
https://www.greatschools.org/washington-dc/washington/282-Capital-City-PCS---Lower-School/
https://www.facebook.com/CapitalCityPCS/
https://www.niche.com/k12/capital-city-public-charter-school-washington-dc/
https://www.niche.com/k12/capital-city-public-charter-school-middle-school-washington-dc/


### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [6]:
# Solution

# Define metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Automated search
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://www.excellence-sa.org/walker
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Interestingly, the [first result](https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/) (with domain of https://www.niche.com) does point to the [official website](https://excellence-sa.org/walker/), but extracting this information systematically would mean web-scraping--which we will get to tomorrow! 

## Scraping URLs using a blacklist<a id='blacklist'></a>

In [7]:
# Define blacklisted domains to filter out: third-party domains/false positives that we DON'T want to scrape 
blacklist = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid blacklisted domains
print("Successfully collected Google search results.")

# Initialize blacklist match counter: How many blacklisted domains has this search encountered?
blacklisted_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Success! URL obtained by Google search with 5 bad URLs avoided.
Quality URL: https://www.excellence-sa.org/walker


What do you think of [the "quality" URL we landed on](http://castro.tea.state.tx.us/charter_apps/content/downloads/Renewals/015806_2.pdf)? Looks like we need to expand our blacklist!

### Challenge

Improve our automated searching to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the blacklist OR (B) try a simple search but for more URLs.

In [8]:
# Solution

# Option A: Define expanded blacklist
blacklist = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com', 
             'castro.tea.state.tx.us']

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid blacklisted domains
print("Successfully collected Google search results.")

# Initialize blacklist match counter
blacklisted_num = 0 

# Get first good search result:
for url in urls:
    if any(domain in url for domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Success! URL obtained by Google search with 5 bad URLs avoided.
Quality URL: https://www.excellence-sa.org/walker


In [9]:
# Option B: Expanded simple search
for url in search(school_name + ' ' + school_address, \
                  stop=20, pause=5.0, num=20): # Get first 20 results: stop at 20, and get 20 in first page of results
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://excellence-sa.org/walker/
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80
http://castro.tea.state.tx.us/charter_apps/content/downloads/Renewals/015806_2.pdf
https://www.homesnap.com/schools/TX/San_Antonio/Dr_David_C_Walker_Intermediate_School
https://texas.hometownlocator.com/schools/profi

# Mirroring websites with `wget`<a id='wget'></a>

In [10]:
!wget https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-27 14:10:11--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8000::1, 2620:12a:8001::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148325 (145K) [text/html]
Saving to: ‘index.html’


2021-04-27 14:10:12 (4.25 MB/s) - ‘index.html’ saved [148325/148325]



## Features of `wget`<a id='wget_features'></a>


```shell
--page-requisites             Grabs all of the linked resources necessary to render the page (images, CSS, javascript, etc.)
--wait                        Pauses between downloads (in seconds)
--tries=3                     Retries failed downloads 3 times
--user-agent=Mozilla          Makes wget look like a Mozilla browser by masking its user agent
--header="Accept:text/html"   Sends header with each HTML request, looks more browser-ish
--no-check-certificate        Doesnt check authenticity of website server (use only with trusted websites!)
--mirror                      Downloads a full website and makes available for local viewing
--recursive                   Recursively downloads files and follows links
--no-parent 		          Does not follow links above hierarchical level of input URL
--convert-links 	          Turns links into local links as appropriate
--accept                      Download only file suffixes in this list (e.g., .html)
--execute robots=off          Turns off automatic robots.txt checking, preventing server privacy exclusions
--random-wait                 Randomizes the defined wait period to between .5 and 1.5x that value
--background		          For a huge download, put the download in background
--spider                      Determines whether the remote file exist at the destination (mimics web spiders)
--domains   		          Downloads only only PDF files from specific domains
--user --password   		  Downloads files from password protected sites
```

In [11]:
!wget --page-requisites --wait=2 --tries=3 --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-27 14:10:13--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148325 (145K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/mdi-news/index.html’


2021-04-27 14:10:13 (6.83 MB/s) - ‘mccourt.georgetown.edu/research/mdi-news/index.html’ saved [148325/148325]

Loading robots.txt; please ignore errors.
--2021-04-27 14:10:15--  https://mccourt.georgetown.edu/robots.txt
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 116 [text/plain]
Saving to: ‘mccourt.georgetown.edu/robots.txt’


2021-04-27 14:10:15 (2.01 MB/s) - ‘mccourt.georgetown.edu/robots.txt’ saved [116/116]

--2021-04-27 14:10:17--  https://mccourt.georgetown.edu/wp-content/plugins/gu-whnu-blocks/

Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 2372233 (2.3M) [image/png]
Saving to: ‘mccourt.georgetown.edu/wp-content/uploads/2020/05/us-capitol-building-dc-1600x900.png’


2021-04-27 14:10:42 (54.7 MB/s) - ‘mccourt.georgetown.edu/wp-content/uploads/2020/05/us-capitol-building-dc-1600x900.png’ saved [2372233/2372233]

--2021-04-27 14:10:44--  https://mccourt.georgetown.edu/wp-content/uploads/2019/10/mdi.1-1600x900.jpg
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 359987 (352K) [image/jpeg]
Saving to: ‘mccourt.georgetown.edu/wp-content/uploads/2019/10/mdi.1-1600x900.jpg’


2021-04-27 14:10:45 (104 MB/s) - ‘mccourt.georgetown.edu/wp-content/uploads/2019/10/mdi.1-1600x900.jpg’ saved [359987/359987]

--2021-04-27 14:10:47--  https://mccourt.georgetown.edu/wp-content/uploads/2019/12/20191017-Protocol_0472-1-1600x900.jpg
Reusing existing connection to mccou

--2021-04-27 14:11:11--  https://mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-swiper-arrow-right.svg
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 462 [image/svg+xml]
Saving to: ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-swiper-arrow-right.svg’


2021-04-27 14:11:11 (12.4 MB/s) - ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-swiper-arrow-right.svg’ saved [462/462]

--2021-04-27 14:11:13--  https://mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-chapters-white.svg
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 945 [image/svg+xml]
Saving to: ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-chapters-white.svg’


2021-04-27 14:11:13 (26.1 MB/s)

--2021-04-27 14:11:37--  https://mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/btn-play-carousel.svg
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 397 [image/svg+xml]
Saving to: ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/btn-play-carousel.svg’


2021-04-27 14:11:37 (11.6 MB/s) - ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/btn-play-carousel.svg’ saved [397/397]

--2021-04-27 14:11:39--  https://mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-arrow-left.svg
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 464 [image/svg+xml]
Saving to: ‘mccourt.georgetown.edu/wp-content/themes/georgetown/pattern_lab/source/images/icons/icon-arrow-left.svg’


2021-04-27 14:11:39 (13.6 MB/s) - ‘mccourt.georgetown.edu

### Challenge

Download only `.html` files from https://mccourt.georgetown.edu/research/ and links below that.

In [12]:
# Solution
!wget --accept .html --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/

--2021-04-27 14:11:51--  https://mccourt.georgetown.edu/research/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8000::1, 2620:12a:8001::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157616 (154K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/index.html’


2021-04-27 14:11:52 (7.32 MB/s) - ‘mccourt.georgetown.edu/research/index.html’ saved [157616/157616]

Loading robots.txt; please ignore errors.
--2021-04-27 14:11:54--  https://mccourt.georgetown.edu/robots.txt
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 116 [text/plain]
Saving to: ‘mccourt.georgetown.edu/robots.txt.tmp’


2021-04-27 14:11:54 (3.19 MB/s) - ‘mccourt.georgetown.edu/robots.txt.tmp’ saved [116/116]

--2021-04-27 14:11:56--  https://mccourt.georgetown.edu/research/featured-publications/
Reusing existing conn

HTTP request sent, awaiting response... 404 Not Found
2021-04-27 14:12:29 ERROR 404: Not Found.

--2021-04-27 14:12:31--  https://mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143331 (140K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’


2021-04-27 14:12:32 (8.55 MB/s) - ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’ saved [143331/143331]

--2021-04-27 14:12:34--  https://mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 142753 (139K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/index.html’


2021-0

### Challenge

Use advanced options for `wget` (listed above) to mirror a website you use often. Be sure to use a polite `--wait` and avoid downloading anything with massive numbers of links, files, or pages (e.g., don't try YouTube.com or Wikipedia.com). If you want to download a segment or specific page within a website (e.g., a single YouTube channel or Wikipedia page), use the `--recursive` option with `--no-parent` (to follow only links within the input URL).

While you let `wget` run, read more about it on its [manual](https://www.gnu.org/software/wget/manual/wget.html) and see other examples of `wget` usage [here](https://gist.github.com/bueckl/bd0a1e7a30bc8e2eeefd) and [here](https://phoenixnap.com/kb/wget-command-with-examples). 

In [13]:
# Solution
!wget --mirror --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://www.jarenhaber.com/

--2021-04-27 14:12:55--  https://www.jarenhaber.com/
Resolving www.jarenhaber.com (www.jarenhaber.com)... 52.73.153.209, 54.205.240.192, 2604:a880:400:d0::1756:9001, ...
Connecting to www.jarenhaber.com (www.jarenhaber.com)|52.73.153.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.jarenhaber.com/index.html’

www.jarenhaber.com/     [ <=>                ]  21.41K  --.-KB/s    in 0.03s   

Last-modified header missing -- time-stamps turned off.
2021-04-27 14:12:55 (730 KB/s) - ‘www.jarenhaber.com/index.html’ saved [21928]

Loading robots.txt; please ignore errors.
--2021-04-27 14:12:57--  https://www.jarenhaber.com/robots.txt
Reusing existing connection to www.jarenhaber.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-04-27 14:12:57 ERROR 404: Not Found.

--2021-04-27 14:12:59--  https://www.jarenhaber.com/css/academic.min.ccb935070d24cc8b4be2d0c581d1f687.css
Reusing existing connection to www.jaren

--2021-04-27 14:13:30--  https://www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_150x0_resize_lanczos_2.png
Reusing existing connection to www.jarenhaber.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 14322 (14K) [image/png]
Saving to: ‘www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_150x0_resize_lanczos_2.png’


Last-modified header missing -- time-stamps turned off.
2021-04-27 14:13:30 (551 KB/s) - ‘www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_150x0_resize_lanczos_2.png’ saved [14322/14322]

--2021-04-27 14:13:32--  https://www.jarenhaber.com/talk/dictionaries/
Reusing existing connection to www.jarenhaber.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.jarenhaber.com/talk/dictionaries/index.html’

www.jarenhaber.com/     [ <=>                ]  17.99K  --.-KB/s    in 0.03s   

Last

HTTP request sent, awaiting response... 404 Not Found
2021-04-27 14:14:03 ERROR 404: Not Found.

--2021-04-27 14:14:05--  https://www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_800x0_resize_lanczos_2.png
Reusing existing connection to www.jarenhaber.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 181862 (178K) [image/png]
Saving to: ‘www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_800x0_resize_lanczos_2.png’


Last-modified header missing -- time-stamps turned off.
2021-04-27 14:14:05 (1.47 MB/s) - ‘www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_800x0_resize_lanczos_2.png’ saved [181862/181862]

--2021-04-27 14:14:07--  https://www.jarenhaber.com/talk/sorting-schools/featured_hu4a598c5d9d41a801e0d33745df71852e_394678_680x500_fill_q90_lanczos_right_2.png
Reusing existing connection to www.jarenhaber.com:443.
HTTP request sent, awaiting respon