# Introduction to web-scraping

It's 2021. The web is everywhere.

* If you want to buy a house, real estate agents have [websites](https://www.wendytlouie.com/) where they list the houses they're currently selling. 
* If you want to know whether to where a rain jacket or shorts, you check the weather on a [website](https://weather.com/weather/tenday/l/Berkeley+CA+USCA0087:1:US). 
* If you want to know what's happening in the world, you read the news [online](https://www.sfchronicle.com/). 
* If you've forgotten which city is the capital of Australia, you check [Wikipedia](https://en.wikipedia.org/wiki/Australia).

**The point is this: there is an enormous amount of information (also known as data) on the web.**

If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, **we can answer all sorts of interesting questions or solve important problems**.

* Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from [Rate My Professors](https://www.ratemyprofessors.com/) (provided you follow their [terms of service](https://www.ratemyprofessors.com/TermsOfUse_us.jsp#use))
* Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
* [Geoff Boeing](https://geoffboeing.com/) and [Paul Waddell](https://ced.berkeley.edu/ced/faculty-staff/paul-waddell) recently published [a great study](https://arxiv.org/pdf/1605.05397.pdf) of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

This first day's workshop is a one-hour beginner's introduction to web scraping. 


## Learning Goals
*   

## Outline

* [Structured queries with APIs](#apis)
* [Domain collection with automated google search](#domain)
* [Mirroring websites with `wget`](#wget)

## Background

We will do some review, but this notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the intro to Python notebook at `extra/intro-to-python.ipynb` and/or [this one](https://github.com/lknelson/text-analysis-course/blob/master/scripts/01.25.02_PythonBasics.ipynb) *before* the workshop. 

## Vocabulary

* *domain*: 
    * The address of information on the web and directions to get there. Known technically as a URL (Uniform Resource Locator), a domain points to resources--usually the files needed to show a website, but it can also point to files and such. 
* *web-scraping* (i.e., *screen-scraping*):
    * Extracting structured information from the files that make up websites (i.e., what's shown in web browsers), relying on their HTML, CSS, and sometimes JS files. 
* *Hyper-Text Markup Language (HTML)*: 
    * The standard markup language for websites, the "nuts and bolts" of WHAT a website will display, including text.
* *Cascading Style Sheets (CSS)*: 
    * A technology used to format the layout of a webpage, i.e. HOW to make it pretty. Not usually relevant for web-scraping.
* *web-crawling*:
    * Finding web pages through links, automated search, etc. Once discovered, pages can be checked (is this website still up?), downloaded, or scraped. 
* *website mirroring*:
    * Creating a complete local copy of the files needed to display and host a website. 
* *Application Programming Interface (API)*:
    * A tool used to access structured data provided by an organization. Examples include Twitter, Reddit, Wikipedia, and the New York Times. When an API is available (not always the case), this is usually the preferred way to access data (over web-scraping).

**__________________________________**


# Structured queries with APIs<a id='apis'></a>

As an example, let's try out the [Google Fact Check API](https://developers.google.com/fact-check/tools/api/), which can be easily explored [in a browser](https://toolbox.google.com/factcheck/explorer). By searching this Google service, the Fact Identifier collects facts relevant to the query input by user (or built in by default, as in the current version).

In [None]:
# Import libraries

import csv
from tqdm import tqdm
import requests # for downloading
from bs4 import BeautifulSoup, NavigableString, Tag # for html scraping
import regex as re # Regex module with Unicode support
import html5lib # slower but more accurate bs4 parser for messy HTML # lxml faster
import urllib
import json

# Import functions to scrape fact check web pages
from scrape_helpers import load_api_key, clean_text, scrape_politifact, scrape_factcheck, scrape_snopes

In [None]:
######################################################
# Call API
######################################################

# Elements in query response: text, claimDate, claimReview[publisher[name], url, textualRating]
# Columns in output CSV: (date (DD-MM-YYYY), claim, truth rating, url, source (publisher), fact, explanation

if __name__ == '__main__':
    page_token = 0
    domains = ['covid', 'blm', 'election']
    query_sets = [
        ["masks", "Chinese bioweapon", "China virus"],
        ["George Floyd", "Antifa", "Black Lives Matter"],
        ["Hunter Biden", "rigged election", "mail-in", "election ballots"],
    ]

    api_key_fp = "api_key.txt"
    key = load_api_key(api_key_fp)
    endpoint = 'https://factchecktools.googleapis.com'
    search = '/v1alpha1/claims:search'

    sites = ['politifact.com', 'factcheck.org', 'snopes.com']
    site_scrapers = [scrape_politifact, scrape_factcheck, scrape_snopes]
    site_switches = ['politifact', 'factcheck.org', 'snopes']


    for i in range(0, len(domains)):
        domain = domains[i]
        queries = query_sets[i]
        claims = [] # initialize list of claims

        for query in queries:
            urls = set() # initialize set of fact check URLs already seen for this query

            for site in tqdm(sites, desc='Collecting data for {} via API'.format(query)):
                params = {
                    'pageToken': page_token,
                    'query': query,
                    'reviewPublisherSiteFilter': site,
                    'key': key
                }

                nextToken = True
                while nextToken:
                    url = endpoint + search + '?' + urllib.parse.urlencode(params)
                    response = requests.get(url)
                    data = response.json()

                    if 'claims' in data:
                        for claim in data['claims']:
                            if not site == 'snopes.com':
                                claims.append([claim['claimDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']])
                            else:
                                claims.append([claim['claimReview'][0]['reviewDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']
                                              .replace('.com', '')])

                    if 'nextPageToken' in data:
                        params['pageToken'] = data['nextPageToken']
                    else:
                        nextToken = False

            for j in tqdm(range(0, len(claims)), desc='Scraping websites'.format(query)):
                claim = claims[j]
                switch = site_switches.index(claim[4].lower()) # use fact to get publisher site then index (site name is 5th element)
                scraper = site_scrapers[switch] # get scraper using index
                claim.extend(scraper(claim[3])) # scrape URL using scraper (URL is 4th element), add to existing claim info
                claims[j] = claim # record fact in list

            claims # remove duplicates

            # Save output for this query
            query_string = query.replace(' ', '-')
            with open('data/fact_checker_data_{}.csv'.format(query_string), 'w') as f:
                csv_writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(['date', 'claim', 'truth_rating', 'url', 'source', 'fact', 'explanation'])
                for claim in claims: # Each claim gets its own column
                    if claim[3] not in urls: # don't add if fact check URL already seen
                        csv_writer.writerow(claim) # save row
                        urls.add(claim[3]) # add to set of urls already saved

            print('Saved {} claims for {} query.'.format(str(len(urls)), query))
            print()

# Domain collection with automated google search<a id='domain'></a>

This script uses two related functions to scrape the best URL from online sources: 
> The Google Places API. See the [GitHub page](https://github.com/slimkrazy/python-google-places) for the Python wrapper and sample code, [Google Web Services](https://developers.google.com/places/web-service/) for general documentation, and [here](https://developers.google.com/places/web-service/details) for details on Place Details requests.

> The Google Search function (manually filtered). See [here](https://pypi.python.org/pypi/google) for source code and [here](http://pythonhosted.org/google/) for documentation.

To get an API key for the Google Places API (or Knowledge Graph API), go to the [Google API Console](http://code.google.com/apis/console).
To upgrade your quota limits, sign up for billing--it's free and raises your daily request quota from 1K to 150K (!!).

The code below doesn't use Google's Knowledge Graph (KG) Search API because this turns out NOT to reveal websites related to search results (despite these being displayed in the KG cards visible at right in a standard Google search). The KG API is only useful for scraping KG id, description, name, and other basic/ irrelevant info. TO see examples of how the KG API constructs a search URL, etc., (see [here](http://searchengineland.com/cool-tricks-hack-googles-knowledge-graph-results-featuring-donald-trump-268231)).

Possibly useful note on debugging: An issue causing the GooglePlaces package to unnecessarily give a "ValueError" and stop was resolved in [July 2017](https://github.com/slimkrazy/python-google-places/issues/59). <br>
Other instances of this error may occur if Google Places API cannot identify a location as given. Dealing with this is a matter of proper Exception handling (which seems to be working fine below).

In [None]:
!pip install google # For automated Google searching 
!pip install https://github.com/slimkrazy/python-google-places/zipball/master # Google Places API

## Define helper functions

In [None]:
def dicts_to_csv(list_of_dicts, file_name, header):
    '''This helper function writes a list of dictionaries to a csv called file_name, with column names decided by 'header'.'''
    
    with open(file_name, 'w') as output_file:
        print("Saving to " + str(file_name) + " ...")
        dict_writer = csv.DictWriter(output_file, header)
        dict_writer.writeheader()
        dict_writer.writerows(list_of_dicts)

In [None]:
def count_left(list_of_dicts, varname):
    '''This helper function determines how many dicts in list_of_dicts don't have a valid key/value pair with key varname.'''
    
    count = 0
    for school in list_of_dicts:
        if school[varname] == "" or school[varname] == None:
            count += 1

    print(str(count) + " schools in this data are missing " + str(varname) + "s.")

count_left(sample, 'URL')

## Initialize Python search environment

In [None]:
# IMPORTING KEY PACKAGES
from googlesearch import search # automated Google Search package
from googleplaces import GooglePlaces, types, lang  # Google Places API

import csv, re, os  # Standard packages
import pandas as pd  # for working with csv files
import urllib, requests  # for scraping
from tqdm import tqdm # for progress tracking in for loops

In [None]:
# Initializing Google Places API search functionality
places_api_key = re.sub("\n", "", open("../data/places_api_key.txt").read())
print(places_api_key)

google_places = GooglePlaces(places_api_key)

In [None]:
# Here's a list of sites we DON'T want to spider, 
# but that an automated Google search might return...
# and we might thus accidentally spider unless we filter them out (as below)!

bad_sites = []
with open('../data/bad_sites.csv', 'r', encoding = 'utf-8') as csvfile:
    for row in csvfile:
        bad_sites.append(re.sub('\n', '', row))

print(bad_sites)

In [None]:
# See the Google Places API wrapper at work!
school_name = "River City Scholars Charter Academy"
address = "944 Evergreen Street, Grand Rapids, MI 49507"

query_result = google_places.nearby_search(
        location=address, name=school_name,
        radius=15000, types=[types.TYPE_SCHOOL], rankby='distance')

for place in query_result.places:
    print(place.name)
    place.get_details()  # makes further API call
    #print(place.details) # A dict matching the JSON response from Google.
    print(place.website)
    print(place.formatted_address)

# Are there any additional pages of results?
if query_result.has_next_page_token:
    query_result_next_page = google_places.nearby_search(
            pagetoken=query_result.next_page_token)

In [None]:
# Example of using the google search function:
for url in search('DR DAVID C WALKER INT 6500 IH 35 N STE C, SAN ANTONIO, TX 78218', \
                  stop=5, pause=5.0):
    print(url)

## Read in data

In [None]:
sample = []  # make empty list in which to store the dictionaries

if os.path.exists('../data/filtered_schools.csv'):  # first, check if file containing search results is available on disk
    file_path = '../data/filtered_schools.csv'
else:  # use original data if no existing results are available on disk
    file_path = '../../data_management/data/charters_unscraped_noURL_2015.csv'

with open(file_path, 'r', encoding = 'utf-8') as csvfile: # open file                      
    print('Reading in ' + str(file_path) + ' ...')
    reader = csv.DictReader(csvfile)  # create a reader
    for row in reader:  # loop through rows
        sample.append(row)  # append each row to the list

print("\nColumns in data: ")
print(list(sample[0]))
sample = sample[0:5]
sample

In [None]:
# Create new "URL" and "NUM_BAD_URLS" variables for each school, without overwriting any with data there already:
for school in sample:
    try:
        if len(school["URL"]) > 0:
            pass
        
    except (KeyError, NameError):
        school["URL"] = ""

for school in sample:
    try:
        if school["QUERY_RANKING"]:
            pass
        
    except (KeyError, NameError):
        school["QUERY_RANKING"] = ""

In [None]:
#### Take a look at the first entry's contents and the variables list in our sample (a list of dictionaries)
print(sample[1]["SCH_NAME"], "\n", sample[1]["ADDRESSES"], "\n", sample[1]["NCESSCH"], "\n")
print(sample[1].keys())

## Getting URLs

In [None]:
def getURL(school_name, address, bad_sites_list): # manual_url
    
    '''This function finds the one best URL for a school using two methods:
    
    1. If a school with this name can be found within 20 km (to account for proximal relocations) in
    the Google Maps database (using the Google Places API), AND
    if this school has a website on record, then this website is returned.
    If no school is found, the school discovered has missing data in Google's database (latitude/longitude, 
    address, etc.), or the address on record is unreadable, this passes to method #2. 
    
    2. An automated Google search using the school's name + address. This is an essential backup plan to 
    Google Places API, because sometimes the address on record (courtesy of Dept. of Ed. and our tax dollars) is not 
    in Google's database. For example, look at: "3520 Central Pkwy Ste 143 Mezz, Cincinnati, OH 45223". 
    No wonder Google Maps can't find this. How could it intelligibly interpret "Mezz"?
    
    Whether using the first or second method, this function excludes URLs with any of the 62 bad_sites defined above, 
    e.g. trulia.com, greatschools.org, mapquest. It returns the number of excluded URLs (from either method) 
    and the first non-bad URL discovered.'''
    
    
    ## INITIALIZE
    
    new_urls = []    # start with empty list
    good_url = ""    # output goes here
    k = 0    # initialize counter for number of URLs skipped
    
    radsearch = 15000  # define radius of Google Places API search, in km
    numgoo = 20  # define number of google results to collect for method #2
    wait_time = 20.0  # define length of pause between Google searches (longer is better for big catches like this)
    
    search_terms = school_name + " " + address
    print("Getting URL for " + school_name + ", " + address + "...")    # show school name & address
    
    
    
    ## FIRST URL-SCRAPE ATTEMPT: GOOGLE PLACES API
    # Search for nearest school with this name within radsearch km of this address
    
    try:
        query_result = google_places.nearby_search(
            location=address, name=school_name,
            radius=radsearch, types=[types.TYPE_SCHOOL], rankby='distance')
        
        for place in query_result.places:
            place.get_details()  # Make further API call to get detailed info on this place

            found_name = place.name  # Compare this name in Places API to school's name on file
            found_address = place.formatted_address  # Compare this address in Places API to address on file

            try: 
                url = place.website  # Grab school URL from Google Places API, if it's there

                if any(domain in url for domain in bad_sites_list):
                    k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                    #print("  URL in Google Places API is a bad site. Moving on.")

                else:
                    good_url = url
                    print("    Success! URL obtained from Google Places API with " + str(k) + " bad URLs avoided.")
                    
                    '''
                    # For testing/ debugging purposes:
                    
                    print("  VALIDITY CHECK: Is the discovered URL of " + good_url + \
                          " consistent with the known URL of " + manual_url + " ?")
                    print("  Also, is the discovered name + address of " + found_name + " " + found_address + \
                          " consistent with the known name/address of: " + search_terms + " ?")
                    
                    if manual_url != "":
                        if manual_url == good_url:
                            print("    Awesome! The known and discovered URLs are the SAME!")
                    '''
                            
                    return(k, good_url)  # Returns valid URL of the Place discovered in Google Places API
        
            except:  # No URL in the Google database? Then try next API result or move on to Google searching.
                print("  Error collecting URL from Google Places API. Moving on.")
                pass
    
    except:
        print("  Google Places API search failed. Moving on to Google search.")
        pass
    
    

    ## SECOND URL-SCRAPE ATTEMPT: FILTERED GOOGLE SEARCH
    # Automate Google search and take first result that doesn't have a bad_sites_list element in it.
    
    
    # Loop through google search output to find first good result:
    try:
        new_urls = list(search(search_terms, stop=numgoo, pause=wait_time))  # Grab first numgoo Google results (URLs)
        print("  Successfully collected Google search results.")
        
        for url in new_urls:
            if any(domain in url for domain in bad_sites_list):
                k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                #print("  Bad site detected. Moving on.")
            else:
                good_url = url
                print("    Success! URL obtained by Google search with " + str(k) + " bad URLs avoided.")
                break    # Exit for loop after first good url is found
                
    
    except:
        print("  Problem with collecting Google search results. Try this by hand instead.")
            
        
    '''
    # For testing/ debugging purposes:
    
    if k>2:  # Print warning messages depending on number of bad sites preceding good_url
        print("  WARNING!! CHECK THIS URL!: " + good_url + \
              "\n" + str(k) + " bad Google results have been omitted.")
    if k>1:
        print(str(k) + " bad Google results have been omitted. Check this URL!")
    elif k>0:
        print(str(k) + " bad Google result has been omitted. Check this URL!")
    else: 
        print("  No bad sites detected. Reliable URL!")
    
    if manual_url != "":
        if manual_url == good_url:
            print("    Awesome! The known and discovered URLs are the SAME!")
    '''
    
    if good_url == "":
        print("  WARNING! No good URL found via API or google search.\n")
    
    return(k + 1, good_url)

In [None]:
numschools = 0  # initialize scraping counter

keys = sample[0].keys()  # define keys for writing function
fname = "../data/final_schools.csv"  # define file name for writing function

In [None]:
for school in sample[:2]:
    print(school["URL"])

In [None]:
# Now to call the above function and actually scrape these things!
# 
for school in tqdm(sample): # loop through list of schools (sample)
    if school["URL"] == "":  # if URL is missing, fill that in by scraping
        numschools += 1
        school["QUERY_RANKING"], school["URL"] = getURL(school["SCH_NAME"], school["ADDRESSES"], bad_sites) # school["MANUAL_URL"]
    
    else:
        if school["URL"]:
            pass  # If URL exists, don't bother scraping it again

        else:  # If URL hasn't been defined, then scrape it!
            numschools += 1
            school["QUERY_RANKING"], school["URL"] = "", "" # start with empty strings
            school["QUERY_RANKING"], school["URL"] = getURL(school["SCH_NAME"], school["ADDRESSES"], bad_sites) # school["MANUAL_URL"]

print("\n\nURLs discovered for " + str(numschools) + " schools.")

In summer 2017, the above approach works to get a good URL for 6,677 out of the 6,752 schools in this data set. Not bad! <br>
For some reason, the Google search algorithm (method #2) is less likely to work after passing from the Google Places API. <br>
To fill in for the remaining 75, let's skip the function's layers of code and just call the google search function by hand.

In [None]:
for school in sample:
    school["SEARCH"] = school["SCH_NAME"] + " " + school["ADDRESSES"]
    if school["URL"] == "":
        k = 0  # initialize counter for number of URLs skipped
        school["QUERY_RANKING"] = ""

        
        print("Scraping URL for " + school["SEARCH"] + "...")
        urls_list = list(search(school["SEARCH"], stop=20, pause=10.0))
        print("  URLs list collected successfully!")

        for url in urls_list:
            if any(domain in url for domain in bad_sites):
                k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                # print("  Bad site detected. Moving on.")
            else:
                good_url = url
                print("    Success! URL obtained by Google search with " + str(k) + " bad URLs avoided.")

                school["URL"] = good_url
                school["QUERY_RANKING"] = k + 1
                
                count_left(sample, 'URL')
                dicts_to_csv(sample, fname, keys)
                print()
                break    # Exit for loop after first good url is found                               
                                           
    else:
        pass

count_left(sample, 'URL')
dicts_to_csv(sample, fname, keys)

In [None]:
# Save sample to file (can continue to load and add to it):
count_left(sample, 'URL')
dicts_to_csv(sample, fname, keys)

In [None]:
# CHECK OUT RESULTS
# TO DO: Make a histogram of 'NUM_BAD_URLS'
# systematic way to look at problem URLs (with k > 0)?

f = 0
for school in sample:
    if int(school['NUM_BAD_URLS']) > 14:
        print(school["SEARCH"], "\n", school["URL"], "\n")
        f += 1

print(str(f))

# Mirroring websites with `wget`<a id='wget'></a>

`wget` is classic (circa 1996, but still updated) [free software](https://www.gnu.org/philosophy/free-sw) in shell for non-interactively downloading web content. It's often used for basic one-time downloads, like `curl` also does for shell or `urllib.urlretrieve` does in-house for Python. But where `wget` really shines is in its extensive customization, including retrying failed connections, following links, and duplicating a remote website's files and structure to the point of having an identical local copy (website mirroring). 

Let's try using the nice Python wrapper for `wget` to download the MDI News page nested in the McCourt School for Public Policy site:

In [33]:
import wget 
wget.download(url='https://mccourt.georgetown.edu/research/mdi-news/')

'download.wget'

We can check out the contents of this (rather poorly named) file using the Jupyter interface in the previous tab. 

We got some HTML--cool! But what if we want something clickable and interactive? This is easiest to do with `wget` run via its native shell, rather than this simple Python wrapper--which also doesn't allow for `get`'s more advanced functionality. We can use the helpful `!` prefix to run shell commands straight from this notebook. 

Let's make a new `wget` request to download a version of the same page that's easier to see in your browser. 

In [41]:
!wget https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-23 14:43:11--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157499 (154K) [text/html]
Saving to: ‘index.html.1’


2021-04-23 14:43:12 (8.80 MB/s) - ‘index.html.1’ saved [157499/157499]



Use your Jupyter browser to check out the results: just click on `index.html` in your current folder (probably this is `day-1/`) to view the page. What do you notice? How does it compare to viewing https://mccourt.georgetown.edu/research/mdi-news/ in your browser? Try clicking the links. Where can you go on the actual page that your local copy can't show you? Do you have local copies of the images?

You might have noticed that we only ended up with some HTML--we didn't download any of the files associated with the webpage. So, this isn't a true copy; we couldn't host the page ourselves, analyze its images, or easily use its content for purposes other than viewing. How do we mirror the full site?

To do this, we need only the `page-requisites` option, which makes sure to download all the resources needed to render the page in a browser: that means CSS, javascript, image files, etc. To keep from overloading the server, let's pause for a few seconds in between downloads using the `--wait` option. 

Let's use some other features as well for politeness and subtlety (i.e. to avoid getting blocked). Here is explanation for all of them:

```shell
--page-requisites             Grabs all of the linked resources necessary to render the page (images, CSS, javascript, etc.)
--wait                        Pauses between downloads (in seconds)
--tries=3                     Retries failed downloads 3 times
--user-agent=Mozilla          Makes wget look like a Mozilla browser by masking its user agent
--header="Accept:text/html"   Sends header with each HTML request, looks more browser-ish
--no-check-certificate        Doesnt check authenticity of website server (use only with trusted websites!)
```

In [38]:
!wget --page-requisites --wait=2 --tries=3 --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/mdi-news/

--2021-04-23 14:16:52--  https://mccourt.georgetown.edu/research/mdi-news/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157499 (154K) [text/html]
Saving to: ‘index.html’


2021-04-23 14:16:52 (9.30 MB/s) - ‘index.html’ saved [157499/157499]



Check out the results--what's similar and whats different? See `/research/mdi-news/` for the `index.html` (sometimes this is `default.html`) page we saw earlier. 

`wget` has a rich array of options. Here are some of the most useful ones in addition to those above:

```shell
--mirror                      Downloads a full website and makes available for local viewing
--recursive                   Recursively downloads files and follows links
--no-parent 		          Does not follow links above hierarchical level of input URL
--convert-links 	          Turns links into local links as appropriate
--accept                      Download only file suffixes in this list (e.g., .html)
--execute robots=off          Turns off automatic robots.txt checking, preventing server privacy exclusions
--random-wait                 Randomizes the defined wait period to between .5 and 1.5x that value
--background		          For a huge download, put the download in background
--spider                      Determines whether the remote file exist at the destination (mimics web spiders)
--domains   		          Downloads only only PDF files from specific domains
--user --password   		  Downloads files from password protected sites
```

### Challenge

Download only `.html` files from https://mccourt.georgetown.edu/research/ and links below that.

In [42]:
# Solution
!wget --accept .html --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/

--2021-04-23 15:03:17--  https://mccourt.georgetown.edu/research/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8000::1, 2620:12a:8001::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157616 (154K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/index.html’


2021-04-23 15:03:18 (9.33 MB/s) - ‘mccourt.georgetown.edu/research/index.html’ saved [157616/157616]

Loading robots.txt; please ignore errors.
--2021-04-23 15:03:20--  https://mccourt.georgetown.edu/robots.txt
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 116 [text/plain]
Saving to: ‘mccourt.georgetown.edu/robots.txt.tmp’


2021-04-23 15:03:20 (3.37 MB/s) - ‘mccourt.georgetown.edu/robots.txt.tmp’ saved [116/116]

--2021-04-23 15:03:22--  https://mccourt.georgetown.edu/research/featured-publications/
Reusing existing conn

HTTP request sent, awaiting response... 404 Not Found
2021-04-23 15:03:55 ERROR 404: Not Found.

--2021-04-23 15:03:57--  https://mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143331 (140K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’


2021-04-23 15:03:58 (8.29 MB/s) - ‘mccourt.georgetown.edu/research/the-massive-data-institute/mdi-research/index.html’ saved [143331/143331]

--2021-04-23 15:04:00--  https://mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 142753 (139K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/the-massive-data-institute/shape-the-policy-conversation/index.html’


2021-0

Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/resources/dp-resources/index.html... 23-8
Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/resources/index.html... 34-8
Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/mdi-conferences-and-panels/index.html... 32-8
Converting links in mccourt.georgetown.edu/research/mccourt-centers/index.html... 24-8
Converted links in 24 files in 0.06 seconds.


### Challenge

Use advanced options for `wget` (listed above) to mirror a website you use often. Be sure to use a polite `--wait` and avoid downloading anything with massive numbers of links, files, or pages (e.g., don't try YouTube.com or Wikipedia.com). If you want to download a segment or specific page within a website (e.g., a single YouTube channel or Wikipedia page), use the `--recursive` option with `--no-parent` (to follow only links within the input URL).

While you let `wget` run, read more about it on its [manual](https://www.gnu.org/software/wget/manual/wget.html) and see other examples of `wget` usage [here](https://gist.github.com/bueckl/bd0a1e7a30bc8e2eeefd) and [here](https://phoenixnap.com/kb/wget-command-with-examples). 

In [40]:
# Solution
!wget --mirror --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://www.jarenhaber.com/

--2021-04-23 14:39:08--  https://www.gnu.org/software/wget/
Resolving www.gnu.org (www.gnu.org)... 209.51.188.148, 2001:470:142:3::a
Connecting to www.gnu.org (www.gnu.org)|209.51.188.148|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.gnu.org/software/wget/index.html’

www.gnu.org/softwar     [ <=>                ]  10.46K  --.-KB/s    in 0.03s   

Last-modified header missing -- time-stamps turned off.
2021-04-23 14:39:08 (363 KB/s) - ‘www.gnu.org/software/wget/index.html’ saved [10708]

Loading robots.txt; please ignore errors.
--2021-04-23 14:39:10--  https://www.gnu.org/robots.txt
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 1135 (1.1K) [text/plain]
Saving to: ‘www.gnu.org/robots.txt’


2021-04-23 14:39:10 (29.2 MB/s) - ‘www.gnu.org/robots.txt’ saved [1135/1135]

--2021-04-23 14:39:12--  https://www.gnu.org/mini.css
Reusing existing connection to www.gnu.org:

--2021-04-23 14:39:46--  https://www.gnu.org/software/wget/manual/wget.texi.tar.gz
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 68053 (66K) [application/x-gzip]
Saving to: ‘www.gnu.org/software/wget/manual/wget.texi.tar.gz’


2021-04-23 14:39:46 (1.13 MB/s) - ‘www.gnu.org/software/wget/manual/wget.texi.tar.gz’ saved [68053/68053]

--2021-04-23 14:39:48--  https://www.gnu.org/software/wget/manual/dir.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 404 Not Found
2021-04-23 14:39:48 ERROR 404: Not Found.

--2021-04-23 14:39:50--  https://www.gnu.org/software/gnulib/manual.css
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2083 (2.0K) [text/css]
Saving to: ‘www.gnu.org/software/gnulib/manual.css’


2021-04-23 14:39:50 (47.9 MB/s) - ‘www.gnu.org/software/gnulib/manual.css’ saved [2083/2083]

--2021-04-23 14:39:52--  https://www.gnu.or

--2021-04-23 14:40:21--  https://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 15938 (16K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:21 (566 KB/s) - ‘www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html’ saved [15938/15938]

--2021-04-23 14:40:23--  https://www.gnu.org/software/wget/manual/html_node/Recursive-Accept_002fReject-Options.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 9306 (9.1K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Recursive-Accept_002fReject-Options.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:23 (64.5 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Recursiv

--2021-04-23 14:40:51--  https://www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3677 (3.6K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:51 (88.9 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Location.html’ saved [3677/3677]

--2021-04-23 14:40:53--  https://www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3731 (3.6K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:40:53 (86.0 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Wgetrc-Syntax.html’ saved [3731/3731]

--2021-04-23 14:40:55--  https://www.gnu.org/softw

--2021-04-23 14:41:22--  https://www.gnu.org/software/wget/manual/html_node/Portability.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 4458 (4.4K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Portability.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:41:22 (114 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Portability.html’ saved [4458/4458]

--2021-04-23 14:41:24--  https://www.gnu.org/software/wget/manual/html_node/Signals.html
Reusing existing connection to www.gnu.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3420 (3.3K) [text/html]
Saving to: ‘www.gnu.org/software/wget/manual/html_node/Signals.html’


Last-modified header missing -- time-stamps turned off.
2021-04-23 14:41:24 (73.4 MB/s) - ‘www.gnu.org/software/wget/manual/html_node/Signals.html’ saved [3420/3420]

--2021-04-23 14:41:26--  https://www.gnu.org/software/wget/manual/html_node/Appen