# Introduction to web-scraping

It's 2021. The web is everywhere.

* If you want to learn about the different types of vaccinations for COVID-19, you check [Wikipedia](https://en.wikipedia.org/wiki/COVID-19_vaccine#Vaccine_types).
* If you want to find a low-income apartment or public housing, you have access to a [treasure trove](https://affordablehousingonline.com/public-housing-waiting-lists) of listings, waitlists, and data. 
* If you want to buy, sell, or rent something, or get a ride, or meet people, you check [craigslist](https://washingtondc.craigslist.org/).  
* If you want to know whether to where a rain jacket or shorts, you check the weather on a [website](https://www.wunderground.com/forecast/us/dc/washington). 
* If you want to know what's happening in the world, you read the news [online](https://www.nytimes.com/). 
* If you want to find a charter school in your community, you check [Google Maps](https://www.google.com/maps/search/charter+school/@38.8837025,-77.0199357,14398m/data=!3m1!1e3).

**The point is this: there is an enormous amount of information (also known as data) on the web.**

If we--as data scientists, social scientists, digital humanists, businesses, public servants or members of the public--can get our hands on this information, **we can answer all sorts of interesting questions or solve important problems**.

* Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from [Rate My Professors](https://www.ratemyprofessors.com/) (provided you follow their [terms of service](https://www.ratemyprofessors.com/TermsOfUse_us.jsp#use))
* Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
* [Geoff Boeing](https://geoffboeing.com/) and [Paul Waddell](https://ced.berkeley.edu/ced/faculty-staff/paul-waddell) recently published [a great study](https://arxiv.org/pdf/1605.05397.pdf) of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

Our goal is to get you started down the path to exciting research like this in the digital era.

This first day's workshop is a ~45 min beginner's introduction to web scraping. 


## Outline

* [Structured queries with APIs](#apis)
    * [Google Fact Check API](#factapi)
* [URL collection with automated Google search](#URLs)
   * [Scraping school URLs](#school_URLs)
   * [Scraping URLs using a blacklist](blacklist)
* [Mirroring websites with `wget`](#wget)
   * [Features of `wget`](#wget_features)
* [Template code: See Google Places API in action](#Places)

## Background

This workshop assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the intro to Python notebook at `extra/intro-to-python.ipynb` and/or [this one](https://github.com/lknelson/text-analysis-course/blob/master/scripts/01.25.02_PythonBasics.ipynb) *before* the workshop. 

## Vocabulary

* *Uniform Resource Locator (URL)*: 
    * The address of information on the web and directions to get there. A URL points to resources--usually the files needed to show a website, but it can also point to files and such.
* *Domain name*:
    * A website identifier that begins a URL: for instance, in https://www.example.com/ this is everything from `https` to `.com/`.
* *web-scraping* (i.e., *screen-scraping*):
    * Extracting structured information from the files that make up websites (i.e., what's shown in web browsers), relying on their HTML, CSS, and sometimes JS files. 
* *Hyper-Text Markup Language (HTML)*: 
    * The standard markup language for websites, the "nuts and bolts" of WHAT a website will display, including text.
* *Cascading Style Sheets (CSS)*: 
    * A technology used to format the layout of a webpage, i.e. HOW to make it pretty. Not usually relevant for web-scraping.
* *web-crawling*:
    * Finding web pages through links, automated search, etc. Once discovered, pages can be checked (is this website still up?), downloaded, or scraped. 
* *website mirroring*:
    * Creating a complete local copy of the files needed to display and host a website. 
* *Application Programming Interface (API)*:
    * A tool used to access structured data provided by an organization. Examples include Twitter, Reddit, Wikipedia, and the New York Times. When an API is available (not always the case), this is usually the preferred way to access data (over web-scraping).

**__________________________________**


# Structured queries with APIs<a id='apis'></a>

APIs are used to access data collections in a structured and efficient way. APIs are offered for many organizations like the New York Times, YouTube, Twitter, eBay, and WordPress. Some are [totally free and open-access](https://apilist.fun/), while most require registration and may even charge you. [Many are public](https://www.computersciencezone.org/50-most-useful-apis-for-developers/), but it's also common for data science teams to use these in-house to share data between programs (e.g., [RESTful APIs](https://restfulapi.net/) are used to access data on servers). 

When a public API is available, it is usually a more reliable way to collect web data than web-scraping. Just as importantly, the organization providing the API surely prefers that you browse their websites via browser (this makes sense, given the name) and that you have your applications programmatically interface with their online data via their Application Programming Interface or API (this also makes sense, given the name). 

As just one example, [Wikipedia offers an API](https://www.mediawiki.org/wiki/REST_API) to access their pages. There's even a [Python package](https://pypi.org/project/wikipedia/) that wraps around this API to make it even easier to use. Wikipedia also makes all of its content [available for direct download](https://dumps.wikimedia.org/). With all these offerings, you have little reason to scrape Wikipedia--and they may not like it if you did.

Another great example is the [large family of Google APIs](https://developers.google.com/apis-explorer/): this includes Google Maps, Speech-to-Text, Translate, and even their Machine Learning Engine. Google APIs are a good example of a non-free service, but they do offer free credit to folks using their tools for education/research. In general--and especially to avoid excessive account charges, I can tell you from experience--check what exact requests you're making before making API calls at scale. Also, when using any API, it's important to read their Terms of Service. 

## Google Fact Check API<a id='factapi'></a>

Let's see the [Google Fact Check API](https://developers.google.com/fact-check/tools/api/) in action. This service aggregates claim reviews by fact checking websites (e.g., [Snopes](https://www.snopes.com/), [PolitiFact](https://www.politifact.com/)) and can be easily perused [in a browser](https://toolbox.google.com/factcheck/explorer). The API documentation [describes the search parameters and lets you try different ones out](https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims/search), and also tells you [what fields make up the response object](https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims#Claim).

If you want to use this or any Google API yourself, you'll need to apply for an API key on [the Google Cloud Console](https://console.cloud.google.com/), but approval can take a few days. Since I assume you don't have an API key at the moment, you won't make live requests for now--but you can follow along with the code and play with the output yourself. 

Let's find claim reviews related to the _infrastructure_ bill currently being considered by the U.S. Congress. To do this, we will make an HTTP request, something we will cover more tomorrow.

In [1]:
# ONLY IF YOU HAVE API KEY: Get key from file
api_key_fp = '../extra/api_key.txt'
with open(api_key_fp) as keyfile:
    key = keyfile.read().strip()
    
# Import package for making web requests
import requests

In [2]:
# Define what to search for in API
query = 'infrastructure'

# Set backend URL for requesting data from API
search_url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"

# Make data request (first page of results only)
response = requests.get(
    url=search_url, 
    params=dict(
        key=key, 
        languageCode='en-US', 
        query=query)).json()

# Show the result
print(response)

{'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode':

Notice that the claim reviews in the query response sometimes include `claimDate` and `claimant`, but they _always_ include these keys: 
```
text, claimReview[publisher[name, site], url, title, textualRating]
```
Where `text` is the claim (often wrong), `publisher` is the fact checking site (e.g., Snopes), and `textualRating` is the fact checker's evaluation of the claim (e.g., 'Mostly True', 'Pants on Fire', 'Two Pinocchios').

## Challenge

Show the `claimant`, `text`, `claimDate`, and `textualRating` features--_when available_--for first 10 claims in the API response. I've copied the raw response output below for you to play with.<br/>
_Hint:_ `claimReview` is a list data type. How do you need to call the list to access the dictionary within it?

In [None]:
response = {'claims': [{'text': '“Each job created in Biden’s ‘infrastructure plan’ will cost the American people $850,000.”', 'claimant': 'Republican National Committee', 'claimDate': '2021-04-20T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/21/faulty-math-claim-that-bidens-infrastructure-plan-costs-850000-per-job/', 'title': "Analysis | Faulty math: The claim that Biden's infrastructure plan ...", 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': 'Infrastructure only comprises of roads and bridges.', 'claimant': 'Leader McConnell', 'claimDate': '2021-04-03T02:33:00Z', 'claimReview': [{'publisher': {'site': 'misbar.com'}, 'url': 'https://misbar.com/en/factcheck/2021/04/03/infrastructure-is-more-than-roads-and-bridges', 'title': 'Infrastructure is More than Roads and Bridges | Misbar', 'reviewDate': '2021-04-03T02:33:00Z', 'textualRating': 'Fake', 'languageCode': 'en'}]}, {'text': '“These figures are what you would consider regular appropriations-plus. So it’s baseline-plus.”', 'claimant': 'Shelley Moore Capito', 'claimDate': '2021-04-22T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/26/apples-apples-senate-gop-infrastructure-proposal-is-smaller-than-it-appears/', 'title': 'Analysis | Apples to apples, the Senate GOP infrastructure proposal ...', 'textualRating': 'Correct', 'languageCode': 'en'}]}, {'text': 'Says Joe Biden’s infrastructure plan “is the Green New Deal.”', 'claimant': 'Citizens United', 'claimDate': '2021-03-31T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/', 'title': "Citizens United calls Biden's infrastructure plan the Green New Deal ...", 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': 'There was no public infrastructure built during the Benigno “Noynoy” Aquino III administration.', 'claimReview': [{'publisher': {'name': 'Rappler', 'site': 'rappler.com'}, 'url': 'https://www.rappler.com/newsbreak/fact-check/no-infrastructure-built-under-noynoy-aquino', 'title': 'FALSE: No infrastructure built under Noynoy Aquino', 'reviewDate': '2021-03-30T06:19:28Z', 'textualRating': 'False', 'languageCode': 'en'}]}, {'text': '“Only about 6% of the president’s proposal actually goes" to infrastructure, meaning "water, wastewater ... highways, roads, bridges, perhaps broadband.”', 'claimant': 'John Thune', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'FactCheck.org', 'site': 'factcheck.org'}, 'url': 'https://www.factcheck.org/2021/04/underselling-the-infrastructure-in-infrastructure-plan/', 'title': 'Underselling the Infrastructure in Infrastructure Plan', 'textualRating': '6% is too low', 'languageCode': 'en'}]}, {'text': '“The proposed tax increases in the Biden administration’s infrastructure plan could lead to 1 million fewer jobs in the first two years.”', 'claimant': 'Roy Blunt', 'claimDate': '2021-04-13T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Mostly False', 'languageCode': 'en'}]}, {'text': '“This is a massive social welfare spending program combined with a massive tax increase on small-business job creators.”', 'claimant': 'Roger Wicker', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'The Washington Post', 'site': 'washingtonpost.com'}, 'url': 'https://www.washingtonpost.com/politics/2021/04/14/pair-misleading-gop-attacks-bidens-infrastructure-plan/', 'title': "Analysis | A pair of misleading GOP attacks on Biden's infrastructure ...", 'reviewDate': '2021-04-14T12:49:17Z', 'textualRating': 'Three Pinocchios', 'languageCode': 'en'}]}, {'text': '“Something less than 6% ... of this proposal that President Biden has put forward is actually focused on infrastructure.”', 'claimant': 'Liz Cheney', 'claimDate': '2021-04-11T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/13/liz-cheney/liz-cheneys-dubious-claim-just-6-biden-plan-infras/', 'title': "Liz Cheney's dubious claim that just 6% of Biden plan is ...", 'textualRating': 'Pants on Fire', 'languageCode': 'en'}]}, {'text': 'President Joe Biden’s infrastructure proposal “is fully paid for. Across 15 years, it would raise all of the revenue needed for these once-in-a-lifetime investments."', 'claimant': 'Pete Buttigieg', 'claimDate': '2021-04-04T00:00:00Z', 'claimReview': [{'publisher': {'name': 'PolitiFact', 'site': 'politifact.com'}, 'url': 'https://www.politifact.com/factchecks/2021/apr/07/pete-buttigieg/joe-bidens-infrastructure-bill-fully-paid/', 'title': "Is Joe Biden's infrastructure proposal fully paid for?", 'textualRating': 'Mostly True', 'languageCode': 'en'}]}], 'nextPageToken': 'CAo'}

In [None]:
# Your solution here


# URL collection with automated Google search<a id='URLs'></a>

If you want to crawl and/or scrape an online community of websites, there's a good chance may find yourself needing to collect their URLs. If you're lucky, you have comprehensive metadata describing these entities, something like their name and physical address. Your next step in this scenario would be to automate a Google search to collect the best URL matching each entity. 

How can you scrape URLs from Google? There are two fairly easy ways.

First, the **Google Places API**, which is the best option to do this at scale. You would need to apply for an API key from Google: go to the [Google cloud console](https://console.cloud.google.com/), create a project, and request an API key for each service you want to use. Approval may take a few days, but once done there is a [handy Python wrapper](https://github.com/slimkrazy/python-google-places) to make this easy to use in Python. See [Google Web Services](https://developers.google.com/places/web-service/) for general documentation and [Google Developers](https://developers.google.com/places/web-service/details) for details on Place Details requests.

The second option is **automated Google search**, which is not nearly as reliable and may get you blocked if used repeatedly. This method tends to get lots of false positives and third-party website aggregators (e.g., yellowpages.com, trulia.com), so using a blacklist to manually filter results is a good idea. Check out [the source code](https://github.com/MarioVilas/googlesearch) and [documentation](https://python-googlesearch.readthedocs.io/en/latest/). _Thanks Mario Vilas for this package!_

Because this second option is free and has no waiting period to use, we will practice using this in a nice way. In case you want to pursue further the first option, at the bottom of this notebook there is template code for running the Google Places API.

_Note_: Remember what I said about following the Terms of Service for APIs? You might find real gems in there--like this extract from the [Google Maps Platform Terms of Service](https://developers.google.com/terms/) that prohibits scraping data you intend to store:

```
3.2.3 Restrictions Against Misusing the Services.

(a)  No Scraping. Customer will not export, extract, or otherwise scrape Google Maps Content for use outside the Services. For example, Customer will not: (i) pre-fetch, index, store, reshare, or rehost Google Maps Content outside the services; (ii) bulk download Google Maps tiles, Street View images, geocodes, directions, distance matrix results, roads information, places information, elevation values, and time zone details; (iii) copy and save business names, addresses, or user reviews; or (iv) use Google Maps Content with text-to-speech services.
```

Keep this in mind should you consider using the Google Places API for URL scraping (as my template code below does): The same terms apply, so be nice!

## Scraping school URLs<a id='school_URLs'></a>

To see how this works, let's start by searching for the best URL for a charter school in Washington, D.C. Assume we have the name and address of the school.

To prevent overwhelming Google search with rapid requests--and likely getting our IP address blocked by Google as a result--let's search only for the first 10 results and include a five-second pause in between each request.

In [None]:
# Import automated Google search package
from googlesearch import search

# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

This is a pretty strong result: the first six matches share the domain of https://www.ccpcs.org/, so this is probably the best match. We identified a URL without even visiting any websites!

Notice that results 7-10 are about the right school, but they don't point to it's genuine website--with all its descriptive language, images, and subpages. Even in this case with a strong topline result, we can already get a feel for what websites will pollute our automated searches: Facebook and greatschools.org are a good start to making a blacklist to filter the results. 

Now let's try something harder to find.

### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [None]:
# Your solution here


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Interestingly, the [first result](https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/) (with domain of https://www.niche.com) does point to the [official website](https://excellence-sa.org/walker/), but extracting this information systematically would mean web-scraping--which we will get to tomorrow! 

## Scraping URLs using a blacklist<a id='blacklist'></a>

To provide cleaner search results, let's filter out the third-party websites from the previous two examples. 

Many of these websites can show up with either 'http' or 'https', often with or without a 'www', but usually have a consistent top-level domain (e.g., 'com'). Exact string matchin would fail to capture matches across these variations. Regular expressions could do this, but for now let's just filter out those search results that contain the core of any blacklisted domain name (e.g., niche.com). 

Let's get the first result for the previous school (Dr. David C. Walker Intermediate School) that doesn't match any blacklisted domains. 

In [None]:
# Define blacklisted domains to filter out: third-party domains/false positives that we DON'T want to scrape 
blacklist = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid blacklisted domains
print("Successfully collected Google search results.")

# Initialize blacklist match counter: How many blacklisted domains has this search encountered?
blacklisted_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

What do you think of [the "quality" URL we landed on](http://castro.tea.state.tx.us/charter_apps/content/downloads/Renewals/015806_2.pdf)? Looks like we need to expand our blacklist!

### Challenge

Improve our automated searching to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the blacklist OR (B) try a simple search but for more URLs.

In [None]:
# Your solution here


# Mirroring websites with `wget`<a id='wget'></a>

`wget` is classic (circa 1996, but still updated) [free software](https://www.gnu.org/philosophy/free-sw) in shell for non-interactively downloading web content. It's often used for basic one-time downloads, like `curl` also does for shell or `urllib.urlretrieve` does in-house for Python. But where `wget` really shines is in its extensive customization, including retrying failed connections, following links, and duplicating a remote website's files and structure to the point of having an identical local copy (website mirroring). 

Let's try using the nice Python wrapper for `wget` to download the MDI News page nested in the McCourt School for Public Policy site:

In [None]:
import wget 
wget.download(url='https://mccourt.georgetown.edu/research/mdi-news/')

We can check out the contents of this (rather poorly named) file using the Jupyter interface in the previous tab. 

We got some HTML--cool! But what if we want something clickable and interactive? This is easiest to do with `wget` run via its native shell, rather than this simple Python wrapper--which also doesn't allow for `get`'s more advanced functionality. We can use the helpful `!` prefix to run shell commands straight from this notebook. 

Let's make a new `wget` request to download a version of the same page that's easier to see in your browser. 

In [None]:
!wget https://mccourt.georgetown.edu/research/mdi-news/

Use your Jupyter browser to check out the results: just click on `index.html` in your current folder (probably this is `day-1/`) to view the page. What do you notice? How does it compare to viewing https://mccourt.georgetown.edu/research/mdi-news/ in your browser? Try clicking the links. Where can you go on the actual page that your local copy can't show you? Do you have local copies of the images?

## Features of `wget`<a id='wget_features'></a>

You might have noticed that we only ended up with some HTML--we didn't download any of the files associated with the webpage. So, this isn't a true copy; we couldn't host the page ourselves, analyze its images, or easily use its content for purposes other than viewing. How do we mirror the full site?

To do this, we need only the `page-requisites` option, which makes sure to download all the resources needed to render the page in a browser: that means CSS, javascript, image files, etc. To keep from overloading the server, let's pause for a few seconds in between downloads using the `--wait` option. 

Let's use some other features as well for politeness and subtlety (i.e. to avoid getting blocked). Here is explanation for all of them:

```shell
--page-requisites             Grabs all of the linked resources necessary to render the page (images, CSS, javascript, etc.)
--wait                        Pauses between downloads (in seconds)
--tries=3                     Retries failed downloads 3 times
--user-agent=Mozilla          Makes wget look like a Mozilla browser by masking its user agent
--header="Accept:text/html"   Sends header with each HTML request, looks more browser-ish
--no-check-certificate        Doesnt check authenticity of website server (use only with trusted websites!)
```

In [None]:
!wget --page-requisites --wait=2 --tries=3 --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/mdi-news/

Check out the results--what's similar and whats different? See `/research/mdi-news/` for the `index.html` (sometimes this is `default.html`) page we saw earlier. 

`wget` has a rich array of options. Here are some of the most useful ones in addition to those above:

```shell
--mirror                      Downloads a full website and makes available for local viewing
--recursive                   Recursively downloads files and follows links
--no-parent 		          Does not follow links above hierarchical level of input URL
--convert-links 	          Turns links into local links as appropriate
--accept                      Download only file suffixes in this list (e.g., .html)
--execute robots=off          Turns off automatic robots.txt checking, preventing server privacy exclusions
--random-wait                 Randomizes the defined wait period to between .5 and 1.5x that value
--background		          For a huge download, put the download in background
--spider                      Determines whether the remote file exist at the destination (mimics web spiders)
--domains   		          Downloads only only PDF files from specific domains
--user --password   		  Downloads files from password protected sites
```

### Challenge

Download only `.html` files from https://mccourt.georgetown.edu/research/ and links below that.

In [None]:
# Your solution here


### Challenge

Use advanced options for `wget` (listed above) to mirror a website you use often. Be sure to use a polite `--wait` and avoid downloading anything with massive numbers of links, files, or pages (e.g., don't try YouTube.com or Wikipedia.com). If you want to download a segment or specific page within a website (e.g., a single YouTube channel or Wikipedia page), use the `--recursive` option with `--no-parent` (to follow only links within the input URL).

While you let `wget` run, read more about it on its [manual](https://www.gnu.org/software/wget/manual/wget.html) and see other examples of `wget` usage [here](https://gist.github.com/bueckl/bd0a1e7a30bc8e2eeefd) and [here](https://phoenixnap.com/kb/wget-command-with-examples). 

In [None]:
# Your solution here


# Template code: See Google Places API in action<a id='Places'></a>

For your reference, this is the code you would use to do URL scraping with the Google Places API.

In [None]:
# Import packages
from googleplaces import GooglePlaces, types  # Google Places API: 'types' lets us define what kind of entity to look for (e.g., schools)
import re

# Initialize Google Places API key
api_fp = 'define_me.txt' # Replace with API key filepath
places_api_key = re.sub("\n", "", open(api_fp).read())
google_places = GooglePlaces(places_api_key)

In [None]:
# See Google Places API in action
school_name = "River City Scholars Charter Academy"
school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

query_result = google_places.nearby_search(
        location=school_address, name=school_name,
        radius=15000, types=[types.TYPE_SCHOOL], rankby='distance') # Search for schools within 15000 km of input location

for place in query_result.places:
    print(place.name)
    place.get_details()  # makes further API call
    print(place.details) # A dict matching the JSON response from Google.
    print(place.website)
    print(place.formatted_address)

# Are there any additional pages of results?
if query_result.has_next_page_token:
    query_result_next_page = google_places.nearby_search(
        pagetoken=query_result.next_page_token)

The output look like this:
```
River City Scholars Charter Academy
http://rivercityscholars.org/
944 Evergreen St SE, Grand Rapids, MI 49507, USA
```

In [None]:
# More robust code with a blacklist
query_result = google_places.nearby_search(
    location=address, name=school_name,
    radius=15000, types=[types.TYPE_SCHOOL], rankby='distance') # search schools within radius of 15000 km
        
for place in query_result.places:
    place.get_details()  # Make further API call to get detailed info on this place
    
    found_name = place.name  # Compare this name in Places API to school's name on file
    found_address = place.formatted_address  # Compare this address in Places API to address on file

    url = place.website  # Grab school URL from Google Places API, if it's there
    
    # Initialize blacklist match counter
    blacklisted_num = 0 

    if any(domain in url for domain in blacklist):
        blacklisted_num += 1    # If this url is in bad_sites_list, add 1 to counter and move on
        print("URL in Google Places API is a third-party domain. Moving on.")

    else:
        good_url = url
        print("Success! URL obtained from Google Places API with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for-loop after finding first good result
        
print(f'Quality URL: {good_url}') # Show valid URL of the Place discovered in Google Places API