# Mirroring websites with `wget`<a id='wget'></a>

`wget` is classic (circa 1996, but still updated) [free software](https://www.gnu.org/philosophy/free-sw) in shell for non-interactively downloading web content. It's often used for basic one-time downloads, like `curl` also does for shell or `urllib.urlretrieve` does in-house for Python. But where `wget` really shines is in its extensive customization, including retrying failed connections, following links, and duplicating a remote website's files and structure to the point of having an identical local copy (website mirroring). 

Let's try using the nice Python wrapper for `wget` to download the MDI News page:

In [1]:
import wget 
wget.download(url='https://mdi.georgetown.edu/news/')

'download (1).wget'

We can check out the contents of this (rather poorly named) file using the Jupyter interface in the previous tab. 

We got some HTML--cool! But what if we want something clickable and interactive? This is easiest to do with `wget` run via its native shell, rather than this simple Python wrapper--which also doesn't allow for `get`'s more advanced functionality. We can use the helpful `!` prefix to run shell commands straight from this notebook. 

Let's make a new `wget` request to download a version of the same page that's easier to see in your browser. 

In [2]:
!wget https://mdi.georgetown.edu/news/

--2022-07-18 20:35:13--  https://mdi.georgetown.edu/news/
Resolving mdi.georgetown.edu (mdi.georgetown.edu)... 23.185.0.4, 2620:12a:8001::4, 2620:12a:8000::4
Connecting to mdi.georgetown.edu (mdi.georgetown.edu)|23.185.0.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66738 (65K) [text/html]
Saving to: ‘index.html.1’


2022-07-18 20:35:14 (1.81 MB/s) - ‘index.html.1’ saved [66738/66738]



Use your Jupyter browser to check out the results: just click on `index.html` in your current folder (probably this is `day-1/`) to view the page. What do you notice? How does it compare to viewing https://mccourt.georgetown.edu/research/mdi-news/ in your browser? Try clicking the links. Where can you go on the actual page that your local copy can't show you? Do you have local copies of the images?

## Features of `wget`<a id='wget_features'></a>

You might have noticed that we only ended up with some HTML--we didn't download any of the files associated with the webpage. So, this isn't a true copy; we couldn't host the page ourselves, analyze its images, or easily use its content for purposes other than viewing. How do we mirror the full site?

To do this, we need only the `page-requisites` option, which makes sure to download all the resources needed to render the page in a browser: that means CSS, javascript, image files, etc. To keep from overloading the server, let's pause for a few seconds in between downloads using the `--wait` option. 

Let's use some other features as well for politeness and subtlety (i.e. to avoid getting blocked). Here is explanation for all of them:

```shell
--page-requisites             Grabs all of the linked resources necessary to render the page (images, CSS, javascript, etc.)
--wait                        Pauses between downloads (in seconds)
--tries=3                     Retries failed downloads 3 times
--user-agent=Mozilla          Makes wget look like a Mozilla browser by masking its user agent
--header="Accept:text/html"   Sends header with each HTML request, looks more browser-ish
--no-check-certificate        Doesnt check authenticity of website server (use only with trusted websites!)
```

In [3]:
!wget --page-requisites --wait=2 --tries=3 --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mdi.georgetown.edu/news/

--2022-07-18 20:35:14--  https://mdi.georgetown.edu/news/
Resolving mdi.georgetown.edu (mdi.georgetown.edu)... 23.185.0.4, 2620:12a:8001::4, 2620:12a:8000::4
Connecting to mdi.georgetown.edu (mdi.georgetown.edu)|23.185.0.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66738 (65K) [text/html]
Saving to: ‘mdi.georgetown.edu/news/index.html’


2022-07-18 20:35:14 (2.31 MB/s) - ‘mdi.georgetown.edu/news/index.html’ saved [66738/66738]

Loading robots.txt; please ignore errors.
--2022-07-18 20:35:16--  https://mdi.georgetown.edu/robots.txt
Reusing existing connection to mdi.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 6224 (6.1K) [text/plain]
Saving to: ‘mdi.georgetown.edu/robots.txt’


2022-07-18 20:35:16 (39.4 MB/s) - ‘mdi.georgetown.edu/robots.txt’ saved [6224/6224]

--2022-07-18 20:35:18--  https://mdi.georgetown.edu/wp-content/plugins/embed-gutenberg-block-google-maps/assets/css/style.836e5da587e9ec9692c0.css?ver=1658047951
Reusi

HTTP request sent, awaiting response... 200 OK
Length: 7106 (6.9K) [application/x-javascript]
Saving to: ‘mdi.georgetown.edu/wp-content/themes/wp-theme-1789/build/js/scripts.min.js?ver=1658047952’


2022-07-18 20:35:47 (123 MB/s) - ‘mdi.georgetown.edu/wp-content/themes/wp-theme-1789/build/js/scripts.min.js?ver=1658047952’ saved [7106/7106]

--2022-07-18 20:35:49--  https://mdi.georgetown.edu/wp-content/plugins/page-links-to/dist/new-tab.js?ver=3.3.6
Reusing existing connection to mdi.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 24734 (24K) [application/x-javascript]
Saving to: ‘mdi.georgetown.edu/wp-content/plugins/page-links-to/dist/new-tab.js?ver=3.3.6’


2022-07-18 20:35:49 (11.1 MB/s) - ‘mdi.georgetown.edu/wp-content/plugins/page-links-to/dist/new-tab.js?ver=3.3.6’ saved [24734/24734]

--2022-07-18 20:35:51--  https://mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-brands-400.eot
Reusing existing connection to mdi.georgetown.edu:443.
HT

--2022-07-18 20:36:19--  https://mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.eot?
Reusing existing connection to mdi.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 34390 (34K) [application/vnd.ms-fontobject]
Saving to: ‘mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.eot?’


2022-07-18 20:36:19 (13.4 MB/s) - ‘mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.eot?’ saved [34390/34390]

--2022-07-18 20:36:21--  https://mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.woff2
Reusing existing connection to mdi.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 13584 (13K) [font/woff2]
Saving to: ‘mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.woff2’


2022-07-18 20:36:21 (14.0 MB/s) - ‘mdi.georgetown.edu/wp-content/pattern-library/build/webfonts/fa-regular-400.woff2’ saved [13584/13584]

--2022-07-18

Check out the results--what's similar and whats different? See `/research/mdi-news/` for the `index.html` (sometimes this is `default.html`) page we saw earlier. 

`wget` has a rich array of options. Here are some of the most useful ones in addition to those above:

```shell
--mirror                      Downloads a full website and makes available for local viewing
--recursive                   Recursively downloads files and follows links
--no-parent 		          Does not follow links above hierarchical level of input URL
--convert-links 	          Turns links into local links as appropriate
--accept                      Download only file suffixes in this list (e.g., .html)
--execute robots=off          Turns off automatic robots.txt checking, preventing server privacy exclusions
--random-wait                 Randomizes the defined wait period to between .5 and 1.5x that value
--background		          For a huge download, put the download in background
--spider                      Determines whether the remote file exist at the destination (mimics web spiders)
--domains   		          Downloads only only PDF files from specific domains
--user --password   		  Downloads files from password protected sites
```

### Challenge

Download only `.html` files from https://mccourt.georgetown.edu/research/ and links below that.

In [4]:
# Solution
!wget --accept .html --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://mccourt.georgetown.edu/research/

--2022-07-18 20:36:32--  https://mccourt.georgetown.edu/research/
Resolving mccourt.georgetown.edu (mccourt.georgetown.edu)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to mccourt.georgetown.edu (mccourt.georgetown.edu)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 169630 (166K) [text/html]
Saving to: ‘mccourt.georgetown.edu/research/index.html’


2022-07-18 20:36:32 (1.87 MB/s) - ‘mccourt.georgetown.edu/research/index.html’ saved [169630/169630]

Loading robots.txt; please ignore errors.
--2022-07-18 20:36:34--  https://mccourt.georgetown.edu/robots.txt
Reusing existing connection to mccourt.georgetown.edu:443.
HTTP request sent, awaiting response... 200 OK
Length: 116 [text/plain]
Saving to: ‘mccourt.georgetown.edu/robots.txt.tmp’


2022-07-18 20:36:34 (1.30 MB/s) - ‘mccourt.georgetown.edu/robots.txt.tmp’ saved [116/116]

--2022-07-18 20:36:36--  https://mccourt.georgetown.edu/research/featured-publications/
Reusing existing conn

Converting links in mccourt.georgetown.edu/research/research-data-center/administrative-data-metadata/index.html... 34-8
Converting links in mccourt.georgetown.edu/research/the-massive-data-institute/index.html... 19-3
Converting links in mccourt.georgetown.edu/research/featured-publications/index.html... 25-8
Converting links in mccourt.georgetown.edu/research/mccourt-centers/index.html... 31-8
Converting links in mccourt.georgetown.edu/research/research-data-center/index.html... 41-8
Converting links in mccourt.georgetown.edu/research/index.html... 28-12
Converting links in mccourt.georgetown.edu/research/research-data-center/research/index.html... 34-8
Converting links in mccourt.georgetown.edu/research/research-data-center/data-and-software-available/index.html... 34-8
Converting links in mccourt.georgetown.edu/research/mccourt-centers/research-center-directors/index.html... 25-8
Converting links in mccourt.georgetown.edu/research/research-data-center/contact-us/index.html... 34-8


### Challenge

Use advanced options for `wget` (listed above) to mirror a website you use often. Be sure to use a polite `--wait` and avoid downloading anything with massive numbers of links, files, or pages (e.g., don't try YouTube.com or Wikipedia.com). If you want to download a segment or specific page within a website (e.g., a single YouTube channel or Wikipedia page), use the `--recursive` option with `--no-parent` (to follow only links within the input URL).

While you let `wget` run, read more about it on its [manual](https://www.gnu.org/software/wget/manual/wget.html) and see other examples of `wget` usage [here](https://gist.github.com/bueckl/bd0a1e7a30bc8e2eeefd) and [here](https://phoenixnap.com/kb/wget-command-with-examples). 

In [5]:
# Solution
!wget --mirror --recursive --no-parent --page-requisites --convert-links --wait=2 --tries=3 \
    --user-agent=Mozilla --header="Accept:text/html" --no-check-certificate \
    https://www.jarenhaber.com/

--2022-07-18 20:37:05--  https://www.jarenhaber.com/
Resolving www.jarenhaber.com (www.jarenhaber.com)... 68.183.29.183, 157.245.84.7, 2604:a880:400:d0::1561:9001, ...
Connecting to www.jarenhaber.com (www.jarenhaber.com)|68.183.29.183|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘www.jarenhaber.com/index.html’ not modified on server. Omitting download.

Converted links in 0 files in 0 seconds.
