Simple Bash script to download websites (including all assets needed to display them properly) locally. The script uses Wget to retrieve files.
On the command line:
- Clone repository. E.g.
git clone git@gitlab.com:jonasjacek/website-downloader.git
- Go to repository. E.g.
cd website-downloader/
- Add list of URL's to retrieve in
website-downloader_urls.txt
. - Adjust script options as needed. See Options.
- Run website downloader. E.g.
. website-downloader.sh
--restrict-file-names=modes
Change which characters found in remote URLs must be escaped during generation of local filenames. Values:[unix|windows]
-r
,--recursive
Turn on recursive retrieving. See Recursive Download, for more details. The default maximum depth is 5.-x
,--force-directories
The opposite of-nd
— create a hierarchy of directories, even if one would not have been created otherwise.-k
,--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing.-p
,--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML page.-E
,--adjust-extension
If a file of typeapplication/xhtml+xml
ortext/html
is downloaded and the URL does not end with the regexp\.[Hh][Tt][Mm][Ll]?
, this option will cause the suffix.html
to be appended to the local filename.--no-cache
Disable server-side cache.-w seconds
,--wait=seconds
Wait the specified number of seconds between the retrievals.-e robots=off
Ignore and do not download robots.txt files.--show-progress
Force wget to display the progress bar in any verbosity.--progress=type
Select the type of the progress indicator you wish to use. Legal indicators are “dot” and “bar”.-i file
,--input-file=file
Read URLs from a local or external file.
Further Options
-np
,--no-parent
Do not ever ascend to the parent directory when retrieving recursively.-H
,--span-hosts
Enable spanning across hosts when doing recursive retrieving (see Spanning Hosts).-D domain-list
,--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on-H
.-a logfile
,--append-output=logfile
Append to logfile.- -
q
,--quiet
Turn off Wget’s output. -t number
,--tries=number
Set number of tries to number. Specify 0 or ‘inf’ for infinite retrying.-nd
,--no-directories
Do not create a hierarchy of directories when retrieving recursively.--no-check-certificate
Don’t check the server certificate against the available certificate authorities.
You can find this repository at:
- GitLab
https://gitlab.com/jonasjacek/website-downloader - GitHub
https://github.com/jonasjacek/website-downloader
Website Downloader is a small, private project. The author makes absolutely no claims and representations to warranties regarding the accuracy or completeness of the information provided. However, you can use the information in this repository AT YOUR OWN RISK.
Website Downloader by Jonas Jacek is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available upon request.
Found a mistake? Open an issue or send a merge request. Want to help in another way? Contact me.