Skip to content

paultraf/makestaticsite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MakeStaticSite — a Bash shell script to generate and deploy static websites

MakeStaticSite (project site https://makestaticsite.sh/) is a set of Bash shell scripts that configure and use Wget to generate a static website from a (typically dynamic) website, with various options to tailor and deploy the output. It aims to improve the performance and security of public-facing websites, whilst allowing continuity in the way they are developed and maintained, without requiring technical know-how on behalf of users.

Table of Contents

About

MakeStaticSite provides a convenient means to set up and manage the automated creation and deployment of static versions of websites. These include content management systems (such as WordPress and Drupal) that can, for example, be administered locally and then deployed remotely to a hosting provider or Content Distribution Network (CDN). Strictly speaking, this is not an archival tool as the output is not an exact mirror, but a version of the site that preserves content while aiming to remain current. For example, RSS feeds are saved and then renamed with .xml extensions, and further files may be added.

The goal is for anyone who has a little familiarity with the command line to be able to use the tool to assist in maintaining their sites. Similarly, a scripting-based approach has been chosen to make the code widely accessible for developers to further fine-tune; a number of refinements are already included that augment the standard use of Wget, such as support for arbitrary attributes and, in the case of WordPress, the use of WP-CLI to prepare sites beforehand.

MakeStaticSite is available under AGPL version 3 license. See the COPYING file for more information.

Requirements

This software should work on version 3 of GNU Bash, though version 4+ is recommended.

It depends on GNU Wget and rsync, for which the latest versions are recommended. For optimising WordPress sites ahead of running Wget, WP-CLI is needed; and HTML Tidy is used to refine HTML output for better conformance with W3C standards. Otherwise, apart from Internet connectivity, there are few dependencies beyond what the shell already provides.

Please note that the system is not designed for Wget2, though it would be useful to support that in future.

Features

  • A straightforward command line interface
  • Able to create static versions of a wide range of dynamic websites
  • Support for managing multiple sites, each with custom settings
  • Setup script guides users with an interactive dialogue and automatically generates a configuration file.
  • During setup, choose from three run levels to determine the amount of customisation - from minimal to advanced.
  • In addition to the main host domain, additional assets such as JavaScript, CSS and images can be retrieved from other domains and subdomains.
  • A phased-based workflow separating different aspects in the build process
  • Manage multiple sites, each with custom settings defined in their own configuration file (and multiple config files can also be used for any given site).
  • Suitable for batch processes, allowing operations to be scaled up so that any or all of the sites are updated in one process.
  • Support for http basic authentication and/or CMS login (experimental, only tested with WordPress)
  • Runtime settings include verbosity (amount of information) for terminal output and logging to file
  • Option of a downloadable copy of the entire site (zip file) for offline use
  • For WordPress installations, optional WP-CLI-based site streamlining with a drop-in search replacement (WP Offline Search plugin) that works offline.
  • Snippets — for tweaking page with offline variants using chunks of HTML.
  • Deep search for orphaned Web assets (phase 3), later retrieved in further runs of Wget (phase 5)
  • Assistance for W3C standards compliance with HTML Tidy. The system also generates a sitemap XML and robots.txt file to match the outputted files.

Limitations

  • MakeStaticSite is prototype software, provided as-is and tested on only a few sites, but in the hope that it will prove useful and become community-supported
  • The system is designed for the original GNU Wget, whereas most development effort is now on GNU Wget2.
  • Not a general crawler, but designed to retrieve from a single site, with supporting assets (CSS, multimedia, etc.) incorporated from other domains and subdomains
  • It is not a good fit for sites that uses query strings extensively, as is the case for collections databases with a large inventory. Whilst query strings are supported in the initial run of Wget, requests for URLs in the post-processing do not currently include query strings.
  • Links generated dynamically by JavaScript are not included.
  • The script can only provide a snapshot of comments, discussions, surveys, etc. that are provided by the Website itself; the interactivity of such components is generally lost. In this case, this kind of interactivity will need to be provided by third-parties, typically through the use of embedded JavaScript.
  • Performance: MakeStaticSite output is not instant. It typically takes up to a few minutes to build a site, which, depending on usage scenario, may or may not be a significant duration. Some acceleration is possible by running Wget threads in parallel (see the wget_threads option).
  • For WordPress sites, using WP-CLI remotely over ssh may not be fully supported by hosting providers running jailed shells for shared hosting. In that case, WordPress updates need to be done manually.

Acknowledgements

Many thanks to various developers for sharing their knowledge on shell scripting, particularly on blogs and Q&A websites such as Stack Exchange and to those who have tested, commented on and otherwise supported MakeStaticSite.

Installing

The source distribution is made available as a gzipped tar file. Download the latest version from:

https://makestaticsite.sh/download/makestaticsite_latest.tar.gz

Once downloaded, from the command line run the following to extract it:

tar -xzvf makestaticsite_latest.tar.gz

This will create a makestaticsite directory. Enter it and then make the scripts executable:

chmod u+x *.sh

Layout

.
├── config/                 # site configuration files
├── lib/                    # library files
├── log/                    # log files (generated)
├── tmp/                    # temporary files (generated)
├── makestaticsite.sh       # main script
├── setup.sh                # setup script
├── version_history.txt     # summary of changes for each version
├── COPYING                 # software license
└── README.md

How to use

Once extracted, for the first use, at the command line enter the makestaticsite directory and run ./setup.sh to get started. You will be asked a series of questions (with suggested defaults) about the site you are mirroring with (for WordPress) options to tweak it beforehand; then, the precise wget options to create the mirror, how it should be deployed (locally or on a remote server), whether to create a zip file, and various other options.

Once you have set up a configuration, mysite.cfg for a domain example.com, say, you can proceed to build the static version with:

./makestaticsite.sh -i mysite

It will proceed to generate a static mirror in the following directory:

mirror/mirror_id/example.com

where mirror_id is a site identifier based on mysite; when the archive option is set, it is mysite concatenated with a timestamp.

For other command-line options, run:

./makestaticsite.sh -h

Manual intervention should be minimal — mainly required when Wget encounters errors or when you are using WordPress and opt to add an offline search facility, in which case you will be prompted to go to the WordPress dashboard and create the search index.

Workflow

MakeStaticSite divides its work into phases, of which there are ten altogether, which may be regarded as a pipeline.

  1. Prepare the CMS
  2. Generate static site
  3. Augment static site
  4. Refine static site
  5. Add extras
  6. Optimise
  7. Use snippets
  8. Create offline zip
  9. Deploy
  10. Conclude (summary report)

Accordingly, you can run the script with arguments p and q, specifying start and end phases respectively such that: 1 <= p <= q <= 10

There are broadly two use cases.

(Case 1) When creating a site for the first time, you can opt to finish at any intermediate phase as far as the conclusion.

./makestaticsite -i mysite -q END_NUM

(where END_NUM is the phase where it stops.)

Thus, to just carry out an initial run of Wget and not carry out further processing, set END_NUM to 2.

(Case 2) An existing mirror may be modified, perhaps subsequent to a run abbreviated as above. Here, both the start and end phases may be specified:

./makestaticsite -m mirror_id -p START_NUM -q END_NUM

(where the argument -m expects a mirror ID, START_NUM is the phase where the script starts processing, and END_NUM is the phase where it stops.)

Options

The customisation of MakeStaticSite is carried out through two sets of options. We provide just a brief description here apart from those relating to Wget as this is core to the whole operation.

  • Configuration options define the target, i.e. the site you are capturing, any authentication requirements, options for Wget, what kinds of refinement to carry out and how to deploy the end result.

    The options are stored in .cfg files in the config directory. They can be created manually, but it's recommended to use the setup script and then tweak as needed.

    Details: https://makestaticsite.sh/help/configuration/

  • Runtime options set the general parameters for running MakeStaticSite on a particular system. These settings, stored in lib/constants.sh, apply to any configuration file supplied, so are to be treated as universal constants. They can be tweaked on any given run, but it is strongly recommended that a backup be made first.

    Details: https://makestaticsite.sh/help/options/

Wget

Wget is at the heart of MakeStaticSite and needs to be precisely configured with multiple command-line arguments to make a faithful snapshot of a site. This is why a warning is given if the version used is not very recent. Also, a single run might not be sufficient to capture everything, particularly orphaned links, so MakeStaticSite provides a separate process (when wget_extra_urls is set) to gather additional URLs and then Wget is called again for each URL that is discovered.

There are several variables that contribute arguments, some are basic and should be included in every run, whilst others are site-specific.

(1) Configuration options

  • wget_extra_options (default: -X/wp-json,/wp-admin --reject xmlrpc*) should specify what directories should be ignored (-X) and what file extensions not to follow (-R or --reject). A default setup of a CMS such as WordPress typically exposes various APIs for data retrieval, which depend on server-side scripting. These are redundant and should be removed, ideally within the CMS, with these arguments for Wget acting as a fallback.

A couple of other parameters that could be supplied here:

  • --spider for just testing the wget operation without downloading files. This will still report errors (and also create the directory structure).

  • --limit-rate=100k limits the wget download rate to about 100KB per second.

Please refer to the Wget manual for details of these and other options.

(2) Runtime options

  • wget_core_options (default: --mirror --convert-links --adjust-extension --page-requisites) is a fairly standard set of arguments for generating a static mirror (phase 2); --adjust-extension generates files with .html extension, making the output suitable for offline browsing, which is one of the main goals of the project.

  • wget_extra_core_options (default: -r -l inf -nc --adjust-extension) is a trimmed-down version of wget_core_options to be used in phase 3 when Wget is rerun with the assumption that hidden URLs are assets, not Web pages, which should be left alone to preserve navigation integrity.

  • wget_reject_clause (default: *login*,*logout*) is added automatically to wget_extra_options (login/logout links are redundant in a static site and should not be followed).

Another option to facilitate crawling a remote site:

  • wget_user_agent (default is empty) can be specified in the case that a host server is configured to forbid access to content without the receipt of a user agent string in a certain format (not usually including Wget). To circumvent this issue a suitable string can be specified here.

Snippets

Snippets provide a means to make changes to the web pages generated by Wget. For example, when mirroring a CMS, there may be (links to) login pages that should be hidden.

A snippet is a chunk on HTML to be substituted for another chunk in the original web page on the host ($url). Each one is assigned a numerical ID, using fixed point notation with three decimal places, i.e., between 000 and 999. They are stored as files in the snippets/ directory inside MakeStaticSite's top-level directory, with filename matching their ID. Thus, a snippet with ID 001 the corresponding file is snippet001.html. A snippet may be used in more than one site, hence they are stored together. To differentiate sets of snippets, a numbering convention may be used, e.g., 1xx for site 1, 2xx for site 2.

To incorporate snippets, the following pair of tags need to be included in the source HTML of any page where a replacement needs to be made (in WordPress, when using the Gutenberg editor, you can insert them by using the <code> block). For ID 001, say, insert the following HTML before the content to be changed:

<!--SNIPPET001BEGIN-->

And insert the other immediately after the content:

<!--SNIPPET001END-->

An index to all the snippets is stored in snippets.data, which lists the path to each file to be modified followed by a list of snippet identifiers. A simple tag is used to demarcate sets of snippets for a particular site, where the element name corresponds to the local site name.

The following code specifies three snippets for one site and one for another:

<sigalaresearch>
index.html:1
about/website/index.html:2,3
</sigalaresearch>
<ptworld_local>
contact/index.html:4
</ptworld_local>

After a Wget mirror is created, the script will match on the <$local_sitename> and work through lines inside the tag pair, extracting the file path and snippet IDs. It will proceed to create a temporary copy of the file and path and apply the relevant snippet substitution. Depending on the settings, the revised file may be deployed and/or included in the zip file.

Once a snippet has been applied, the SNIPPET tag is removed, whereas if a SNIPPET tag is visible, the snippet has not yet been applied. The latter is true for content within the mirror/ directory, which contains the 'raw' snapshot before applying snippets; files which have been changed are stored in the subs/ directory.

Newsfeeds

MakeStaticSite attempts to adjust Wget output to maintain support for RSS feeds. Currently targeted at WordPress, it renames files inside feed/ directories from index.html to index.xml. To properly support this in deployment, on the web server, add index.xml as the last entry to the DirectoryIndex directive in .htaccess at the site's root.

Further work

Many improvements could surely be made to improve the quality of the code as well as extend it, with i18n being a high priority. Another key requirement is to add support for Wget2, and HTTrack should also be considered. Also, a properly implemented modular architecture would enable enhanced support for a variety of content management systems (CMS). Whilst MakeStaticSite is authored in Bash, versions for other shells should be possible and might not require a great deal of modification.

Details: https://makestaticsite.sh/developers/further-work/

About

A set of Bash scripts to generate, optimize and deploy static websites using Wget and other open source tools.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages