Skip to content

sysadmin publishqueue

Violet edited this page Dec 6, 2010 · 7 revisions

MSAG: Using Melody's "Publish Queue"

The Melody Publish Queue is an essential component to any large scale Melody powered website because it plays a crucial role in publishing performance optimization. There are a number of benefits to using the publish queue, they are:

  • It eliminates redundant, duplicated and unnecessary publication of files.

  • It offloads publishing to stand alone process which can be throttled and scaled independently from the Melody web application itself.

  • It speeds up the commenting experience by reducing the number of files that an end user must wait to be published prior to being able to navigate the website again.

How it Works

It might be best to describe how the publish queue works by examining a scenario in which it would be utilized: republishing the necessary files in response to a comment.

Adding Jobs to the Queue

When a comment comes in to Melody multiple files are often in need of being updated, not only because the comment needs to be published to the entry’s permalink page, but also because multiple other pages which display a comment count associated with the comment’s entry may need to be updated.

Each of those pages (assuming they are configured to be published via the publish queue) will then be added to the “publish queue.” When this happens, a publishing “job” is created and added to the database for each page that need to be published. There is one row in the database for each individual job in the system.

Now let’s assume for a moment that shortly after receiving the first comment, a second one is published by a different visitor to your website. This action also results in pages needing to be republished. However this time, before those pages are added to the queue as jobs the system checks to see if a job corresponding to each page is already on the queue. If there is, then the job is discarded because its work would be unnecessarily duplicated otherwise. If the job is not already on the queue, then it is added. This ensures that no unnecessary work is performed by the system.

In addition, each page that is added to the publish queue is given a priority which dictates the order in which the corresponding job will be processed. The higher the priority, the sooner the system will work on the job. Melody assigns priority based upon the following criteria:

Page/Template Type Priority
Preferred Page and Entry archives 10
Index templates with a filename beginning with “index” or “default” 9
Feed index templates 9
All other index templates 8
Non-preferred Page and Entry archives 5
Daily archives 4
Weekly archives 3
Monthly archives 2
Any Category archive 1
Any Author archive 1
Yearly archives 1

And that is how jobs are added to the queue. There is a separate process that exists that is then responsible for publishing.

Creating Publish Queue Workers

One or more publish queue “workers” can be created to process jobs on the queue. The number of workers needed by a system is based largely upon two variables:

  • The capacity of any one worker to process jobs on the queue.
  • The volume of jobs being added to the queue over time.

A worker is created by running the “run-periodic-tasks” script that comes with every copy of Melody. This script can be run in three modes:

  • daemon mode - in this mode the script never quits; instead it constantly monitors the job queue for work to be done and nearly the instance a job is made available for work, the script will begin work on it.

  • run-once - in this mode the script is run via the command line and will quit only after there is no more work on the queue to be done.

  • scheduled task - in this mode the script is executed in the “run-once” mode periodically according to a schedule defined by cron or a similar service.

Processing Jobs on the Queue

Each worker will monitor the queue for jobs. When one becomes available it is pulled off the queue to be worked on. Once it is "off the queue" no other workers can claim it. This makes sure that no two workers are trying to work on the same job at the same time.

In the event that something goes wrong during the publishing process and the file is not published, then the system will notice saying something skin to, "uh-oh, look at this job that was claimed on the queue, but was never successfully finished," and then free up the job for a worker to pick up and try again on. If the task is retried more than 5 times, then the job is marked as failed and left on the queue. In this state it is possible for a similar job to be placed on the queue, and if the problem that was resulting in the published failure is not transient, then that job is likely to fail again.

An important thing to note is that if a job is pulled off the queue by a worker to be worked on, then it remains possible at that point in time for that same page to be added to the queue again in response to the receipt of another comment. The rational being that by the time the page is finished being rebuilt it is most likely out of date, and so needs to be published again.

What Powers It?

The Publish Queue is powered by a stand alone job/queue management library called “The Schwartz.” The Schwartz is actually a more generic and abstract job management system capable of processing any number of tasks via a similar queuing mechanism.

For the time being, Melody only utilizes the Schwartz for publishing, but in the future may use this framework for sending emails or other non-critical system tasks.

Publish Queue Tools

There is one tool in particular that is recommended for most systems that utilize the Publish Queue, aptly named the Publish Queue Manager.

This tool provides a user interface within Melody that allows administrators to monitor and inspect jobs on the queue. Each job can be deleted, or have its priority changed.

For more information, visit the plugin’s website.

Using RSync and/or NFS in Multi-Server Environments

In large multi-server environments it often becomes necessary to take a single file that has been published by a Publish Queue worker and somehow get it to show up on multiple front end web servers.

Confused? Well, consider the following scenario. Suppose your system employs multiple machines for the express purpose of process publishing jobs on the publish queue. Now let’s say one of those machines updates one set of files and another one of those machines updates a different set of files. How then does one get these disparate files to a machine intended to serve them to your readers.

In a single server environment this is never a problem because the machine serving the files and the machine publishing them are one in the same. Therefore publish queue workers publish directly to your web servers document root and thus make updated pages available. In a multi server environment there are generally two different ways to solve this problem:

  • Link your front end web servers and your publish queue machine together via a shared filesystem like NFS.
  • Physically copy files from your publishing machines to your front end web servers via rsync or scp.

Now, let’s explain each of these options in more detail.

NFS

In using the NFS solution all of your publishing servers (or Publish Queue workers) write files to an external NFS mount. In so doing these files never actually physically reside on the publishing server, they only appear to be local thanks to NFS which helps different servers share the same set of files between them. The front end web server then mounts this shared NFS directory for reading.

Pros Cons
Scales better because each file is written once and immediately made visible on the front end web server. Very poor performance in a geographical disparate setup (e.g. Amazon EC2 or other cloud services).
Easier to setup IMHO. Single point of failure. If something were go wrong with your shared filesystem, then much of your system will be hosted. This can be mitigated with a solid RAID config or other highly reliable disk storage.

Note: “NFS” here is used only for illustrative purposes. Technically any shared file system technology will do.

RSync

When using rsync, Melody will invoke a command line utility designed for keeping two different file systems in sync with one another. This is what happens when Melody is configured to use RSync:

  1. User leaves a comment.
  2. Job is created in Publish Queue.
  3. Worker pulls job off queue and publishes file to local file system.
  4. Worker then begins to rsync (usually via scp) to each of the designated servers.
Pros Cons
Failure tolerance - by replicating your published content you ensure that if one file system or server goes bad, you still have something to fall back on. Slightly harder to setup IMHO.
Great for cloud hosting services like Amazon EC2, or any time in which your publishing server and front end web servers are not likely to be on the same subnet. Scalability - the more front end web servers you have the more servers you will need to synchronize with. This can add latency to your publishing process and cause some servers for a brief period of time to have slightly different content from one another.
Only works in Unix environments.

Setting Up Publish Queue and Rsync

To get started using Publish Queue and rsync you will need to follow these steps:

  1. Make sure that your publishing servers are configured to publish files to the exact same path as your front end web servers are configured to read from. In other words, your publishing server should mirror exactly the file/directory/path structure of your front end web server.

  2. Setup a user on your front end web server has that has write access to the directory that serves your blog’s published files to the outside world. Make sure this user can connect via SSH to your front end web server from each of your publishing servers - without having to supply a password. This often done using SSH’s special file called .authorized_keys.

Testing Your Setup

Once this is complete it is best to test make sure you can transfer files between the two hosts. To do so, successfully execute the following command from one of your publishing servers:

prompt> cd /
prompt> scp /path/to/a/file.txt username@someserver.com:/path/to/a/

If it is not obvious, please make sure to replace “/path/to/a/file.txt” with an absolute path to a file in your blog’s document root. Also, replace “username” and “someserver.com” with the username and server address to transfer files to.

Your config.cgi file

Once you have tested that files can be transferred between hosts without being prompted for a password, then add this to the config.cgi file on each of your publish queue servers:

SyncTarget username@someserver.com:/
RsyncOptions -e ssh

###Common Cron Settings

The following are the most comment Cron settings for configuring the publish queue. These can be copied and pasted directly into the crontab editor once you have fixed the path to correspond to your user account on your host (for example, changing /home/username/www/melody to /home/johndoe/www/melody if you log into your web host's service panel as "johndoe").

  • Every 5 minutes:

    */5 * * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every 10 minutes:

    */10 * * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every 30 minutes:

    */30 * * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every hour:

    0 * * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every other hour:

    0 */2 * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every six hours:

    0 */6 * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every twelve hours:

    0 */12 * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every day:

    0 0 * * * cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every week:

    0 0 * * 0 cd /home/username/www/melody/tools; perl run-periodic-tasks

  • Every month:

    0 0 1 * * cd /home/username/www/melody/tools; perl run-periodic-tasks

If you have SSH access to your web host, you do the following to configure these over a command line interface:

  1. Copy the setting you prefer into Notepad or TextEdit (Windows and MacOS respectively).
  2. Update the path portion of the string to the full path for Melody on your host.
  3. Connect to your host via SSH.
    • Mac users, open Terminal.app under Applications/Utilities and type "ssh yourusername@yourdomainhere"
    • Windows users, download Putty
  4. Type in the following command: crontab -e
  5. Press the "i" key. (The editor it uses is called Vi. It is very temperamental if you are not accustomed to its behavior. Just follow these steps verbatim)
  6. Mac users, just copy and paste like normal. Windows users using Putty, copy the setting you modified into the clipboard (control+c) and right click on the main are of the Putty terminal (this is an easy way to paste into Putty, for future reference). You should now see the string in there.
  7. Hit the "escape" key on your keyboard.
  8. Type the following: :wq
  9. If you encounter any problems, contact your host. This is something which any good host can fix in a matter of seconds for their users.

If your host provides a "CPanel" interface to their service, this process is substantially easier. Here are the steps:

  1. Log into CPanel.
  2. Search for "Cron Jobs" and click on that link.
  3. Select the "Common Settings" value that you want to use from the "Add New Cron Job" field.
  4. Where it says "Command," copy and paste in the command portion of the string (ex. /home/username/www/melody/tools; perl run-periodic-tasks)
  5. Click on the "Add New Cron Job" button.

Additional Reading

To learn more about the Publish Queue, consider reading the following resources:



Questions, comments, can't find something? Let us know at our community outpost on Get Satisfaction.

Credits

  • Authors: Byrne Reese
  • Edited by: Violet Bliss Dietz
Clone this wiki locally