Generate a redirect map from two sitemaps for website migration.
Switch branches/tags
Nothing to show
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Ignore some paths from git May 4, 2018
LICENSE Initial commit Apr 28, 2017
Pipfile Add dependencies to pipfile May 4, 2018
Pipfile.lock Add dependencies to pipfile May 4, 2018
README.md Fix command in readme May 4, 2018
aggregate.py Formatting in aggregate.py May 4, 2018
map.py Python 3 compatibility with argparse May 4, 2018
setup.py Initial commit Apr 28, 2017

README.md

redirect-mapper

Takes two lists of URLs and outputs a mapping that assigns each entry in list 1 an item from list 2 along with a score that indicates how likely the two refer to the same thing.

Use case

This script was created to automatically generate a map of redirects when migrating a website. The input lists would be a sitemap of each the old and new website, both plain text files containing one url per line. The URLs are required to be "pretty", meaning not just /post.php?id=123 but rather something like /blog/why-wordpress-sucks and ideally have their protocol- and domain parts removed.

It can of course be used as a generic tool to fuzzy match two sets of strings. It uses the Levenshtein distance metric as implemented by python-Levenshtein.

Warning: Always check the results manually. Never trust the output of the script blindly. It will assign each item in list 1 one item from list 2, even if it's a really bad match.

map.py usage

  1. Clone this repository git clone https://github.com/jsphpl/redirect-mapper
  2. Enter it cd redirect-mapper
  3. Install dependencies python setup.py install
  4. Use it:
$ python map.py [-h] [-t VALUE] [-c PATH] [-d] list1 list2

Generates a redirect map from two sitemaps for website migration.

By default, all matches are dumped on the standard output. If an item
from list1 is exactly contained in list2, it will be assigned right
away, without calculating distance or checking for ambiguity.

Issues & Documentation: https://github.com/jsphpl/redirect-mapper

positional arguments:
  list1                 List of target items for which to find matches. (1 item per line)
  list2                 List of search items on which to search for matches. (1 item per line)

optional arguments:
  -h, --help            show this help message and exit
  -t VALUE, --threshold VALUE
                        Range within which two scores are considered equal. (default: 0.05)
  -c PATH, --csv PATH   If specified, the output will be formatted as CSV and written to PATH
  -d, --drop-exact      If specified, exact matches will be ommited from the output

Examples

Generate a list of redirects

Say your're asking where to redirect all the urls from old_sitemap.txt ?. Pass it as the first argument like so:

python map.py old_sitemap.txt new_sitemap.txt

Adjust ambiguity threshold

To influence the level at which two matches are considered equally good, use the -t VALUE argument.

python map.py -t 0.1 old_sitemap.txt new_sitemap.txt

Omit exact matches

If the results are used to set up 301 redirects on the new website to catch all traffic arriving at old URLs, exact matches can be omitted. They will be handled by actual pages exisiting on the new site (list2). Use the -d flag here.

python map.py -d old_sitemap.txt new_sitemap.txt

Save output to CSV file

Specify the output filename with -c PATH.

python map.py -c results.csv old_sitemap.txt new_sitemap.txt

Aggregating URLs from an XML sitemap

A helper exists that lets you crawl an XML sitemap and outputs a flat list of URLs, as required as input by map.py. Together with that tool, the whole process of generating a redirect map could look like the following. After that, you would of course manually check the results.csv, taking special care of matches with a low score (≤0.8).

python aggregate.py https://old-website.com/sitemap.xml > old.txt
python aggregate.py https://new-website.com/sitemap.xml > new.txt
python map.py --drop-exact --csv results.csv old.txt new.txt

aggregate.py usage

$ python aggregate.py [-h] URL/PATH

Aggregates URLs from a set of XML sitemaps listed under the entry path.

This script processes the XML file at given path, opens all sitemaps
listed inside, and prints all URLs inside those maps to stdout.
It should support most sitemaps that comply with the spec at
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd.

It was tested with sitemaps generated by the following WP plugins:
 - (Google XML Sitemaps)[https://wordpress.org/plugins/google-sitemap-generator/]
 - (XML Sitemap & Google News feeds)[https://wordpress.org/plugins/xml-sitemap-feed/]
 - (Yoast SEO)[https://wordpress.org/plugins/wordpress-seo/]

Issues & Documentation: https://github.com/jsphpl/redirect-mapper

positional arguments:
  URL/PATH    Path or URL of the root sitemap.

optional arguments:
  -h, --help  show this help message and exit