Takes two lists of URLs and outputs a mapping that assigns each entry in list 1 an item from list 2 along with a score that indicates how likely the two refer to the same thing.
This script was created to automatically generate a map of redirects when migrating a website. The input lists would be a sitemap of each the old and new website, both plain text files containing one url per line. The URLs are required to be "pretty", meaning not just
/post.php?id=123 but rather something like
/blog/why-wordpress-sucks and ideally have their protocol- and domain parts removed.
Warning: Always check the results manually. Never trust the output of the script blindly. It will assign each item in list 1 one item from list 2, even if it's a really bad match.
- Clone this repository
git clone https://github.com/jsphpl/redirect-mapper
- Enter it
- Install dependencies
python setup.py install
- Use it:
$ python map.py [-h] [-t VALUE] [-c PATH] [-d] list1 list2 Generates a redirect map from two sitemaps for website migration. By default, all matches are dumped on the standard output. If an item from list1 is exactly contained in list2, it will be assigned right away, without calculating distance or checking for ambiguity. Issues & Documentation: https://github.com/jsphpl/redirect-mapper positional arguments: list1 List of target items for which to find matches. (1 item per line) list2 List of search items on which to search for matches. (1 item per line) optional arguments: -h, --help show this help message and exit -t VALUE, --threshold VALUE Range within which two scores are considered equal. (default: 0.05) -c PATH, --csv PATH If specified, the output will be formatted as CSV and written to PATH -d, --drop-exact If specified, exact matches will be ommited from the output
Generate a list of redirects
Say your're asking where to redirect all the urls from old_sitemap.txt ?. Pass it as the first argument like so:
python map.py old_sitemap.txt new_sitemap.txt
Adjust ambiguity threshold
To influence the level at which two matches are considered equally good, use the
-t VALUE argument.
python map.py -t 0.1 old_sitemap.txt new_sitemap.txt
Omit exact matches
If the results are used to set up 301 redirects on the new website to catch all traffic arriving at old URLs, exact matches can be omitted. They will be handled by actual pages exisiting on the new site (list2). Use the
-d flag here.
python map.py -d old_sitemap.txt new_sitemap.txt
Save output to CSV file
Specify the output filename with
python map.py -c results.csv old_sitemap.txt new_sitemap.txt
Aggregating URLs from an XML sitemap
A helper exists that lets you crawl an XML sitemap and outputs a flat list of URLs, as required as input by
map.py. Together with that tool, the whole process of generating a redirect map could look like the following. After that, you would of course manually check the results.csv, taking special care of matches with a low score (≤0.8).
python aggregate.py https://old-website.com/sitemap.xml > old.txt python aggregate.py https://new-website.com/sitemap.xml > new.txt python map.py --drop-exact --csv results.csv old.txt new.txt
$ python aggregate.py [-h] URL/PATH Aggregates URLs from a set of XML sitemaps listed under the entry path. This script processes the XML file at given path, opens all sitemaps listed inside, and prints all URLs inside those maps to stdout. It should support most sitemaps that comply with the spec at https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd. It was tested with sitemaps generated by the following WP plugins: - (Google XML Sitemaps)[https://wordpress.org/plugins/google-sitemap-generator/] - (XML Sitemap & Google News feeds)[https://wordpress.org/plugins/xml-sitemap-feed/] - (Yoast SEO)[https://wordpress.org/plugins/wordpress-seo/] Issues & Documentation: https://github.com/jsphpl/redirect-mapper positional arguments: URL/PATH Path or URL of the root sitemap. optional arguments: -h, --help show this help message and exit