# Link Checker for Links in OU-XML Documents

The OU-XML document format is used to act as a gold master document for teaching materials published on the OU VLE and elsewhere.

This script can be used to extract and check links referenced within OU-XML documents.

Reports are generated as CSV files (`.csv` file suffix) that may be viewed in a text editor, or loaded into a spreadsheet application.

## Utility Functions

Some utility functions for working with the OU-XML document and processing XML tags.

In [1]:
# Extract all the text inside a tag, include text inside child tags

import unicodedata

def flatten(el):
    ''' Utility function for flattening XML tags. '''
    if el is None: return
    result = [ (el.text or "") ]
    for sel in el:
        result.append(flatten(sel))
        result.append(sel.tail or "")
    return unicodedata.normalize("NFKD", "".join(result)) or ' '

Iterating through a list of links and making a web request for each one can happen quickly, especially if we are just requesting page headers and not the complete contents of a page.

A trick many screenscrapers use is to add a small delay between requests, often randomising the delay times to make it less obvious that a scrape is in play. THhe following function gives a minimal function that helps us play nice when creating lots of web requests.

In [2]:
# Wait for a short random interval
# This is so we don't clobber a particular website or the network too badly!

import time
from random import random

def play_nice(delay=0.1, min_delay=0.01):
    """Add a random delay between consecutive web requests."""
    # It would probably make more sense to do this to throttle multiple requests to the same domain
    time.sleep(min_delay + delay*random())

We're going to want to get lists of XML files somehow; the following function will grab the names of files with a particular file suffix (by default, `.xml`) from a directory with a specified path:

In [3]:
# Get a list of xml files in a folder/directory

import os
from pathlib import Path

def get_xml_files(path_str='.', suffix='.xml'):
    path = Path(path_str)
    docs = [path/x for x in os.listdir(path) if x.endswith(suffix)]
    return docs

In [4]:
# Example usage
get_xml_files('./xml_test')

[PosixPath('xml_test/tm351-week3.xml'), PosixPath('xml_test/tm351-week5.xml')]

A nod to user feedback on long running jobs where we are perhaps waiting for lots of links to be tested, the `tqdm` package provides a simple way to create progress bars.

In [5]:
# Import a package that displays progress bars nicely
from tqdm import tqdm

## Extract Information From OU-XML

Tool(s) to extract links and other information from OU-XML documents.

To make use of the XML content in a document, we need to open the document, read its contents and then parse it into a strcuture we can work with. Sometimes, we need to clean the document text up a bit before we can parse it...

In [6]:
# Open an XML file/document, read its contents and parse them as an XML etree object

from lxml import etree

def get_xml_from_doc(doc, clean=True):
    """Read file and parse as XML object."""
    xml = doc.read_text()
    
    if clean:
        for cleaner in ['<?sc-transform-do-oumusic-to-unicode?>',
                        '<?sc-transform-do-oxy-pi?>',
                        '<?xml version="1.0" encoding="utf-8"?>',
                        '<?xml version="1.0" encoding="UTF-8"?>',
                        '<?xml version="1.0" encoding="UTF-8" standalone="no"?>']:
            xml = xml.replace(cleaner, '')
            
    xml_root = etree.fromstring(xml)
    return xml_root

Having got the XML into a structure we can work with, we can start to treat it like a simple database from which we can extract different sorts of infomration.

OU-XML documents are structured in a particular way, so we can use prior knowledge of that structure to do things like pull out a course/module code and name, as well as the names of different sections in the document. We can use this information to help us report on where in a document we might find a particular broken link if we do discover that it is broken and in need of repair...

We also want to extract the links referenced in the document; that is the point of this recipe, after all...

In [7]:
# Extract metadata and links from a parsed OU-XML document

def parse_ouxml_metadata(courseRoot):
    """Extract some metadata from the OU-XML file."""
    _coursecode = courseRoot.find('.//CourseCode')
    coursecode = flatten(_coursecode)
    _coursetitle = courseRoot.find('.//CourseTitle')
    coursetitle = flatten(_coursetitle)

    _itemtitle = courseRoot.find('.//ItemTitle')
    itemtitle = flatten(_itemtitle)

    metadata = {'coursecode': coursecode,
                'coursetitle': coursetitle,
                'itemtitle': itemtitle,
               }
    return metadata


def extract_links_from_doc(courseRoot, unique_links=None):
    """Extract links from OU-XML document."""
    unique_links = [] if unique_links is None else unique_links
    links = {}

    # Grab some metadata
    metadata = parse_ouxml_metadata(courseRoot)
    sessions = courseRoot.findall('.//Session')
    
    for session in sessions:
        _links = []
        session_title = flatten(session.find('.//Title'))
        
        for l in session.findall('.//a'):
            _lhref = l.get('href').replace('.libezproxy.open.ac.uk','')
            if l not in unique_links:
                unique_links.append(_lhref)
                _links.append((flatten(l), _lhref))

        links[session_title] = _links

    # <BackMatter>
    backmatter = courseRoot.find('.//BackMatter')
    _links = []
    for l in backmatter.findall('.//a'):
        _lhref = l.get('href').replace('.libezproxy.open.ac.uk','')
        if l not in unique_links:
            unique_links.append(_lhref)
            _links.append((flatten(l), _lhref))
    links['BackMatter'] = _links
    
    doc_links = {'metadata': metadata, 'sessions': links}
    
    return doc_links, unique_links

We can extract the links from a set of documents by extracting the links from each separate document in turn.

For convenience, we return two objects:

- a growing list of unique URLs we need to test;
- a list of URLs associated with each OU-XML document, broken down by session.

Note: there is also the notion of a `Unit` block element in the OU-XML definition but that is not currently parsed as a particular unit of organisation.

In [8]:
# Extract all the links from a set of documents

def extract_links_from_docs(docs):
    """Process a set of OU-XML documents and extract unique links from them all."""
    docs = docs if isinstance(docs, list) else [docs]
    unique_links = []
    doc_links = []
    for doc in docs:
        xml = get_xml_from_doc(doc)
        _doc_links, unique_links = extract_links_from_doc(xml, unique_links)
        _doc_links['metadata']['file'] = str(doc)
        doc_links.append(_doc_links)
        
    return doc_links, unique_links

In [9]:
docs = get_xml_files('./xml_test')
doc_links, unique_links = extract_links_from_docs(docs)

unique_links, doc_links

(['http://library.open.ac.uk',
  'https://www.open.ac.uk/libraryservices/resource/website:104152&f=28335',
  'https://www.open.ac.uk/libraryservices/resource/website:104153&f=28335',
  'https://www.open.ac.uk/libraryservices/resource/website:104154&f=28335',
  'https://www.open.ac.uk/libraryservices/resource/website:104223&f=28335',
  'https://www.open.ac.uk/libraryservices/resource/website:104155&f=28335',
  'http://www.digitalhealth.net/news/25742/',
  'http://edition.cnn.com/TECH/space/9909/30/mars.metric/',
  'https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/213809/dh_120579.pdf',
  'https://www.youtube.com/watch?v=k4gj_RdtKCw',
  'https://github.com/drichard/mindmaps',
  'http://blog.ouseful.info/2012/02/03/several-takes-on-the-definition-of-data-laundering/',
  'http://blog.ouseful.info/2013/05/03/a-wrangling-example-with-openrefine-making-ready-data/',
  'http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf',
  'http://www.nytimes.

## Link Checker

A simple routine to check links. The link check will report on:

- broken links (404);
- temporary and permanent redirects;
- links using OU authentication (libezproxy; link domains are suffixed by `.libezproxy.open.ac.uk`) will have the libezproxy component stripped;
- links using liblinks (of the form `https://www.open.ac.uk/libraryservices/resource/website:ID&amp;f=FID`; the `f=FID` seems redundant - what does it do?)

*TO DO: where appropriate/informative, consider reports for the full range of [`2xx` success](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_success) reponse codes, [`3xx` redirection](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_redirection) codes, [`4xx` client error](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_client_errors) codes and [`5xx` server error](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_server_errors) codes.*

At the moment, all links are checked, even if they are duplicates.

In [10]:
# Run a link check on a single link

import requests

def link_reporter(url, display=False, redirect_log=True):
    """Attempt to resolve a URL and report on how it was resolved."""
    if display:
        print(f"Checking {url}...")
    
    # Make request and follow redirects
    try:
        r = requests.head(url, allow_redirects=True)
    except:
        r = None
    
    if r is None:
        return [(False, url, None, "Error resolving URL")]

    # Optionally create a report including each step of redirection/resolution
    steps = r.history + [r] if redirect_log else [r]
    
    step_reports = []
    for step in steps:
        step_report = (step.ok, step.url, step.status_code, step.reason)
        step_reports.append( step_report )
        if display:
            txt_report = f'\tok={step.ok} :: {step.url} :: {step.status_code} :: {step.reason}\n'
            print(txt_report)

    return step_reports

We can run some simple tests to see examples of particular sorts of errors. Other errors may be discovered when running the link checker over a wider range of resources.

We could capture and report on different sorts of status code in specific ways, if that is useful.

In [11]:
## Test link check reports
url_direct='https://rallydatajunkie.com/visualising-rally-stages/'
url_redirect='http://rallydatajunkie.github.io/visualising-rally-stages/'
url_liblink='https://www.open.ac.uk/libraryservices/resource/website:106223&amp;f=28585'
url_404='http://www.digitalhealth.net/news/25742/'
url_ezproxy='http://ieeexplore.ieee.org.libezproxy.open.ac.uk/xpl/articleDetails.jsp?arnumber=4376143'
url_ezproxy_clean=url_ezproxy.replace('.libezproxy.open.ac.uk', '')

link_reporter(url_direct)
link_reporter(url_redirect)
link_reporter(url_liblink)
link_reporter(url_404)
link_reporter(url_ezproxy)
link_reporter(url_ezproxy_clean)

[(True,
  'http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4376143',
  301,
  'Moved Permanently'),
 (True,
  'https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4376143',
  302,
  'Moved Temporarily'),
 (True,
  'https://ieeexplore.ieee.org/document/4376143/?arnumber=4376143',
  200,
  'OK')]

## Checking Links from One or More Documents

A routine to check links from OU-XML documents found in a directory:

In [12]:
# Run link checks over a set of links

def check_multiple_links(urls, display=False, redirect_log=True):
    """Check multiple links."""
    
    # Use unique links
    urls = list(set(urls)) if isinstance(urls, list) else [urls]
    
    link_reports = {}
    for url in tqdm(urls):
        play_nice()
        link_report = link_reporter(url, display, redirect_log)
        link_reports[url] = link_report
        
    return link_reports

In [13]:
# Test a set of links
test_links = [url_direct, url_redirect, url_404]

link_reports = check_multiple_links(test_links, display=True)
link_reports

 33%|███████████████                              | 1/3 [00:00<00:00,  7.52it/s]

Checking https://rallydatajunkie.com/visualising-rally-stages/...
	ok=True :: https://rallydatajunkie.com/visualising-rally-stages/ :: 200 :: OK

Checking http://rallydatajunkie.github.io/visualising-rally-stages/...


 67%|██████████████████████████████               | 2/3 [00:00<00:00,  7.03it/s]

	ok=True :: http://rallydatajunkie.github.io/visualising-rally-stages/ :: 301 :: Moved Permanently

	ok=True :: https://rallydatajunkie.com/visualising-rally-stages/ :: 200 :: OK

Checking http://www.digitalhealth.net/news/25742/...


100%|█████████████████████████████████████████████| 3/3 [00:03<00:00,  1.17s/it]

	ok=True :: http://www.digitalhealth.net/news/25742/ :: 301 :: Moved Permanently

	ok=True :: https://www.digitalhealth.net/news/25742/ :: 301 :: Moved Permanently

	ok=False :: https://www.digitalhealth.net/404-error-page/ :: 404 :: Not Found






{'https://rallydatajunkie.com/visualising-rally-stages/': [(True,
   'https://rallydatajunkie.com/visualising-rally-stages/',
   200,
   'OK')],
 'http://rallydatajunkie.github.io/visualising-rally-stages/': [(True,
   'http://rallydatajunkie.github.io/visualising-rally-stages/',
   301,
   'Moved Permanently'),
  (True, 'https://rallydatajunkie.com/visualising-rally-stages/', 200, 'OK')],
 'http://www.digitalhealth.net/news/25742/': [(True,
   'http://www.digitalhealth.net/news/25742/',
   301,
   'Moved Permanently'),
  (True,
   'https://www.digitalhealth.net/news/25742/',
   301,
   'Moved Permanently'),
  (False, 'https://www.digitalhealth.net/404-error-page/', 404, 'Not Found')]}

From the set of links, we can create a report showing any dead links:

In [14]:
# Find dead links

def dead_link_report(link_reports):
    """Create a list of dead links."""
    dead_links = {}
    
    for link in link_reports:
        link_report = link_reports[link]
        if not link_report[-1][0]:
            dead_links[link] = link_report

    return dead_links


In [15]:
dead_link_report(link_reports)

{'http://www.digitalhealth.net/news/25742/': [(True,
   'http://www.digitalhealth.net/news/25742/',
   301,
   'Moved Permanently'),
  (True,
   'https://www.digitalhealth.net/news/25742/',
   301,
   'Moved Permanently'),
  (False, 'https://www.digitalhealth.net/404-error-page/', 404, 'Not Found')]}

We can also create a report per document. This is perhaps more useful because we can see which sections contain which dead links, if any.

In [18]:
def link_reporter_by_docs(doc_links):
    """Link reports by document."""
    doc_links_reports = []
    doc_links_nok_reports = []
    
    unique_link_reports = {}
    
    for doc in doc_links:

        doc_links_report = {'metadata': doc['metadata'], 'sessions': {}}
        doc_links_nok_report = {'metadata': doc['metadata'], 'sessions': {}}
        
        for session in tqdm(doc['sessions']):
            link_reports = []
            nok_link_reports = []
            for (title, url) in doc['sessions'][session]:
                # Only request each unique URL once
                if url in unique_link_reports:
                    link_report = unique_link_reports[url]
                else:
                    play_nice()
                    link_report = link_reporter(url)
                    unique_link_reports[url] = link_report
                ok = link_report[-1][0]
                link_reports.append( (title, url, link_report, ok))
                
                if not ok:
                    nok_link_reports.append( (title, url, link_report, ok))

            doc_links_report['sessions'][session] = link_reports
            if nok_link_reports:
                doc_links_nok_report['sessions'][session] = nok_link_reports

        doc_links_reports.append(doc_links_report)
        doc_links_nok_reports.append(doc_links_nok_report)
        
    return doc_links_reports, doc_links_nok_reports, unique_link_reports

In [19]:
link_reports, bad_link_reports, unique_link_reports = link_reporter_by_docs(doc_links)

100%|█████████████████████████████████████████████| 7/7 [00:11<00:00,  1.61s/it]
100%|█████████████████████████████████████████████| 7/7 [00:29<00:00,  4.24s/it]


In [20]:
bad_link_reports

[{'metadata': {'coursecode': 'TM351',
   'coursetitle': 'Data management and analysis',
   'itemtitle': 'Part 3 Data preparation',
   'file': 'xml_test/tm351-week3.xml'},
  'sessions': {'BackMatter': [('http://www.digitalhealth.net/news/25742/',
     'http://www.digitalhealth.net/news/25742/',
     [(True,
       'http://www.digitalhealth.net/news/25742/',
       301,
       'Moved Permanently'),
      (True,
       'https://www.digitalhealth.net/news/25742/',
       301,
       'Moved Permanently'),
      (False,
       'https://www.digitalhealth.net/404-error-page/',
       404,
       'Not Found')],
     False),
    ('http://www.nytimes.com/2006/02/15/national/15house.html?_r=0',
     'http://www.nytimes.com/2006/02/15/national/15house.html?_r=0',
     [(True,
       'http://www.nytimes.com/2006/02/15/national/15house.html?_r=0',
       301,
       'Moved Permanently'),
      (False,
       'https://www.nytimes.com/2006/02/15/national/15house.html?_r=0',
       403,
       'Forbidde

In [23]:
unique_link_reports

{'http://library.open.ac.uk': [(True,
   'http://library.open.ac.uk/',
   301,
   'Moved Permanently'),
  (True, 'https://www.open.ac.uk/library/', 200, 'OK')],
 'https://www.open.ac.uk/libraryservices/resource/website:104152&f=28335': [(True,
   'https://www.open.ac.uk/libraryservices/resource/website:104152&f=28335',
   302,
   'Found'),
  (True,
   'https://www.digitalhealth.net/2010/04/inquiry-into-transplant-database-errors/',
   200,
   'OK')],
 'https://www.open.ac.uk/libraryservices/resource/website:104153&f=28335': [(True,
   'https://www.open.ac.uk/libraryservices/resource/website:104153&f=28335',
   302,
   'Found'),
  (True,
   'https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/213809/dh_120579.pdf',
   301,
   'Moved Permanently'),
  (True,
   'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/213809/dh_120579.pdf',
   200,
   'OK')],
 'https://www.open.ac.uk/libraryservices/resource/website:104154&f=283

Whilst the JSON file is convenient for storing data, and easy to work with if you know how, a tabular CSV report is probably more useful in many cases.

The following is a first quick attempt at a report that contains enough information to be useful when it comes to finding which section of which file each broken link appears in.

In [24]:
# Crude output report broken link

import csv

def simple_csv_report(links_report, outf='link_report.csv'):
    """Generate a simple CSV link check report."""

    with open(outf, 'w') as f:
        write = csv.writer(f)
        cols = ['file', 'code', 'title', 'item', 'session', 'linktext', 'link', 'error']
        write.writerow(cols)

        for link_report in links_report:
            row_base = [link_report['metadata']['file'],
                        link_report['metadata']['coursecode'],
                        link_report['metadata']['coursetitle'],
                        link_report['metadata']['itemtitle']]

            rows = []
            for session in link_report['sessions']:
                for link in link_report['sessions'][session]:
                    row = row_base + [session, link[0], link[1], link[2][-1][-2]]
                    rows.append(row)
 
            write.writerows(rows) 

In [25]:
simple_csv_report(bad_link_reports)
!cat link_report.csv

file,code,title,item,session,linktext,link,error
xml_test/tm351-week3.xml,TM351,Data management and analysis,Part 3 Data preparation,BackMatter,http://www.digitalhealth.net/news/25742/,http://www.digitalhealth.net/news/25742/,404
xml_test/tm351-week3.xml,TM351,Data management and analysis,Part 3 Data preparation,BackMatter,http://www.nytimes.com/2006/02/15/national/15house.html?_r=0,http://www.nytimes.com/2006/02/15/national/15house.html?_r=0,403
xml_test/tm351-week5.xml,TM351,Data management and analysis,Part 5 Presentation: telling the story,3 Finding the story,When the data you don’t collect affects the data you do,http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/diary-data-sleuth-when-the-data-you-dont-collect-affects-the-data-you-do ,404
xml_test/tm351-week5.xml,TM351,Data management and analysis,Part 5 Presentation: telling the story,BackMatter,http://www.jstor.org.libezproxy.open.ac.uk/stable/2245820?seq=1#page_scan_tab_content

To try to make things easier to use, we also define (in the `cli.py` file) a simple command line interface that can call the following function, with a directory path, to run the link checker over links extracted form the `.xml` files in the specified directory.

Usage on the command line is then of the form:

`ouxml_tools link-check PATH/TO/DIR/CONTAINING/OU-XML_FILES`

with three output reports generated:

- a csv file report containing minimal broken link information (`broken_links_report.csv`);
- complete reports of all link resolution traces and just the broken link resoution traces in `all_links_report.json` and `broken_links_report.json` respectively.

## Submitting Links to the Internet Archive

To ensure that web page content is available even a website disappears, we can try to archive the content on the Internet Archive.

The following function will attempt to submit a URL for archiving on the Internet Archive:

In [26]:
import json
import urllib.parse

def archive_link(url):
    """Submit link archive request to Interet Archive."""

    quoted_url = urllib.parse.quote(url)
    url_ = f"https://web.archive.org/save/{quoted_url}"
    # Should probably capture response and generate archive request report
    r = requests.get(url_, allow_redirects=True)
    return url, r

It is more likely that we will want to submit multiple URLs to the archiver.

Links with various HTTP status codes as described in the link check status report can be included or excluded from he report.

In [48]:
def get_valid_links(link_reports, include=None, exclude=None):
    """Generate a list of valid links from a dict with URL keys and report values."""
    include = [] if include is None else include
    exclude = [] if exclude is None else exclude
    not_valid_url = []
    excluded_url = []
    link_reports_ = []
    
    print("Finding valid links for this process...")
    # Generate a list of links we want to archive
    # Note that this is not a list of unique URLs
    for link_ in link_reports:
        # Get status report of last step in forwarded request
        status = link_reports[link_][-1][2]
        if status is None:
            not_valid_url.append(link_)
            continue
        # We don't want excluded report status links
        if status not in exclude:
            # We want all links or just included links
            if not include or status in include:
                link_reports_.append(link_)
        else:
            excluded_url.append(link_)
            
    return link_reports_, excluded_url, not_valid_url

In [55]:
def archive_links(link_reports, include=None, exclude=None):
    """Submit each unique link in a link status report dictionary to the Internet Archive."""
    archived = []
    not_archived = []

    link_reports_, excluded_url, not_valid_url = get_valid_links(link_reports, include, exclude)

    for link_ in tqdm(link_reports_):
        print(f"Archiving: {link_}")
        url, response = archive_link(link_)
        if response.ok:
            archived.append(url)
        else:
            not_archived.append(url)
        play_nice()

    for url in archived:
        print(f"Archived: {url}")
    for url in not_archived:
        print(f"Not archived: {url}")
    for url in excluded_url:
        print(f"Excluded: {url}")
    for url in not_valid_url:
        print(f"Not valid URLs: {url}")

In [56]:
archive_links(unique_link_reports)

Finding valid links for this process...


  0%|                                                    | 0/61 [00:00<?, ?it/s]

Archiving: http://library.open.ac.uk


  2%|▋                                           | 1/61 [00:07<07:56,  7.94s/it]

Archiving: https://www.open.ac.uk/libraryservices/resource/website:104152&f=28335


  3%|█▍                                          | 2/61 [00:08<03:48,  3.88s/it]

Archiving: https://www.open.ac.uk/libraryservices/resource/website:104153&f=28335


  5%|██▏                                         | 3/61 [00:09<02:26,  2.53s/it]

Archiving: https://www.open.ac.uk/libraryservices/resource/website:104154&f=28335


  7%|██▉                                         | 4/61 [00:11<02:13,  2.33s/it]

Archiving: https://www.open.ac.uk/libraryservices/resource/website:104223&f=28335


  8%|███▌                                        | 5/61 [00:12<01:42,  1.83s/it]

Archiving: https://www.open.ac.uk/libraryservices/resource/website:104155&f=28335


 10%|████▎                                       | 6/61 [00:13<01:23,  1.51s/it]

Archiving: http://www.digitalhealth.net/news/25742/


 11%|█████                                       | 7/61 [00:14<01:13,  1.35s/it]

Archiving: http://edition.cnn.com/TECH/space/9909/30/mars.metric/


 13%|█████▊                                      | 8/61 [00:16<01:20,  1.51s/it]

Archiving: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/213809/dh_120579.pdf


 15%|██████▍                                     | 9/61 [00:17<01:10,  1.36s/it]

Archiving: https://www.youtube.com/watch?v=k4gj_RdtKCw


 16%|███████                                    | 10/61 [00:18<01:03,  1.24s/it]

Archiving: https://github.com/drichard/mindmaps


 18%|███████▊                                   | 11/61 [00:19<00:56,  1.13s/it]

Archiving: http://blog.ouseful.info/2012/02/03/several-takes-on-the-definition-of-data-laundering/


 20%|████████▍                                  | 12/61 [00:21<01:05,  1.34s/it]

Archiving: http://blog.ouseful.info/2013/05/03/a-wrangling-example-with-openrefine-making-ready-data/


 21%|█████████▏                                 | 13/61 [00:22<01:03,  1.32s/it]

Archiving: http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf


 23%|█████████▊                                 | 14/61 [00:24<01:10,  1.51s/it]

Archiving: http://www.nytimes.com/2006/02/15/national/15house.html?_r=0


 25%|██████████▌                                | 15/61 [00:26<01:17,  1.68s/it]

Archiving: http://www.nwitimes.com/news/local/a-million-home/article_899b39b3-fbdf-522f-a6fc-e8ae73eb8c96.html


 25%|██████████▌                                | 15/61 [00:28<01:27,  1.91s/it]


KeyboardInterrupt: 

We can generate screenshots for the links:

In [61]:
def screenshot_grabber(link_reports, include=None, exclude=None):
    """Grab screenshots for links."""
    import unicodedata
    import string

    valid_filename_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
    char_limit = 255

    def clean_filename(filename, whitelist=valid_filename_chars, replace=None):
        # replace spaces and . by default
        replace = [" ", "."] if replace is None else replace
        filename = filename.split("://")[-1]
        for r in replace:
            filename = filename.replace(r, '_')

        # keep only valid ascii chars
        cleaned_filename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore').decode()

        # keep only whitelisted chars
        cleaned_filename = ''.join(c for c in cleaned_filename if c in whitelist)
        if len(cleaned_filename)>char_limit:
            print("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
        return cleaned_filename[:char_limit]

    links, excluded_url, not_valid_url = get_valid_links(link_reports, include, exclude)
    
    from playwright.sync_api import sync_playwright
    img_path = "grab_link_screenshots"
    p = Path(img_path)
    p.mkdir(parents=True, exist_ok=True)
    with sync_playwright() as p:
        browser = p.webkit.launch()
        page = browser.new_page()
        for shot_url in links:
            try:
                page.goto(shot_url)
                page.screenshot(path=Path(img_path) / f"{clean_filename(shot_url)}.png")
            except:
                print(f"\t- failed to grab screenshot for {shot_url}")
        browser.close()
        print(f"\nScreenshots saved to {img_path}")

In [62]:
#screenshot_grabber(unique_link_reports)

Finding valid links for this process...


Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

We can generate the link status report for links extracted from one or more files, optionally calling the archiver, using the following function:

In [29]:
def link_check_reporter(path,
                        archive=False,
                        strong_archive=False,
                        grab_screenshots=False,
                        display=False, redirect_log=True):
    """Run link checks."""
    print("Getting files...")
    docs = get_xml_files(path)
    doc_links, unique_links = extract_links_from_docs(docs)

    print("Getting link statuses for each document section...")
    link_reports, bad_link_reports, unique_link_reports = link_reporter_by_docs(doc_links)

    print("Writing status reports...")
    with open('all_links_report.json', 'w') as f:
        json.dump(link_reports, f)
    with open('broken_links_report.json', 'w') as f:
        json.dump(bad_link_reports, f)
    simple_csv_report(bad_link_reports, outf='broken_links_report.csv')

    if archive or strong_archive:
        print("Archiving links...")
        if strong_archive:
            archive_links(unique_link_reports, exclude=[404])
        elif archive:
            archive_links(unique_link_reports, include=[200])

    if grab_screenshots:
        print("Screengrabbing links...")
        screenshot_grabber(unique_link_reports, include=None, exclude=None)