<a href="https://colab.research.google.com/github/nationalarchives/UKGWA-computational-access/blob/main/Extracting_data_from_the_UKGWA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load library functions and data

In [1]:
import sys
import os
import shutil
from matplotlib import pyplot
from matplotlib.pyplot import figure
from operator import itemgetter
import datetime
from gensim.summarization import summarizer
# Function to convert snapshot timestamp to a date
snap_to_date = lambda d : datetime.datetime.strptime(str(d), '%Y%m%d%H%M%S') # %I:%M%p')

In [2]:
if os.path.isdir('UKGWA-computational-access'):
    shutil.rmtree('UKGWA-computational-access')
!git clone https://github.com/nationalarchives/UKGWA-computational-access.git
sys.path.insert(0, 'UKGWA-computational-access/Code')
data_folder = "./UKGWA-computational-access/Data/"

Cloning into 'UKGWA-computational-access'...
remote: Enumerating objects: 181, done.[K
remote: Counting objects: 100% (181/181), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 181 (delta 91), reused 110 (delta 42), pack-reused 0[K
Receiving objects: 100% (181/181), 15.00 MiB | 20.45 MiB/s, done.
Resolving deltas: 100% (91/91), done.


In [19]:
from ukgwa_textindex import UKGWATextIndex
from ukgwa_cdx_indexer import TemporalIndexer
from ukgwa_index import UKGWAIndex
from ukgwa_url import UKGWAurl
from disco_search import DiscoSearch
from bs4 import BeautifulSoup
import requests
import ipywidgets

In [4]:
!pip install boilerpy3
import boilerpy3
from boilerpy3 import extractors

Collecting boilerpy3
  Downloading boilerpy3-1.0.5-py3-none-any.whl (22 kB)
Installing collected packages: boilerpy3
Successfully installed boilerpy3-1.0.5


In [5]:
ukgwa_prefix = "https://webarchive.nationalarchives.gov.uk/ukgwa/"
examples = [["Environment Agency", "http://www.environment-agency.gov.uk:80","19961104034437"],
            ["Salt", "http://www.salt.gov.uk", "20090810121540"],
            ["Butterflies", "http://www.ukbms.org", "20140402194840"],
            ["Space Education", "http://www.esero.org.uk", "20140508193217"],
            ["Professional Standards", "http://www.professionalstandards.org.uk", "20160506161604"],
            ["Investors in People", "http://www.investorsinpeople.co.uk/Pages/Home.aspx", "20120418183519"]]
example_dropdown = ipywidgets.widgets.Dropdown(options=[e[0] for e in examples])

## Extracting links from archived web pages

Opening and extracting from a web page is made easy by the Python library BeautifulSoup. It converts an HTML page into structural elements and provides functions for navigating the structure and extracting elements by type.
One element which is of interest is the hyperlink. This consists of a descriptive label and address of a web page. By extracting all of the links from a page, or a set of pages, we can create the data to perform network analysis.
Before doing that we need to be aware of the varying formats of urls which appear on a page.


The Environment Agency page has two formats of link: snapshot + page, or ./page. The latter is known as relative addressing which means it refers to files which are in a sub-folder of the current page address. The former we will call snapshot format.

The Salt home page also includes two formats: this time the relative addresses do not include the ./. There are also two links to a page in a different domain (food.gov.uk). In this case the address not only includes the snapshot but also has the prefix of the web archive domain. We will refer to this a complete address. An address in this format can be immediately navigated to (or crawled) without any pre-processing of the address.

The remainder of the sites follow similar patterns.

Begin by running the first code block to refresh the dropdown menu of example web pages. Once that is done you can select a page from the list and run the following block to see links from the selected site, as many times as you wish.

In [6]:
example_dropdown

Dropdown(options=('Environment Agency', 'Salt', 'Butterflies', 'Space Education', 'Professional Standards', 'I…

In [9]:
ex_id = example_dropdown.index

ex = examples[ex_id]
ex_url = UKGWAurl(ex[2] + "/" + ex[1])
page = requests.get(ex_url.get_url(crawl=True))
soup = BeautifulSoup(page.content, 'html.parser')

print("Links from", ex_url.get_url(prefix=False))
links = soup.find_all('a')
for ln in links:
    if ln.has_attr('href'):
        print("\t",ln['href'])

https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80
Links from 19961104034437/http://www.environment-agency.gov.uk:80
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-home.map
	 ./who.html
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/text/home.html
	 ./s-enviro.html
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/new.html
	 ./info.html
	 ./who/hotline.html
	 ./publications.html
	 ./education.html
	 ./feedback.html
	 ./search.html
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-nav.map
	 mailto:webmaster@environment-agency.gov.uk
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/new.html


What the preceding code has demonstrated is that a certain amount of pre-processing is necessary to resolve link addresses to valid web archive addresses (complete addresses). The solution is simple but is a step that users should be aware of. The following steps should be taken to standardise addresses:

1. Complete addresses: leave as they are
2. Snapshot addresses: prefix with "https://webarchive.nationalarchives.gov.uk/ukgwa/"
3. Relative addresses: prefix with the complete address of the page containing the link which must also end in a "/"

The next thing to be aware of is that the snapshot timestamp in a web archive URL does not necessarily point to an existing page. The timestamp of a page refers to the time it was crawled which may have occurred seconds or even years after the timestamp in a hyperlink to that page.

We can use the CDX data to further analyse this. From the Environment Agency homepage we see that their State of the Environment report was captured 1154 seconds (19 minutes) later than the link address.

In [10]:
T = TemporalIndexer()
url = "http://www.environment-agency.gov.uk:80/s-enviro.html"
snapshot = 19961104034437
T.add_entry(url)
nearest_snapshot = snap_to_date(T.get_field(url, 'CDX').nearest_to(19961104034437))
(nearest_snapshot - snap_to_date(snapshot))

datetime.timedelta(seconds=1154)

In constrast, the "Who we are" page was first captured over 467 days later than the link address states. This is an extreme example but makes the point that whether a page linked to is crawled or not depends on both the depth of the crawl and the success of crawling an individual page.

In [11]:
T = TemporalIndexer()
url = "http://www.environment-agency.gov.uk:80/who.html"
snapshot = 19961104034437
T.add_entry(url)
nearest_snapshot = snap_to_date(T.get_field(url, 'CDX').nearest_to(19961104034437))
(nearest_snapshot - snap_to_date(snapshot))

datetime.timedelta(days=467, seconds=26444)

## Extracting content from archived web pages

One challenge with working with HTML pages is extracting text from those pages. Using the same examples from before, we can use the BeautifulSoup library to extract text from each page.

It should be immediately obvious that this is messy as the text also includes a javascript program which is used to render the page in the web archive interface.

As before, run the code to refresh the dropdown list before running the text extraction code.

In [12]:
example_dropdown

Dropdown(options=('Environment Agency', 'Salt', 'Butterflies', 'Space Education', 'Professional Standards', 'I…

In [13]:
ex_id = example_dropdown.index
ex = examples[ex_id]
ex_url = UKGWAurl(ex[2] + "/" + ex[1])
page = requests.get(ex_url.get_url(crawl=True))
soup = BeautifulSoup(page.content, 'html.parser')
soup.get_text()

'\n\n\n\n  wbinfo = {};\n  wbinfo.top_url = "https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/";\n  // Fast Top-Frame Redirect\n  if (window == window.top && wbinfo.top_url) {\n    var loc = window.location.href.replace(window.location.hash, "");\n    loc = decodeURI(loc);\n \n    if (loc != decodeURI(wbinfo.top_url)) {\n        window.location.href = wbinfo.top_url + window.location.hash;\n    }\n  }\n  wbinfo.url = "http://www.environment-agency.gov.uk:80/";\n  wbinfo.timestamp = "19961104034437";\n  wbinfo.request_ts = "19961104034437";\n  wbinfo.prefix = decodeURI("https://webarchive.nationalarchives.gov.uk/ukgwa/");\n  wbinfo.mod = "mp_";\n  wbinfo.is_framed = true;\n  wbinfo.is_live = false;\n  wbinfo.coll = "ukgwa";\n  wbinfo.proxy_magic = "";\n  wbinfo.static_prefix = "https://webarchive.nationalarchives.gov.uk/static/";\n  wbinfo.enable_auto_fetch = false;\n\n \n\n  wbinfo.wombat_ts = "19961104034437";\n  wbinfo.wombat_sec

Thankfully BeautifulSoup makes it easy to remove script elements from the HTML. This time we get pure text, however, if we were to compare this with the original web page we see that text from menus is also included. This might be ok for an individual page but when performing text analysis across a site the menu items will be repeated for every page thereby creating unwanted noise.

In [14]:
example_dropdown  # Please run me

Dropdown(options=('Environment Agency', 'Salt', 'Butterflies', 'Space Education', 'Professional Standards', 'I…

In [15]:
ex_id = example_dropdown.index
ex = examples[ex_id]
ex_url = UKGWAurl(ex[2] + "/" + ex[1])
page = requests.get(ex_url.get_url(crawl=True))
soup = BeautifulSoup(page.content, 'html.parser')
for s in soup.select('script'):
    s.extract()
soup.get_text().replace("\n"," ").replace("  "," ")

"    Environment Agency Home Page      Welcome to the Environment Agency of England and Wales Web Site. A text only version of this site is available. A key feature of this site is our State of the Environment report: this is a snapshot look at the pressures on the environment and how the quality of the environment has changed over the last twenty-five or so years.  This web site will be regularly updated and enhanced to provide comprehensive information and advice on the protection and management of the environment. Look in the information and resource pages for details on how to contact us.  You can also request copies of our publications and leaflets on line and review the educational services we will be offering. And, of course we encourage you to give us your comments on any aspect of the Agency.      To find specific information you can directly search the entire content of this site. An index is also available to aid navigation. Please enter your search request here:       © Env

A more sophisticated approach to extracting content is to use a boilerplate removal tool, such as boilerpy3. This tool will strip out all of the menus and only return the textual content of the page.

The tool needs to be used with care though. While the first example looks great, we see that example 2 returns nothing at all, while example 3 returns a copyright notice. This is because home pages are entry points to a site and are often to help find pages within the site. They are therefore light on content and heavy on boilerplate.

In [16]:
example_dropdown  # Please run me

Dropdown(options=('Environment Agency', 'Salt', 'Butterflies', 'Space Education', 'Professional Standards', 'I…

In [17]:
ex_id = example_dropdown.index
ex = examples[ex_id]
ex_url = UKGWAurl(ex[2] + "/" + ex[1])
page = requests.get(ex_url.get_url(crawl=True))
soup = BeautifulSoup(page.content, 'html.parser')
extractor = extractors.ArticleExtractor()
content = extractor.get_content(str(soup))
content

'Welcome to the Environment Agency of England and Wales Web Site.\nA text only version of this site is available.\nA key feature of this site is our State of the Environment report: this is a snapshot look at the pressures on the environment and how the quality of the environment has changed over the last twenty-five or so years.\nThis web site will be regularly updated and enhanced to provide comprehensive information and advice on the protection and management of the environment.  Look in the information and resource pages for details on how to contact us.\nYou can also request copies of our publications and leaflets on line and review the educational services we will be offering.  And, of course we encourage you to give us your comments on any aspect of the Agency.\nTo find specific information you can directly search the entire content of this site. An index is also available to aid navigation.\nPlease enter your search request here:\n'

## Auto-generating website summaries

In the first notebook we looked at the varios descriptions available in Discovery and the A-Z list. The most descriptive of these was the Administrative History from Discovery. Currently this only exists for around one-third of website. The content of the web archive is open and fully indexed so it can be searched but one feature of a good catalogue description is that it provides an overview of a document to let the user know whether that document is worth opening or requesting from the reading room.

The great thing about computational access is that in the absence of a catalogue summary we can automate the creation of our own. Using some of the code from earlier to find links, combined with the boilerplate removal tool, we can extract the content from pages linked to on the home page of a site, which are part of that site.

Then we can apply a machine learning algorithm called Document Summarisation (from the gensim NLP library) to summarise all of the pages into a short paragraph.
We will try it for one of the earlier examples. First, we crawl each linked to page and extract the content into a list:

We can compare some of these results with the Administrative History from the first notebook. First we need to set up the data including extracting from the Discovery API, as in notebook 1.

<b>Estimated running time 40 seconds</b>

In [20]:
W = UKGWAIndex()
W.indexfromweb()
W.discoveryfromfile(data_folder + 'ukgwa_catrefs.txt')
disco_web_lookup = {}
ex_urls = [ex[1] for ex in examples]
for w in W:
    url = W.get_field(w, 'URL')
    if url[-1] == "/":
        url = url[:-1]
    if url in ex_urls:
        catref = W.get_field(w, 'CATREF')
        disco_web_lookup[catref] = w

D = DiscoSearch()
D.add_entry('web AND snapshots')
example_admin = {}
for d in D:
    if D.get_field(d, 'reference') in disco_web_lookup:
        web_ref = disco_web_lookup[D.get_field(d, 'reference')]
        url = W.get_field(web_ref, 'URL')
        if url[-1] == "/":
            url = url[:-1]
        example_admin[W.get_field(web_ref, 'URL')] = D.get_field(d, 'adminHistory')


This code performs the crawling of the pages linked to from the home page (within the same site). It then summarises each page using the summarization function from gensim, and stores a list of the summaries.

In [21]:
example_dropdown

Dropdown(options=('Environment Agency', 'Salt', 'Butterflies', 'Space Education', 'Professional Standards', 'I…

In [23]:
ex_id = example_dropdown.index
ex = examples[ex_id]
ex_url = UKGWAurl(ex[2] + "/" + ex[1])
page = requests.get(ex_url.get_url(crawl=True))
soup = BeautifulSoup(page.content, 'html.parser')

print("Links from", ex_url.get_url(prefix=False))
links = soup.find_all('a')
summaries = []


for ln in links:
    if ln.has_attr('href'):
        href = ln['href']
        if href[0] == "#":
            continue
        if href[0:2] == "./":
            href = href[2:]
        parent_url = UKGWAurl(ex[2] + "/" + ex[1])
        link_url = UKGWAurl(href, parent_url)
        if link_url.get_domain() != parent_url.get_domain():
            continue
        print("\t", href, link_url)
        page = requests.get(link_url.get_url(crawl=True))
        soup = BeautifulSoup(page.content, 'html.parser')
        extractor = extractors.ArticleExtractor()
        content = extractor.get_content(str(soup))
        summary = summarizer.summarize(content, ratio = 0.2)
        summaries.append(summary)


Links from 19961104034437/http://www.environment-agency.gov.uk:80
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-home.map https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/ukgwa/19961104034437mp_/http:/www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-home.map
	 who.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/who.html




	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/text/home.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/ukgwa/19961104034437mp_/http:/www.environment-agency.gov.uk:80/text/home.html
	 s-enviro.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/s-enviro.html




	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/new.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/ukgwa/19961104034437mp_/http:/www.environment-agency.gov.uk:80/new.html
	 info.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/info.html
	 publications.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/publications.html
	 education.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/education.html




	 feedback.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/feedback.html




	 search.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/search.html




	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-nav.map https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/ukgwa/19961104034437mp_/http:/www.environment-agency.gov.uk:80/cgi-bin/imagemap/maps/ea-nav.map
	 /ukgwa/19961104034437mp_/http://www.environment-agency.gov.uk:80/new.html https://webarchive.nationalarchives.gov.uk/ukgwa/19961104034437/http://www.environment-agency.gov.uk:80/ukgwa/19961104034437mp_/http:/www.environment-agency.gov.uk:80/new.html


Next we concatenate all of those summaries together into one big document and summarise that. You can experiment with the length of the summary using the word count parameter but 100 words produces a nice summary which provides an overview of the web site.

It is not a catalogue description, it is built from heuristics not knowledge, but it is a powerful technique and this notebook environment provides an ideal playground to try it out.

Run this code to see the result and to compare it against the human curated Administrative History:

In [24]:
all_summaries = "\n".join(summaries)
print(ex[1])
print("")
print("Auto summary")
print(summarizer.summarize(all_summaries, word_count=100))
print("")
print("Admin History")
if ex[0] in example_admin:
    print(example_admin[ex[1]])
else:
    print(example_admin[ex[1] + "/"])



http://www.environment-agency.gov.uk:80

Auto summary


Admin History


KeyError: ignored