<a href="https://colab.research.google.com/github/nationalarchives/UKGWA-computational-access/blob/main/Extracting_data_from_the_UKGWA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load library functions and data

In [None]:
import sys
import os
import shutil
from matplotlib import pyplot
from matplotlib.pyplot import figure
from operator import itemgetter
import datetime
from gensim.summarization import summarizer
# Function to convert snapshot timestamp to a date
snap_to_date = lambda d : datetime.datetime.strptime(str(d), '%Y%m%d%H%M%S') # %I:%M%p')

In [None]:
if os.path.isdir('AURA-Article'):
    shutil.rmtree('AURA-Article')
!git clone https://github.com/mark-bell-tna/AURA-Article.git
sys.path.insert(0, 'AURA-Article/Code')
data_folder = "./AURA-Article/Data/"

Cloning into 'AURA-Article'...
remote: Enumerating objects: 110, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (75/75), done.[K
remote: Total 110 (delta 44), reused 93 (delta 32), pack-reused 0[K
Receiving objects: 100% (110/110), 14.94 MiB | 38.73 MiB/s, done.
Resolving deltas: 100% (44/44), done.


In [None]:
from disco_search import DiscoSearch
from ukgwa_textindex import UKGWATextIndex
from ukgwa_cdx_indexer import TemporalIndexer
from ukgwa_index import UKGWAIndex
from ukgwa_url import UKGWAurl
from bs4 import BeautifulSoup
import requests

In [None]:
!pip install boilerpy3
import boilerpy3
from boilerpy3 import extractors



## Extracting links from archived web pages

Opening and extracting from a web page is made easy by the Python library BeautifulSoup. It converts an HTML page into structural elements and provides functions for navigating the structure and extracting elements by type.
One element which is of interest is the hyperlink. This consists of a descriptive label and address of a web page. By extracting all of the links from a page, or a set of pages, we can create the data to perform network analysis.
Before doing that we need to be aware of the varying formats of urls which appear on a page.

Change the ex_id variable in the following code to print out the links from different pages.

The Environment Agency page has two formats of link: snapshot + page, or ./page. The latter is known as relative addressing which means it refers to files which are in a sub-folder of the current page address. The former we will call snapshot format.

The Salt home page also includes two formats: this time the relative addresses do not include the ./. There are also two links to a page in a different domain (food.gov.uk). In this case the address not only includes the snapshot but also has the prefix of the web archive domain. We will refer to this a complete address. An address in this format can be immediately navigated to (or crawled) without any pre-processing of the address.

The Office for Rail Regulation (orr.gov.uk) links are almost exclusively in the format of a complete address with only a few in snapshot format.

In [None]:
ex_id = 3  # Change this to a number between 1 and 6 to change the example

ukgwa_prefix = "https://webarchive.nationalarchives.gov.uk/"
examples = [["http://www.environment-agency.gov.uk:80","19961104034437"],
            ["http://www.salt.gov.uk", "20090810121540"],
            ["http://www.ukbms.org", "20140402194840"],
            ["http://www.esero.org.uk", "20140508193217"],
            ["http://www.professionalstandards.org.uk", "20160506161604"],
            ["http://www.investorsinpeople.co.uk/Pages/Home.aspx", "20120418183519"]]

ex = examples[ex_id-1]
page = requests.get(ukgwa_prefix + ex[1] + "/" + ex[0])
soup = BeautifulSoup(page.content, 'html.parser')

print("Links from", ex[1] + "/" + ex[0])
links = soup.find_all('a')
for ln in links:
    if ln.has_attr('href'):
        print("\t",ln['href'])

Links from 20140402194840/http://www.ukbms.org
	 #NavigationMenu_SkipLink
	 Default.aspx
	 involved.aspx
	 wcbs.aspx
	 indicators.aspx
	 KeyFindings.aspx
	 news.aspx
	 reportsAndPublications.aspx
	 resources.aspx
	 Specieslist.aspx
	 map.aspx
	 About.aspx
	 Methods.aspx
	 Obtaining.aspx
	 Links.aspx
	 Contacts.aspx
	 #SiteMapDataProvider_SkipLink
	 wcbs.aspx
	 about.aspx
	 Methods.aspx
	 Obtaining.aspx
	 https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.butterfly-conservation.org/bnm/atlas/index.html
	 https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.naturebureau.co.uk/shop/books/StateofButterflies.html
	 docs/reports/2013/Country-level Summary of changes Table 2013.pdf
	 docs/reports/2013/UK Summary of changes Table 2013.pdf
	 Downloads\Wider_Countryside\Newsletter_WCBS2013.pdf
	 downloads/National Butterfly Recorders' Meeting 2014 Programme.pdf
	 https://webarchive.nationalarchives.gov.uk/20140402194840/http://butterfly-conservation.org/244-4890

What the preceding code has demonstrated is that a certain amount of pre-processing is necessary to resolve link addresses to valid web archive addresses (complete addresses). The solution is simple but is a step that users should be aware of. The following steps should be taken to standardise addresses:

1. Complete addresses: leave as they are
2. Snapshot addresses: prefix with "https://webarchive.nationalarchives.gov.uk/"
3. Relative addresses: prefix with the complete address of the page containing the link which must also end in a "/"

The next thing to be aware of is that the snapshot timestamp in a web archive URL does not necessarily point to an existing page. The timestamp of a page refers to the time it was crawled which may have occurred seconds or even years after the timestamp in a hyperlink to that page.

We can use the CDX data to further analyse this. From the Environment Agency homepage we see that their State of the Environment report was captured 1154 seconds (19 minutes) later than the link address.

In [None]:
T = TemporalIndexer()
url = "http://www.environment-agency.gov.uk:80/s-enviro.html"
snapshot = 19961104034437
T.add_entry(url)
nearest_snapshot = snap_to_date(T.get_field(url, 'CDX').nearest_to(19961104034437))
(nearest_snapshot - snap_to_date(snapshot))

datetime.timedelta(seconds=1154)

In constrast, the "Who we are" page was first captured over 467 later than the link address states. This is an extreme example but makes the point that whether a page linked to is crawled or not depends on both the depth of the crawl and the success of crawling an individual page.

In [None]:
T = TemporalIndexer()
url = "http://www.environment-agency.gov.uk:80/who.html"
snapshot = 19961104034437
T.add_entry(url)
nearest_snapshot = snap_to_date(T.get_field(url, 'CDX').nearest_to(19961104034437))
(nearest_snapshot - snap_to_date(snapshot))

datetime.timedelta(days=467, seconds=26444)

## Extracting content from archived web pages

One challenge with working with HTML pages is extracting text from those pages. Using the same examples from before, we can use the BeautifulSoup library to extract text from each page.

It should be immediately obvious that this is messy as the text also includes a javascript program which is used to render the page in the web archive interface.

In [None]:
ex_id = 1
ex = examples[ex_id-1]
page = requests.get(ukgwa_prefix + ex[1] + "/" + ex[0])
soup = BeautifulSoup(page.content, 'html.parser')
soup.get_text()

'\n\n\n\n    // This block sets up a series of server-defined variables for use within Wombat\n    wbinfo = {\n        url           : "http://www.environment-agency.gov.uk:80/",\n        timestamp     : "19961104034437",\n        request_ts    : "19961104034437",\n        prefix        : decodeURI("https://webarchive.nationalarchives.gov.uk/"),\n        mod           : "",\n        origurl       : "19961104034437/www.environment-agency.gov.uk:80/",\n        type          : "replay",\n        top_url       : "https://webarchive.nationalarchives.gov.uk/19961104034437tf_/http://www.environment-agency.gov.uk:80/",\n        is_framed     : true,\n        is_live       : false,\n        coll          : "",\n        proxy_magic   : "",\n        static_prefix : "https://webarchive.nationalarchives.gov.uk/static/__pywb"\n    };\n\n\n \n \n\n \n\n\n    // Since we\'re in rewrite mode, add a few more variables for later use\n    wbinfo.wombat_ts = "19961104034437";\n    wbinfo.wombat_scheme = "h

Thankfully BeautifulSoup makes it easy to remove script elements from the HTML. This time we get pure text, however, if we were to compare this with the original web page we see that text from menus is also included. This might be ok for an individual page but when performing text analysis across a site the menu items will be repeated for every page thereby creating unwanted noise.

In [None]:
ex_id = 3
ex = examples[ex_id-1]
page = requests.get(ukgwa_prefix + ex[1] + "/" + ex[0])
soup = BeautifulSoup(page.content, 'html.parser')
for s in soup.select('script'):
    s.extract()
soup.get_text().replace("\n"," ").replace("  "," ")

"          Home | Office of Rail Regulation             Skip to content    Search Menu        About ORR  < Back  Who we are  < Back  The board  < Back  Committees  Register of interests  Board meeting minutes   Executive directors   What we do  < Back  Our functions  The law and our duties  < Back  Our duties  Railway regulatory law  UK law  EU law   Our vision and strategy  < Back  Corporate strategy  Health and safety strategy  < Back  Health and safety regulatory approach    How we work  < Back  Business plan  < Back  Impact assessments   Accountability  Sustainable development    Who we work with  < Back  Rail infrastructure  < Back  Mainline network  Underground railways  Light rail and tramways  Minor and heritage railways   Government  < Back  Department for Transport  Transport for London  Transport Scotland  Welsh Government  Department of Regional Development, Northern Ireland  Passenger transport executives  The European Railway Agency  Channel Tunnel   Safety bodies  < Back

A more sophisticated approach to extracting content is to use a boilerplate removal tool, such as boilerpy3. This tool will strip out all of the menus and only return the textual content of the page.

The tool needs to be used with care though. While the first example looks great, we see that example 2 returns nothing at all, while example 3 returns a copyright notice. This is because home pages are entry points to a site and are often to help find pages within the site. They are therefore light on content and heavy on boilerplate.

In [None]:
ex_id = 2
ex = examples[ex_id-1]
page = requests.get(ukgwa_prefix + ex[1] + "/" + ex[0])
soup = BeautifulSoup(page.content, 'html.parser')
extractor = extractors.ArticleExtractor()
content = extractor.get_content(str(soup))
content

''

## Auto-generating website summaries

In the first notebook we looked at the varios descriptions available in Discovery and the A-Z list. The most descriptive of these was the Administrative History from Discovery. Currently this only exists for around one-third of website. The content of the web archive is open and fully indexed so it can be searched but one feature of a good catalogue description is that it provides an overview of a document to let the user know whether that document is worth opening or requesting from the reading room.

The great thing about computational access is that in the absence of a catalogue summary we can automate the creation of our own. Using some of the code from earlier to find links, combined with the boilerplate removal tool, we can extract the content from pages linked to on the home page of a site, which are part of that site.

Then we can apply a machine learning algorithm called Document Summarisation (from the gensim NLP library) to summarise all of the pages into a short paragraph.
We will try it for one of the earlier examples. First, we crawl each linked to page and extract the content into a list:

We can compare some of these results with the Administrative History from the first notebook. First we need to set up the data including extracting from the Discovery API, as in notebook 1.

<b>Estimated running time 40 seconds</b>

In [None]:
W = UKGWAIndex()
W.indexfromweb()
W.discoveryfromfile(data_folder + 'ukgwa_catrefs.txt')
disco_web_lookup = {}
ex_urls = [ex[0] for ex in examples]
for w in W:
    url = W.get_field(w, 'URL')
    if url[-1] == "/":
        url = url[:-1]
    if url in ex_urls:
        catref = W.get_field(w, 'CATREF')
        disco_web_lookup[catref] = w

D = DiscoSearch()
D.add_entry('web AND snapshots')
example_admin = {}
for d in D:
    if D.get_field(d, 'reference') in disco_web_lookup:
        web_ref = disco_web_lookup[D.get_field(d, 'reference')]
        url = W.get_field(web_ref, 'URL')
        if url[-1] == "/":
            url = url[:-1]
        example_admin[W.get_field(web_ref, 'URL')] = D.get_field(d, 'adminHistory')


http://www.investorsinpeople.co.uk/Pages/Home.aspx
http://www.professionalstandards.org.uk
http://www.salt.gov.uk
http://www.ukbms.org
http://www.esero.org.uk
IAID: C17986 URL: http://www.investorsinpeople.co.uk/Pages/Home.aspx Data: Investors in People UK was formed in 1993 to provide business leadership and development for the investors in people standard, and to lead and undertake national promotion of the standard. The snapshots in this series begin in the period before June 2007 when it was the responsibility of the Department for Education and Skills, before becoming a non-departmental public body which receives funding from the Department for Business, Innovation and Skills (BIS).
IAID: C11689433 URL: http://www.esero.org.uk/ Data: ESERO-UK, also known as the UK Space Education Office, aims to promote the use of space to enhance and support the teaching and learning of Science, Technology, Engineering and Mathematics (STEM) in schools and colleges throughout the UK. ESERO-UK is 

This code performs the crawling of the pages linked to from the home page (within the same site). It then summarises each page using the summarization function from gensim, and stores a list of the summaries.

In [None]:
ex_id = 3
ex = examples[ex_id-1]
page = requests.get(ukgwa_prefix + ex[1] + "/" + ex[0])
soup = BeautifulSoup(page.content, 'html.parser')

print("Links from", ex[1] + "/" + ex[0])
links = soup.find_all('a')
summaries = []


for ln in links:
    if ln.has_attr('href'):
        href = ln['href']
        if href[0] == "#":
            continue
        if href[0:2] == "./":
            href = href[2:]
        parent_url = UKGWAurl(ukgwa_prefix + ex[1] + "/" + ex[0])
        link_url = UKGWAurl(href, parent_url)
        if link_url.get_domain() != parent_url.get_domain():
            continue
        print("\t", href, link_url)
        page = requests.get(link_url)
        soup = BeautifulSoup(page.content, 'html.parser')
        extractor = extractors.ArticleExtractor()
        content = extractor.get_content(str(soup))
        summary = summarizer.summarize(content, ratio = 0.2)
        summaries.append(summary)


Links from 20140402194840/http://www.ukbms.org
	 Default.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/Default.aspx
	 involved.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/involved.aspx
	 wcbs.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/wcbs.aspx
	 indicators.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/indicators.aspx
	 KeyFindings.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/KeyFindings.aspx
	 news.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/news.aspx
	 reportsAndPublications.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/reportsAndPublications.aspx
	 resources.aspx https://webarchive.nationalarchives.gov.uk/20140402194840/http://www.ukbms.org/resources.aspx
	 Specieslist.aspx https://webarchive.nationalarchives.gov.

Next we concatenate all of those summaries together into one big document and summarise that. You can experiment with the length of the summary using the word count parameter but 100 words produces a nice summary which provides an overview of the web site.

It is not a catalogue description, it is built from heuristics not knowledge, but it is a powerful technique and this notebook environment provides an ideal playground to try it out.

Run this code to see the result and to compare it against the human curated Administrative History:

In [None]:
all_summaries = "\n".join(summaries)
print(ex[0])
print("")
print("Auto summary")
print(summarizer.summarize(all_summaries, word_count=100))
print("")
print("Admin History")
if ex[0] in example_admin:
    print(example_admin[ex[0]])
else:
    print(example_admin[ex[0] + "/"])



http://www.ukbms.org

Auto summary
The Wider Countryside Butterfly Survey (WCBS) is the main scheme for monitoring population changes of the UK's common and widespread butterflies.
The Wider Countryside Butterfly Survey (WCBS) is the main scheme for monitoring population changes of the UK's common and widespread butterflies.
The WCBS is the most comprehensive UK-wide survey of insect abundance to use a robust random sampling framework and is important in assessing the changing status of butterfly species in the wider countryside and in providing an indicator of the health of nature.

Admin History
The UK Butterfly Monitoring Scheme (UKBMS) is run as a partnership between the Centre for Ecology and Hydrology (CEH), Butterfly Conservation (BC) and the Joint Nature Conservation Committee (JNCC).
