# Scraping Low Hanging Fruit on the UK Register of Members' Financial Interests 

This notebook describes a recipe for starting to scrape the UK MP register of interests.

The register entries for each member contains *semi-structured text data*, which is to say that there are some recognisable patterns in the text that makes up the register entries.

![](img/hoc_reg_mp_fin_int.png)

The entries are made in different sections, and have a form that repeats, ish...

![](img/hoc_reg_mp_fin_int2.png)

We can use these repeating structures as the basis of a *scraper* that will extract the information from the page and put it into a form we can work with, such as a spreadsheet or simple database.

## The Structure of a Single Register Entry Page

The page you see in your web browser is a rendering of a structured HTML document. You can look at the "code" that defines the page using browser developer tools.

In Chrome, you can view the source code of a web page from the *View -> Developer -> View Source* menu.

Looking at the raw HTML of a page can be confusing, but many browsers have built in tools to make it easier to inspect the source code of a web page.

In Chrome, you can launch the developer tools from he *View -> Developer -> Developer Tools* menu option.

### Exploring the page

In Chrome developer tools, if you click on the arrow / pointer icon in the top left corner of the tools panel, you can use it to highlight areas of the rendered web page; the HTML used to define that block is then highlighted.

![](img/hoc_reg_devtools1.png)

One of the tricks to scraping is to try to identify structural elements or patterns in the HTML that identify the things we are interestd in and that we can grab hold of and use as the basis for our scrape.

In the MPs' register of interest pages, we notice that the `div` tag with the id `mainTextBlock` contains all the elements that describe the register of entries. In particular, we also notice that each spearate entry is contained within its own `<p>` tag.

![](img/hoc_reg_devtools2.png)

This gives us one strategy for scraping the page:

- grab the parent `div` tag - and its contents - that contains all the separate register entries;
- look at each `<p>` tag - and its contents - in turn and try to pull out the member interests.

### What structured or semi-structured data can we see?

Some of the register entries have a largely *unstructured* form. For example, the entry:

```
13 December 2016, received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training. Hours: 16 hrs including travel and preparation. (Registered 19 December 2016)
```

is largely free text. We can see some structure in there (a date, followed by an amount, then a name and an address) but we get the feeling that this entry could be made up of arbitrary text.

An entry such as:

```
Name of donor: VGC Group
Address of donor: Cardinal House, Bury Street, Ruislip HA4 7GD
Amount of donation or nature and value if donation in kind: £1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.
Donor status: company, registration 5741473
(Registered 04 May 2016)
```

is semi-structured, in that we have structural items of the form `attribute: value` where the value term may or may not itself be structured.

For example, the `Name of donor` attribute is simply that - a name - which explicitly represents the name of the donor. But the text associated with the `Amount of donation or nature and value if donation in kind` is more unstructured. For sure, we can see recognisable thinks in the description, but the way they are presented is leargely as free text. Which is to say, the way it's presented is arbitrary, which makes it harder to extract information from in a reliable way.

What this means is that there is some low hanging fruit in this register that we *can* extract reasonably reliably (the name of a donor, for example), but there is also information in that that we may have to parse by hand if we want to do it reliably.

## Getting Started with the Scrape

There are many tools available to help you scrape a web page or set of web pages, but I tend to use code becuase it gives me the most control over the scrape, albeit at the cost of added complexity compared to point and click style applications.

A couple of Python packages that crossed my radar recently provide a relatively easy way in to getting started with Python based scrapers, so we'll use those in this recipe.

### `kennethreitz/requests-html`

The first package is Kenneth Reitz' ([requests-html](https://github.com/kennethreitz/requests-html), straplined *"Pythonic HTML Parsing for Humans™"*. This package helps us grab an HTML page and extract the text from it.

In [1]:
#https://github.com/kennethreitz/requests-html
#!~/anaconda3/bin/pip install requests_html

### `r1chardj0n3s/parse`

The second package is Richard Jones' `parse` package ([r1chardj0n3s/parse](https://github.com/r1chardj0n3s/parse)).

This package provides a set of tools that make it relatively easy to extract the rendered / visually structured information, such as the attributes/values identified in the MPs' register above.

In [2]:
#https://github.com/r1chardj0n3s/parse
#Install the package
#!~/anaconda3/bin/pip install parse

### Grabbing a test page

We'll start by working with a single test page: [https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm](https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm)


This was selected from the [last register of the 2015 Parliament](https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm). For now, I'm just hoping that the register for the current 2017 Parliament has the same structural form!

*Disclaimer: the page I picked from the register was picked becuase it had a range of content structures, not for any reason relating to the member it relates to or the actual content of an entry. Which is to say: nothing is implied by the selection of the test page, etc etc.*

In [3]:
#Set the url of the test page
url='https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm'

The first thing to do is get hold of the page HTML. The `requests_html` makes this easy:

In [4]:
#Import the package - we only need to do this once
from requests_html import HTMLSession

#Create a session - we conly need to do this once
session = HTMLSession()

In [5]:
#Grab the page
r = session.get(url)

To find the `p` tags *within* an HTML block with a given id, such as `mainTextBlock`, we can apply the `.html.find()` method to the page and get a list of items in return.

In [6]:
ptags = r.html.find('#mainTextBlock > p')

#View the contents of the 7th p tag in the list (the index starts at 0)
ptags[6].text

'Name of donor: VGC Group\nAddress of donor: Cardinal House, Bury Street, Ruislip HA4 7GD\nAmount of donation or nature and value if donation in kind: £1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.\nDonor status: company, registration 5741473\n(Registered 04 May 2016)'

Now we can start to extract some information *as data* using the `parse` package.

In [7]:
#Import everything from the package - this is not best practice!
from parse import *

The `parse` package encourages you to split a text string into recognisable components. The text we want to extract is wrapped using braces ({}). The contents of the braces may be a name we want to assigned to the extracted text, and / or a pattern that describes the text we want to extract "into" that pair of braces.

The expression we need to us has the form:

```
parse(stringExtractionPattern, stringWeWantToParse)
```

In [8]:
pattern = '''Name of donor: {name}\nAddress of donor: {addr}\nAmount of donation or nature and value if donation in kind: {txt}\nDonor status: {status}\n(Registered {date})'''

pr = parse(pattern, ptags[6].text)

#View the results of parsing the string
pr

<Result () {'name': 'VGC Group', 'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD', 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'status': 'company, registration 5741473', 'date': '04 May 2016'}>

If the the pattern matcher doesn't match the string, nothing is returned.

In [9]:
pr = parse(pattern,"Some arbitrary text that is unlikely to match...")
print('This returns >>', pr, '<<')

This returns >> None <<


We can pass the "named" items we have extracted from the text string into its own python `dict`.

The python expression used to do this is known as a "list comprehension" (or more specifically in this case, a "dict comprehension"). Essentially what it does is take the contents of one `dict` and use them to create another. *Don't worry about it: it's voodoo magic...*

In [10]:
pr = parse(pattern, ptags[6].text)
extractedItems = {k:pr[k] for k in pr.named}

#Preview items
extractedItems

{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

We can mask the complexity of the dict comprehension by creating a function that deploys it for us:

In [11]:
def todict(result):
    #If there's no match, return an empty dict
    if not result: return {}
    #If there is a match, add the named results items to the returned dict
    return {k:result[k] for k in result.named}

In [12]:
todict(pr)

{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

If the parse expression does not match the string, then it won't return anything.

Looking at the extracted text, we can see that there are some other elements in there that we might be able to extract.

For example, if the `status` is a company, it looks like we might be able to extract out the company number:

In [13]:
#View status
extractedItems['status']

'company, registration 5741473'

We can try to extract the fact the entity is a company, along with the company number:

In [14]:
parse('{status2}, registration {cn}', extractedItems['status'])

<Result () {'status2': 'company', 'cn': '5741473'}>

We could directly create a `dict` from that:

In [15]:
todict( parse('{status2}, registration {cn}', extractedItems['status']) )

{'cn': '5741473', 'status2': 'company'}

Or we could write a new function that adds any newly extracted items to the `extractedItems` dict, *and* does the parsing:

In [16]:
def todict2(pattern, string, extracted=None):
    if extracted is None: extracted = {}
    newextract = todict( parse(pattern, string) )
    
    #Add the contents of newextract to the original dict and return
    #Note that this updates any dict passed in via the extracted argument and doesn't strictly need to return it 
    extracted.update(newextract)
    return extracted

In [17]:
todict2('{status2}, registration {cn}', extractedItems['status'], extractedItems)
extractedItems

{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'cn': '5741473',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'status2': 'company',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

Alternatively, we might write a function to handle the company number extraction and return the company number (and extracted status) as a `dict`.

In [55]:
def companynumber(string):
    return todict( parse('{status2}, registration {cn}', string) )

In [57]:
print( companynumber(extractedItems['status']) )
print( companynumber('Arbitrary text.') )
#Check to see if it works with a company number that starts with a leading 0
print( companynumber('company, registration 05741473') )

{'status2': 'company', 'cn': '5741473'}
{}
{'status2': 'company', 'cn': '05741473'}


Let's look at another entry - if we also print it, the end of line (`\n`) characters will be rendered make things easier to read:

In [20]:
print(ptags[11].text)

ptags[11].text

Name of donor: Balmoral Tanks Ltd
Address of donor: Balmoral Park, Aberdeen AB12 3GY
Amount of donation or nature and value if donation in kind: £2,000 to support my Primary School Christmas Card Competition
Date received: 8 December 2016
Date accepted: 8 December 2016
Donor status: company, registration 300656
(Registered 09 December 2016)


'Name of donor: Balmoral Tanks Ltd\nAddress of donor: Balmoral Park, Aberdeen AB12 3GY\nAmount of donation or nature and value if donation in kind: £2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016\nDonor status: company, registration 300656\n(Registered 09 December 2016)'

This entry has `Date received` and `Date accepted` fields that were not in the entry we scraped first. After parsing, they form part of the `txt` item:

In [21]:
#If we don't pass a dict in to todict2(), one will be created for us
extracteditems = todict2(pattern, ptags[11].text)

extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'date': '09 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

So let's parse that item and grab the dates.

(The reason we don't add them to the pattern we used earlier is becuase that pattern would then *not* match the entries that do not contain the received and accepted dates.)

In [22]:
#Here's what we're going to parse
extracteditems['txt']

'£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'

In [23]:
datepattern = '{}\nDate received: {dateRxd}\nDate accepted: {dateAccd}'

#This updates extracteditems
todict2(datepattern, extracteditems['txt'], extracteditems)
extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

We could also clean the text field a bit by splitting the string on `\nDate` fragments and just retaining the first part:

In [24]:
extracteditems['txt'].split('\nDate')

['£2,000 to support my Primary School Christmas Card Competition',
 ' received: 8 December 2016',
 ' accepted: 8 December 2016']

In [25]:
#The split returns a list of items - just grab the first one with index 0
extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

One thing we might notice at this part is that we have some dates. These are represented as text strings, but we can also part them into a "date-timey" computational thing that identifies the date *as a date* and let's us do datey things to it.

In [26]:
#The dateutil package extends the standard Python datetime  module and helps us parse dates
#~/anaconda3/bin/pip install python-dateutil
from dateutil import parser as dtparser

In [27]:
def parsedate(string):
    #This is not best practice - if the parse fails, return None
    try:
        dt = dtparser.parse(string)
    except:
        dt = None
    return dt

In [28]:
parsedate('8 December 2016')

datetime.datetime(2016, 12, 8, 0, 0)

Having things in `datetime` format lets us work with them as such. For example, we can display them in a variety of ways:

In [29]:
print( parsedate('8 December 2016').strftime("%d/%m/%y") )
print( parsedate('8 December 2016').strftime("%B %d, %Y") )
print( parsedate('8 December 2016').strftime("%A, %B %, %Y") )
print( parsedate('8 December 2016').isoformat() )

08/12/16
December 08, 2016
Thursday, December , 2016
2016-12-08T00:00:00


There is a reference card for `strftime` modifiers / formatters here: [http://strftime.org/](http://strftime.org/)

We can use the formatter to format our dates for us:

In [30]:
extracteditems['date_f'] = parsedate(extracteditems['date'])
extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

We can also automate a bit if date items have 'date' to start their name and haven't already been formatted (identified using the `_f` suffix as part of their name).

Part of the automation requires creating dict attributes, named after date attributes but with the additional `_f` suffix as part of the name. We can use a Python string formatter to help us do this:

In [31]:
'{}_f'.format('date')

'date_f'

In [32]:
def parsedates(record):
    #This looks complicated but what it basically does is look for attributes called date* and not ending _f
    for k in [k for k in record.keys() if k.lower().startswith('date') and not k.lower().endswith('_f') ]:
        #The record is a dict and is mutable - that is, the dict we passed in is changed by the function
        record['{}_f'.format(k)] = parsedate( record[k] )

In [33]:
#Remember - this automatically updates the dict we pass to it
parsedates(extracteditems)

#Show dict updated with parsed dates
extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateAccd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'dateRxd': '8 December 2016',
 'dateRxd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

### Where's the money?

We've now grabbed quite a lot of the low hanging fruit from the page, but what about the money?

Ideally, there should only be one monetary amount specified in an entry (this is not always the case), but for now we'll just make an attempt at grabbing the first. We're also going to assume amounts are converted to, and given as, £ equivalents. To make life easier for the parser, we remove any commas (which is to say, commas used as thouseands separators) from the parsed string by replacing them with an empty string.

In [34]:
print( '£1,250,000.00 #loadsamoney'.replace(',','') )

moneystring='£1,250,000.00 #loadsamoney'
print( moneystring.replace(',','') )

£1250000.00 #loadsamoney
£1250000.00 #loadsamoney


In [35]:
#Define a simple helper function
def commaclean(string):
    return string.replace(',','')

In [36]:
#Demo the helper function
commaclean( moneystring )

'£1250000.00 #loadsamoney'

Let's test a cash amount detecting pattern to see how it works with different strings.

In [37]:
ukppattern = '{?}£{ukp:g}{}'

def testcashparse(string):
    print('{}: {}'.format(string, todict2(ukppattern, commaclean(string)) ))
    

testcashparse('I got £5,000, okay?')
testcashparse('£5,000')
testcashparse('£5001')
testcashparse('They paid me, in 2015, £5000')
testcashparse('£1,250,000.00. #loadsamoney')
testcashparse('The sum of £2.75 for a coffee which cost £2.75')

testcashparse('I got £1000. In two lots: £200 and £800')
testcashparse('Gaming the system: I got £1.25 then £50,000 on top')

I got £5,000, okay?: {'ukp': 5000.0}
£5,000: {'ukp': 500.0}
£5001: {'ukp': 500.0}
They paid me, in 2015, £5000: {'ukp': 2015.0}
£1,250,000.00. #loadsamoney: {'ukp': 1250000.0}
The sum of £2.75 for a coffee which cost £2.75: {'ukp': 2.75}
I got £1000. In two lots: £200 and £800: {'ukp': 1000.0}
Gaming the system: I got £1.25 then £50,000 on top: {'ukp': 1.25}


There are some issues with this:

- where the cash item is at the end of the string, it looks like we can lose the last digit because of the final match requirement. We can get round this by adding whitespace at the end (and perhaps also the start) of the string to provide a match opportunity;
- things other than financial amounts, which are identified with the preceding £, are returned as numbers. (I'm not sure why?); one of the things the `parse` funciton returns is an index of where the match took place in the parsed string. So if we check the character before a match, if it's a `£` we know we're quids in...
- in some cases there may be multiple amounts, so rather than use the `parse()` function let's use the `findall()` function to see if we can find all the sterling amounts.

In [38]:
def moneyfudge(string):
    return ' {} '.format( commaclean(string))

In [39]:
moneystring = 'Gaming the system: I got £1.25 then £50,000 on top in 2015'

items=findall(ukppattern, moneyfudge(moneystring))
for i in items:
    print(i, moneyfudge(moneystring)[i.spans['ukp'][0]-1] )

<Result (' Gaming the system: I got £', ' ') {'ukp': 1.25}> £
<Result ('then £', ' ') {'ukp': 50000.0}> £
<Result ('on top in', ' ') {'ukp': 2015.0}> n


Let's create a function to get the financial amounts from a string.

In [40]:
def getamounts(string, numpattern = '{}{num:g}{}'):
    response = {}
    response['amounts'] = []
    for amount in findall(numpattern, moneyfudge(string) ):
        if moneyfudge(string)[amount.spans['num'][0]-1]=='£':
            response['amounts'].append(amount['num'])
        
    return response

In [41]:
getamounts(moneystring)

{'amounts': [1.25, 50000.0]}

Right - so we can pull out amounts. Let's start to think about generating some new items for our `extracteditems` record. We'll create function that returns the maximum, summed and itemised amounts that we can use variously if we want to go digging in the data.

We can also try to be clever, and where more that two items are listed, calculate the difference between the total sum and the the maximum amount. If these are equal (or if we are being more elaborate, *nearly* equal) then we might want to check the text to see if the larger amount is specifying the sum of the smaller amounts (in which case, the total sum item is meaningless).

Where there are mutliple amounts, we can create a serialised version of the list.

In [42]:
amounts = [100, 250.25, 1000]
#serialise the amounts - making sure they are represented as strings first
'::'.join([str(amount) for amount in amounts])

'100::250.25::1000'

In [43]:
def getcheckamounts(string, numpattern = '{}{num:g}{?D}'):
    response = getamounts(string, numpattern)
    
    response['maxamount'] = max(response['amounts']) if response['amounts'] else 0
    response['numamounts'] = len(response['amounts'])
    response['sumamounts'] = sum(response['amounts']) if amounts else 0
    response['sumlessmax'] = response['sumamounts'] - response['maxamount'] if len(response['amounts'])>2 else 0
    
    response['amounts'] = '::'.join([str(amount) for amount in response['amounts']])
    return response

In [44]:
print( getcheckamounts('Gaming the system: I got £1.25 then £50,000 on top in 2015') )
print( getcheckamounts('£1.25') )
print( getcheckamounts('No money') )
print( getcheckamounts('Four payments, £100, £200 and £400 to give £701 total') )

{'amounts': '1.25::50000.0', 'maxamount': 50000.0, 'numamounts': 2, 'sumamounts': 50001.25, 'sumlessmax': 0}
{'amounts': '1.25', 'maxamount': 1.25, 'numamounts': 1, 'sumamounts': 1.25, 'sumlessmax': 0}
{'amounts': '', 'maxamount': 0, 'numamounts': 0, 'sumamounts': 0, 'sumlessmax': 0}
{'amounts': '100.0::200.0::400.0::701.0', 'maxamount': 701.0, 'numamounts': 4, 'sumamounts': 1401.0, 'sumlessmax': 700.0}


Let's have a go at adding that information to the `extracteditems` record. The text we want to parse is in the `txt` field, and perhaps also in simplified form in the `cleanertxt` field.

In [45]:
extracteditems['txt']

'£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'

In [46]:
# The .update() method updates the contents of the dict directly
extracteditems.update( getcheckamounts( extracteditems['txt']) )
extracteditems

{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'amounts': '2000.0',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateAccd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'dateRxd': '8 December 2016',
 'dateRxd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'maxamount': 2000.0,
 'name': 'Balmoral Tanks Ltd',
 'numamounts': 1,
 'status': 'company, registration 300656',
 'sumamounts': 2000.0,
 'sumlessmax': 0,
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

### Finding More Structure From the Original Page

Looking back at the original page, we notice that the entries are grouped acorrding to different sorts of interest, such as *2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation* or *3. Gifts, benefits and hospitality from UK sources*.

We can detect when we enter a section by detecting a paragraph that starts with one of these headings:

In [47]:
for p in ptags:
    #The .startswith() method accepts a tuple (enumeration of things in a pair of brackets) to check against
    #If the text startswith any of the strings in the tuple, the condition evaluates true for that text
    if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
        print(p.text)

1. Employment and earnings
2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation
2. (b) Any other support not included in Category 2(a)
3. Gifts, benefits and hospitality from UK sources
4. Visits outside the UK
8. Miscellaneous


## Putting the Bits Together

So no we pretty much have all the bits we need in order to scrape a register page for a particular member. Let's put them together and see if we can make a dataset, for that member, from the pieces we've assembled above.

In [48]:
section = ''

#iterate through the paragraphs
for p in ptags:
    #Check to see if we're in a new section. If so, capture the section
    if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
        section = p.text
        
    #Do the preliminary parsing of a paragraph
    extracteditems = todict2(pattern, p.text)
    
    #Identify the section
    extracteditems['section'] = section
    
    #Look for the data - checking first there's a txt tag that's been extracted...
    if extracteditems and 'txt' in extracteditems:
        #Dates
        todict2(datepattern, extracteditems['txt'], extracteditems)
        #Get a simmplified version of the text string, without dates, to potentially make life easier in the future
        extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
        #Dateify any dates
        parsedates(extracteditems)
        
        #Extract any company numbers that are declared
        extracteditems.update( companynumber(extractedItems['status']) )
        
        #Money
        extracteditems.update( getcheckamounts( extracteditems['txt']) )
    
        print(extracteditems)
    

{'name': 'VGC Group', 'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD', 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'status': 'company, registration 5741473', 'date': '04 May 2016', 'section': '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'cleanertxt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'date_f': datetime.datetime(2016, 5, 4, 0, 0), 'status2': 'company', 'cn': '5741473', 'amounts': '1800.0', 'maxamount': 1800.0, 'numamounts': 1, 'sumamounts': 1800.0, 'sumlessmax': 0}
{'name': 'Edward Maurice Watkins', 'addr': 'private', 'txt': '£755 in purchasing tickets for a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which wil

## Making a tabular dataset

Well it looks like we're scraping something!

Let's see if we can now tidy that up a bit and put it into a datastructure designed for working with tabular datasets. A *dataframe* from the *pandas* package.

What we're going to do is add each record as a row to a dataframe using the *pandas* `pd.concat()` function.

In [49]:
#!~/anaconda3/bin/pip install pandas
import pandas as pd

In [50]:
#To make the code reusable, let's wrap it in a function:

def scrapeData(ptags):

    #The code is pretty much as it was before...
    #...except that at first we create an empty dataframe
    df = pd.DataFrame()

    section = ''

    #iterate through the paragraphs
    for p in ptags:
        #Check to see if we're in a new section. If so, capture the section
        if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
            section = p.text

        #Do the preliminary parsing of a paragraph
        extracteditems = todict2(pattern, p.text)

        #Identify the section
        extracteditems['section'] = section

        #Look for the data - checking first there's a txt tag that's been extracted...
        if extracteditems and 'txt' in extracteditems:
            #Dates
            todict2(datepattern, extracteditems['txt'], extracteditems)
            #Get a simmplified version of the text string, without dates, to potentially make life easier in the future
            extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
            #Dateify any dates
            parsedates(extracteditems)

            #Extract any company numbers that are declared
            extracteditems.update( companynumber(extractedItems['status']) )
        
            #Money
            extracteditems.update( getcheckamounts( extracteditems['txt']) )

            #...and then we add each record to it
            df = pd.concat([df,pd.DataFrame([extracteditems])])

    #Return the dataframe... resetting the index for the whole dataframe
    #We can ignore (drop) the original index:
    #  it contains "dummy" values created when the single row dataframe was made from each record
    return df.reset_index(drop=True)

In [51]:
#Let's if it works...
scrapeData(ptags)

Unnamed: 0,addr,amounts,cleanertxt,cn,date,dateAccd,dateAccd_f,dateRxd,dateRxd_f,date_f,maxamount,name,numamounts,section,status,status2,sumamounts,sumlessmax,txt
0,"Cardinal House, Bury Street, Ruislip HA4 7GD",1800.0,"£1,800 in a successful auction bid at a fundra...",5741473,04 May 2016,,NaT,,NaT,2016-05-04 00:00:00,1800.0,VGC Group,1,2. (a) Support linked to an MP but received by...,"company, registration 5741473",company,1800.0,0,"£1,800 in a successful auction bid at a fundra..."
1,private,755.0,£755 in purchasing tickets for a fundraising d...,5741473,23 May 2016,,NaT,,NaT,2016-05-23 00:00:00,755.0,Edward Maurice Watkins,1,2. (a) Support linked to an MP but received by...,individual,company,755.0,0,£755 in purchasing tickets for a fundraising d...
2,private,2000.0,"£2,000 in a successful auction bid at a fundra...",5741473,23 May 2016,,NaT,,NaT,2016-05-23 00:00:00,2000.0,Edward Maurice Watkins,1,2. (a) Support linked to an MP but received by...,individual,company,2000.0,0,"£2,000 in a successful auction bid at a fundra..."
3,private,1900.0,"£1,900 in a successful auction bid at a fundra...",5741473,22 September 2016,,NaT,,NaT,2016-09-22 00:00:00,1900.0,Chris Chenn,1,2. (a) Support linked to an MP but received by...,individual,company,1900.0,0,"£1,900 in a successful auction bid at a fundra..."
4,"Balmoral Park, Aberdeen AB12 3GY",2000.0,"£2,000 to support my Primary School Christmas ...",5741473,09 December 2016,8 December 2016,2016-12-08,8 December 2016,2016-12-08,2016-12-09 00:00:00,2000.0,Balmoral Tanks Ltd,1,2. (b) Any other support not included in Categ...,"company, registration 300656",company,2000.0,0,"£2,000 to support my Primary School Christmas ..."
5,"4th Floor, 49 Whitehall, London SW1A 2BX",444.0,ticket and hospitality at the Ivor Novello Awa...,5741473,27 May 2016,19 May 2016,2016-05-19,19 May 2016,2016-05-19,2016-05-27 00:00:00,444.0,UK Music,1,"3. Gifts, benefits and hospitality from UK sou...","company, registration no 3245288",company,444.0,0,ticket and hospitality at the Ivor Novello Awa...
6,"Headingley Cricket Ground, Leeds LS6 3DP",200.0::400.0,two tickets and hospitality at Yorkshire Count...,5741473,07 June 2016,21 May 2016,2016-05-21,21 May 2016,2016-05-21,2016-06-07 00:00:00,400.0,The Yorkshire County Cricket Club,2,"3. Gifts, benefits and hospitality from UK sou...","company, registration IP28929R",company,600.0,0,two tickets and hospitality at Yorkshire Count...
7,"30 Gloucester Place, London W1U 8PL",259.0,Ticket and hospitality for a concert at Wemble...,5741473,13 June 2016,5 June 2016,2016-06-05,5 June 2016,2016-06-05,2016-06-13 00:00:00,259.0,Football Association Premier League,1,"3. Gifts, benefits and hospitality from UK sou...","company, registration no 2719699",company,259.0,0,Ticket and hospitality for a concert at Wemble...
8,"Botley Road, West End, Southampton SO30 3XH",499.0::1197.6,"two tickets, and accompanying hospitality, to ...",5741473,08 July 2016; updated 13 July 2016,5 July 2016,2016-07-05,5 July 2016,2016-07-05,,1197.6,Hampshire Cricket Ltd,2,"3. Gifts, benefits and hospitality from UK sou...","company, registration no 4343355",company,1696.6,0,"two tickets, and accompanying hospitality, to ..."
9,"4th Floor, 49 Whitehall, London SW1A 2BX",380.0,four tickets to attend a music concert at Wemb...,5741473,20 September 2016,10 September 2016,2016-09-10,10 September 2016,2016-09-10,2016-09-20 00:00:00,380.0,UK Music,1,"3. Gifts, benefits and hospitality from UK sou...","company, registration 3245288",company,380.0,0,four tickets to attend a music concert at Wemb...


If everything has gone to plan, we should be able to scrape the data from another page...

In [52]:
url2='https://publications.parliament.uk/pa/cm/cmregmem/170502/gray_james.htm'
r2 = session.get(url2)
ptags2 = r2.html.find('#mainTextBlock > p')
scrapeData(ptags2)

Unnamed: 0,addr,amounts,cleanertxt,cn,date,dateAccd,dateAccd_f,dateRxd,dateRxd_f,date_f,maxamount,name,numamounts,section,status,status2,sumamounts,sumlessmax,txt
0,private,2500.0,"£2,500 for local EU Referendum campaigning",5741473,21 July 2016,14 June 2016,2016-06-14 00:00:00,14 June 2016,2016-06-14 00:00:00,2016-07-21 00:00:00,2500.0,Lady A T Keswick,1,2. (b) Any other support not included in Categ...,individual,company,2500.0,0.0,"£2,500 for local EU Referendum campaigning\nDa..."
1,"House of Commons, London SW1A 0AA",40.0::14.25::425.25,six dinners value £40 each; thirteen breakfast...,5741473,24 October 2016; updated 05 December 2016,1 January 2016 to 2 December 2016,,1 January 2016 to 2 December 2016,,,425.25,APPG for the Armed Forces,3,"3. Gifts, benefits and hospitality from UK sou...",other,company,479.5,54.25,six dinners value £40 each; thirteen breakfast...
2,"PO Box 1010, Uxbridge UB8 9NT",944.0,12 month season ticket for car parking at Chip...,5741473,07 November 2016,7 November 2016,2016-11-07 00:00:00,7 November 2016,2016-11-07 00:00:00,2016-11-07 00:00:00,944.0,APCOA Parking,1,"3. Gifts, benefits and hospitality from UK sou...","company, registration 2572947",company,944.0,0.0,12 month season ticket for car parking at Chip...


## Further Work on the Single Page Scrape

Looking through some of the member pages, there are other elements of structure that we should be able to pull on. For example, many pages have content of the following form in the `Employment and earnings` section which we should be able to pull out:

```
1. Employment and earnings

Payments from ComRes, 4 Millbank, London SW1P 3JA:
26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
4 April 2016, £75 for participating in Parliamentary Panel Survey. Hours: 20 mins. (Registered 03 August 2016)
20 May 2016, £100 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)

Payments from YouGov, 50 Featherstone St, London EC1Y 8RT:
13 Jan 2016, £50 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
4 Mar 2016, £30 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
```

If the section predominately contains information in that form, it may be worth recording it in a separate dataframe.

## Summary

This notebook has walked through the creation of a recipe for scraping the contents of the register of financial interests of a single MP.

In the next installment, we'll look at how to grab the data for *all* the members listed in a particular register. This will be followed - possibly! - by a look at how to query the data once we have scraped it, as well as how to enrich the dataset and make it "linkable" to other datasets using common identifiers, such as MNIS / MP ids and company numbers.