# Python case study: tribunals (inc. dealing with dates)

When we import or grab dates they are often treated as text strings, when we actually want to treat them *as* dates.

In this notebook we scrape some data that includes dates, and then convert the dates to a 'datetime' variable so we can perform related actions (such as extracting a month, or identifying which day of the week a date fell on).

The page we want to scrape is [employment tribunal decisions](https://www.gov.uk/employment-tribunal-decisions).

## Scrape the page

First we scrape the data - we are going to scrape a page of tribunal decisions.

In [2]:
#importing requests to fetch URLs
import requests
#importing beautiful soup's library for drilling into webpages
from bs4 import BeautifulSoup
#and pandas for storing data
import pandas as pd


#fetch URL
page = requests.get("https://www.gov.uk/employment-tribunal-decisions")

#command beautiful soup to parse the page
soup = BeautifulSoup(page.content,'html.parser')
#show the soup!
soup

<!DOCTYPE html>

<!--[if lt IE 9]><html class="lte-ie8 govuk-template" lang="en"><![endif]--><!--[if gt IE 8]><!--><html class="govuk-template" lang="en">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Find decisions on Employment Tribunal cases in England, Wales and Scotland." property="og:description"/>
<meta content="Employment tribunal decisions" property="og:title"/>
<meta content="https://www.gov.uk/employment-tribunal-decisions" property="og:url"/>
<meta content="article" property="og:type"/>
<meta content="GOV.UK" property="og:site_name"/>
<meta content="" data-module="ga4-scroll-tracker" name="govuk:scroll-tracker"/>
<meta content="Employment tribunal decisions - GOV.UK" name="govuk:base_title"/>
<meta content="103723" name="govuk:search-result-count"/>
<meta content="summary" name="twitter:card"/>
<meta content="&lt;EA73&gt;&lt;CO1133&gt;" name="govuk:analytics:organisations"/>
<meta content="2023-02-07T11:26:26Z" n

## Grabbing all the link text

We check we can grab the text which is linked to each case. To do this we add `.select()` to the `soup` object, and specify a **selector** inside the brackets.

To grab all the links on the page we would target the `<a>` tag. The selector for that is simply `'a'`.

In [3]:
#select all the <a> tags inside the soup object (the webpage)
#because we are not storing this inside a variable it will simply display the results
soup.select('a')

[<a class="gem-c-skip-link govuk-skip-link govuk-!-display-none-print" data-module="govuk-skip-link" href="#content">Skip to main content</a>,
 <a class="govuk-link" data-module="gem-track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/help/cookies">change your cookie settings</a>,
 <a class="govuk-link" data-module="gem-track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/help/cookies">change your cookie settings</a>,
 <a class="govuk-link" href="/help/cookies">View cookies</a>,
 <a class="govuk-header__link govuk-header__link--homepage" data-ga4-link='{"event_name":"navigation","type":"header menu bar","external":"false","text":"GOV.UK","section":"Logo","index":{"index_link":1,"index_section":0,"index_section_count":2},"index_total":1}' data-track-action="logoLink" data-track-category="headerClicked" data-track-dimension="GOV.UK" data-track-dim

### Measuring the length, and showing just one result

We might want to know how many results we have got. To do that we can use Python's `len()` function (short for 'length') which tells you how many items there are in a list (how long it is).

We also might want to look at the first or last item in the list to see which item it grabs first or last. To do that we can use an index like `[0]` for 'the first' and `[-1]` for 'the last'.

In both cases it makes sense to store the results in a variable first - we will call this 'cases'.

In [5]:
#create a variable called 'cases'
#and use it to store the results of selecting all <a> tags in 'soup'
cases = soup.select('a')

#print the length
print(len(cases))

#print the first match
print(cases[0])

133
<a class="gem-c-skip-link govuk-skip-link govuk-!-display-none-print" data-module="govuk-skip-link" href="#content">Skip to main content</a>


### Refining our selector

We now know that the selector grabbed over 100 links. And looking at what has been grabbed, we can see that it's getting links we don't want too.

We don't just want *all* the links - we just want the links to tribunal decisions.

We need to look at the HTML for those links to see if they have some additional attributes we can target - or if we can target a parent tag that contains the same information.

### Using the Inspector


You can do that by viewing the source HTML underneath the page, but [using the Inspector](https://blog.hubspot.com/website/how-to-inspect) is an even quicker way to do that.

If you right-click on a link (or anything else) and select **Inspect**, this will open the Inspector with the specific HTML for that element highlighted.

The HTML for one of the tribunal links is quite long, with lots of attributes. It helps to copy it into a separate file and break it up by putting each new attribute on its own line, like so:

```
<a
data-ga4-ecommerce-path="/employment-tribunal-decisions/mrs-p-marques-v-just-kidd-inn-ltd-3200680-slash-2023"
data-ecommerce-path="/employment-tribunal-decisions/mrs-p-marques-v-just-kidd-inn-ltd-3200680-slash-2023"
data-ecommerce-row="1"
data-ecommerce-index="2"
data-track-category="navFinderLinkClicked"
data-track-action="Employment tribunal decisions.2"
data-track-label="/employment-tribunal-decisions/mrs-p-marques-v-just-kidd-inn-ltd-3200680-slash-2023"
data-track-options="{&quot;dimension28&quot;:50,&quot;dimension29&quot;:&quot;Mrs P Marques v Just Kidd Inn Ltd: 3200680/2023&quot;}"
class="  govuk-link"
href="/employment-tribunal-decisions/mrs-p-marques-v-just-kidd-inn-ltd-3200680-slash-2023">
Mrs P Marques v Just Kidd Inn Ltd: 3200680/2023</a>
```

There are 10 different attributes here, including the `href=` attribute that defines which URL the linked text will take you to when clicked.



### Using a parent tag

If the specific tag you want to target is problematic, it's worth looking at the tags around it. In particular, look for tags which *contain* your tag - called **parent** tags. Parent tags can often be targeted to gather the same information.

The Inspector helpfully formats the HTML to make it easy to see if a tag is a parent tag (or, conversely, a **child** of a parent tag), by indenting tags:

* A tag indented below another is a child tag. The tag above (which is indented less) is its parent tag.
* Two tags which are indented the same amount are **siblings**: that is, they both share the same parent, and neither is a parent to the other.

In this case the `<a>` tag is indented below a `<div>` tag, so we might turn our attention to that.

Here's the `div` tag in full:

`<div class="gem-c-document-list__item-title">`

This tag contains the `<a>` tag and nothing else, and as it's much simpler it is easier to target.

The `class=` attribute is something we will need to include because `<div>` is a frequently used tag and there are likely to be others elsewhere on the page that we don't want.

In [8]:
#grab all the <div> tags with class="gem-c-document-list__item-title"
divswewant = soup.select('div[class="gem-c-document-list__item-title"]')
#show how many items we got
print(len(divswewant))
#print the first item
print(divswewant[0])
#print the last item
print(divswewant[-1])

50
<div class="gem-c-document-list__item-title"> <a class="govuk-link" data-ecommerce-index="1" data-ecommerce-path="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-ecommerce-row="1" data-ga4-ecommerce-path="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-track-action="Employment tribunal decisions.1" data-track-category="navFinderLinkClicked" data-track-label="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-track-options='{"dimension28":50,"dimension29":"Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022"}' href="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022">Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022</a>
</div>
<div class="gem-c-document-list__item-title"> <a class="g

This is much better: we get the amount of results we want, and the first and last items match what we would expect it to be (the first and last cases).

### Shifting our attention to another tag: `<time>`

We can repeat the same process for another piece of information: the date.

The HTML containing that information is:

`<time datetime="2023-08-04">4 August 2023</time>`

This is very promising: `<time>` is a very specific tag that we might hope isn't used anywhere else on the page.

Let's find out - we're hoping to see 50 results when we grab them.

In [15]:
#this grabs the <time> tags
times = soup.select('time')
#check the length
print(len(times))
#check the first
print(times[0])
#check the last
print(times[-1])

50
<time datetime="2023-08-04">4 August 2023</time>
<time datetime="2023-08-03">3 August 2023</time>


## Grabbing just the text: `.get_text()`

So far we've grabbed tags *and* their contents. That might be fine - we could clean it up in a spreadsheet, for example - but there is a way to grab just the text inside the tags: `.get_text()`

Here's an example of using it an item with each list:

In [11]:
#grab the first item from the list 'divswewant'
#apply the .get_text() method to it to grab the text
divswewant[0].get_text()

' Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022\n'

In [16]:
#grab the first item from the list 'cases'
#apply the .get_text() method to it to grab the text
times[0].get_text()

'4 August 2023'

## Using a loop to create a new list of text entries

We can generate a new list from our existing list of tag-and-text entries by applying to `.get_text()` to each of the entries in turn, then adding the resulting text to a new list.

To do this we:

* Create an empty list
* Loop through the items in the existing list (of tags-and-text)
* Apply `.get_text()` to each item
* Add the resulting text to the empty list (which is now not empty any more)

Once the loop has finished, our previously empty list should be full of the extracted text.

Here's the code:

In [14]:
#create an empty list
casetitles = []

#loop through the divswewant list
for i in divswewant:
  #extract the text
  casename = i.get_text()
  #add the text and link to the previously empty lists
  casetitles.append(casename)

#show the first 5 rows
casetitles[:5]

[' Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022\n',
 ' Mrs P Marques v Just Kidd Inn Ltd: 3200680/2023\n',
 ' Mrs K Janusz v ABC Distribution Ltd: 3200710/2023\n',
 ' Mr D Singh v Wanis Management Services LLP: 3200018/2023\n',
 ' Mr A Sesay v Bardwood Support Services Ltd: 3205655/2022\n']

In [18]:
#create an empty list
datelist = []

#loop through the divswewant list
for i in times:
  #extract the text
  timetext = i.get_text()
  #add the text and link to the previously empty lists
  datelist.append(timetext)

#show the first 5 rows
datelist[:5]

['4 August 2023',
 '15 August 2023',
 '15 August 2023',
 '2 August 2023',
 '15 August 2023']

## Creating a dataframe from those lists

Lists can be used to make the columns in a dataframe.

To create a dataframe use the `pandas` function `DataFrame()`.

Note the capitalisation on the function name: the D and F must both be upper case for it to work (with no spaces).

Because it's a `pandas` function, we need to name the library as well. And because the `pandas` library was imported as `pd`, that's the name we use: so, `pd.DataFrame()`

The main ingredient for the function `pd.DataFrame()` is the data you want to add. This can be in the form of a **dictionary**.

The dictionary object that we create below has two keys - `"case name"` and `"date"`. The keys are used as column names.

Against each key are two values - and in this case each value is not just one value, but a whole list. A list will be treated as a column of values to go under each column name.

In [20]:
#create a new dataframe which uses those two lists as its two columns
casedataframe = pd.DataFrame({"case name" : casetitles, "date" : datelist})

#show the first 5 rows
casedataframe[:5]

Unnamed: 0,case name,date
0,Ms I M de Araújo Ramos Fernandes v Eden Brook...,4 August 2023
1,Mrs P Marques v Just Kidd Inn Ltd: 3200680/20...,15 August 2023
2,Mrs K Janusz v ABC Distribution Ltd: 3200710/...,15 August 2023
3,Mr D Singh v Wanis Management Services LLP: 3...,2 August 2023
4,Mr A Sesay v Bardwood Support Services Ltd: 3...,15 August 2023


## Export the data

Now we have a dataframe, we can export it. To do that add `.to_csv()` to the dataframe, and put the name of the file you want to export it as inside the brackets, remembering to include quotation marks.

In [21]:
casedataframe.to_csv("tribunaldata.csv")

That data can now be downloaded from the **Files** area (the folder icon) on the left of the Colab notebook.

## Bonus: extract the `datetime` attribute

In our code above we grabbed the text for the date of each tribunal - but this data was also encoded within the HTML tag that contained it, in YYYY-MM-DD format:

`<time datetime="2022-02-07">7 February 2022</time>`

Quite often the data that we want (or would prefer) is stored in this way.

For example, if we want the URLs for any linked text, we would need to grab the value of the `href=` attribute.

This means we need something other than `.get_text()` to grab it.

To extract an attribute you need to add the name of the attribute as a string inside square brackets after a match item (*not after the whole list*), like so:

`['datetime']`

`['href']`

Here's some code putting that into practice in different ways:

In [23]:
#show the first match in full
print(times[0])
#now show the datetime attribute of that match
print(times[0]['datetime'])

<time datetime="2023-08-04">4 August 2023</time>
2023-08-04


And now, a loop to grab them all and put them in a list.

In [38]:
#create an empty list
timeslist = []

#loop through the times
for i in times:
  #extract the text inside datetime=""
  datetimevalue = i['datetime']
  #add it to our list
  timeslist.append(datetimevalue)

#check the first 5 results
timeslist[:5]

['2023-08-04', '2023-08-15', '2023-08-15', '2023-08-02', '2023-08-15']

## Bonus: extracting parts of a string

It's worth pointing out that you can do additional work in Python to extract just part of the data you have.

Here's some more code where we extract characters at particular positions in a string, using indices.

In [30]:
#store it
testdatetime = times[0]['datetime']
#now extract the first 4 characters
print(testdatetime[0:4])
#and the characters from position 8 to 9 (the month)
print(testdatetime[8:10])
#and the characters from position 5 to 6 (the day)
#converting to an integer at the same time
print(int(testdatetime[5:7]))

2023
04
8


In [31]:
#create 4 empty lists
datetimes = []
years = []
days = []
months = []

#loop through those <time> tag matches
for i in times:
  #extract the datetime= value
  casedate = i['datetime']
  #add to the datetimes list
  datetimes.append(casedate)
  #'slice' the string to extract the year, month and day, adding each to a different list
  years.append(int(casedate[0:4]))
  months.append(int(casedate[5:7]))
  days.append(int(casedate[8:10]))

## Bonus: grabbing the link URL

The same principles can be applied to extract the URL that each case title linked *to*.

In this case we have a little problem, because we targeted the `<div>` tag when scraping the case titles, but the `<div>` tag doesn't contain the URL that each case title linked to - that information is contained inside an `<a>` tag, which is a **child** of the `<div>` tag.

The good news is that we can use `.select()` on these results too.

For example:

In [33]:
#grab all the <div> tags with class="gem-c-document-list__item-title"
divswewant = soup.select('div[class="gem-c-document-list__item-title"]')
#grab all the <a> tags inside the first match
print(divswewant[0].select('a'))
#drill down to the first match (there's only one)
print(divswewant[0].select('a')[0])
#grab the href= attribute value from that
print(divswewant[0].select('a')[0]['href'])

[<a class="govuk-link" data-ecommerce-index="1" data-ecommerce-path="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-ecommerce-row="1" data-ga4-ecommerce-path="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-track-action="Employment tribunal decisions.1" data-track-category="navFinderLinkClicked" data-track-label="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022" data-track-options='{"dimension28":50,"dimension29":"Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022"}' href="/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022">Ms I M de Araújo Ramos Fernandes v Eden Brook Home Care Ltd: 3205112/2022</a>]
<a class="govuk-link" data-ecommerce-index="1" data-ecommerce-path="/employment-tribunal-decisions/ms-i-m-de-ar

Once we've got that working with one item, we can put it inside a list to create a new loop.

In [36]:
#create an empty list
url_list = []

#loop through the div matches
for i in divswewant:
  #select <a> tags inside that, and grab the first match
  atag = i.select('a')[0]
  #extract the URL inside href=""
  hrefvalue = atag['href']
  #add it to our list
  url_list.append(hrefvalue)

#check the first 5 results
url_list[:5]

['/employment-tribunal-decisions/ms-i-m-de-araujo-ramos-fernandes-v-eden-brook-home-care-ltd-3205112-slash-2022',
 '/employment-tribunal-decisions/mrs-p-marques-v-just-kidd-inn-ltd-3200680-slash-2023',
 '/employment-tribunal-decisions/mrs-k-janusz-v-abc-distribution-ltd-3200710-slash-2023',
 '/employment-tribunal-decisions/mr-d-singh-v-wanis-management-services-llp-3200018-slash-2023',
 '/employment-tribunal-decisions/mr-a-sesay-v-bardwood-support-services-ltd-3205655-slash-2022']

## Loop to store these in the dataframe too

We can expand the earlier code to include all four lists we created.



In [39]:
#create a new dataframe which uses those two lists as its two columns
casedataframe = pd.DataFrame({"case name" : casetitles,
                              "date" : datelist,
                              "datetime":timeslist,
                              "url": url_list})

#show the first 5 rows
casedataframe[:5]

Unnamed: 0,case name,date,datetime,url
0,Ms I M de Araújo Ramos Fernandes v Eden Brook...,4 August 2023,2023-08-04,/employment-tribunal-decisions/ms-i-m-de-arauj...
1,Mrs P Marques v Just Kidd Inn Ltd: 3200680/20...,15 August 2023,2023-08-15,/employment-tribunal-decisions/mrs-p-marques-v...
2,Mrs K Janusz v ABC Distribution Ltd: 3200710/...,15 August 2023,2023-08-15,/employment-tribunal-decisions/mrs-k-janusz-v-...
3,Mr D Singh v Wanis Management Services LLP: 3...,2 August 2023,2023-08-02,/employment-tribunal-decisions/mr-d-singh-v-wa...
4,Mr A Sesay v Bardwood Support Services Ltd: 3...,15 August 2023,2023-08-15,/employment-tribunal-decisions/mr-a-sesay-v-ba...
