# Get Hansard data for 15 March 2021

Below is a screenshot of Hansard data for 15 March 2021. We are interested in the highlighted red box. The blue box shows the **folder**. The orange box shows a **sub folder**. Within the folder or sub folder are **cards** (highlighted by stars) which link to a page containing the text.

![Folders](meta/folders.png)

Below shows an example of the text contained within a card URL. We take all paragraph text, as highlighted in the green boxes.

![Folders](meta/card.png)

All results are then saved to a dataframe.

Below we break down the code for you to see it in action. The original code is in src/main.py

In [74]:
from datetime import datetime, timedelta

import requests
from bs4 import BeautifulSoup as Soup
import pandas as pd

In [75]:
#Set a date
date = '2021-03-15'

In [76]:
#Download data from the page

url = 'https://hansard.parliament.uk/commons/' + date
response = requests.get(url, timeout=5)
page_content = Soup(response.content, "html.parser")
folders = page_content.find_all("div", attrs={"class": "card-folder"})

Folders is just the HTML data from the selected area of the page (blue areas in the first screenshot). We do this by using the attribute selection in BeautifulSoup and searching on the card-folder class.

In [77]:
folders

[<div class="card-folder">
 <a class="toggle" data-expand-on-load="true" href="#">
 <h2>Commons Chamber</h2>
 <p>Record of debates that happened in the House of Commons main chamber.</p>
 </a>
 <div class="contents collapse">
 <a class="card card-section" href="/commons/2021-03-15/debates/72DFF73F-9E31-4651-935A-47EE0BFFD2B9/HouseOfCommons">
 <div class="card-inner">
 <div class="content">
 <div class="primary-info">
                     House of Commons
                     <span class="sr-only">(item number 1)</span>
 </div>
 </div>
 </div>
 </a>
 <a class="card card-section" href="/commons/2021-03-15/debates/0AA3FB1E-8056-49FC-923D-BBAF53D50E66/Prayers">
 <div class="card-inner">
 <div class="content">
 <div class="primary-info">
                     Prayers
                     <span class="sr-only">(item number 2)</span>
 </div>
 </div>
 </div>
 </a>
 <div class="card-folder card-folder-section">
 <a class="toggle" href="#">
 <h3>Oral Answers to Questions</h3>
 </a>
 <div class="c

In [78]:
folder_name = ''
folder_desc = ''

folder_name_lst = []
folder_desc_lst = []
sub_folder_name_lst = []
card_lst = []
link_lst = []
text_lst = []

For each folder (blue box) we then run our code to search for sub folders and cards. We then get the URL for each card and grab the text.

In [79]:
for folder in folders:
    # Searching just for H2 text which denotes the folder name of the publication
    if len(folder.find_all('h2')) > 0:
        folder_name = folder.text.split("\n")[2]
        folder_desc = folder.text.split("\n")[3]
        sub_folder_name = ''

    else:
        # Otherwise it's a subfolder
        sub_folder_name = folder.text.split("\n")[2]

    cards = folder.find_all("a", attrs={"class": "card card-section"})
    for card in cards:
        # Gets name of the card
        card_lst.append(card.text.split('\r\n')[1].strip())

        # Gets each card in each folder, finds it's URL link
        link = 'https://hansard.parliament.uk/' + card.attrs.get('href')
        folder_name_lst.append(folder_name)
        folder_desc_lst.append(folder_desc)
        sub_folder_name_lst.append(sub_folder_name)
        link_lst.append(link)

        # Goes to the relevant URL
        response = requests.get(link, timeout=5)
        page_content = Soup(response.content, "html.parser")
        content = page_content.find("div", attrs={"class": "col-lg-9 primary-content"})
        paras = content.find_all("p")

        text_out = ''
        for para in paras:
            # Collects all relevant text for the page
            if len(para.text) > 5:
                text_out = text_out + ' ' + para.text
        text_lst.append(text_out)

Now we have many lists comprising the folder names, URL links and text within the cards. For example...

In [80]:
print(folder_name_lst[3] +
      '\n' + 
      folder_desc_lst[3] +
      '\n' + 
      sub_folder_name_lst[3] +
      '\n' + 
      card_lst[3] +
      '\n' + 
      link_lst[3] +
      '\n' +
      text_lst[3])

Commons Chamber
Record of debates that happened in the House of Commons main chamber.

Military Housing: Annington Homes
https://hansard.parliament.uk//commons/2021-03-15/debates/6ECDE577-00AD-4C01-BB1D-64272D5005A8/MilitaryHousingAnningtonHomes
  What recent discussions his Department has had with representatives of Annington Homes on the sale of military housing.  (913319) Before I turn to Question 1, on behalf of the Government I wish to pay tribute to Sergeant Gavin Hillier of the Welsh Guards, who tragically died in an accident during live-firing exercises in Wales earlier this month. Sergeant Hillier’s distinguished service throughout his career was a tribute not only to his own dedication to duty but to his family and to his regiment, who continue to prepare for operations in Iraq later this year. I thank my hon. Friend the Member for Orpington (Gareth Bacon) for his close interest in this issue, which is also actively pursued by my right hon. Friend the Member for Preseli Pembr

Finally we just neatly add these to a dataframe.

In [81]:
df = pd.DataFrame(list(zip(folder_name_lst, folder_desc_lst, sub_folder_name_lst, card_lst, link_lst,
                           text_lst)),
                  columns=['Folder', 'Description', 'Subfolder', 'Card', 'Link', 'Text'])

df['date_collected'] = datetime.now().strftime('%d/%m/%Y')

df.head(5)

Unnamed: 0,Folder,Description,Subfolder,Card,Link,Text,date_collected
0,Commons Chamber,Record of debates that happened in the House o...,,House of Commons,https://hansard.parliament.uk//commons/2021-03...,Monday 15 March 2021 The House met at half-pa...,30/03/2021
1,Commons Chamber,Record of debates that happened in the House o...,,Prayers,https://hansard.parliament.uk//commons/2021-03...,[Mr Speaker in the Chair] Virtual participati...,30/03/2021
2,Commons Chamber,Record of debates that happened in the House o...,,,https://hansard.parliament.uk//commons/2021-03...,The Secretary of State was asked— What recen...,30/03/2021
3,Commons Chamber,Record of debates that happened in the House o...,,Military Housing: Annington Homes,https://hansard.parliament.uk//commons/2021-03...,What recent discussions his Department has h...,30/03/2021
4,Commons Chamber,Record of debates that happened in the House o...,,Defence Estate Optimisation Programme,https://hansard.parliament.uk//commons/2021-03...,What plans he has to review the defence esta...,30/03/2021


As shown, row 2 has no Card name. This is because the page it links to comprises all the text from a number of subsqeuent Cards such as Military Housing: Annington Homes and Defence Estate Optimisation Programme. Therefore, we will drop all rows where the Card field is empty.

In [82]:
df = df[df.Card != '']
df.head(5)

Unnamed: 0,Folder,Description,Subfolder,Card,Link,Text,date_collected
0,Commons Chamber,Record of debates that happened in the House o...,,House of Commons,https://hansard.parliament.uk//commons/2021-03...,Monday 15 March 2021 The House met at half-pa...,30/03/2021
1,Commons Chamber,Record of debates that happened in the House o...,,Prayers,https://hansard.parliament.uk//commons/2021-03...,[Mr Speaker in the Chair] Virtual participati...,30/03/2021
3,Commons Chamber,Record of debates that happened in the House o...,,Military Housing: Annington Homes,https://hansard.parliament.uk//commons/2021-03...,What recent discussions his Department has h...,30/03/2021
4,Commons Chamber,Record of debates that happened in the House o...,,Defence Estate Optimisation Programme,https://hansard.parliament.uk//commons/2021-03...,What plans he has to review the defence esta...,30/03/2021
5,Commons Chamber,Record of debates that happened in the House o...,,Cyber Warfare,https://hansard.parliament.uk//commons/2021-03...,What steps his Department is taking to reduc...,30/03/2021
