# Data aquisition: Speeches BoE 2022 - 2025
In order to get the most recent speeches for our analysis, we perform web scraping techniques on the www.bankofengland.co.uk website. This will allow us to obtain the speech date, title, author, and links to each speech's content.

In a separate step, we will create a dataframe with this information and then reach out to every link to obtain the actual text of each speech.

We will later merge our initial speeches dataset with the one resulting from this procedure. This will ensure the analysis is based on the most comprehensive and updated information available, in line with the project's objectives outlined in the employer brief​Bank of England Project…

In [1]:
# Import libraries
import pandas as pd               # for data handling
import requests                   # for web scraping
from bs4 import BeautifulSoup
import re                         # for regular expressions
from datetime import datetime
from time import sleep            # for introducing pauses 


### Retrieve speeches metadata from BoE website (scraping)

The scraping focused on the BoE speeches section (https://www.bankofengland.co.uk/news/speeches).
The cookie contains a session verification token required for authenticated access. This setup ensures that subsequent requests to the API are correctly authorised, enabling the scraper to retrieve structured data directly rather than parsing HTML pages.

In [2]:
# Endpoint and headers
url = "https://www.bankofengland.co.uk/_api/News/RefreshPagedNewsList"

headers = {
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.6",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Origin": "https://www.bankofengland.co.uk",
    "Referer": "https://www.bankofengland.co.uk/news/speeches",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

# Cookie with session token (replace if expired)
cookies = {
    "shell#lang": "en",
    "__RequestVerificationToken": "F0TRyiMm0Nwv7WY9BlRedjFyTyzk5BExaj_WC8N-TXLOB75rrftgDCk55SpI9VN0uoMCkj0FqJk3ZD36jWZnPiilGoE1"
}


In [3]:
response = requests.post(url, headers=headers, cookies=cookies, data=payload)
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
else:
    print(f"All Good. Status code: {response.status_code}")

NameError: name 'payload' is not defined

#### Inspect the HTML structure with prettifiy() 

Using prettify() allow us to see the entire structure of the HTML, so that we can focus on specific items to look into in search of the data we need.

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
# Let's see how this looks
print(soup.prettify())

- Each individual speech item was contains within a "div" element with a specific class, which groups the title, link, date, and sometimes the speaker's name together.
- Speech title and the hyperlink to the full speech page were found inside <a> tags
- Publication dates are in a separate "div" or "time" element
- Speakers/authors are inconsistently available: Not all speeches include a named speaker in the listing; some require fetching from the detailed speech page, while others did not list the speaker at all.
#

Let's print the HTTP response status code to verify if the request was successful.

In [None]:
print(response.status_code)
print(response.text[:500])  # Only first 500 characters to keep it short

### Scraping the metadata for every speech item
- In this section, we will create a dataframe with the title, date, and link for every speech listed usin BeautifulSoup.
- The approach uses a loop structure to iteratively fetch and parse each page, stopping when no further speeches were found.

In [4]:
# Set target date range
start_date = datetime.strptime("2022-10-21", "%Y-%m-%d")
today = datetime.today()



# Cookie with session token (replace if expired)
cookies = {
    "shell#lang": "en",
    "__RequestVerificationToken": "F0TRyiMm0Nwv7WY9BlRedjFyTyzk5BExaj_WC8N-TXLOB75rrftgDCk55SpI9VN0uoMCkj0FqJk3ZD36jWZnPiilGoE1"
}

# List to store results
records = []

# Loop through pages
for page in range(1, 160):  # Limit to 160 pages to avoid infinite loop
    payload = {
        "SearchTerm": "",
        "Id": "{CE377CC8-BFBC-418B-B4D9-DBC1C64774A8}",
        "PageSize": "30",
        "NewsTypesAvailable[]": "f949c64a4c88448b9e269d10080b0987",
        "Page": str(page),
        "Direction": "1",
        "Grid": "false",
        "InfiniteScrolling": "false"
    }

    response = requests.post(url, headers=headers, cookies=cookies, data=payload)

    if response.status_code != 200:
        break

    data = response.json()
    html = data.get("Results", "")
    soup = BeautifulSoup(html, "html.parser")
    items_found = 0

    for div in soup.select("div.release-content"):
        date_text = div.select_one("time.release-date").text.strip()
        try:
            date = datetime.strptime(date_text, "%d %B %Y")
        except ValueError:
            continue
        if date < start_date:
            break
        if date > today:
            continue

        title = div.select_one("h3.list").text.strip()
        link = "https://www.bankofengland.co.uk" + div.find_parent("a")["href"]
        records.append({"date": date.strftime("%Y-%m-%d"), "title": title, "link": link})
        items_found += 1

    if items_found == 0 or date < start_date:
        break                                           # stop when there are no more items

# Create dataframe
df = pd.DataFrame(records)

# Save results to a CSV file
#df.to_csv("boe_speeches_metadata.csv", index=False)
print("Saved", len(df), "speeches to 'boe_speeches_metadata.csv'")

Saved 168 speeches to 'boe_speeches_metadata.csv'


In [8]:
df.to_csv(r'C:\Users\germa\OneDrive\LSE DATA ANALYTICS CAREER ACCELERATOR\EMPLOYER PROJECT BoE\speech assigning for test\speeches.csv', index=False)


#

### Expanding the data with actual text from each link

For each speech URL:
- The full speech text was extracted.
- A time delay was inserted between requests to respect ethical scraping standards.


In [None]:
import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
    "Mozilla/5.0 (X11; Linux x86_64)...",
]


# Function to extract the full text of a speech from a given BoE speech page
def extract_speech_text(url):
    headers = {"User-Agent": random.choice(user_agents)}
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            return f"HTTP {response.status_code}"
        
        soup = BeautifulSoup(response.text, "html.parser")
        main_section = soup.select_one("div#output > section.page-section")
        if not main_section:
            return "Speech section not found"

        paragraphs = main_section.find_all(['p', 'h2', 'h3', 'h4', 'ul', 'ol'])
        return "\n\n".join(p.get_text(separator=" ", strip=True) for p in paragraphs)
    except Exception as e:
        return f"Error: {str(e)}"

# Apply to all rows with a small delay
df['speech_text'] = df['link'].apply(lambda url: extract_speech_text(url))

sleep(random.uniform(2.5, 20))  # Wait between 2.5 to 20 seconds  # optional: delay between requests if scraping more than a few

# Save to new CSV
#df.to_csv("boe_speeches_speech_text.csv", index=False)
#print("Saved", len(df), "speeches to boe_speeches_speech_text.csv")

In [None]:
df

#

### Extracting speaker name

In the 'title' column, the name of the speaker appears after 'speech by ' string in most of the cases. We can use this to isolate the author and create a new column

In [None]:
# Load the dataset
df = pd.read_csv("boe_speeches_speech_text.csv")

# Combined function to extract speaker's name from title
def extract_author_from_title(title):
    # Try both types of separators
    for separator in [" − ", " - ", " – "]:
        if separator in title:
            event_part = title.split(separator)[-1]
            match = re.search(r"\bby ([\w\s\.\-']+)", event_part, re.IGNORECASE)
            if match:
                return match.group(1).strip()
    return None

# Apply to rows where 'author' is missing or new extraction is preferred
df['author'] = df['title'].apply(extract_author_from_title)


# Preview results
df[['title', 'author']].head()

In [None]:
# Count nulls in 'author' column
null_author_count = df['author'].isnull().sum()
print(null_author_count)

In [None]:
df.info()

#

### Applying data types

In [None]:
# Apply correct data types
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Parse dates
df['title'] = df['title'].astype(str)
df['link'] = df['link'].astype(str)
df['speech_text'] = df['speech_text'].astype(str)
df['author'] = df['author'].astype('string')  # Nullable string type in pandas

In [None]:
df.info()

In [None]:
# Remove links
#df2 = df.drop(['link'], axis=1)

# Reorder the columns:
df2 = df2[['date', 'title', 'author', 'speech_text']]

df2

#
### Adding the 'is_gov' row
The BoE governor for the period in the dataframe (2022-10-20 - present) is Andrew Bailey


Let's write a function that checks the last names of the authors. If they match the governor's, is_gov=1. Else, is_gov=0

In [None]:
# List of last names of BoE Governors from 1998 to 2022
governors_last_names = ["Bailey"]

# Function to check if the author is a governor
def is_governor(author):
    if pd.isna(author):
        return 0
    return int(any(last_name in author for last_name in governors_last_names))

# Apply the function and insert the column after 'author'
df2.insert(loc=df2.columns.get_loc('author') + 1, column='is_gov', value=df2['author'].apply(is_governor))

df2

In [None]:
# check
print(df2[df2['is_gov'] == 1])

In [None]:
df2.head()

### Sorting and saving

In [None]:
# rename columns
df2 = df2.rename(columns={'speech_text': 'text'})

# Sort ascending
df2 = df2.sort_values(by='date', ascending=True)

# save the file
df2.to_csv("recent_speeches_clean.csv", index=False)

#
# Manual tweaks
After inspecting the saved file, some manual procedures were performed to further complete the data:
- 20 names for 'author' where added when scraping was unsuccesfull. The names where found on the 'speech by' part of the titles.
- 38 speeches showed 'Speech section not found'. This was probably due to inconsistencies on the HTML structures of every speech details page. The contents where manually copied and pasted to the CSV in Excel
- A speech by David Bailey was labeled as true for 'is_gov' column. This mistake was corrected (Only Andrew Bailey's speeches should be true for the period examined)

#
# Final Dataset
After completing the missing data in Excel, we exported the final 'recent_speeches' file:

In [None]:
recent_speeches = pd.read_excel('recent speeches_v2.xlsx')
recent_speeches