<a href="https://colab.research.google.com/github/jesusvillota/CSS_DataScience_2025/blob/main/Session2/2_1_BIS_Scraper_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="max-width: 880px; margin: 20px auto 22px; padding: 0px; border-radius: 18px; border: 1px solid #e5e7eb; background: linear-gradient(180deg, #ffffff 0%, #f9fafb 100%); box-shadow: 0 8px 26px rgba(0,0,0,0.06); overflow: hidden;">

  <!-- Banner Header -->
  <div style="padding: 34px 32px 14px; text-align: center; line-height: 1.38;">
    <div style="font-size: 13px; letter-spacing: 0.14em; text-transform: uppercase; color: #6b7280; font-weight: bold; margin-bottom: 5px;">
      Session #2
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      Scraping Central Bank Speeches from the BIS
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      Part I
    </div>
    <div style="font-size: 16.5px; color: #374151; font-style: italic; margin-bottom: 0;">
      Using Textual Data in Empirical Monetary Economics
    </div>
  </div>

  <!-- Logo Section -->
  <div style="background: none; text-align: center; margin: 30px 0 10px;">
    <img src="https://www.cemfi.es/images/Logo-Azul.png" alt="CEMFI Logo" style="width: 158px; filter: drop-shadow(0 2px 12px rgba(56,84,156,0.05)); margin-bottom: 0;">
  </div>

  <!-- Name -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1.22em; font-weight: bold; margin-bottom: 0px;">
    Jesus Villota Miranda ¬© 2025
  </div>

  <!-- Contact info -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1em; margin-top: 7px; margin-bottom: 20px;">
    <a href="mailto:jesus.villota@cemfi.edu.es" style="color: #38549c; text-decoration: none; margin-right:8px;" title="Email">
      <!-- Email logo -->
      <!-- <img src="https://cdn-icons-png.flaticon.com/512/11679/11679732.png" alt="Email" style="width:18px; vertical-align:middle; margin-right:5px;"> -->
      jesus.villota@cemfi.edu.es
    </a>
    <span style="color:#9fa7bd;">|</span>
    <a href="https://www.linkedin.com/in/jesusvillotamiranda/" target="_blank" style="color: #38549c; text-decoration: none; margin-left:7px;" title="LinkedIn">
      <!-- LinkedIn logo -->
      <!-- <img src="https://1.bp.blogspot.com/-onvhHUdW1Us/YI52e9j4eKI/AAAAAAAAE4c/6s9wzOpIDYcAo4YmTX1Qg51OlwMFmilFACLcBGAsYHQ/s1600/Logo%2BLinkedin.png" alt="LinkedIn" style="width:17px; vertical-align:middle; margin-right:5px;"> -->
      LinkedIn
    </a>
  </div>
</div>


**IMPORTANT**: **Are you running this notebook in Google Colab?**

- If so, please make sure that in the cell below `running_in_colab` is set to `True`

- And, of course,  make sure to **run the cell**!

In [1]:
# ARE YOU RUNNING THIS IN GOOGLE COLAB? If YES, type True below
running_in_colab = False

In [2]:
# --- Conditional install ---
if running_in_colab:
    # Install selenium if running in Colab
    !pip install bs4 requests selenium

# **0. Introduction**

In this notebook, we will explore how to automatically collect and process central bank speeches published by the Bank for International Settlements (BIS). We will demonstrate practical techniques for scraping web content, handling dynamic pages, and extracting information from documents.

<div style="text-align: center;">
    <img src="https://www.pngitem.com/pimgs/m/586-5865614_bis-bank-for-international-settlements-logo-hd-png.png" alt="Bank for International Settlements" width="250"/>
</div></div>

## **Understanding the Website Structure**

If you go to the website ([https://www.bis.org/cbspeeches](https://www.bis.org/cbspeeches)), you will see that there is a dynamic pane with the speeches, and you can only load up to a maximum of 25 speeches at a time. This means that we may need to implement pagination handling in our scraper to access all available speeches. If you were to run your scraping procedure on the simple link ([https://www.bis.org/cbspeeches](https://www.bis.org/cbspeeches), you would only be able to scrape the first 10 speeches (the default number of speeches loaded is 10), and nothing else! 

> **IMPORTANT**: A crucial step when scraping this type of websites is to tweak the parameters of the selection menu and look at the link. Then, you can infer the patterns to dynamically scrape the content from all the pages. 

Try it out yourself! Go to the BIS webpage, and make adjustments in the menu. Then look again at the link. You will see that the link now is different!

**Modus Operandi**

1) Go to the link: From the original link: ([https://www.bis.org/cbspeeches](https://www.bis.org/cbspeeches))
2) Make modifications in the selection menu

> ![](images/selection1.png)

3) Scroll down to modify the amount of speeches shown per page

> ![](images/selection2.png)

If you now go to the link, you will see it changed!

> ![](images/link.png)
> https://www.bis.org/cbspeeches?fromDate=01%2F01%2F2008&tillDate=19%2F08%2F2025&authors=2366&authors=2864&institutions=1&institutions=29&countries=19&countries=168&cbspeeches_page=3&cbspeeches_page_length=25


Let's break down this link:


> `https://www.bis.org/cbspeeches`
>
> ?`fromDate`=**01**%2F**01**%2F**2008**
> 
> &`tillDate`=**19**%2F**08**%2F**2025**
>
> &`authors`=**2366**&`authors`=**2864**
>
> &`institutions`=**1**&`institutions`=**29**
>
> &`countries`=**19**&`countries`=**168**
>
> &`cbspeeches_page`=**3**
>
> &`cbspeeches_page_length`=**25**


As you can see, whatever you do in the selection pane is mapped into the link! This provides us a way to easily scrape with little effort by simply modifying the URL parameters.

For this purpose, let's define a function that will take us from a set of arguments (`initial_date`, `final_date`, `page`, `page_length`) into the desired link to be scraped:

In [3]:
def bis_link(initial_date, final_date, page, page_length): 
    index_url = (
        f"https://www.bis.org/cbspeeches"
        f"?fromDate={initial_date}"
        f"&tillDate={final_date}"
        f"&cbspeeches_page={page}"
        f"&cbspeeches_page_length={page_length}"
    )
    return index_url

Let's try it!

In [4]:
bis_link(
    "01/01/2010",
    "01/01/2020",
    2,
    10
)

'https://www.bis.org/cbspeeches?fromDate=01/01/2010&tillDate=01/01/2020&cbspeeches_page=2&cbspeeches_page_length=10'

Now, let's define the parameters as global variables to be used throughout this notebook.

In [None]:
from pathlib import Path

# --- Setup params ---
BASE_URL = "https://www.bis.org"
INITIAL_DATE = "01/01/2000"
FINAL_DATE = "11/08/2025"
PAGE_LENGTH = 25
MAX_PAGE = 1
DOWNLOAD_DIR = Path("output/2_2/comparison")

if running_in_colab:
  from google.colab import drive
  drive.mount('/content/gdrive')
  DOWNLOAD_DIR = Path("/content/gdrive/My Drive") / DOWNLOAD_DIR
else: 
  DOWNLOAD_DIR = Path("../") / DOWNLOAD_DIR

# Create directories using pathlib
DOWNLOAD_DIR.mkdir(parents=True, exist_ok=True)

# **1. Scraping from the raw HTML (*Naive Approach*)**

- Let's try to apply what we learned on Monday in Session#1. 
- If you remember, we learned how to scrape a static webpage ([www.cemfi.es](https://www.cemfi.es)) using `requests` and `BeautifulSoup`. 
- Now, we'll attempt to scrape the BIS website of central bank speeches ([https://www.bis.org/cbspeeches](https://www.bis.org/cbspeeches)) using the same approach.

In [6]:
import requests
from bs4 import BeautifulSoup

# Let's try the naive approach using just requests and BeautifulSoup
for i in range(1, MAX_PAGE+1):
    index_url = bis_link(INITIAL_DATE, FINAL_DATE, i, PAGE_LENGTH)
    
    print(f"\n=== Processing page {i} with naive approach ===")
    print(f"üîó Requesting URL: {index_url}")
    
    # Step 1. Send a simple HTTP request - (this won't execute any JavaScript)
    response = requests.get(index_url)
    print(f"üåê HTTP Status: {response.status_code}")
    
    # Step 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Step 3. Try to find the speeches container and links
    container = soup.find(id="cbspeeches_list")
    
    if container:
        print("‚úÖ Found container with id 'cbspeeches_list'!")
        # Try to find speech links
        review_links = container.select("a.dark[href^='/review/']")
        print(f"‚úÖ Found {len(review_links)} review links on page {i}.")
        
        # Print the first few links to see what we got
        for link in review_links[:3]:
            print(f"Link: {link.get('href')}")
    else:
        print("‚ùå Could not find container with id 'cbspeeches_list'")
    
    # Save the HTML
    output_file = DOWNLOAD_DIR / f"page_{i}_naive.html"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(response.text)
    print(f"üíæ Saved HTML to {output_file} for inspection")


=== Processing page 1 with naive approach ===
üîó Requesting URL: https://www.bis.org/cbspeeches?fromDate=01/01/2000&tillDate=11/08/2025&cbspeeches_page=1&cbspeeches_page_length=25
üåê HTTP Status: 200
‚ùå Could not find container with id 'cbspeeches_list'
üíæ Saved HTML to ../output/2_2/comparison/page_1_naive.html for inspection


# **2. Overcoming the Challenge of Scraping JavaScript-heavy websites:**  

1. **The Problem**: When using simple HTTP requests (via `requests` or similar libraries), we only get the initial HTML from the server, which often lacks the content we see in the browser.

2. **Why This Happens**: Modern websites load content dynamically after the initial page load using JavaScript. The browser executes this JavaScript to fetch and display data, but basic HTTP requests don't.

3. **Solutions**:
   - Analyzing potential API endpoints that the website might be using
   - Using `Selenium` to automate a real browser that executes JavaScript

4. **Best Practices**:
   - Always check if a website uses JavaScript to load content before choosing a scraping approach
   - Use browser dev tools (especially the Network tab) to understand how data is loaded
   - Consider browser automation for complex sites or direct API requests for efficiency
   - Be respectful of websites' terms of service and implement rate limiting

Conclusion: The approach you choose depends on the specific website, your performance requirements, and the complexity of the data you need to extract.

## **2.1. Looking for an API endpoint**

In [7]:
# Let's try to find and use the API endpoint that the BIS website might be using
# This is a common approach for dealing with JavaScript-heavy sites

import requests
import json

# Based on network analysis (done in browser), we might find that the site uses an API endpoint
# For demonstration purposes, let's try a common pattern for such endpoints

# Use bis_link function to generate the URL instead of hardcoding
api_url = bis_link(INITIAL_DATE, FINAL_DATE, 1, PAGE_LENGTH)

print(f"Attempting to access potential API endpoint: {api_url}")

try:
    # Some APIs require headers that look like a browser to prevent scraping
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept": "application/json",
        "Referer": "https://www.bis.org/cbspeeches/index.htm"
    }
    
    response = requests.get(api_url, headers=headers)
    print(f"üåê API Response Status: {response.status_code}")
    
    # Try to parse as JSON
    if response.status_code == 200:
        try:
            data = response.json()
            print("‚úÖ Successfully parsed JSON response!")
            print(f"Response structure: {json.dumps(data, indent=2)[:500]}...")  # Print first 500 chars
        except json.JSONDecodeError:
            print("Response is not JSON. Content type:", response.headers.get('Content-Type'))
            print("\nFirst 500 characters of response:", response.text[:500])
    else:
        print(f"‚ùå Failed to access API endpoint: {response.status_code}")
        print("\nFirst 500 characters of response:", response.text[:500])
        
    # Note to students: If the above doesn't work, you would need to:
    # 1. Use browser dev tools (Network tab) to see what requests are being made when the page loads
    # 2. Look for XHR or Fetch requests that return the data you need
    # 3. Analyze those requests and replicate them in your code
    
except Exception as e:
    print(f"‚ùå Error occurred: {e}")

print("\nNote: Finding the exact API endpoint requires analyzing network traffic in browser dev tools.")
print("The URL used above is just a guess for demonstration purposes!")

Attempting to access potential API endpoint: https://www.bis.org/cbspeeches?fromDate=01/01/2000&tillDate=11/08/2025&cbspeeches_page=1&cbspeeches_page_length=25
üåê API Response Status: 200
Response is not JSON. Content type: text/html; charset=UTF-8

First 500 characters of response: <!DOCTYPE html>
<html class='no-js' lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
<head>
<meta content='IE=edge' http-equiv='X-UA-Compatible'>
<meta content='width=device-width, initial-scale=1.0' name='viewport'>
<meta content='text/html; charset=utf-8' http-equiv='Content-Type'>
<meta content='Central bankers&#39; speeches' property='og:title'>
<meta content='Central bankers&#39; speeches' property='og:description'>
<meta content='https://www.bis.org/cbspeeches/index.htm' prope

Note: Finding the exact API endpoint requires analyzing network traffic in browser dev tools.
The URL used above is just a guess for demonstration purposes!


## **2.2. Selenium approach: JavaScript Rendering**

As we just saw, the naive approach using `requests` and `BeautifulSoup` failed to capture the content we need. This is because the BIS website uses JavaScript to dynamically load the speeches data after the initial page load.

To overcome this limitation, we'll use **Selenium** - a browser automation tool that:

1. Opens a real browser instance
2. Navigates to the BIS website 
3. Allows JavaScript to fully execute and render the page
4. Gives us access to the complete HTML content that users actually see

This approach simulates how a human would interact with the website, ensuring we can access all the dynamically loaded content. In the following sections, we'll:

- Set up Selenium with Chrome WebDriver
- Visit the BIS speeches page
- Wait for the JavaScript to execute and render the content
- Extract the links to individual speeches
- Compare the results with our naive approach

In [8]:
# Let's use Selenium to get the "real" HTML after JavaScript execution
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

if running_in_colab:
    from selenium.webdriver.chrome.options import Options
    chrome_options = Options()
    chrome_options.add_argument('--headless') # Run in headless mode
    chrome_options.add_argument('--no-sandbox') # Bypass OS security model
    chrome_options.add_argument('--disable-dev-shm-usage') # Overcome limited resource problems
    driver = webdriver.Chrome(options=chrome_options)
else: 
    driver = webdriver.Chrome()

try:
    # Use the bis_link function instead of hardcoding the URL
    index_url = bis_link(INITIAL_DATE, FINAL_DATE, 1, PAGE_LENGTH)
    print(f"\n=== Processing page {i} with Selenium ===")
    
    # Navigate to the page and wait for JavaScript to execute
    driver.get(index_url)
    time.sleep(5)  # Wait for JS to load content
    
    # Save the HTML after JavaScript execution
    selenium_html = driver.page_source
    output_file = DOWNLOAD_DIR / "page_1_selenium.html"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(selenium_html)
    print(f"üíæ Saved JavaScript-rendered HTML to {output_file}")
    
    # Try to find the speeches container and links
    try:
        container = driver.find_element(By.ID, "cbspeeches_list")
        review_links = container.find_elements(By.CSS_SELECTOR, "a.dark[href^='/review/']")
        review_hrefs = [link.get_attribute("href") for link in review_links]
        print(f"‚úÖ Found {len(review_hrefs)} review links with Selenium.")
        
        # Print the first few links
        for href in review_hrefs:
            print(f"Link: {href}")
    except Exception as e:
        print(f"‚ùå Could not find review links with Selenium: {e}")
        
finally:
    driver.quit()


=== Processing page 1 with Selenium ===
üíæ Saved JavaScript-rendered HTML to ../output/2_2/comparison/page_1_selenium.html
‚úÖ Found 25 review links with Selenium.
Link: https://www.bis.org/review/r250728g.htm
Link: https://www.bis.org/review/r250728f.htm
Link: https://www.bis.org/review/r250728e.htm
Link: https://www.bis.org/review/r250717g.htm
Link: https://www.bis.org/review/r250728i.htm
Link: https://www.bis.org/review/r250728h.htm
Link: https://www.bis.org/review/r250717f.htm
Link: https://www.bis.org/review/r250728d.htm
Link: https://www.bis.org/review/r250717b.htm
Link: https://www.bis.org/review/r250728j.htm
Link: https://www.bis.org/review/r250728k.htm
Link: https://www.bis.org/review/r250714a.htm
Link: https://www.bis.org/review/r250717h.htm
Link: https://www.bis.org/review/r250728l.htm
Link: https://www.bis.org/review/r250709f.htm
Link: https://www.bis.org/review/r250716a.htm
Link: https://www.bis.org/review/r250717e.htm
Link: https://www.bis.org/review/r250701c.htm
Link:

# **3. Comparison of the HTML contents: *Naive vs Selenium***

The naive approach only captures the initial HTML, which lacks the dynamically loaded content. In contrast, the Selenium approach provides the fully rendered HTML, including all JavaScript-generated elements.

This comparison highlights the importance of using the right tools for web scraping, especially when dealing with modern websites that rely heavily on JavaScript for content delivery.

In [9]:
# Let's analyze the differences between the two HTML files in more detail
import os
from bs4 import BeautifulSoup
import difflib
import re

# Only run this if both files exist
naive_file = DOWNLOAD_DIR / "page_1_naive.html"
selenium_file = DOWNLOAD_DIR / "page_1_selenium.html"

if naive_file.exists() and selenium_file.exists():
    # Read both HTML files
    with open(naive_file, "r", encoding="utf-8") as f:
        naive_html = f.read()
    
    with open(selenium_file, "r", encoding="utf-8") as f:
        selenium_html = f.read()
    
    # Parse with BeautifulSoup for better analysis
    naive_soup = BeautifulSoup(naive_html, 'html.parser')
    selenium_soup = BeautifulSoup(selenium_html, 'html.parser')

    print("--- Size comparison ---")

    # Compare file sizes
    naive_size = len(naive_html)
    selenium_size = len(selenium_html)
    size_diff = selenium_size - naive_size
    print(f"Naive HTML size: {naive_size:,} bytes")
    print(f"Selenium HTML size: {selenium_size:,} bytes")
    print(f"Difference: {size_diff:,} bytes ({size_diff/naive_size*100:.1f}% more content in Selenium version)")
    
    # Check for specific content we're interested in
    naive_speeches = naive_soup.find(id="cbspeeches_list")
    selenium_speeches = selenium_soup.find(id="cbspeeches_list")
    
    print("\n--- Content Analysis ---")
    if naive_speeches:
        print(f"‚úÖ Naive HTML has the speeches container")
    else:
        print("‚ùå Naive HTML does NOT have the speeches container")
    
    if selenium_speeches:
        print(f"‚úÖ Selenium HTML has the speeches container")
    else:
        print("‚ùå Selenium HTML does NOT have the speeches container")

    # Find DOM differences that might explain where the dynamic content goes
    print("\n--- DOM Structure Differences ---")
    
    # Look for elements that exist in selenium but not in naive HTML
    selenium_ids = [el.get('id') for el in selenium_soup.find_all(id=True)]
    naive_ids = [el.get('id') for el in naive_soup.find_all(id=True)]
    
    added_ids = [id for id in selenium_ids if id not in naive_ids]
    print(f"IDs present in Selenium HTML but not in naive HTML: {added_ids[:10]}")

--- Size comparison ---
Naive HTML size: 20,924 bytes
Selenium HTML size: 114,818 bytes
Difference: 93,894 bytes (448.7% more content in Selenium version)

--- Content Analysis ---
‚ùå Naive HTML does NOT have the speeches container
‚úÖ Selenium HTML has the speeches container

--- DOM Structure Differences ---
IDs present in Selenium HTML but not in naive HTML: ['nav_main_menu', 'main_menu', 'dtmenu', 'menuline', 'toptitle', 'dthome', 'breadcrumbs', 'nav_local_menu', 'local_menu', 'cbspeeches']
