<a href="https://colab.research.google.com/github/jesusvillota/BIS_Scraper/blob/main/tutorial/tutorial2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="max-width: 880px; margin: 20px auto 22px; padding: 0px; border-radius: 18px; border: 1px solid #e5e7eb; background: linear-gradient(180deg, #ffffff 0%, #f9fafb 100%); box-shadow: 0 8px 26px rgba(0,0,0,0.06); overflow: hidden;">

  <!-- Banner Header -->
  <div style="padding: 34px 32px 14px; text-align: center; line-height: 1.38;">
    <div style="font-size: 13px; letter-spacing: 0.14em; text-transform: uppercase; color: #6b7280; font-weight: bold; margin-bottom: 5px;">
      Session #2
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      Scraping Central Bank Speeches from the BIS (Part II)
    </div>
    <div style="font-size: 16.5px; color: #374151; font-style: italic; margin-bottom: 0;">
      Data Science for Economics: Mastering Unstructured Data
    </div>
  </div>

  <!-- Logo Section -->
  <div style="background: none; text-align: center; margin: 30px 0 10px;">
    <img src="https://www.cemfi.es/images/Logo-Azul.png" alt="CEMFI Logo" style="width: 158px; filter: drop-shadow(0 2px 12px rgba(56,84,156,0.05)); margin-bottom: 0;">
  </div>

  <!-- Name -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1.22em; font-weight: bold; margin-bottom: 0px;">
    Jesus Villota Miranda © 2025
  </div>

  <!-- Contact info -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1em; margin-top: 7px; margin-bottom: 20px;">
    <a href="mailto:jesus.villota@cemfi.edu.es" style="color: #38549c; text-decoration: none; margin-right:8px;" title="Email">
      <!-- Email logo -->
      <img src="https://cdn-icons-png.flaticon.com/512/11679/11679732.png" alt="Email" style="width:18px; vertical-align:middle; margin-right:5px;">
      jesus.villota@cemfi.edu.es
    </a>
    <span style="color:#9fa7bd;">|</span>
    <a href="https://www.linkedin.com/in/jesusvillotamiranda/" target="_blank" style="color: #38549c; text-decoration: none; margin-left:7px;" title="LinkedIn">
      <!-- LinkedIn logo -->
      <img src="https://1.bp.blogspot.com/-onvhHUdW1Us/YI52e9j4eKI/AAAAAAAAE4c/6s9wzOpIDYcAo4YmTX1Qg51OlwMFmilFACLcBGAsYHQ/s1600/Logo%2BLinkedin.png" alt="LinkedIn" style="width:22px; vertical-align:middle; margin-right:5px;">
      LinkedIn
    </a>
  </div>
</div>


**IMPORTANT**: **Are you running this notebook in Google Colab?**

- If so, please make sure that in the cell below `running_in_colab` is set to `True`

- And, of course,  make sure to **run the cell**!

In [17]:
# ARE YOU RUNNING THIS IN GOOGLE COLAB? If YES, type True below
running_in_colab = False

# **1. Initial Setup**

In this section, we set up the necessary parameters and environment for our web scraping project. We'll:

1. Define constants like base URL, download directory, and date ranges
2. Import required libraries (`requests`, `selenium`, etc.)
3. Create directories for storing downloaded PDFs and extracted text
4. Define a utility function to generate BIS (Bank for International Settlements) search URLs with pagination parameters

This setup is crucial as it establishes the foundation for our scraping process, ensuring we have all the necessary tools and configurations in place before proceeding with data collection.

In [18]:
# --- Setup params ---
BASE_URL = "https://www.bis.org"
DOWNLOAD_DIR = "downloads"
TEXT_DIR = 'texts'
INITIAL_DATE = "01/01/2000"
FINAL_DATE = "11/08/2025"
PAGE_LENGTH = 10
MAX_PAGE = 5

# --- Conditional install ---
if running_in_colab:
    # Install selenium if running in Colab
    !pip install selenium PyPDF2

# --- Imports ---
import time
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
import PyPDF2

os.makedirs(TEXT_DIR, exist_ok=True)
os.makedirs(DOWNLOAD_DIR, exist_ok=True)

# --- BIS Link generator ---
def bis_link(initial_date, final_date, page, page_length): 
    index_url = (
        f"https://www.bis.org/cbspeeches"
        f"?fromDate={initial_date}"
        f"&tillDate={final_date}"
        f"&cbspeeches_page={page}"
        f"&cbspeeches_page_length={page_length}"
    )
    return index_url

# **2. Scraping PDF files with Selenium**

This section demonstrates how to use `Selenium` WebDriver to automate the process of finding and downloading PDF files from the BIS website. The process involves:

1. Initializing a Chrome WebDriver to automate browser interactions
2. Navigating through multiple pages of search results using our previously defined pagination function
3. Finding and extracting links to speech review pages
4. Visiting each review page to locate the PDF download links
5. Downloading the PDFs and saving them to our local directory

`Selenium` is particularly useful here because it can handle JavaScript-rendered content that might not be accessible with simple HTTP requests. This allows us to interact with dynamic elements on the page and navigate through the site's structure programmatically.

In [19]:
if running_in_colab:
    from selenium.webdriver.chrome.options import Options
    chrome_options = Options()
    chrome_options.add_argument('--headless') # Run in headless mode
    chrome_options.add_argument('--no-sandbox') # Bypass OS security model
    chrome_options.add_argument('--disable-dev-shm-usage') # Overcome limited resource problems
    driver = webdriver.Chrome(options=chrome_options)
else: 
    driver = webdriver.Chrome()

for i in range(1, MAX_PAGE + 1):
    index_url = bis_link(INITIAL_DATE, FINAL_DATE, i, PAGE_LENGTH)
    print(f"\n==============[ Processing page {i} ]==============")
    driver.get(index_url)
    time.sleep(5)  # Wait for JS

    try:
        container = driver.find_element(By.ID, "cbspeeches_list")
        review_links = container.find_elements(By.CSS_SELECTOR, "a.dark[href^='/review/']")
        review_hrefs = [link.get_attribute("href") for link in review_links]
        print(f"---> 🔗 Found {len(review_hrefs)} review links on page {i}")
    except Exception as e:
        print(f"---> ❌ Could not find review links on page {i}: {e}")
        continue

    # --- Iterate over each review link ---
    for review_url in review_hrefs:
        print(f"\n🌐 Visiting: {review_url}")
        driver.get(review_url)
        time.sleep(2)  # Wait for detail page JS (adjust if necessary)

        # Look for pdf link on the detail page
        try:
            pdf_link = driver.find_element(By.CSS_SELECTOR, "a.pdftitle_link[href$='.pdf']")
            pdf_href = pdf_link.get_attribute("href")
            if not pdf_href.startswith("http"):
                pdf_href = BASE_URL + pdf_href
            print("📄 PDF found:", pdf_href)

            # Download the PDF
            response = requests.get(pdf_href)
            filename = os.path.basename(pdf_href)
            save_path = os.path.join(DOWNLOAD_DIR, filename)
            with open(save_path, "wb") as f:
                f.write(response.content)
            print(f"📩 Downloaded PDF to {save_path}")
        except Exception as e:
            print(f"❌ No PDF found or error: {e}")

driver.quit()



---> 🔗 Found 10 review links on page 1

🌐 Visiting: https://www.bis.org/review/r250728g.htm
📄 PDF found: https://www.bis.org/review/r250728g.pdf
📩 Downloaded PDF to downloads/r250728g.pdf

🌐 Visiting: https://www.bis.org/review/r250728f.htm
📄 PDF found: https://www.bis.org/review/r250728f.pdf
📩 Downloaded PDF to downloads/r250728f.pdf

🌐 Visiting: https://www.bis.org/review/r250728e.htm
📄 PDF found: https://www.bis.org/review/r250728e.pdf
📩 Downloaded PDF to downloads/r250728e.pdf

🌐 Visiting: https://www.bis.org/review/r250717g.htm
📄 PDF found: https://www.bis.org/review/r250717g.pdf
📩 Downloaded PDF to downloads/r250717g.pdf

🌐 Visiting: https://www.bis.org/review/r250728i.htm
📄 PDF found: https://www.bis.org/review/r250728i.pdf
📩 Downloaded PDF to downloads/r250728i.pdf

🌐 Visiting: https://www.bis.org/review/r250728h.htm
📄 PDF found: https://www.bis.org/review/r250728h.pdf
📩 Downloaded PDF to downloads/r250728h.pdf

🌐 Visiting: https://www.bis.org/review/r250717f.htm
📄 PDF found: 

# **3. Extracting text from PDF files**

After downloading the PDF files, we need to extract their textual content for analysis. This section covers:

1. Using the `PyPDF2` library to process PDF documents
2. Iterating through each downloaded PDF file in our directory
3. Extracting text content from all pages of each PDF
4. Saving the extracted text to corresponding text files in our designated text directory
5. Implementing error handling to manage potential issues in PDF processing

Text extraction is an essential step in the data pipeline as it converts the unstructured PDF content into plain text that can be more easily analyzed, processed, and used for tasks like natural language processing, sentiment analysis, or topic modeling. The `PyPDF2` library provides a straightforward way to extract text from PDF documents without requiring external dependencies.

In [20]:
for pdf_file in os.listdir(DOWNLOAD_DIR):
    if pdf_file.lower().endswith('.pdf'):
        pdf_path = os.path.join(DOWNLOAD_DIR, pdf_file)
        print(f"\n👁️‍🗨️ Extracting: {pdf_path}")
        try:
            with open(pdf_path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                text = ""
                for page in reader.pages:
                    text += page.extract_text() or ""
            txt_filename = os.path.splitext(pdf_file)[0] + ".txt"
            txt_path = os.path.join(TEXT_DIR, txt_filename)
            with open(txt_path, "w", encoding="utf-8") as f:
                f.write(text)
            print(f"📩 Saved text to {txt_path}")
        except Exception as e:
            print(f"❌ Error processing {pdf_path}: {e}")



👁️‍🗨️ Extracting: downloads/r250702c.pdf
📩 Saved text to texts/r250702c.txt

👁️‍🗨️ Extracting: downloads/r250715d.pdf
📩 Saved text to texts/r250715d.txt

👁️‍🗨️ Extracting: downloads/r250714a.pdf
📩 Saved text to texts/r250714a.txt

👁️‍🗨️ Extracting: downloads/r250715a.pdf
📩 Saved text to texts/r250715a.txt

👁️‍🗨️ Extracting: downloads/r250703a.pdf
📩 Saved text to texts/r250703a.txt

👁️‍🗨️ Extracting: downloads/r250702e.pdf
📩 Saved text to texts/r250702e.txt

👁️‍🗨️ Extracting: downloads/r250717h.pdf
📩 Saved text to texts/r250717h.txt

👁️‍🗨️ Extracting: downloads/r250715b.pdf
📩 Saved text to texts/r250715b.txt

👁️‍🗨️ Extracting: downloads/r250703b.pdf
📩 Saved text to texts/r250703b.txt

👁️‍🗨️ Extracting: downloads/r250703c.pdf
📩 Saved text to texts/r250703c.txt

👁️‍🗨️ Extracting: downloads/r250715c.pdf
📩 Saved text to texts/r250715c.txt

👁️‍🗨️ Extracting: downloads/r250710h.pdf
📩 Saved text to texts/r250710h.txt

👁️‍🗨️ Extracting: downloads/r250708c.pdf
📩 Saved text to texts/r250708c.txt