# Section 4. Python Web Scraping with Beautiful Soup

#### Instructor: Pierre Biscaye 

The objective of this notebook is to introduce you to how to extract data from the web using Python web scraping tools.

### Learning Objectives
1. Extracting and parsing HTML using Beautiful Soup
2. Understand difference between tags, attributes, and attribute values
3. Apply these tools in the context of scraping information about 2025 development job market paper blog posts
4. Practice scraping downloadable files from a website

### Libraries loaded
* beautifulsoup4
* datetime
* requests
* time
* requests
* lxml
* pandas
* re
* os

# Introduction

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. In the next notebook we provide an introduction to using APIs using the NY Times API as a case study.

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our first case study will be scraping information on the [World Bank Development Impact Job Market Blogs](https://blogs.worldbank.org/en/impactevaluations/wrap-up-of-job-market-series-2025). 

Before we get started, let's peruse the link and view the page source to take a look at the structure of the blogs. 

**Question**: What do you observe, both about the structure of the web pages and about the structure of the URLs?

# 1. Extracting and Parsing HTML 

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). We'll also need the `lxml` package, which helps support some of the parsing that Beautiful Soup performs, but do not need to load it specifically.

In [1]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
import pandas as pd
import re

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. But here we are making a request directly to the website, and we are going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output via an API.

In [2]:
# Specify URL
url = "https://blogs.worldbank.org/en/impactevaluations/wrap-up-of-job-market-series-2025"
# Make a GET request
req = requests.get(url)
req.raise_for_status()  # Ensures the request was successful/not blocked
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])


<!DOCTYPE HTML>
<html lang="en">
    <head>

    <meta content="text/html; charset=UTF-8" http-equiv="content-type"/>
    
    
    
    <script defer="defer" type="text/javascript" src="https://rum.hlx.page/.rum/@adobe/helix-rum-js@%5E2/dist/micro.js" data-routing="env=stage,tier=publish,ams=World Bank"></script>
<link href="/content/dam/wbr-redesign/logos/wbg-favicon.png" rel="shortcut icon" type="image/png"/>
    

    
    <title>Wrap-up of Job Market Series 2025</title>
    <meta name="keywords" content="Development Impact"/>

    
    <meta content="blog-details-page" name="template"/>
    <meta content="/content/worldbankgroup/blogs/en/blogs/impactevaluations/wrap-up-of-job-market-series-2025" name="pagepath"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    

    

    
        <link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link crossorigin="" hr

## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

In order to parse the HTML, we need to choose a **parser**. This is the 'engine' that decides how to read, break down, and structure the HTML code.

Two common ones are `html.parser` and `lxml`. `html.parser` is built-in to Python and has moderate speed but strict flexibility. In contrast `lxml` must be installed but is very fast and extremely lenient. `lxml` is the current inductry standard. 

In terms of speed, we probably won't notice a speed difference with the World Bank blog posts, but it would save significant time if scraping thousands of pages.

In terms of leniency/flexibility, it is important to understand that real-world websites rarely have perfect HTML. For example, there might be a `<div>` that opens but never closes. More lenient/flexible parsers will guess where tags should close and build a logical tree anyway. More strict parsers might break and return incomplete content.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [3]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<!DOCTYPE HTML>
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <script data-routing="env=stage,tier=publish,ams=World Bank" defer="defer" src="https://rum.hlx.page/.rum/@adobe/helix-rum-js@%5E2/dist/micro.js" type="text/javascript">
  </script>
  <link href="/content/dam/wbr-redesign/logos/wbg-favicon.png" rel="shortcut icon" type="image/png"/>
  <title>
   Wrap-up of Job Market Series 2025
  </title>
  <meta content="Development Impact" name="keywords"/>
  <meta content="blog-details-page" name="template"/>
  <meta content="/content/worldbankgroup/blogs/en/blogs/impactevaluations/wrap-up-of-job-market-series-2025" name="pagepath"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect"/>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link crossorigin="" href="https://assets.adobedtm.com" rel="preconnect"/>
  <lin

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns all of those elements.

What does the example below do?

In [4]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a class="visually-hidden focusable" href="#main" tabindex="0">Skip to Main Navigation</a>, <a class="toplink" href="http://www.worldbank.org" target="_self">worldbank.org</a>, <a class="lp__navbar_brand" href="/en/home" target="_self">
<img alt="World Bank Blogs Logo" src="https://s7d1.scene7.com/is/image/wbcollab/logo-en?qlt=90&amp;fmt=webp&amp;resMode=sharp2" title="World Bank Blogs Logo"/>
</a>, <a aria-current="page" class="lp__nav_link" href="/en/home" target="_self">Home</a>, <a aria-current="page" class="lp__nav_link" href="/en/blogs" target="_self">All Blogs</a>, <a aria-current="page" class="lp__nav_link" href="/en/topics" target="_self">Topics</a>, <a class="lp__nav_link" href="mailto:voicesen@worldbank.org">Contact</a>, <a class="lp__nav_link" href="https://live.worldbank.org/en/home?intcid=wbw_nav_WBL_en_ext" target="_self">
<span class="sr-only">WB Live Logo</span>
<img alt="WB Live Logo" aria-hidden="true" height="35" src="/content/dam/sites/blogs/logos/wblive-logo-ENG-

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [5]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

<a class="visually-hidden focusable" href="#main" tabindex="0">Skip to Main Navigation</a>
<a class="visually-hidden focusable" href="#main" tabindex="0">Skip to Main Navigation</a>


How many links did we obtain?

In [6]:
print(len(a_tags))

70


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink (which can see by the "href" in the above output), so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? What classes do you see in the above list of the first set of HTML tags?

We can restrict our search to certain classes by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="lp__nav_link"`.

In [7]:
# Get only the 'a' tags in 'lp__nav_link' class
nav_link = soup("a", class_="lp__nav_link")
nav_link[:5]

[<a aria-current="page" class="lp__nav_link" href="/en/home" target="_self">Home</a>,
 <a aria-current="page" class="lp__nav_link" href="/en/blogs" target="_self">All Blogs</a>,
 <a aria-current="page" class="lp__nav_link" href="/en/topics" target="_self">Topics</a>,
 <a class="lp__nav_link" href="mailto:voicesen@worldbank.org">Contact</a>,
 <a class="lp__nav_link" href="https://live.worldbank.org/en/home?intcid=wbw_nav_WBL_en_ext" target="_self">
 <span class="sr-only">WB Live Logo</span>
 <img alt="WB Live Logo" aria-hidden="true" height="35" src="/content/dam/sites/blogs/logos/wblive-logo-ENG-short_white2.png" title="WB Live Logo" width="95"/>
 </a>]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.lp__nav_link"` as a CSS selector, which returns all `a` tags with class `lp__nav_link`.

In [8]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.lp__nav_link")
selected[:5]

[<a aria-current="page" class="lp__nav_link" href="/en/home" target="_self">Home</a>,
 <a aria-current="page" class="lp__nav_link" href="/en/blogs" target="_self">All Blogs</a>,
 <a aria-current="page" class="lp__nav_link" href="/en/topics" target="_self">Topics</a>,
 <a class="lp__nav_link" href="mailto:voicesen@worldbank.org">Contact</a>,
 <a class="lp__nav_link" href="https://live.worldbank.org/en/home?intcid=wbw_nav_WBL_en_ext" target="_self">
 <span class="sr-only">WB Live Logo</span>
 <img alt="WB Live Logo" aria-hidden="true" height="35" src="/content/dam/sites/blogs/logos/wblive-logo-ENG-short_white2.png" title="WB Live Logo" width="95"/>
 </a>]

In this case, browsing the HTML, we notice that the list of links to all the blog posts is inside a div tag. We can look at all div tags, and also identify just the one we want by looking at its attributes.

In [9]:
div_tags = soup.find_all("div")
print(len(div_tags))

159


In [10]:
main_div = soup.find('div', class_='text aem-GridColumn aem-GridColumn--default--12')
main_div

<div class="text aem-GridColumn aem-GridColumn--default--12">
<div class="cmp-text" id="text-631b7d20eb">
<p>This year was the 15<sup>th</sup> year of our tradition of posting blogs written by PhD students on the job market summarizing their job market papers. We received a record 74 submissions this year from students at 48 universities in 14 countries. The submitted posts were based on work in 30 different countries plus several multi-country studies, with India (20 posts) by far the most common setting, followed by Nigeria and Pakistan (5 each). There were more posts submitted based on RCTs this year that last year (42% vs 30%), and about the same percentage using DiD and event studies (27%), with the remainder a mix of IV, descriptive, structural models, and some that were unclear.</p>
<p>We ended up selecting 30 to publish, the most ever, and we could have taken more except for running out of calendar days before the end of the year. We again collaborated with the Cornell Economic

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [11]:
# Get all navigation links as a list
navigation_links = soup.select("a.lp__nav_link")

# Examine the first link
first_link = navigation_links[0]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

<a aria-current="page" class="lp__nav_link" href="/en/home" target="_self">Home</a>
Class:  <class 'bs4.element.Tag'>


It's a Beautiful Soup tag! This means it has a `text` attribute.

In [12]:
print(first_link.text)

Home


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes. You can access a tag’s attributes by treating the tag like a dictionary.

In [13]:
print(first_link['href'])
print(first_link.get('href')) # equivalent

/en/home
/en/home


What about the main `<div>` section we are interested in?

In [14]:
main_div.text

'\n\nThis year was the 15th\xa0year of our tradition of posting blogs written by PhD students on the job market summarizing their job market papers. We received a record 74 submissions this year from students at 48 universities in 14 countries. The submitted posts were based on work in 30 different countries plus several multi-country studies, with India (20 posts) by far the most common setting, followed by Nigeria and Pakistan (5 each). There were more posts submitted based on RCTs this year that last year (42% vs 30%), and about the same percentage using DiD and event studies (27%), with the remainder a mix of IV, descriptive, structural models, and some that were unclear.\nWe ended up selecting 30 to publish, the most ever, and we could have taken more except for running out of calendar days before the end of the year. We again collaborated with the Cornell Economics that Really Matters blog, which has published some more posts in their series – so far they have 10 posts up in thei

How do we get the links from within this section? 

We need to search for the 'a' tags.

In [15]:
main_div.find_all('a')

[<a href="https://www.econthatmatters.com/">10 posts up in their series</a>,
 <a href="https://blogs.worldbank.org/en/impactevaluations/when-risk-aversion-keeps-firms-small--evidence-from-kenyan-retai0">When Risk Aversion Keeps Firms Small: Evidence from Kenyan Retailers</a>,
 <a href="https://blogs.worldbank.org/en/impactevaluations/the-price-of-flexibility--revealing-salaries-in-job-postings--gu">The Price of Flexibility: Revealing Salaries in Job Postings</a>,
 <a href="https://blogs.worldbank.org/en/impactevaluations/when-every-child-is-stunted--no-child-is--how-local-norms-distor">When every child is stunted, no child Is? How local norms distort perceptions of growth</a>,
 <a href="https://blogs.worldbank.org/en/impactevaluations/priced-for-development--how-price-controls-spread-technology-but">Priced for Development? How Price Controls Spread Technology but Stall Innovation</a>,
 <a href="https://blogs.worldbank.org/en/impactevaluations/escaping-the-poverty-trap--why-some-househo

# 2. Scraping WB Development Impact blog posts

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [World Bank Development Impact 2025 JMP blog posts](https://blogs.worldbank.org/en/impactevaluations/wrap-up-of-job-market-series-2025).

Specifically, our goal is to scrape information on each blog post, including the link, title, author, website, institution, and blog text.

Right now what we have is the page hosting the links to all our target blog post. Our first task is to collect those links, and then we will scrape information for each one.

## Extract all links

We've already scraped and souped the context of the page linking to all the blogs. We identified the section containing the links, and seen how we can focus on just the link content. Now we want to extract those links. 

We'll use list comprehension (faster than a loop to do this.

In [16]:
# list comprehension: get the URL href for every tag a in the list main_div.find_all(‘a’), if the href exists
links = [a.get('href') for a in main_div.find_all('a') if a.get('href')]
links

['https://www.econthatmatters.com/',
 'https://blogs.worldbank.org/en/impactevaluations/when-risk-aversion-keeps-firms-small--evidence-from-kenyan-retai0',
 'https://blogs.worldbank.org/en/impactevaluations/the-price-of-flexibility--revealing-salaries-in-job-postings--gu',
 'https://blogs.worldbank.org/en/impactevaluations/when-every-child-is-stunted--no-child-is--how-local-norms-distor',
 'https://blogs.worldbank.org/en/impactevaluations/priced-for-development--how-price-controls-spread-technology-but',
 'https://blogs.worldbank.org/en/impactevaluations/escaping-the-poverty-trap--why-some-households-stay-poor---and-o',
 'https://blogs.worldbank.org/en/impactevaluations/lower-prices--lower-chances--how-misbeliefs-keep-freelancers-out',
 'https://blogs.worldbank.org/en/impactevaluations/fostering-trust-to-save-lives--evidence-from-organ-donation-in-t',
 'https://blogs.worldbank.org/en/impactevaluations/what-if-the-train-brought-the-job-to-you--how-public-transit-mov',
 'https://blogs.wo

We need to get rid of the first link to `econthatmatters`. We could do this structurally by searching on the strings, but since it's the first result we'll just drop that one. 

How can we do that?

In [24]:
# code here
links=links[1:]
links

['https://blogs.worldbank.org/en/impactevaluations/when-risk-aversion-keeps-firms-small--evidence-from-kenyan-retai0',
 'https://blogs.worldbank.org/en/impactevaluations/the-price-of-flexibility--revealing-salaries-in-job-postings--gu',
 'https://blogs.worldbank.org/en/impactevaluations/when-every-child-is-stunted--no-child-is--how-local-norms-distor',
 'https://blogs.worldbank.org/en/impactevaluations/priced-for-development--how-price-controls-spread-technology-but',
 'https://blogs.worldbank.org/en/impactevaluations/escaping-the-poverty-trap--why-some-households-stay-poor---and-o',
 'https://blogs.worldbank.org/en/impactevaluations/lower-prices--lower-chances--how-misbeliefs-keep-freelancers-out',
 'https://blogs.worldbank.org/en/impactevaluations/fostering-trust-to-save-lives--evidence-from-organ-donation-in-t',
 'https://blogs.worldbank.org/en/impactevaluations/what-if-the-train-brought-the-job-to-you--how-public-transit-mov',
 'https://blogs.worldbank.org/en/impactevaluations/shou

## Scraping blog content

Our goal is to obtain data from the contents of the individual blog pages. 

Here is how our data extraction process will work. We want:
1. To loop through each link in the list of links
2. To scrape and soup the content of the page
3. To extract specific information from the page: title, author, author's website, institution, and blog text

Let's look at the code for one of these to understand how it is set up. In particular, we need to identify how we will target the information we want to extract.

Let's start digging through one blog. Then we'll write a loop to extract data systematically.

In [25]:
# Fetch the content
page_res = requests.get(links[0])
page_soup = BeautifulSoup(page_res.text, 'lxml')
page_soup.prettify()[:1000]

'<!DOCTYPE HTML>\n<html lang="en">\n <head>\n  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>\n  <script data-routing="env=stage,tier=publish,ams=World Bank" defer="defer" src="https://rum.hlx.page/.rum/@adobe/helix-rum-js@%5E2/dist/micro.js" type="text/javascript">\n  </script>\n  <link href="/content/dam/wbr-redesign/logos/wbg-favicon.png" rel="shortcut icon" type="image/png"/>\n  <title>\n   When Risk Aversion Keeps Firms Small: Evidence from Kenyan Retailers: Guest post by Grady Killeen\n  </title>\n  <meta content="Development Impact" name="keywords"/>\n  <meta content="blog-details-page" name="template"/>\n  <meta content="/content/worldbankgroup/blogs/en/blogs/impactevaluations/when-risk-aversion-keeps-firms-small--evidence-from-kenyan-retai0" name="pagepath"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect"/>\n  <link crossorigin="" href="https://fonts.gst

This first extract of the data already tells us we can extract some information from the `<title>` tag.

In [26]:
page_soup.find_all('title')

[<title>When Risk Aversion Keeps Firms Small: Evidence from Kenyan Retailers: Guest post by Grady Killeen</title>]

The title is also a direct attribute of the soup that we can call. Note that title tags have a `string` associated with them, rather than `text`.


In [27]:
page_soup.title.string

'When Risk Aversion Keeps Firms Small: Evidence from Kenyan Retailers: Guest post by Grady Killeen'

How to extract what we want? We need to use string methods.

In [28]:
# Split by "Guest post by "
if "Guest post by " in page_soup.title.string:
    title_part, author_name = page_soup.title.string.split("Guest post by ", 1)
    # Remove trailing colon/spaces from title
    title_part = title_part.strip().rstrip(':')
else:
    title_part = page_soup.title.string
    author_name = "Unknown"
print(title_part)
print(author_name)

When Risk Aversion Keeps Firms Small: Evidence from Kenyan Retailers
Grady Killeen


Looking at the blog source code, it looks like the rest of what we want is in a `div` tag for the main body, with a specific class.

In [29]:
# There is only one div with the class we are targeting, so we will use find
content_div = page_soup.find('div', class_='text aem-GridColumn aem-GridColumn--default--12')
content_div

<div class="text aem-GridColumn aem-GridColumn--default--12">
<div class="cmp-text" id="text-0d51320499">
<p><i>This is the first in this year’s series of posts by PhD students on the job market.</i></p>
<p>Across low- and middle-income countries, most businesses are small and grow slowly. Economists have long pointed to credit constraints, poor management, or limited demand as explanations. But even when small firms have opportunities to expand, many hesitate. My research investigates a different barrier to private enterprise growth: firm risk aversion.<i></i></p>
<p>The prevailing view of firms in high-income settings is that they are risk neutral because investors in firms have diversified portfolios and managers typically do not own corporations. In practice, this means that enterprises will pursue investments with positive average returns even if they are risky. But the modal firm in low and middle-income countries is small and owner-operated, meaning that business losses may dire

We can notice the following things:

* The first `<p>` tag relates to the blog post series.
* The last `<p>` tag is the author bio.
* The main body text of the blog post is in between.

We can therefore extract the main text of the blog by selecting specific `<p>` tags, and then extract author information from the last `<p>` tag.

In [30]:
# Combine main text into one string
paragraphs = content_div.find_all('p')
post_text = "\n".join([p.get_text() for p in paragraphs[1:-1]])
post_text

'Across low- and middle-income countries, most businesses are small and grow slowly. Economists have long pointed to credit constraints, poor management, or limited demand as explanations. But even when small firms have opportunities to expand, many hesitate. My research investigates a different barrier to private enterprise growth: firm risk aversion.\nThe prevailing view of firms in high-income settings is that they are risk neutral because investors in firms have diversified portfolios and managers typically do not own corporations. In practice, this means that enterprises will pursue investments with positive average returns even if they are risky. But the modal firm in low and middle-income countries is small and owner-operated, meaning that business losses may directly impact owners’ consumption. This may deter small firms from pursuing high average return but uncertain investments, constraining growth.   In my job market paper, I test whether fear of short-term losses discourage

Some formatting would be needed to clean this up, but that is a problem for later - in particular if we wanted to do any text analysis on these blogs.

Now, let's extract author bio information.

In [31]:
bio_p = paragraphs[-1]

# Link: Extract the href from the <a> tag in the bio
author_site = bio_p.find('a').get('href')
                
# Institution: Extract text after "at " and clean &nbsp;
# in Python, the HTML entity &nbsp; is represented as \xa0. 
# This line replaces those "non-breaking spaces" with standard spaces
bio_text = bio_p.get_text().replace('\xa0', ' ') # Remove &nbsp;
if "at " in bio_text:
    # Capture everything after "at " until the period
    institution = bio_text.split("at ", 1)[1].split('.', 1)[0].strip()

print(author_site)
print(institution)

https://gkilleen33.github.io/
the University of California, Berkeley


## Systematic scraping protocol

Now that we've figured out where to get the information we want, and assuming all the blog posts are set up the same way in the code, we can formalize the extraction code.

We will do a few things in case there are differences, notably writing `if` statements that let the code move on if it doesn't find what we are looking for. 

We will also code in a short lag between requests, to not overload the server. This is a best practice for web scraping. We will use the `time` package to do this.

We will start by creating a list of dictionaries, as this will go fast than creating a data frame and appending content in the loop. Then we will loop over each link in our list from before.

In [32]:
data_list = []

print(f"Starting extraction for {len(links)} links...")

for link in links:
    try: # defensive coding - if it doesn't work, returns message below
        
        # Be polite to the server by building a lag between requests
        time.sleep(1) 
        
        # Fetch the content
        page_res = requests.get(link)
        page_soup = BeautifulSoup(page_res.text, 'lxml')
        
        # Extract Title and Author from <title>
        full_title = page_soup.title.string if page_soup.title else "" # dealing with unexpected cases
        # Split by "Guest post by "
        if "Guest post by " in full_title: # dealing with unexpected cases
            title_part, author_name = full_title.split("Guest post by ", 1)
            # Remove trailing colon/spaces from title
            title_part = title_part.strip().rstrip(':')
        else:
            title_part = full_title
            author_name = "Unknown"

        # Extract Main Content and Author Bio 
        # Find the div containing the post body
        content_div = page_soup.find('div', class_='text aem-GridColumn aem-GridColumn--default--12')

        # Set blank in case following code doesn't work as intended
        post_text = ""
        author_site = ""
        institution = ""
        
        if content_div:
            # Find all <p> tags inside this div
            paragraphs = content_div.find_all('p')
            
            if len(paragraphs) > 2:
                # The first <p> tag is the "This is the Xth post..." intro
                # The last <p> tag is the author bio
                main_body_paragraphs = paragraphs[1:-1]
                
                # Combine main text into one string
                post_text = "\n".join([p.get_text() for p in main_body_paragraphs])
                
                # Link: Extract the href from the <a> tag in the last paragraph
                bio_link = paragraphs[-1].find('a')
                if bio_link:
                    author_site = bio_link.get('href')
                
                # Institution: Extract text after "at " and clean &nbsp; in the last paragraph
        	    # in Python, the HTML entity &nbsp; is represented as \xa0. 
        	    # This line replaces those "non-breaking spaces" with standard spaces
                bio_text = paragraphs[-1].get_text().replace('\xa0', ' ') # Remove &nbsp;
                if "at " in bio_text:
                    # Capture everything after "at " until the period
                    institution = bio_text.split("at ", 1)[1].split('.', 1)[0].strip()

        # Save to our list
        data_list.append({
            'URL': link,
            'Title': title_part,
            'Author': author_name,
            'Author_Website': author_site,
            'Institution': institution,
            'Full_Text': post_text
        })
        print(f"Successfully processed: {title_part[:30]}...")

    except Exception as e: # what happens if our scrape doesn't work as intended
        print(f"Error skipping {link}: {e}")


Starting extraction for 30 links...
Successfully processed: When Risk Aversion Keeps Firms...
Successfully processed: The Price of Flexibility: Reve...
Successfully processed: When every child is stunted, n...
Successfully processed: Priced for Development? How Pr...
Successfully processed: Escaping the Poverty Trap: Why...
Successfully processed: Lower Prices, Lower Chances? H...
Successfully processed: Fostering Trust to Save Lives:...
Successfully processed: What if the train brought the ...
Successfully processed: Should firms subsidize worker-...
Successfully processed: Why would employees work harde...
Successfully processed: Expecting the worst: Household...
Successfully processed: Scaling short days: Even limit...
Successfully processed: Planning for Which Future? Sea...
Successfully processed: Enforcement Matters: How Niger...
Successfully processed: A framework for social network...
Successfully processed: Beyond sinking sand: How housi...
Successfully processed: Can Better S

Success!! Now let's convert this a data frame to export it in case we want to use it later.

In [33]:
# Export
results_df = pd.DataFrame(data_list)
results_df.to_csv('Data/world_bank_job_market_blogs_2025.csv', index=False)


Some things we could consider for next steps in text analysis when we get to that part of the class:
* Standardizing Text: Stripping punctuation, handling those "stray" HTML characters, and lowercasing.
* Tokenization: Breaking the posts into individual words or sentences.
* Frequency Analysis: Seeing which topics (like "RCT," "India," or "Labor") appear most often in the 2025 series.
* Contextual Filtering: Finding all sentences that mention "Nigeria" or "Flooding" to see how researchers are currently discussing the topics we are focused on.


# 3. Scraping downloadable files

Another useful application of web scraping is bulk downloading files that are stored following some predictable format. This can save time if you know you have to download many files and don't want to go through the process of navigating to each page and manually clicking on download links.

We'll apply these tools to the case of downloading country boundary shape files from [GADM](https://gadm.org/download_country.html). Let's first look at the website and observe what these look like.

What do you observe?

Now we will set up our code to download shapefiles.

In [34]:
# GADM download page URL
base_url = "https://gadm.org/download_country.html"

# Start a session
session = requests.Session()

# Get the HTML content of the page
response = session.get(base_url)
soup = BeautifulSoup(response.text, "lxml")
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>\n  <link href="include/bootstrap-4.0.0-dist/css/bootstrap-united.min.css" rel="stylesheet"/>\n </head>\n <title>\n  GADM\n </title>\n <body>\n  <style type="text/css">\n   .main-container {\r\n  max-width: 940px;\r\n  margin-left: auto;\r\n  margin-right: auto;\r\n}\r\ncode {\r\n  color: inherit;\r\n  background-color: rgba(0, 0, 0, 0.04);\r\n}\r\nimg { \r\n  max-width:100%; \r\n  height: auto; \r\n}\n  </style>\n  <div class="container-fluid main-container">\n   <div class="container">\n    <div class="row">\n     <div class="col">\n      <nav class="navbar navbar-expand-lg navbar-dark bg-primary">\n       <a class="navbar-brand" href="/index.html">\n        GADM\n       </a>\n       <button aria-controls="navbarColor01" aria-expanded="false" aria-label="Toggle navigation" class="navbar-toggler" data-target="#navbarColor01" d

How do we know what country codes to use to build our URLs? We can use the information in the country dropdown menu.

In [35]:
# Extract country codes from the dropdown menu
# Observe that the dropdown is formatted with name="Country" as an attribute
# Observe that each country is associated with an option value giving the GADM country code
country_options = soup.select("select[name=country] option")
country_dict = {option.text.strip(): option["value"] for option in country_options if option["value"]}
country_dict

{'Afghanistan': 'AFG_Afghanistan_3',
 'Akrotiri and Dhekelia': 'XAD_Akrotiri and Dhekelia_2',
 'Ã\x85land': 'ALA_Ã\x85land_3',
 'Albania': 'ALB_Albania_4',
 'Algeria': 'DZA_Algeria_3',
 'American Samoa': 'ASM_American Samoa_4',
 'Andorra': 'AND_Andorra_2',
 'Angola': 'AGO_Angola_4',
 'Anguilla': 'AIA_Anguilla_2',
 'Antarctica': 'ATA_Antarctica_1',
 'Antigua and Barbuda': 'ATG_Antigua and Barbuda_2',
 'Argentina': 'ARG_Argentina_3',
 'Armenia': 'ARM_Armenia_2',
 'Aruba': 'ABW_Aruba_1',
 'Australia': 'AUS_Australia_3',
 'Austria': 'AUT_Austria_5',
 'Azerbaijan': 'AZE_Azerbaijan_3',
 'Bahamas': 'BHS_Bahamas_2',
 'Bahrain': 'BHR_Bahrain_2',
 'Bangladesh': 'BGD_Bangladesh_5',
 'Barbados': 'BRB_Barbados_2',
 'Belarus': 'BLR_Belarus_3',
 'Belgium': 'BEL_Belgium_5',
 'Belize': 'BLZ_Belize_2',
 'Benin': 'BEN_Benin_4',
 'Bermuda': 'BMU_Bermuda_2',
 'Bhutan': 'BTN_Bhutan_3',
 'Bolivia': 'BOL_Bolivia_4',
 'Bonaire, Saint Eustatius and Saba': 'BES_Bonaire, Saint Eustatius and Saba_2',
 'Bosnia and 

We now have a dictionary of country names and associated GADM country values. But these are not exactly what we need to create the URLs for the shapefiles. We need to extract just the first three letters - this is straightforward. Then, we can to loop through the countries we want, find the appropriate download URL, and download shapefiles.

In [36]:
country_dict['Brazil'][:3]

'BRA'

In [37]:
# List of countries to download shapefiles for
countries = ["Bangladesh", "Brazil", "Burundi"]

# Create a directory to store downloaded shapefiles
output_dir = "Data"
import os
os.makedirs(output_dir, exist_ok=True)

# Iterate over the list of desired countries
for country in countries:
    if country in country_dict:

        # Constructing the target URL
        country_code = country_dict[country][:3]
        download_url = f"https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_{country_code}_shp.zip"

        # File path for saving the zip file
        file_path = os.path.join(output_dir, f"gadm41_{country_code}_shp.zip")

        print(f"Downloading {country} shapefile")

        # Download and save the file
        # The stream=True argument keeps the connection open so large files can be downloaded in small pieces.
        response = session.get(download_url, stream=True)
        if response.status_code == 200: # confirms the file was found
        # Memory-efficient download; wb is write binary for a zip file
            with open(file_path, "wb") as f:
        # We go 1Mb (1024*1024 kb) at a time until the file is complete
                for chunk in response.iter_content(chunk_size=1048576):
                    f.write(chunk)
            print(f"Successfully downloaded: {country}\n")
        else:
            print(f"Failed to download: {country}\n")
    else:
        print(f"Country not found in dropdown: {country}\n")

print("All downloads complete!")

Downloading Bangladesh shapefile
Successfully downloaded: Bangladesh

Downloading Brazil shapefile
Successfully downloaded: Brazil

Downloading Burundi shapefile
Successfully downloaded: Burundi

All downloads complete!


We're done! We've successfully downloaded all the shapefiles we wanted.

You can see how this is a powerful tool for bulk downloading data where the URLs follow a common format.

We can also potentially make this much faster using parallel processing, downloading multiple countries simultaneously. Below is sample code for how we would do this. This would be much faster if we are interested in downloading data from many different countries.

In [38]:
# from fcurrence.concurrent.futures import ThreadPoolExecutor

# def download_country(country):
#     # Code for one country, based on the above

# # Download up to 5 countries at once
# with ThreadPoolExecutor(max_workers=5) as executor:
#     executor.map(download_country, countries)