### Web Scraping Tutorial: Scraping Podcasts from eenbeetjenederlands.nl

---

# Introduction

## What is Web Scraping?
Web scraping is the process of extracting data from websites. This tutorial will guide you through building a simple web scraper to collect podcast episode data and transcripts from the website [eenbeetjenederlands.nl](https://www.eenbeetjenederlands.nl/).

---

## Overview of the Steps
In this tutorial, we will:
1. **Import the Required Libraries** – Set up the necessary Python packages.
2. **Configure Logging** – Track the scraper's progress and log errors.
3. **Initialize Variables** – Define the base URL and data storage structures.
4. **Scrape Episode Information** – Loop through web pages to extract episode data.
5. **Save Data to CSV** – Store the collected data in a CSV file.
6. **Scrape Transcripts** – Visit each episode page and extract the transcript.
7. **Download Audio Previews** – Retrieve and save audio previews for each episode.

---

# Setup and Prerequisites

Before we begin, ensure you have the following libraries installed:
```bash
pip install requests beautifulsoup4 pandas tqdm
```

---

# Step 1: Import Required Libraries

### Description:
- **requests**: For making HTTP requests to fetch webpage content. 
  - `requests.get(url)`: Sends a GET request to the specified URL and retrieves the page's content.
  - `response.raise_for_status()`: Raises an exception if the request returns an HTTP error status.
- **BeautifulSoup**: For parsing HTML and extracting data from the webpage.
  - `BeautifulSoup(response.content, 'html.parser')`: Parses the HTML content of the webpage.
  - `soup.find()`: Finds the first matching HTML element.
  - `soup.find_all()`: Finds all matching elements.
- **pandas**: For handling and saving data.
  - `pd.DataFrame()`: Creates a structured DataFrame from a list or dictionary.
  - `df.to_csv()`: Saves the DataFrame to a CSV file.
- **logging**: For logging errors and progress.
  - `logging.info()`: Logs informational messages.
  - `logging.error()`: Logs error messages.
- **tqdm**: For creating progress bars during iteration.
  - `tqdm.tqdm()`: Wraps an iterable to show progress during loops.

---


In [1]:
import logging
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import os
import tqdm

# Step 2: Setup Logging

### Description:
- Configures a logging system to track errors and information throughout the scraping process.
- Logs are saved to `eenbeetjenederlands.log` for debugging and tracking purposes.

In [2]:
logging.basicConfig(
    filename='eenbeetjenederlands.log', 
    level=logging.INFO, 
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Step 3: Initialize Variables

### Description:
- Initializes the starting URL for the scraper and an empty list to store episode data.

In [3]:
base_url = 'https://www.eenbeetjenederlands.nl/'
episodes_list = []

# Step 4: Scraping Episodes List

### Description:
- Loops through each page of the podcast site.
- Extracts episode titles, summaries, and links, appending them to a list.
- Stops when no next page is found.
- A short delay is added between requests to prevent overwhelming the server.

__Methods:__

- **requests.get(base_url)**: Sends a GET request to fetch the webpage content.
- **response.raise_for_status()**: Ensures the response is successful; logs error if the request fails.
- **soup.find()**: Locates the first HTML `tag` containing episode articles.
- **soup.find_all()**: Collects all elements with the given `tags`.
- **sleep(1)**: Adds a delay between requests to prevent overloading the server.

In [4]:
import time

while base_url:
    try:
        response = requests.get(base_url)
        response.raise_for_status()  # Raise error if request fails
    except requests.exceptions.HTTPError as error:
        logger.error(error)
        break

    soup = BeautifulSoup(response.content, 'html.parser')
    
    episodes = soup.find('div', id='afleveringen').find_all('article')
    for episode in episodes:
        episode_link = episode.find('a', class_='button')['href']
        episode_title = episode.find('h2').text
        episode_summary = episode.find('p').text

        episodes_list.append({'title': episode_title, 'summary': episode_summary, 'link': episode_link})
    
    next_page_link = soup.find('a', class_='next page-numbers')
    base_url = next_page_link['href'] if next_page_link else None
    
    time.sleep(1)  # Avoid overwhelming the server

# Step 5: Save Episodes Data to CSV

### Description:
- Converts the episode data list to a DataFrame and saves it to a CSV file.


In [6]:
episodes_df = pd.DataFrame(data=episodes_list)
episodes_df.to_csv('eenbeetjenederlands.csv', index=False)

# Step 6: Scraping Transcripts and Audio Files

### Description:
- Iterates over each episode link, scrapes the transcript, and logs a warning if none is found.


In [11]:
df = pd.read_csv('eenbeetjenederlands.csv')
episode_links = df['link'].values

transcripts = []
for i, episode_url in tqdm.tqdm(enumerate(episode_links), total=len(episode_links)):
    
    try:
        response = requests.get(episode_url)
        response.raise_for_status()
    except requests.exceptions.HTTPError as err:
        logger.error(err)
        continue

    title = df.loc[df['link'] == episode_url, 'title'].iloc[0].replace('/', '_')

    soup = BeautifulSoup(response.content, 'html.parser')

    text_finder = soup.find('div', class_='wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow')
    
    if text_finder:
        transcript = "\n\n".join([p.text for p in text_finder.find_all('p')])
    else:
        transcript = ""
        logger.warning(f"Transcript not found for episode {title}")
    transcripts.append(transcript)

    # step 7. Download Audio Previews
    player_iframe = soup.find('iframe', class_='player')
    if player_iframe:
        player_url = player_iframe['src']
        player_response = requests.get(player_url)
        player_soup = BeautifulSoup(player_response.content, 'html.parser')
    
        next_data_script = player_soup.find('script', id="__NEXT_DATA__")
        next_data = json.loads(next_data_script.text)
    
        audio_preview_url = next_data['props']['pageProps']['state']['data']['entity']['audioPreview']['url']
    
        audio_response = requests.get(audio_preview_url)
        clip_dir = './data/clips/'
        
        if not os.path.exists(clip_dir):
            os.makedirs(clip_dir)
    
        with open(clip_dir + title + ".mp3", "wb") as file:
            file.write(audio_response.content)
    else:
        logger.warning(f"No player found for episode {title}")

100%|███████████████████████████████████████████| 74/74 [00:54<00:00,  1.36it/s]


In [12]:
df['transcript'] = transcripts
df.to_csv('eenbeetjenederlands.csv', index=False)

In [13]:
df.head()

Unnamed: 0,title,summary,link,transcript
0,#70: Natuur in Nederland,Ook al is Nederland een klein en dichtbevolkt ...,https://www.eenbeetjenederlands.nl/podcast/nat...,"Hallo allemaal! Dit is Een Beetje Nederlands, ..."
1,#69: Anna Maria van Schurman,Veel mensen denken dat Aletta Jacobs de eerste...,https://www.eenbeetjenederlands.nl/podcast/ann...,"Hallo allemaal! Dit is Een Beetje Nederlands, ..."
2,#68: Tachtigjarige Oorlog (deel 2),Het jaar 1568 is een van de belangrijkste jaar...,https://www.eenbeetjenederlands.nl/podcast/68-...,"Hallo allemaal! Dit is Een Beetje Nederlands, ..."
3,#67: Tachtigjarige Oorlog (deel 1),Het jaar 1568 is een van de belangrijkste jaar...,https://www.eenbeetjenederlands.nl/podcast/67-...,"Hallo allemaal! Dit is Een Beetje Nederlands, ..."
4,#66: Zwangerschap in Nederland,In deze aflevering: hoe gaat een zwangerschap ...,https://www.eenbeetjenederlands.nl/podcast/zwa...,"Hallo allemaal! Dit is Een Beetje Nederlands, ..."
