# Scraping Open Data Documents from Rijksoverheid

This notebook scrapes and processes ICT-related reports from the Dutch government's open data portal. Information on how to access these documents is available on the website of [Rijksoverheid (the Dutch Government website](https://www.rijksoverheid.nl/opendata/documenten). The goal is to compile a dataset of these documents, including their metadata and content, into a CSV file.

### Importing libraries

In [27]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm
import os
from pathlib import Path

### Fetching the list of documents

In [28]:
# create directory
Path("../Data/Rijksoverheid").mkdir(parents=True, exist_ok=True)

In [29]:
# Function to fetch documents
def fetch_documents(subject, initial_date, type, offset, rows):
    """
    Fetches documents from the Rijksoverheid API based on the specified parameters.
    :param subject: the subject of the documents to fetch
    :param initial_date: the initial date from which to fetch documents
    :param offset: the offset to start fetching documents from
    :param rows: the number of rows to fetch
    :return: the XML response text if successful, None otherwise
    """
    base_url = "https://opendata.rijksoverheid.nl/v1/documents"
    params = {
        "subject": subject,
        "initialdatesince": initial_date,
        "type": type,
        "offset": offset,
        "rows": rows
    }
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        return response.text
    else:
        return None

### Parsing the XML response

In [30]:
# Function to parse XML and extract document metadata
def parse_xml(xml_data):
    """
    Parses the XML data and extracts the metadata for each document.
    :param xml_data: XML data to parse
    :return: a list of dictionaries containing the metadata for each document
    """
    documents = []
    root = ET.fromstring(xml_data)
    for doc in root.findall('document'):
        metadata = {
            "id": doc.find('id').text,
            "type": doc.find('type').text,
            "title": doc.find('title').text,
            "canonical": doc.find('canonical').text,
            "introduction": "",
            "lastmodified": doc.find('lastmodified').text,
            "available": doc.find('available').text,
            "initialdate": doc.find('initialdate').text,
        }
        
        # Handling introduction extraction within <p> tags
        intro_html = doc.find('introduction').text
        if intro_html is not None:
            intro_soup = BeautifulSoup(intro_html, 'html.parser')
            paragraphs = [p.get_text() for p in intro_soup.find_all('p')]
            metadata['introduction'] = " ".join(paragraphs).strip()
        
        documents.append(metadata)
    return documents

### Downloading and extracting text from the PDFs

In [88]:
def download_pdf(url):
    """
    Downloads a PDF file from the specified URL and returns the path to the downloaded file.
    :param url: the URL of the PDF file to download
    :return: the path to the downloaded PDF file if successful, None otherwise
    """
    # Ensure the directory exists
    Path("../Data/Rijksoverheid/rapporten").mkdir(parents=True, exist_ok=True)
    
    # Extract parts of the URL to create a more descriptive filename
    parts = url.strip().split('/')
    # Attempt to create a filename using the last two segments before '.pdf'
    if len(parts) > 2:
        file_name = parts[-2] + "_" + parts[-1]
    else:
        file_name = parts[-1]

    # Ensure the file ends with '.pdf'
    if not file_name.lower().endswith('.pdf'):
        file_name += ".pdf"

    file_path = f"../Data/Rijksoverheid/rapporten/{file_name}"
    
    try:
        # Make a request to the PDF URL
        response = requests.get(url)
        response.raise_for_status()
        
        # Save the pdf file
        with open(file_path, 'wb') as f:
            f.write(response.content)
        return file_path
    except requests.RequestException as e:
        print(f"Failed to download {url}: {str(e)}")
        return None

### Scraping the content

In [89]:
# Updated scrape_content function to handle PDF content
def scrape_content(url):
    """
    Scrapes the content from the URL. Often the content is in a PDF file, so we attempt to download and extract the text from the PDF.
    :param url: the URL to scrape content from
    :return: the scraped content if successful, an empty string otherwise
    """
    response = requests.get(url)
    response.raise_for_status()
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        intro_div = soup.find('div', class_='intro')
        # Try to find a PDF link in the introduction div, else get all <p> tags
        if intro_div:
            pdf_link = intro_div.find('a', href=True)['href'] if intro_div.find('a', href=True) else None
            if pdf_link:
                pdf_path = download_pdf(pdf_link)
                return pdf_path
        # If no PDF is found or extraction failed, return None
        return "No PDF found or extraction failed."
            
    return ""

### Fetching and processing documents

In [90]:
# Adjusted main function to skip certain document types
def fetch_and_save_documents(subject, initial_date, type, csv_file_path):
    """
    Fetches and processes documents from the Rijksoverheid API based on the specified parameters and saves them to a CSV file.
    :param subject: the subject of the documents to fetch (see https://www.rijksoverheid.nl/opendata/documenten)
    :param initial_date: the initial date from which to fetch documents (format: YYYYMMDD)
    :param csv_file_path: the path to save the CSV file
    :return: None
    """
    offset = 1
    rows = 200
    file_exists = os.path.exists(csv_file_path)
    
    while True:
        xml_data = fetch_documents(subject, initial_date, type, offset, rows)
        if xml_data:
            documents = parse_xml(xml_data)
            if not documents:
                break
            batch_documents = []
            for doc in tqdm(documents, desc=f"Processing documents {offset} to {offset + len(documents) - 1}"):
                doc['content'] = scrape_content(doc['canonical'])
                batch_documents.append(doc)
                
            # Convert the batch documents to a DataFrame
            batch_df = pd.DataFrame(batch_documents)
            # Append to the CSV file
            if not file_exists:
                batch_df.to_csv(csv_file_path, mode='w', header=True, index=False)
                file_exists = True # Ensure header is written only once
            else:
                batch_df.to_csv(csv_file_path, mode='a', header=False, index=False)
            
            offset += rows
        else:
            break
    print("All documents processed and saved to CSV file.")

### Scrape

In [92]:
csv_file_path = '../Data/Rijksoverheid/reports_ict_20240201.csv'
fetch_and_save_documents('ict', '20240201', 'rapport', csv_file_path)

Processing documents 1 to 7: 100%|██████████| 7/7 [00:03<00:00,  2.14it/s]

All documents processed and saved to CSV file.





In [93]:
# check the first few rows of the dataset
df = pd.read_csv('../Data/Rijksoverheid/reports_ict_20240201.csv')
df

Unnamed: 0,id,type,title,canonical,introduction,lastmodified,available,initialdate,content
0,deb57a9f-0038-4056-9e03-7dc187f5178b,rapport,Raamwerk Online Leeftijdsverificatie,https://www.rijksoverheid.nl/documenten/rappor...,Dit afwegingskader is bedoeld om ontwikkelaars...,2024-04-09T13:40:29.065Z,2024-04-09T13:36:30.546Z,2024-04-09T13:50:36.074+02:00,../Data/Rijksoverheid/rapporten/b93b2880-0c26-...
1,9bf78cb9-2edd-4f99-9903-6241a65f901a,rapport,Arbeidsvoorwaardenonderzoek Rijksoverheid ICT,https://www.rijksoverheid.nl/documenten/rappor...,Arbeidsvoorwaardenonderzoek Rijksoverheid ICT,2024-04-09T13:40:39.307Z,2024-04-09T13:36:29.187Z,2024-04-09T13:50:52.163+02:00,../Data/Rijksoverheid/rapporten/f8ab09b6-7ea9-...
2,836dc993-196d-40cf-9dbc-11c0271a8ab2,rapport,Afschrift brief Definitief BIT advies beheer e...,https://www.rijksoverheid.nl/documenten/rappor...,Afschrift van de brief van het Adviescollege I...,2024-03-07T09:00:01.905Z,2024-03-07T08:53:41.311Z,2024-03-06T18:35:08.908+01:00,../Data/Rijksoverheid/rapporten/dpc-0dad420545...
3,d6e48953-6c86-4e8d-89cc-4b6bca11e03e,rapport,Onderzoek digitale competenties (DIGCOM),https://www.rijksoverheid.nl/documenten/rappor...,In dit rapport worden de digitale competenties...,2024-04-09T13:40:13.816Z,2024-04-09T13:36:30.262Z,2024-04-09T13:50:06.529+02:00,../Data/Rijksoverheid/rapporten/2d466bc4-67b1-...
4,6371ce32-11db-44f2-ba94-45789f077da3,rapport,Jaarmonitor Digitale Toegankelijkheid 2023,https://www.rijksoverheid.nl/documenten/rappor...,Jaarmonitor Digitale Toegankelijkheid 2023,2024-04-09T13:40:35.725Z,2024-04-09T13:36:29.961Z,2024-04-09T13:50:46.781+02:00,../Data/Rijksoverheid/rapporten/ebf701d1-bc49-...
5,52449578-0a0f-4bd1-9b8a-5b72f6497895,rapport,Too late to act? Europe's quest for cloud sove...,https://www.rijksoverheid.nl/documenten/rappor...,Dit document is enkel in het Engels beschikbaa...,2024-04-23T08:35:02.092Z,2024-04-23T08:31:38.617Z,2024-04-23T09:55:13.157+02:00,../Data/Rijksoverheid/rapporten/8e449da5-0a87-...
6,5bd73236-d17d-4ec6-8354-06040cb582d4,rapport,Kabinetsappreciatie witboek over digitale infr...,https://www.rijksoverheid.nl/documenten/rappor...,Kabinetsreactie op het witboek ‘How to master ...,2024-04-05T13:35:13.712Z,2024-04-05T13:31:36.663Z,2024-04-05T15:05:05.980+02:00,../Data/Rijksoverheid/rapporten/4fbe64a7-b945-...


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427 entries, 0 to 426
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            427 non-null    object
 1   type          427 non-null    object
 2   title         427 non-null    object
 3   canonical     427 non-null    object
 4   introduction  427 non-null    object
 5   lastmodified  427 non-null    object
 6   available     427 non-null    object
 7   initialdate   427 non-null    object
 8   content       422 non-null    object
dtypes: object(9)
memory usage: 30.2+ KB


In [13]:
# How many rows have NaN values in the content column?
df[df['content'].isna()]

Unnamed: 0,id,type,title,canonical,introduction,lastmodified,available,initialdate,content
167,031c9ba6-b36f-444f-a683-9a2d134a8021,beleidsnota,Beslisnota bij antwoorden op Kamervragen over ...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2022-11-22T14:10:28.258Z,2022-11-18T09:32:00.000Z,2022-11-18T10:35:08.494+01:00,
193,c9a0943e-a057-4db2-bf52-a2f1ac8969d1,beleidsnota,Beslisnota bij antwoorden Kamervragen over toe...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2022-11-28T09:19:38.056Z,2022-11-11T16:55:00.000Z,2022-11-11T18:12:56.474+01:00,
205,12741548-91f5-4f41-b0cc-d380bf4125b1,rapport,Convenant waarborging .nl-domein 2022-2029,https://www.rijksoverheid.nl/documenten/rappor...,Het convenant tussen EZK en de Stichting Inter...,2022-12-07T14:26:05.870Z,2022-11-23T10:14:00.000Z,2022-11-23T11:20:07.248+01:00,
208,e8c135f0-bade-4272-bdfe-1cfef4532e89,kamerstuk,Kamerbrief over advies van het Adviescollege I...,https://www.rijksoverheid.nl/documenten/kamers...,Minister Yeşilgöz-Zegerius (JenV) stuurt de Tw...,2022-12-07T09:12:45.219Z,2022-11-24T18:28:00.000Z,2022-11-24T18:40:07.314+01:00,
335,c69b41f5-8fdd-4409-8d00-8b2827a059f8,beleidsnota,Beslisnota bij Kamerbrief over planning en voo...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2024-02-14T10:55:13.395Z,2024-02-14T10:48:57.349Z,2024-02-14T08:50:04.933+01:00,
