## Webscraping Data on Arbeitstellen from the website of the BA

This code automatically downloads all excel files (2020-2025) and all pdf files (pre 2020) from the BA's website, extracts the relevant tables from the files and merges them together in one usable data frame. 

The first step is to load the relevant packages. 

In [1]:
!pip install selenium pandas openpyxl requests os

Defaulting to user installation because normal site-packages is not writeable


ERROR: Could not find a version that satisfies the requirement os (from versions: none)

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for os


1. Step: Automized download of all excel files (February 2020 - February 2025) from the BA website (https://statistik.arbeitsagentur.de/SiteGlobals/Forms/Suche/Einzelheftsuche_Formular.html?topic_f=analyse-gemeldete-arbeitsstellen-kldb2010)

In the code I first browse through all pages of the website and search for links that end with 'xls' or 'xlxs' indicating excel files and then I store all links. Then, I execute those links and download all excel files and save them in the same folder. 

In [None]:
# === Load Packages ===

from selenium import webdriver # Selenium is used for the automatic download of files from the web browser
from selenium.webdriver.common.by import By
from selenium.webdriver.edge.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import os
import requests
import urllib.parse
import time

# === Setup Edge ===

# The microsoft edge driver is required for selenium to work. This code opens the microsoft edge driver:
service = Service("C:\\Users\\jhummels\\OneDrive - DIW Berlin\\Gehlen, Annica's files - retirement-labor-shortages\\edgedriver\\msedgedriver.exe")
options = webdriver.EdgeOptions()
options.add_argument("start-maximized")
driver = webdriver.Edge(service=service, options=options)

# === Opening Ergebnisseite ===

# This command opens the BA's website from which we want to download all the excel sheets 
driver.get("https://statistik.arbeitsagentur.de/SiteGlobals/Forms/Suche/Einzelheftsuche_Formular.html?topic_f=analyse-gemeldete-arbeitsstellen-kldb2010")
wait = WebDriverWait(driver, 20)

# === Accepting Cookie-Banner ===

# If we don't deal with the cookies window that automatically opens when opening the website link, our webscraping will not work. The following code adresses 
# this problem. However, the command still has issues with accepting cookies by itself, so when the cookie window opens you have to manually accept cookies and then the code will run errorless. Except for 
# accepting cookies, you shoule not do anything in the window while the code is running. Once the command is executed, the window should close automatically. 
try:
    cookie_button = wait.until(EC.element_to_be_clickable((By.ID, "cc-all")))
    cookie_button.click()
    print("✅ Cookies akzeptiert.")
    time.sleep(2)
except:
    print("ℹ️ Kein Cookie-Banner gefunden oder schon geschlossen.") # This is the response you will get if you accept cookies manually, which you have to do. 

# === Navigation through all subpages and collection of all Excel-Links ===

# The following code browses through all subpages on the website and obtains all links, which initiate the download of excel files

all_excel_links = []
page_number = 1

while True:
    print(f"\n🔄 Lade Seite {page_number}...")

    try:
        wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(@href, '.xls')] | //a[contains(@href, '.xlsx')]")))
    except:
        print("❌ Keine Excel-Links gefunden auf dieser Seite.")
        break

    elements = driver.find_elements(By.XPATH, "//a[contains(@href, '.xls')] | //a[contains(@href, '.xlsx')]")
    for el in elements:
        href = el.get_attribute("href")
        if href and href not in all_excel_links:
            all_excel_links.append(href)

    # Searching for next subtab and press 'next'
    try:
        next_link = driver.find_element(By.XPATH, "//a[contains(@class, 'forward') and contains(@class, 'button')]")
        ActionChains(driver).move_to_element(next_link).perform()
        next_link.click()
        time.sleep(2)
        page_number += 1
    except:
        print("✅ Keine weitere Seite gefunden oder Button deaktiviert.")
        break

driver.quit()

# === Printing all Excel-Links ===
print(f"\n🔗 Insgesamt {len(all_excel_links)} Excel-Dateien gefunden.")

# === Preparing Download-Folder ===
os.makedirs(r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~1", exist_ok=True)
failed_links = []

# === Hilfsfunktion: Retry-Logik ===
def download_with_retries(url, retries=3, delay=5):
    headers = {"User-Agent": "Mozilla/5.0"}
    for i in range(retries):
        try:
            response = requests.get(url, headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(f"⚠️ Versuch {i+1} fehlgeschlagen für {url}: {e}")
            time.sleep(delay)
    return None

# === Download Excel-Files ===
for link in all_excel_links:
    filename = link.split("/")[-1].split("?")[0].split(";")[0]
    filename = urllib.parse.unquote(filename)
    filepath = os.path.join(r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~1", filename)

    # Falls Datei schon existiert, überspringen
    if os.path.exists(filepath):
        print(f"⏩ Überspringe bereits vorhandene Datei: {filename}")
        continue

    print(f"⬇️ Lade herunter: {filename}")
    response = download_with_retries(link)
    if response:
        try:
            with open(filepath, "wb") as f:
                f.write(response.content)
            print(f"✅ Erfolgreich gespeichert: {filename}")
        except Exception as e:
            print(f"❌ Fehler beim Speichern von {filename}: {e}")
            failed_links.append(link)
    else:
        print(f"❌ Endgültig fehlgeschlagen: {filename}")
        failed_links.append(link)

# === Safe all failed links ===
if failed_links:
    with open("failed_excels.txt", "w", encoding="utf-8") as f:
        for link in failed_links:
            f.write(link + "\n")
    print(f"\n⚠️ {len(failed_links)} Dateien konnten nicht geladen werden. Gespeichert in 'failed_excels.txt'")
else:
    print("\n🎉 Alle Excel-Dateien erfolgreich heruntergeladen!")


2. Step: Automized download of all pdf files (October 2011 - February 2020) from the BA website (https://statistik.arbeitsagentur.de/SiteGlobals/Forms/Suche/Einzelheftsuche_Formular.html?topic_f=analyse-gemeldete-arbeitsstellen-kldb2010)

I use the same procedure as for the download of Excel file just that here I am looking for links ending with 'pdf'. 

In [None]:
# === Load Packages ===

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.edge.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import os
import requests
import urllib.parse
import time

# === Edge Setup ===
service = Service("C:\\Users\\jhummels\\OneDrive - DIW Berlin\\Gehlen, Annica's files - retirement-labor-shortages\\edgedriver\\msedgedriver.exe")
options = webdriver.EdgeOptions()
options.add_argument("start-maximized")
driver = webdriver.Edge(service=service, options=options)

# === Ergebnisseite öffnen ===
driver.get("https://statistik.arbeitsagentur.de/SiteGlobals/Forms/Suche/Einzelheftsuche_Formular.html?topic_f=analyse-gemeldete-arbeitsstellen-kldb2010")
wait = WebDriverWait(driver, 20)


# === Cookies akzeptieren ===
try:
    cookie_button = wait.until(EC.element_to_be_clickable((By.ID, "cc-all")))
    cookie_button.click()
    print("✅ Cookies akzeptiert.")
    time.sleep(2)
except:
    print("ℹ️ Kein Cookie-Banner gefunden oder schon geschlossen.")

# === Sammeln aller PDF-Links von allen Seiten ===
all_pdf_links = []
page_number = 1

while True:
    print(f"\n🔄 Lade Seite {page_number}...")

    try:
        wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(@href, '.pdf')]")))
    except:
        print("❌ Keine PDF-Links gefunden auf dieser Seite.")
        break

    elements = driver.find_elements(By.XPATH, "//a[contains(@href, '.pdf')]")
    for el in elements:
        href = el.get_attribute("href")
        if href and href not in all_pdf_links:
            all_pdf_links.append(href)

    # Weiterblättern
    try:
        next_link = driver.find_element(By.XPATH, "//a[contains(@class, 'forward') and contains(@class, 'button')]")
        ActionChains(driver).move_to_element(next_link).perform()
        next_link.click()
        time.sleep(2)
        page_number += 1
    except:
        print("✅ Keine weitere Seite gefunden oder Button deaktiviert.")
        break

driver.quit()

# === Ordner vorbereiten ===
os.makedirs(r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~2", exist_ok=True)
failed_links = []

# === Hilfsfunktion: Retry-Logik ===
def download_with_retries(url, retries=3, delay=5):
    headers = {"User-Agent": "Mozilla/5.0"}
    for i in range(retries):
        try:
            response = requests.get(url, headers=headers, timeout=20)
            if response.status_code == 200:
                return response
        except Exception as e:
            print(f"⚠️ Versuch {i+1} fehlgeschlagen für {url}: {e}")
            time.sleep(delay)
    return None

# === PDFs herunterladen ===
for link in all_pdf_links:
    filename = link.split("/")[-1].split("?")[0].split(";")[0]
    filename = urllib.parse.unquote(filename)
    filepath = os.path.join(r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~2", filename)

    # Falls Datei schon existiert, überspringen
    if os.path.exists(filepath):
        print(f"⏩ Überspringe bereits vorhandene Datei: {filename}")
        continue

    print(f"⬇️ Lade herunter: {filename}")
    response = download_with_retries(link)
    if response:
        try:
            with open(filepath, "wb") as f:
                f.write(response.content)
            print(f"✅ Erfolgreich gespeichert: {filename}")
        except Exception as e:
            print(f"❌ Fehler beim Speichern von {filename}: {e}")
            failed_links.append(link)
    else:
        print(f"❌ Endgültig fehlgeschlagen: {filename}")
        failed_links.append(link)

# === Fehlgeschlagene Links speichern ===
if failed_links:
    with open("failed_pdfs.txt", "w", encoding="utf-8") as f:
        for link in failed_links:
            f.write(link + "\n")
    print(f"\n⚠️ {len(failed_links)} Dateien konnten nicht geladen werden. Gespeichert in 'failed_pdfs.txt'")
else:
    print("\n🎉 Alle PDFs erfolgreich heruntergeladen!")


3. Step: Read all PDF files and extract relevant tables with the Engpass Indicators, then convert the data into a machine readable format (csv, xlsx)

I browse through all pdf files and use the Fitz algorithm from the PyMuPDF to extract the desired table using key words and patterns that detect the right table in the pdf file. In the webscraping for the labor tightness data, I use the camelot package, which is a little bit more advanced when used with the 'network' algorithm. 

In [None]:
import os
import re
import fitz  # PyMuPDF
import pandas as pd

# Input and output folders
input_folder = r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~2"
output_folder = r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~3"
os.makedirs(output_folder, exist_ok=True)

# Table detection pattern
pattern = re.compile(
    r"(?P<BKZ>\d{3})\s+(?P<Beruf>[\wäöüÄÖÜß\-,.()\/&\s]+?)\s+"
    r"(?P<Zugang>[\d.]+)\s+(?P<Zugang_V>[-+.,\d]+)\s+"
    r"(?P<Bestand>[\d.]+)\s+(?P<Bestand_V>[-+.,\d]+)\s+"
    r"(?P<Anteil>[\d.,]+)\s+(?P<Anteil_V>[-+.,\d]+)\s+"
    r"(?P<Vakanzzeit>[\d.,]+)\s+(?P<Vakanzzeit_V>[-+.,\d]+)\s+"
    r"(?P<Arbeitslose>[\d.]+)\s+(?P<Arbeitslose_V>[-+.,\d]+)\s+"
    r"(?P<Relation>[\d.,]+)\s+(?P<Relation_V>[-+.,\d]+)"
)

# Loop through all PDFs
for filename in os.listdir(input_folder):
    if filename.lower().endswith(".pdf"):
        pdf_path = os.path.join(input_folder, filename)
        doc = fitz.open(pdf_path)

        # Extract text from pages likely to contain the table
        text = ""
        for i, page in enumerate(doc):
            page_text = page.get_text()
            # Look for BKZ + numeric pattern
            if re.search(r"\b\d{3}\s+[A-Za-zÄÖÜäöüß]", page_text) and re.search(r"\d+\s+[-+,.0-9]+\s+\d+", page_text):
                text += page_text + "\n"

        # Match rows using regex
        rows = []
        for match in pattern.finditer(text):
            row = match.groupdict()
            for key in row:
                row[key] = row[key].replace(".", "").replace(",", ".") if key != "Beruf" else row[key].strip()
            rows.append(row)

        if rows:
            df = pd.DataFrame(rows)
            for col in df.columns:
                if col != "Beruf":
                    df[col] = pd.to_numeric(df[col], errors="coerce")

            # Save CSV
            output_path = os.path.join(output_folder, os.path.splitext(filename)[0] + ".csv")
            df.to_csv(output_path, index=False, encoding="utf-8-sig")
            print(f"✓ Saved: {output_path}")
        else:
            print(f"⚠️ No table found in: {filename}")


4. Merge all tables extracted from pdfs into one data frame and add a year and bundesland column based on their file name


In [15]:
# Load packages
from pathlib import Path
import os
import pandas as pd
import re

# Folder with individual CSVs
csv_folder = r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~3"

# List to collect all dataframes
all_dfs = []

# Regex to extract Bundesland and date info
filename_pattern = re.compile(r"kldb2010-(\d{2})-0-(\d{6})")

for file in os.listdir(csv_folder):
    if file.endswith(".csv"):
        match = filename_pattern.search(file)
        if match:
            bundesland = match.group(1)
            year = match.group(2)[:4]
            month = match.group(2)[4:]
        else:
            print(f"⚠️ Skipping file with unexpected name: {file}")
            continue

        # Load CSV and add metadata
        file_path = os.path.join(csv_folder, file)
        df = pd.read_csv(file_path)
        df["Bundesland"] = bundesland
        df["Year"] = int(year)
        df["Month"] = int(month)
        all_dfs.append(df)

# Merge all
combined_df = pd.concat(all_dfs, ignore_index=True)

# Export
output_csv = os.path.join(csv_folder, "combined_arbeitsagentur_data.csv")
output_excel = os.path.join(csv_folder, "combined_arbeitsagentur_data.xlsx")

combined_df.to_csv(output_csv, index=False, encoding="utf-8-sig")
combined_df.to_excel(output_excel, index=False)

print(f"✓ Combined CSV saved to: {output_csv}")
print(f"✓ Combined Excel saved to: {output_excel}")


⚠️ Skipping file with unexpected name: combined_arbeitsagentur_data.csv
✓ Combined CSV saved to: C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~3\combined_arbeitsagentur_data.csv
✓ Combined Excel saved to: C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~3\combined_arbeitsagentur_data.xlsx


5. Clean PDF data frame and get rid of little inaccuracies. Extract BKZ and add as new columns. Delete rows that don't hold any information and rename columns such that the names fit the names of the excel tables


In [16]:
# Import relevant libraries: 

import pandas as pd
import re  # <-- Add this

data = combined_df.copy()

# Show all rows
pd.set_option('display.max_rows', 2000)

# Optional: also widen column display if needed
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Keep only rows where BKZ is a proper 3-digit number between 100 and 999
data = data[data["BKZ"].astype(str).str.fullmatch(r"\d{3}")]

import re

def extract_clean_bkz_and_beruf(row):
    beruf_raw = str(row.get("Beruf", "")).replace("\n", " ").strip()

    # Match pattern like: "814 Human- und Zahnmedizin" (ignore any garbage before it)
    match = re.search(r"(\d{3})\s+([A-ZÄÖÜa-zäöüß].+)", beruf_raw)
    if match:
        row["BKZ"] = match.group(1)
        row["Beruf"] = match.group(2).strip()
    return row

data = data.apply(extract_clean_bkz_and_beruf, axis=1)
# Keep rows where Beruf starts with a letter (i.e., likely a real label)
data = data[
    data["Beruf"].notna() &
    data["Beruf"].astype(str).str.match(r"^[A-ZÄÖÜa-zäöüß]")
]

# Clean up whitespace and hidden characters in all string columns
for col in ["BKZ", "Beruf", "Bundesland"]:
    data[col] = data[col].astype(str).str.strip().str.replace(r"\s+", " ", regex=True)
data = data.drop_duplicates(subset=["BKZ", "Beruf", "Bundesland", "Year", "Month"], keep="first")

# Rename columns of PDF data to match column names of the Excel Data 

data = data.rename(columns={
    "Anteil": "3_Monate_Vakant_Anteil",
    "Anteil_V" : "3_Monate_Vakant_V_abs",
    "Vakanzzeit" : "abgesch_Vakanzzeit_Tage",
    "Vakanzzeit_V" : "abgesch_Vakanzzeit_V_abs"
})

data.head()

Unnamed: 0,BKZ,Beruf,Zugang,Zugang_V,Bestand,Bestand_V,3_Monate_Vakant_Anteil,3_Monate_Vakant_V_abs,abgesch_Vakanzzeit_Tage,abgesch_Vakanzzeit_V_abs,Arbeitslose,Arbeitslose_V,Relation,Relation_V,Bundesland,Year,Month
2,814,Human- und Zahnmedizin,147,10.5,66,-1.4,48.3,-4.7,167,50,93,-9.4,140,-12,1,2011,10
3,921,Werbung und Marketing,1949,4.2,595,17.4,46.5,5.4,113,29,783,-1.0,132,-24,1,2011,10
4,821,Altenpflege,1644,-13.5,561,-3.6,50.6,5.3,110,13,748,-25.4,133,-39,1,2011,10
5,721,Versicherungs- u. Finanzdienstleistungen,437,6.8,136,-5.6,42.8,1.9,102,6,500,-10.6,368,-20,1,2011,10
6,813,"Gesundh.,Krankenpfl.,Rettungsd.Geburtsh.",1563,24.9,453,17.9,38.8,-3.0,87,-17,708,-6.2,156,-40,1,2011,10


7. Extract the right tables from all excel files, then clean and improve format of excel data frame, then merge all excel tables into one data frame.

I extract the right tables from the excel files by defining the sheet names that include the relevant table.


In [18]:
# Load relevant libraries

import os
import pandas as pd
import re

# Path to folder with Excel files
excel_folder = r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\ARBEIT~1"

# Collect cleaned DataFrames
all_dfs = []

# Loop through Excel files
for file in os.listdir(excel_folder):
    if file.endswith(".xlsx"):
        filepath = os.path.join(excel_folder, file)
        try:
            # Load the sheet (header on row 7)
            df = pd.read_excel(filepath, sheet_name="3.1 Engpass_Tab1", header=7)

            # Drop last 4 columns
            df = df.iloc[:, :-4]

            # Rename columns (up to 24)
            df.columns = [
             "Drop", "Beruf",
             "Zugang", "Zugang_V",
             "Bestand", "Bestand_V",
             "3_Monate_Vakant_abs", "3_Monate_Vakant_V",
             "3_Monate_Vakant_Anteil", "3_Monate_Vakant_V_abs",
             "abgesch_Vakanzzeit_Tage", "abgesch_Vakanzzeit_V_abs",
             "Arbeitslose", "Arbeitslose_V",
             "SGBIII_abs", "SGBIII_V",
             "SGBII_abs", "SGBII_V",
             "Relation", "Relation_V",
             "SGBIII_abs_2", "SGBIII_V_2",
             "SGBII_abs_2", "SGBII_V_2"
            ]

            # Drop the first column (empty)
            df = df.drop(columns="Drop")

            # Remove footer or malformed rows
            df = df[df["Beruf"].astype(str).str.match(r"^\d{3}\s+.+")]

            # Split BKZ from Beruf
            df[["BKZ", "Beruf"]] = df["Beruf"].str.extract(r"^(\d{3})\s+(.+)$")

            # Convert numeric columns
            numeric_cols = [col for col in df.columns if col not in ["Beruf", "BKZ"] and df[col].ndim == 1]
            for col in numeric_cols:
                df[col] = pd.to_numeric(df[col], errors="coerce")

            # Drop "Insgesamt" row if present
            df = df[~df.iloc[:, 1].astype(str).str.contains("Insgesamt", na=False)]

            # Add metadata from filename
            match = re.search(r"kldb2010-(\d{2})-0-(\d{6})", file)
            if match:
                df["Bundesland"] = match.group(1)
                df["Year"] = int(match.group(2)[:4])
                df["Month"] = int(match.group(2)[4:])
            else:
                df["Bundesland"] = df["Year"] = df["Month"] = None

            df["source_file"] = file

            all_dfs.append(df)

        except Exception as e:
            print(f"❌ Error processing {file}: {e}")

# Combine all cleaned data
combined_excel_df = pd.concat(all_dfs, ignore_index=True)

# Preview
print(f"✓ Combined {len(all_dfs)} files. Total rows: {combined_excel_df.shape[0]}")
combined_excel_df.head()


✓ Combined 969 files. Total rows: 58452


Unnamed: 0,Beruf,Zugang,Zugang_V,Bestand,Bestand_V,3_Monate_Vakant_abs,3_Monate_Vakant_V,3_Monate_Vakant_Anteil,3_Monate_Vakant_V_abs,abgesch_Vakanzzeit_Tage,abgesch_Vakanzzeit_V_abs,Arbeitslose,Arbeitslose_V,SGBIII_abs,SGBIII_V,SGBII_abs,SGBII_V,Relation,Relation_V,SGBIII_abs_2,SGBIII_V_2,SGBII_abs_2,SGBII_V_2,BKZ,Bundesland,Year,Month,source_file
0,"Klempnerei,Sanitär,Heizung,Klimatechnik",555,-5.290102,374.916667,0.581265,257.333333,0.553566,68.6,-0.018908,211.668508,7.072048,144.666667,-6.816962,,,,,38.6,-3.063547,65.017411,-2.191057,20.1509,-2.596093,342,1,2020,2,analyse-gemeldete-arbeitsstellen-kldb2010-01-0...
1,Bau- und Transportgeräteführung,111,-31.481481,69.166667,-14.871795,44.916667,-12.924071,64.9,1.45258,210.373913,68.227572,113.75,-0.582666,,,,,164.5,23.637319,44.95276,-33.063326,41.81999,-22.154541,525,1,2020,2,analyse-gemeldete-arbeitsstellen-kldb2010-01-0...
2,Metallbearbeitung,218,-36.443149,121.416667,-20.512821,79.666667,-7.899807,65.6,8.985798,195.694118,79.882523,234.083333,18.473218,,,,,192.8,63.44262,100.128507,-4.556622,159.049047,10.092294,242,1,2020,2,analyse-gemeldete-arbeitsstellen-kldb2010-01-0...
3,Bodenverlegung,224,22.404372,128.0,1.520159,80.666667,-10.618652,63.0,-8.55881,193.036,17.831,90.833333,3.122044,,,,,71.0,1.102339,170.862262,-0.966677,254.669653,13.461432,331,1,2020,2,analyse-gemeldete-arbeitsstellen-kldb2010-01-0...
4,Energietechnik,901,-5.157895,514.583333,0.931677,327.166667,4.276228,63.6,2.039229,189.345251,14.621151,289.333333,9.423259,,,,,56.2,4.363367,268.888889,18.445203,150.222222,-4.385286,262,1,2020,2,analyse-gemeldete-arbeitsstellen-kldb2010-01-0...


Preview of the excel data frame

In [None]:
combined_excel_df.head()

Inquiry of NA count for the webscraped PDF and Excel Data Frame

In [None]:
print(data.isna().sum())
print(combined_excel_df.isna().sum())

SGBIII_abs, SGBIII_V, SGBII_abs, SGBII_V all have 59452 missing values (which are all rows), so we can drop them without any bad conscience 

In [19]:
combined_excel_df = combined_excel_df.drop(columns=[
    "SGBIII_abs", "SGBIII_V", "SGBII_abs", "SGBII_V"
])

Merging PDF and Excel data frames into one data frame

In [20]:

# Reindex `data` to match the column structure of `combined_excel_df`
data_aligned = data.reindex(columns=combined_excel_df.columns)

# Append older `data` before `combined_excel_df`
full_df = pd.concat([data_aligned, combined_excel_df], ignore_index=True)

# Preview result
print(full_df.shape)
full_df.head()


(149419, 24)


Unnamed: 0,Beruf,Zugang,Zugang_V,Bestand,Bestand_V,3_Monate_Vakant_abs,3_Monate_Vakant_V,3_Monate_Vakant_Anteil,3_Monate_Vakant_V_abs,abgesch_Vakanzzeit_Tage,abgesch_Vakanzzeit_V_abs,Arbeitslose,Arbeitslose_V,Relation,Relation_V,SGBIII_abs_2,SGBIII_V_2,SGBII_abs_2,SGBII_V_2,BKZ,Bundesland,Year,Month,source_file
0,Human- und Zahnmedizin,147,10.5,66.0,-1.4,,,48.3,-4.7,167.0,50.0,93.0,-9.4,140.0,-12.0,,,,,814,1,2011,10,
1,Werbung und Marketing,1949,4.2,595.0,17.4,,,46.5,5.4,113.0,29.0,783.0,-1.0,132.0,-24.0,,,,,921,1,2011,10,
2,Altenpflege,1644,-13.5,561.0,-3.6,,,50.6,5.3,110.0,13.0,748.0,-25.4,133.0,-39.0,,,,,821,1,2011,10,
3,Versicherungs- u. Finanzdienstleistungen,437,6.8,136.0,-5.6,,,42.8,1.9,102.0,6.0,500.0,-10.6,368.0,-20.0,,,,,721,1,2011,10,
4,"Gesundh.,Krankenpfl.,Rettungsd.Geburtsh.",1563,24.9,453.0,17.9,,,38.8,-3.0,87.0,-17.0,708.0,-6.2,156.0,-40.0,,,,,813,1,2011,10,


Improve column names and order


In [21]:
# Adjust column names in a better order


desired_order = [
    'Bundesland', 'Year', 'Month',
    'BKZ', 'Beruf',
    'Zugang', 'Zugang_V',
    'Bestand', 'Bestand_V',
    '3_Monate_Vakant_abs', '3_Monate_Vakant_V',
    '3_Monate_Vakant_Anteil', '3_Monate_Vakant_V_abs',
    'abgesch_Vakanzzeit_Tage', 'abgesch_Vakanzzeit_V_abs',
    'Arbeitslose', 'Arbeitslose_V',
    'Relation', 'Relation_V',
    'SGBIII_abs_2', 'SGBIII_V_2',
    'SGBII_abs_2', 'SGBII_V_2',
    'source_file'
]

# Keep only the columns that are actually in the DataFrame
existing_columns = [col for col in desired_order if col in full_df.columns]

# Reorder
full_df = full_df[existing_columns]


full_df.head()



Unnamed: 0,Bundesland,Year,Month,BKZ,Beruf,Zugang,Zugang_V,Bestand,Bestand_V,3_Monate_Vakant_abs,3_Monate_Vakant_V,3_Monate_Vakant_Anteil,3_Monate_Vakant_V_abs,abgesch_Vakanzzeit_Tage,abgesch_Vakanzzeit_V_abs,Arbeitslose,Arbeitslose_V,Relation,Relation_V,SGBIII_abs_2,SGBIII_V_2,SGBII_abs_2,SGBII_V_2,source_file
0,1,2011,10,814,Human- und Zahnmedizin,147,10.5,66.0,-1.4,,,48.3,-4.7,167.0,50.0,93.0,-9.4,140.0,-12.0,,,,,
1,1,2011,10,921,Werbung und Marketing,1949,4.2,595.0,17.4,,,46.5,5.4,113.0,29.0,783.0,-1.0,132.0,-24.0,,,,,
2,1,2011,10,821,Altenpflege,1644,-13.5,561.0,-3.6,,,50.6,5.3,110.0,13.0,748.0,-25.4,133.0,-39.0,,,,,
3,1,2011,10,721,Versicherungs- u. Finanzdienstleistungen,437,6.8,136.0,-5.6,,,42.8,1.9,102.0,6.0,500.0,-10.6,368.0,-20.0,,,,,
4,1,2011,10,813,"Gesundh.,Krankenpfl.,Rettungsd.Geburtsh.",1563,24.9,453.0,17.9,,,38.8,-3.0,87.0,-17.0,708.0,-6.2,156.0,-40.0,,,,,


Export the full data frame of all webscraped data for Arbeitsstellen nach Berufsgruppen into an excel file

In [23]:
full_df.to_csv(r"C:\Users\jhummels\OneDrive - DIW Berlin\GEHLEN~1\Data\BA_data\A_GEME~1\Merged_data.csv", index=False)